This page records the v1.0/v1.1 implementation contract, validation criteria, and future roadmap. Items marked done are included in the shipped releases (v1.0.0 and v1.1.0). Items marked wip or future are deferred to v2.0+. The goal: Django for adaptive multi-model LLM systems — batteries included, deeply configurable, extensible without editing AUA internals.
✓ Status: v1.1.0 shipped.
All items marked done are included in v1.0.0/v1.1.0 and validated. v1.1.0 adds the complete AUA-Veritas backport (V-P1–V-P3), end-to-end session IDs (#15), live secrets-provider tests (#19), and YAML-wired plugins/hooks/middleware (F-09–F-11): 297 tests, 50+ REST endpoints, a 40-check live-router E2E suite, and every tutorial command verified as written. v1.0 record: docs/v1_validation_report.md (197 tests, 23 REST endpoints, 8 plugin protocols, 18 Prometheus metrics).
Reading guide. Items are ordered so each step unlocks the next with minimal rework. Original roadmap numbers (#01–#73) are preserved where they exist; new items introduced by the planning documents use prefixed IDs (P-, F-, E-, U-, D-). Items already implemented (v0.5 POC) are marked done. Items needing production hardening are marked wip.
Milestones
Items #01–#10 are production-complete in v1.0.0. This milestone records the hardening work completed: quality, contracts, correctness, and packaging. All items are shipped.
Repo hygiene & packaging
Must happen first — CI, installs, and all other work depend on this being clean.
Run black aua tests and isort aua tests across the entire repo. Add [tool.black], [tool.isort], [tool.ruff], and [tool.mypy] sections to pyproject.toml. line-length=100, target-version=py310. This is a prerequisite for CI.
Package name adaptive-utility-agent. Optional extras: pip install aua[vllm], aua[dev], aua[ui], aua[postgres], aua[otel]. vllm must be optional — it is not Mac-friendly. Include aua/tiers/*.yaml in wheel data. Add Makefile with install, test, lint, format, typecheck, build targets.
Create aua/version.py containing __version__ = "1.0.0". All other references import from here. Create aua/py.typed marker file. Add version consistency test: importlib.metadata.version("adaptive-utility-agent") == aua.__version__.
Matrix: Python 3.10, 3.11, 3.12. Steps: pip install -e ".[dev]" → ruff → black check → mypy → pytest. Must pass without GPUs using fake specialists. Also add wheel-build validation step.
Public API alignment & config strictness
Config correctness unblocks Docker, tiers, presets, and all downstream work.
Export exact roadmap API: Arbiter = ArbiterAgent (alias), BlueGreenDeployment, CorrectionLoop. Expose all endpoint models from aua.endpoints. Stable __all__. Add import smoke tests in tests/test_imports.py. Add docs/public_api.md separating stable from internal APIs.
Remove hardcoded localhost from endpoint construction. Add host, scheme, endpoint_path, endpoint_override fields to SpecialistConfig and ArbiterConfig. Add strict unknown-key validation (catches typos). Add duplicate port validation. Add threshold range validation. Add RuntimeConfig (.aua/logs, .aua/pids, .aua/state, .aua/checkpoints). Add APIConfig with cors_origins.
Canonical tier names: macbook, single-4090, quad-4090, a100-cluster. Backward-compatible aliases: rtx4090 → single-4090, a100 → a100-cluster. Add quad-4090.yaml template. Every tier template must load in CI. Annotate generated YAML with inline comments.
Runtime hardening — serve, router, rollback, CLI, tests
Production lifecycle, API contracts, and test suite.
SIGINT/SIGTERM handler that terminates child processes (TERM → 15s grace → KILL). Write .aua/pids/ and .aua/logs/ on startup. Async readiness polling per specialist. Port-conflict detection before start with --reuse-running flag. Deterministic --dry-run output (exit 0). Document foreground-only mode explicitly in help.
Move CORS from wildcard to APIConfig.cors_origins. Add session_id (auto-generated UUID if not supplied) to every request/response. Add ErrorResponse model with stable error codes. Distinguish 503 (unreachable) from 504 (timeout). Add GET /version. Redact secrets from GET /config. Make POST /deploy/green honest (return dry_run_only until harness exists). Preserve batch result order with per-item index and ok fields.
Move promotions log to .aua/state/promotions.jsonl with UUID per record. File locking via filelock to prevent concurrent rollback+deploy races. Atomic config writes: write to .tmp then os.replace(). Add --dry-run to rollback. Add POST /deploy/rollback REST endpoint or document CLI-only clearly.
All CLI commands: correct exit codes (0=pass, 1=fail, 2=warn-in-strict). Add --json to aua doctor and aua status. Add --strict to aua doctor. Add --once, --url, --refresh to aua status. Stream: add SSE named event fields (event: chunk), heartbeat comments (: keep-alive every 15s), client disconnect handling, robust data: [DONE] parser. Add Content-Encoding: none header on SSE routes to prevent gzip middleware.
Create tests/fakes/openai_server.py: FastAPI fake with GET /v1/models, POST /v1/chat/completions (buffered + streaming). Create fixture configs: minimal, rtx4090, macbook, invalid_duplicate_ports, invalid_unknown_key, invalid_threshold. Test files: test_imports, test_config, test_cli_init, test_cli_doctor, test_router_api, test_streaming, test_status, test_rollback. CI must pass without GPUs.
v0.6-alpha definition of done: pip install -e ".[dev]" && python -c "from aua import Router, Arbiter, UtilityScorer, BlueGreenDeployment, CorrectionLoop; print('ok')" && aua init --tier macbook --force && aua doctor --strict && aua serve --dry-run && pytest -q && ruff check aua tests && black --check aua tests && mypy aua all pass from a fresh clone.
Packaging is clean. Now add the deployment layer (#11–#13) and the Django-like configuration system
(presets, model registry, field registry, aua config command group). These are ordered so the
registry items come before the commands that reference them.
Docker + hardware tier templates
Run everything in one command. Depends on clean pyproject.toml (P-02) and correct tier names (P-07).
docker-compose up starts router, specialists, arbiter. docker-compose --profile gpu up for GPU variant. Profiles: cpu (Ollama/local), gpu (vLLM), observability (Prometheus + Grafana optional), secure. Health checks on all containers. Volume mounts for .aua/ state. Environment file support. Structured logs to stdout.
Canonical tiers: macbook (Ollama), single-4090 (RTX 4090, vLLM AWQ), quad-4090 (4× RTX 4090, multi-GPU), a100-cluster (A100 80GB, fp16). Each specifies specialists, arbiter, GPU assignment, memory split, ports, promotion thresholds, default models, observability defaults, security defaults. Annotated with inline YAML comments.
Django config layer — registries, presets, config commands
Registries before presets. Presets before config commands. Config commands before tutorial.
Built-in aliases: qwen-coder-7b-awq, qwen-math-7b-awq, qwen-14b-awq, llama3-8b, etc. Each entry: provider, full model ID, backend, quantization, recommended VRAM. User-defined registry entries in YAML. Hardware compatibility checks (AWQ + Ollama → error). CLI: aua models list, aua models inspect <alias>. Compact config uses aliases: model: qwen-coder-7b-awq.
Built-in fields: software_engineering, mathematics, general, research, medicine, law, finance. Each has aliases (swe, coding), description, default utility weights, default confidence threshold, default arbiter policy. User-defined custom fields. CLI: aua fields list, aua fields inspect <field>.
Built-in presets: coding, math, research, generalist, medical-safe, legal-safe, local-ollama. Each preset selects fields, default models (via registry aliases), routing thresholds, utility weights, arbiter policy. Presets live in aua/defaults/presets/. Compact config: preset: coding expands to full config. CLI: aua presets list, aua presets inspect <name>. aua init --preset coding --tier single-4090 works end-to-end.
aua config validate (strict schema check), aua config expand (compact → full YAML printed to stdout), aua config show (current loaded config, secrets redacted), aua config diff config_a.yaml config_b.yaml, aua config explain routing.fanout_threshold (human-readable explanation with default, range, higher/lower meaning). JSON output option on all commands. Config schema registry with explanation strings for every key.
Hot reload + tutorial rewrite
Tutorial written last so it reflects the final config system and presets.
SIGHUP triggers reload. aua config reload CLI command. Hot-reloadable without restart: routing thresholds, utility weights, promotion thresholds, logging level, rate limits, CORS origins, webhook config. Partial restart required for: new specialist, changed model, changed backend, changed mTLS certs. Reload is atomic — validate new config before applying.
Replaces tutorial.html. Do not start with theory. Part 1: quickstart in <10 min. Parts 2–12 progressively teach: models & fields, routing & utility, arbiter & corrections, blue-green, plugins, hooks/middleware, security, observability, Docker deployment, expert deployment (full custom config without editing AUA internals). Tutorial verifies commands from each section actually run.
v0.7-beta definition of done: docker-compose up works. aua init --preset coding --tier single-4090 && aua config validate && aua config expand && aua models list && aua fields list && aua presets list && aua serve --dry-run all pass. Tutorial Part 1 tested from scratch.
The configuration layer exists. Now add the extensibility layer: architecture spec, contracts, error taxonomy, state abstraction, plugin system, hooks, middleware. These are ordered so lower-level items (architecture, contracts, errors, state) come before the systems that depend on them (plugins, hooks, middleware).
Framework discipline — architecture, contracts, errors, state, config versioning
These define the contracts everything else implements. Must come before plugins and hooks.
Defines: runtime architecture, full request lifecycle (Input → Middleware → Session → Correction Retrieval → Classifier → Routing → Specialist Calls → Utility Scoring → Arbiter → Hooks → Correction Logging → Response → Metrics/Logs/Traces/Audit), component boundaries, component ownership, plugin loading lifecycle, hook/middleware execution order, observability flow, security boundary. This doc prevents implementation drift.
Define formal Python protocols for all extension points: FieldClassifierPlugin, UtilityScorerPlugin, ArbiterPolicyPlugin, PromotionPolicyPlugin, CorrectionStorePlugin, ModelBackendPlugin, HookPlugin, AUAMiddleware. Create docs/public_api.md. Guarantee: v1.x will not break public import paths or plugin method signatures. Deprecated APIs get one minor release of warning.
Define stable error code list: AUA_CONFIG_INVALID, AUA_BACKEND_UNREACHABLE, AUA_SPECIALIST_TIMEOUT, AUA_ARBITER_TIMEOUT, AUA_PLUGIN_LOAD_FAILED, AUA_PLUGIN_CONTRACT_INVALID, AUA_HOOK_FAILED, AUA_MIDDLEWARE_FAILED, AUA_AUTH_REQUIRED, AUA_FORBIDDEN, AUA_RATE_LIMITED, AUA_PROMOTION_REJECTED, AUA_ROLLBACK_FAILED, AUA_STATE_STORE_UNAVAILABLE, etc. Map each code to HTTP status and CLI exit code. REST error format: {"error": "AUA_*", "message": "...", "trace_id": "...", "details": {}}. Create AUA_Framework_v1_Error_Codes.md.
Define scopes: aua:query, aua:stream, aua:status, aua:config:read, aua:config:write, aua:corrections:read, aua:corrections:write, aua:deploy, aua:rollback, aua:extensions:read, aua:extensions:write, aua:tokens:read, aua:tokens:write, aua:admin. Map every endpoint to required scope. Extension/runtime import endpoints require aua:extensions:write or aua:admin. Create scope matrix table in docs.
Define state store interface. State categories: chat sessions, corrections, promotion logs, audit logs, DPO pairs, config snapshots, token metadata, cost records, eval runs, routing traces. Implementations: local files (default), SQLite (local dev), Postgres (production). Config: state: {backend: sqlite, path: .aua/state/aua.db}. Migration support. Doctor checks. Removes the scattered .jsonl files in favor of structured storage.
Add config_version: "1.0" field to YAML. CLI: aua config check-version, aua config migrate --from 0.6 --to 1.0, aua config migrate --dry-run. Clear error for unsupported config versions. Migration tests against old config fixtures. Required before v1.0 to support users upgrading from v0.x.
Extension system — plugins, backends, hooks, middleware, extension CLI
Ordered: plugin interfaces → backend plugins → import system → hooks → middleware → runtime API → extension CLI.
Define base protocols in aua/plugins/: FieldClassifierPlugin, UtilityScorerPlugin, ArbiterPolicyPlugin, PromotionPolicyPlugin, CorrectionStorePlugin. Each has: required method signatures, type hints, docstrings, example implementations, test harness. Version compatibility guarantees documented.
ModelBackendPlugin protocol: complete(request) -> dict, stream(request) -> AsyncIterator, health() -> dict. Built-in backends: vLLM (existing), Ollama (existing), OpenAI-compatible generic. YAML registration: backends: {my_gateway: {plugin: my_company.backends:GatewayBackend, base_url: ..., auth_secret: ...}}. Doctor checks backend health for all registered backends.
YAML syntax: import_path: plugins.custom_utility:RiskWeightedUtilityScorer. Config injection: config: {risk_weight: 0.7} passed to constructor. Strict interface validation on load (not on import). Clear error messages on mismatch. Doctor checks all import paths. Reload support where safe. Allowlisted paths only in production mode. Users never edit AUA source files.
Hooks: pre_query, post_query, pre_route, post_route, pre_specialist_call, post_specialist_call, pre_arbiter, post_arbiter, pre_response, post_response, on_correction, on_promotion, on_rollback. Hook interface: async def __call__(self, event: dict) -> dict. Configurable ordering. Error handling policy (fail-open vs fail-closed). Timeout policy per hook. Audit logged. Metrics tracked. YAML registration under hooks:.
Interface: before_query(request) -> request, after_response(response) -> response. Ordered execution (list order in YAML). Short-circuit support (return early without calling downstream). Error behavior: configurable (skip, fail). Metrics per middleware. Built-in examples: PIIRedactionMiddleware, TenantPolicyMiddleware, AuditMiddleware. YAML: middleware: [plugins.middleware:PIIRedactionMiddleware].
GET /extensions, GET /extensions/{name}, POST /extensions/test, POST /extensions/reload. Disabled by default in production. Requires aua:extensions:write scope. Only loads allowlisted paths. Audit logged. Warns loudly if bound to 0.0.0.0. Development-only: POST /extensions/import.
aua extensions list, aua extensions inspect <name>, aua extensions test --kind utility_scorer --import-path plugins.custom_utility:RiskWeightedUtilityScorer, aua extensions reload, aua plugins doctor. Also: management command system stub: aua run <command> dispatching to import_path registered in YAML.
Framework defaults — prompt templates, safety policy, defaults registry
Prompt templates and safety policy before presets reference them.
Built-in templates in aua/templates/prompts/: classifier_v1.txt, arbiter_balanced_v1.txt, arbiter_conservative_v1.txt, correction_injection_v1.txt, abstention_v1.txt, medical_safe_v1.txt, legal_safe_v1.txt. Config: prompts: {arbiter_template: arbiter_conservative_v1}. Advanced: arbiter_template_path: ./prompts/custom.txt. Template registry with version tracking. Prompts are framework behavior, not implementation details.
Config: safety: {abstention_enabled: true, high_risk_fields: [medicine, law, finance], require_arbiter_for_high_risk: true, min_confidence_for_direct_answer: 0.90}. Behavior: low confidence → abstain or clarify. High-risk field → force arbiter. Contradiction detected → do not answer confidently. Required by medical-safe and legal-safe presets. Abstention response model in endpoints.py.
aua/defaults/: presets/, models.yaml, fields.yaml, utility.yaml, routing.yaml, security.yaml, prompts.yaml. CLI: aua defaults show, aua defaults show preset coding, aua defaults show models. Makes batteries-included defaults inspectable and avoids hidden magic. Underpins aua config expand.
Required examples: quickstart_coding/, custom_utility_plugin/, custom_field_classifier/, custom_backend/, blue_green_demo/, secure_docker_deploy/. Each: aua_config.yaml, README.md, sample curl commands, expected output, troubleshooting. Examples validate the plugin system, docs, and API simultaneously. Users copy these before reading advanced docs.
v0.8-framework-beta definition of done: aua extensions test --kind utility_scorer --import-path plugins.example:ExampleUtilityScorer passes. aua defaults show and aua models list work. Custom middleware and hook examples in examples/ run. Architecture spec exists and is accurate.
Security items ordered so the foundation (session IDs, scopes) comes before the features that use them (audit log, rate limiting, mTLS). Observability ordered so the pipeline (OTEL) precedes the outputs (Prometheus, Grafana, Datadog).
Security foundation — identity, auth, audit, rate limiting
Session IDs first (everything else references them). Auth before audit. Secrets before mTLS certs.
Every query gets session_id, trace_id, request_id. Propagated through: router, field classifier, specialist calls, arbiter, correction loop, response, logs, metrics, audit log. Returned in every API response. Generated UUID if not supplied by client. Context available to all hooks and middleware.
Config references secret names, not values. Supported: environment variables, local encrypted file, HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager. YAML: secrets: {provider: env}, api_key_secret: OPENAI_API_KEY. Doctor checks secret resolution. Redaction in GET /config and logs.
External endpoints require auth. JWT or signed token support. Scope validation per endpoint (using permission matrix from F-04). Token ID in logs and audit. Admin endpoints strongly protected. Extension endpoints require aua:extensions:write. Local dev mode can opt out of auth with explicit warning.
aua token create --scope aua:query --expires 30d, aua token list, aua token revoke <token-id>, aua token inspect <token-id>. Token metadata stored in state store (F-05). Revocation list checked on every request.
Events: query, routing decision, arbiter verdict, correction injection, config reload, token create/revoke, promotion, rollback, extension load, hook failure, auth failure. Required fields: timestamp, session_id, trace_id, token_id, event_type, field, specialist, utility_score, confidence, latency_ms, hash_chain_previous, hash_chain_current. Append-only to state store. Hash chain provides tamper evidence.
Router ↔ specialists, router ↔ arbiter, arbiter ↔ correction loop. Config: security: {mtls: {enabled: true, cert_dir: .aua/certs, auto_generate: true}}. CLI: aua certs generate, aua certs rotate. Doctor check: aua doctor --check-certs. Auto-generated dev certs with warning. Production: bring your own CA.
Configurable per token/scope. Config: rate_limits: {aua:query: {requests_per_minute: 60}, aua:admin: {requests_per_minute: 10}}. Behavior options: reject, queue, warn, alert. Returns 429 with Retry-After header. Metrics tracked per scope. Audit logged when limits hit.
CORS configurable via api.cors_origins (already in P-06/P-09). External endpoints explicit. Internal specialist ports not exposed externally by default. Admin endpoints protected. Extension endpoints protected. Unsafe settings (wildcard CORS + 0.0.0.0 + no auth) produce loud doctor warnings.
Encrypt at rest: correction payloads, assertion store entries, DPO pairs, plugin secrets, token metadata, sensitive audit fields. AES-256-GCM. Key managed per deployment. mTLS (#16) covers in-transit encryption. Config: security: {encryption: {enabled: true, key_secret: AUA_ENCRYPTION_KEY}}.
Observability — OTEL, Prometheus, Grafana, logging, alerting, cost
OTEL pipeline first (Prometheus and Datadog export from it). Alerting after metrics. Cost after metrics.
JSON log emission. Fields: timestamp, level, message, session_id, trace_id, request_id, token_id, field, specialist, utility_score, confidence, contradiction, verdict, latency_ms, plugin, hook, middleware. Apply LoggingConfig from config at startup. ELK/Splunk compatible.
Instrument: query, routing, classification, specialist call, arbiter, correction loop, blue-green decision, plugin execution, hook execution, middleware execution. Trace propagation through all components. Span IDs in API response. Optional extra: pip install aua[otel].
Full W3C trace context propagation from router through all specialist calls. Trace ID in every API response header and body. Jaeger/Tempo exporter via OTEL collector. Debugger panel links to trace UI per request.
Metrics: aua_query_total, aua_query_latency_seconds, aua_utility_score, aua_contradiction_rate, aua_routing_field_distribution, aua_specialist_confidence, aua_bluegreen_traffic_split, aua_correction_count, aua_abstention_rate, aua_arbiter_verdict_distribution, aua_dpo_pairs_accumulated, aua_token_requests_total, aua_token_requests_rejected, aua_specialist_vram_utilization, aua_plugin_execution_seconds, aua_hook_failures_total.
Dashboard JSON in docker/grafana/. Panels: query volume, latency p50/p95/p99, routing distribution, utility score trends, contradiction rate, arbiter verdict distribution, specialist health, VRAM usage, blue-green split, auth failures, cost metrics, plugin latency.
OTEL collector config preset for Datadog exporter. Docs. Env var secret handling (DD_API_KEY). Doctor check for Datadog connectivity. Optional extra: pip install aua[datadog].
Alert rules: AUA_HighContradictionRate, AUA_SpecialistDown, AUA_LowUtilityScore, AUA_BlueGreenStalled, AUA_ArbiterCaseFourSpike, AUA_TokenRejectionSpike, AUA_PluginFailureSpike, AUA_HookFailureSpike. Alert rules YAML in docker/prometheus/alerts.yaml.
Events: specialist_promoted, rollback_completed, contradiction_threshold_exceeded, arbiter_inconclusive_spike, specialist_down, plugin_failure, security_auth_failure_spike. Config: webhooks: {slack: {url_secret: SLACK_WEBHOOK_URL, events: [specialist_down, high_contradiction_rate]}}. Retry with backoff. Audit logged.
Track: GPU hours per query, cost per specialist call, cost per arbiter call, cost per blue-green cycle, cost per plugin execution. Config: cost: {gpu_hour_rates: {rtx4090: 0.50, a100: 2.50}}. Exposed in GET /metrics/cost and aua status.
v0.9-rc1 definition of done: aua token create --scope aua:query --expires 30d && aua certs generate && aua doctor --strict && curl http://localhost:8000/metrics && curl http://localhost:8000/metrics/cost all pass. Grafana dashboard loads in Docker Compose observability profile.
Evaluation harness before Chat UI because the UI's debugger panel references eval runs and DPO pairs. Chat Session API before Chat UI frontend (API-first).
Evaluation + correction export
Required by blue-green promotion demo, tutorial Part 6, and the UI's blue-green debug panel.
aua eval run --dataset evals/coding.yaml --config aua_config.yaml, aua eval report .aua/evals/latest.json, aua eval compare --baseline blue --candidate green. Dataset format: YAML with cases (id, field, prompt, expected_properties). Runner routes each case through the framework, scores with utility function, checks expected properties. JSON output for CI. Regression detection. Wires up to POST /deploy/green harness.
aua corrections export --format jsonl, aua dpo export --format preference-pairs. Output format: {prompt, chosen, rejected, field, utility_chosen, utility_rejected, correction_ids, trace_id}. Traceability to source queries. Safe redaction option. Not a training pipeline — just export. Users bring their own fine-tuning infrastructure.
evals/coding_smoke.yaml, math_smoke.yaml, routing_smoke.yaml, correction_smoke.yaml, arbiter_smoke.yaml, safety_smoke.yaml. Small but useful. Run by aua eval run. Used in CI where feasible (fake specialists). Used in tutorial Part 6. Used in blue-green promotion demo in examples/blue_green_demo/.
Chat UI — session API + reference frontend
Chat Session API (U-01) before the frontend (U-02). Frontend consumes the API.
POST /sessions, GET /sessions, GET /sessions/{id}, DELETE /sessions/{id}, GET /sessions/{id}/messages, POST /sessions/{id}/messages, POST /sessions/{id}/stream. Session store backed by state store abstraction (F-05). Schema: chat_sessions, chat_messages, routing_traces tables. Chat request/response schema includes session_id, trace_id, routing, utility, debug fields. SSE streaming events: start, route, specialist_start, chunk, specialist_done, arbiter_start, arbiter_done, done, error.
Three-zone layout: session sidebar (left) + main chat window (center) + Framework Debugger panel (right). Left pop-up AUA Controls drawer exposes: preset, fields, models, routing thresholds, utility weights, arbiter policy, corrections, blue-green, backends, security, observability, extensions. Right debugger shows: route summary, field classifier output, specialist calls, candidate responses (optional), utility breakdown, arbiter decision, corrections injected, blue-green debug, latency breakdown, cost estimate, trace metadata, plugin/hook/middleware execution. Tech: React/Next.js + TypeScript + Tailwind + shadcn/ui. CLI: aua serve --with-ui, aua ui --port 3000. Config: ui: {enabled: true, host: ..., port: 3000, persist_sessions: true}. Auth required in production. Layout in apps/aua_chat/.
v0.9-rc2 definition of done: aua serve --with-ui starts everything. Open http://localhost:3000, create a chat, ask a coding question, watch streaming response, open Framework Debugger, see route + utility + arbiter + trace, open AUA Controls drawer, change routing threshold, reload config, ask again, see behavior change.
Final polish, documentation, release discipline. Everything else is implemented; this milestone makes it shippable.
Release readiness — compatibility, deployment profiles, examples, release docs
Four profiles documented in AUA_Framework_v1_Deployment_Profiles.md: (1) Local Developer — localhost only, auth optional, SQLite; (2) Single GPU Workstation — auth recommended, local file/SQLite state; (3) Team Server — auth required, mTLS required, Postgres state, Prometheus/Grafana; (4) Enterprise — custom backends, IAM, secrets manager, strict audit, runtime import disabled. Doctor validates profile-specific requirements.
Document in docs/compatibility.md: Python 3.10/3.11/3.12, CUDA versions, vLLM versions, Ollama versions, macOS (Ollama only), Linux (vLLM + GPU), GPU memory recommendations per tier, supported model formats (AWQ, GPTQ, fp16, GGUF), supported backends, browser support for UI, Docker support, Postgres versions, Prometheus/Grafana versions. Doctor references this matrix.
Create: CHANGELOG.md, RELEASE.md (release checklist: tests, lint, types, docs, tutorial, examples, Docker, fresh-clone, UI smoke, security profile, migration notes), MIGRATIONS.md, DEPRECATIONS.md. Versioning policy: semver, v1.x preserves public API, v2 may break plugin contracts with migration guide, deprecated features warn before removal.
All of the following must pass from a fresh clone: install, import, CLI (init, config, doctor, serve), REST (health, version, config, metrics), query, streaming, plugin test, security (token, certs), observability (metrics, cost, status). Chat UI smoke test. Tutorial Part 1 verified from scratch. All examples run. Docker Compose profiles verified. mypy, black, ruff, pytest all pass.
v1.0.0 — shipped: The one-liner from objective_v1.md works — pip install adaptive-utility-agent && aua init --preset coding --tier single-4090 && aua serve — and the expert path works without editing any AUA source files.
These 49 items were discovered, built, and battle-tested in AUA-Veritas — a macOS desktop AI assistant built on top of the AUA framework. Every item here was required to make a real user-facing product work. All four priorities are done: P0 shipped with v1.0.0; P1–P3 were backported and shipped in v1.1.0 (296+ tests, 40-check live-router E2E suite, CI green across Python 3.10–3.12).
Implementation rules discovered through production. Each warning below represents a real bug that caused silent failures in AUA-Veritas. The rules are carried forward into the framework to prevent recurrence.
Priority 0 — Required for any production deployment
All items implemented and tested against real Claude API calls (75 E2E tests, 0 failures).
New tables: conversations (with project_id FK),
messages, model_runs, token_counters,
message_keywords, context_backups, projects.
Endpoints: POST /conversations, GET /conversations (with
?project_id= filter), PATCH /conversations/{id}/title,
GET /conversations/{id}/messages (paginated with before/after
timestamp cursors), POST /projects, GET /projects.
All implemented in aua/state.py and aua/router.py.
model_runs.conversation_id — required join column done
model_runs previously had query_id but no conversation_id.
There was literally no join path between a conversation and its model runs in the database.
Added conversation_id TEXT column + idx_runs_conv index.
Also added domain_l0 and domain_path for hierarchical domain tracking.
Implementation rule: every method that stores a model run must receive
conversation_id as an explicit parameter — never rely on closure
capture. The _peer_review() method in Veritas silently dropped conversation IDs
for 3 months because of this.
fire_and_forget() done
Model score updates, model_runs inserts, and token counter increments
were blocking the response path (1–8 ms each, compounding). Moved off the critical
path using asyncio.create_task() via the new fire_and_forget(coro)
helper in aua/state.py. Saves 15–25 ms per query.
Score thresholds validated in production: 3 correct responses → +1 point,
2 wrong → −1 point. A threshold of 5 made score movement invisible
within a typical session.
Implementation rule: asyncio must be imported at the
top of lifespan() (or locally in each use site) — never as a
lazy mid-body import. Any startup code that runs before the lazy import will crash with
NameError. This bug manifested in 2 separate places in Veritas.
MessageCache class done
Implemented MessageCache in aua/state.py using
collections.OrderedDict + move_to_end() on every cache hit
(true LRU). FIFO dict evicts the most-accessed conversation first
— the opposite of what you want. Capacity: 500 conversations.
Cache bypass rule: only serve from cache when
limit ≥ 50 (the default). A custom ?limit=1 query must
bypass the cache and hit the DB with the actual limit — otherwise pagination is
silently broken (limit=1 returns all cached messages). This bug was found
by the E2E test suite.
The context backup generator previously used conversation_history from the
client request body — whatever messages the frontend happened to have loaded.
If the UI had 20 messages loaded from an 80-message conversation, the backup was written
from those 20. Fixed: backup reads the full conversation from the DB (last 60 messages,
decrypted). The get_messages() method now serves as the canonical source.
Also: max backup tokens raised 600 → 900 to accommodate the structured 6-section
template (GOAL / DECISIONS / STATUS / ACTIVE FILE / PREFERENCES / RESUME INSTRUCTION).
Priority 1 — Required for desktop / long-running sessions
Backported and shipped in v1.1.0. Verified by tests/test_v11_veritas.py plus a 40-check live-router E2E suite; CI green on Python 3.10/3.11/3.12.
In-memory inverted index (keyword → {conversation_id, ...})
backed by a sorted keyword list for O(log n) prefix matching via bisect.
Multi-word AND semantics. Average query latency: 4–10 ms (vs 1.07 ms DB fallback).
Queue-based async worker (asyncio.Queue) drains keyword extraction
off the response path in 50 ms batches.
Closure-scope trap (Phase 13): the _kw_worker closure
inside lifespan() used _tm.time() where _tm was
only imported later in the same body. The task crashed on its first item in every
session — search was 100% broken for 3 months. Fix: add
import time as _kw_tm inside the closure.
Startup backfill required: the async worker is killed on process
restart before flushing. At startup, scan messages for rows not yet in
message_keywords and index them. Without this, search returns empty after
every rebuild.
Background job that runs every 6 hours (first run: 60 s after startup).
SQL validity check: backup valid ⇔ MAX(context_backups.created_at) > MAX(messages.created_at)
for that conversation + specialist pair. Finds all stale or missing backups and generates
them at 1/s (rate-limit pacing). Exposes GET /context/backup/coverage and
POST /context/backup/run-coverage-job.
Method signature hygiene: when refactoring method signatures,
grep all call sites. A stale conversation_id= kwarg passed to a method
that no longer accepts it crashed every coverage job run silently.
POST /corrections/confirm-implicit
Layer 1 trigger detector catches implicit corrections (short reply + negation pattern +
semantic similarity). Instead of asking the user to re-type their intent, a modal
presents Accept / Reject buttons. POST /corrections/confirm-implicit
handles the response.
Explicit prefix rule: correction: X is a preference
statement — store it regardless of whether a prior AI turn exists.
The _handle_correction() early-return
(if not last_ai_response: return None) must be guarded by
and not explicit_prefix. Without this guard, corrections sent at the
start of a conversation are silently discarded.
Regex pattern rule: patterns ending in punctuation or space
(e.g. actually, no, in fact,)
cannot use a trailing word boundary — comma + space is not a
word boundary. Split CORRECTION_PATTERNS into two groups: word-terminated
(use ...) and punct/space-terminated (use prefix only).
Replaces the free-form backup prompt with a structured template that forces the model to capture: GOAL (objective + constraints), DECISIONS MADE (“Decided X because Y. Rejected Z because W.”), CURRENT STATUS (✅ completed / 🔄 in progress / ❓ unresolved), ACTIVE FILE / CODE CONTEXT (exact path + function + next step), USER PREFERENCES LEARNED, and RESUME INSTRUCTION (one sentence for the new window). Max tokens raised 600 → 900 to accommodate all 6 sections without truncation.
On startup: write status=‘running’ sentinel to DB.
On clean shutdown: update to status=‘clean’.
On next startup: detect status=‘running’ → previous session
crashed → send report async to GitHub Contents API.
Includes pending_error_reports table for queuing errors from crashed sessions.
Three-level fallback: remote JSON (e.g. GitHub Pages, fetched at startup + every 24 h) → DB-cached config (last successful fetch, kept 7 days) → hardcoded fallback. Allows model IDs, pricing, and context windows to be updated without a rebuild. Critical for production: Gemini 1.5 Pro was silently deprecated with a 404 and required a full rebuild to fix without this system.
Priority 2 — UX quality
Backported and shipped in v1.1.0 — review_notes on RouterResponse, the analytics suite (/analytics, /reliability, /usage, /pricing), update management, and corrections CRUD with event-history evidence.
review_notes field
When the peer reviewer flags an issue, surface the findings to the client:
reviewer model name, ISSUES found, CORRECTION suggested.
Add review_notes: str | None to RouterResponse.
Parse ISSUES: / CORRECTION: structured sections from
reviewer output. Currently the framework discards reviewer findings after
parsing the verdict.
GET /analytics — session stats, agreement rate, domain distribution,
correction stats, confidence distribution.
GET /reliability — per-specialist win rate and
effective_u trajectory.
GET /usage — token usage summary.
GET /pricing — per-specialist token pricing for cost estimation.
GET /version/check — check GitHub releases for newer version.
POST /update/skip — store skipped_version so update
banner doesn't re-appear. GET /update/skipped.
PATCH /corrections/{id} — update correction text.
DELETE /corrections/{id} — soft-delete (sets
scope=‘superseded’, row stays in DB, excluded from
retrieval). GET /corrections/evidence — per-correction evidence
and application history.
Priority 3 — Production hardening
Backported and shipped in v1.1.0 — structured bug reporting, project scoping, local model management, and the dynamic domain ontology with 4-gate candidate promotion.
POST /bug-report: collects system log tail, API log tail, JS console errors,
optional conversation exchange. Submits via GitHub Contents API to a private repo.
Falls back gracefully when PAT not configured (returns 200 with error message, never 500).
project_id FK
conversations.project_id FK (schema already in V-P0.1).
GET /conversations?project_id=X filters correctly.
POST /conversations accepts project_id in body.
Sidebar filters to that project’s conversations.
“All chats” option shows conversations with project_id=NULL.
GET /local/models — list connected local models with
specialist_domain + specialist_depth fields.
GET /local/settings, POST /local/settings.
PATCH /local/specialist/{id} — tag a local model as specialist
for a domain node. Framework currently has ollama_backend.py but no
discovery, management, or specialist routing API.
Dynamic domain tree with candidate promotion: domain nodes start as aliases, get
promoted when query volume and divergence thresholds are met.
GET /domain-tree — full ontology with node stats and candidate queue.
Background ontology job runs divergence test, applies promotion criteria.
v1.1-veritas P0 definition of done:
from aua.state import SQLiteStateStore, MessageCache, fire_and_forget imports cleanly.
POST /conversations, GET /conversations,
GET /conversations/{id}/messages?limit=1 (bypasses cache),
POST /projects, GET /context/backup/coverage
all return correct responses. Schema includes conversations,
messages, model_runs (with conversation_id),
token_counters, message_keywords, context_backups,
projects. fire_and_forget() schedules background tasks without
blocking the response path.
v1.1.0 definition of done — verified:
All 14 V-P1–V-P3 items live behind 28 new REST endpoints
(/search, /context/backup/run-coverage-job,
/corrections/confirm-implicit, corrections CRUD + evidence,
/analytics suite, /version/check,
/bug-report, /local/*, /domain-tree).
Session/trace/request IDs returned on every response and propagated to
specialists, hooks, audit, and logs (#15). secrets: config block
with live Vault (wire-faithful KV v2 + real hvac) and AWS Secrets Manager
(moto + real boto3) integration tests in CI (#19). plugins:,
hooks:, middleware:, state:, and
security: blocks parse with strict validation and wire at
startup; GET /extensions reports server truth (F-09–F-11).
297 tests passing, ruff/black/isort/mypy clean, CI green on
Python 3.10/3.11/3.12.
Original roadmap items #33–#73. These are real and valuable, but are post-v1. They require v1 to be stable first. Grouped by concern for clarity.
High-availability — consensus, leader election, service discovery, failover
Kafka KRaft-based coordinator for multi-node router/arbiter election without ZooKeeper dependency.
Active-standby leader election with automatic failover for router and arbiter replicas.
Distributed configuration and state via etcd. Enables multi-node deployments to share routing state.
Specialists register in Consul. Router discovers and load-balances across specialist replicas automatically.
Prevent cascade failures. Open circuit after threshold failures, half-open probe, close on recovery.
When primary specialist fails, route to healthy replica or fall back to arbiter with degraded-mode flag.
Per-specialist retry policy. Configurable max retries, base delay, jitter. Audit logged.
aua k8s generate --tier a100-cluster produces ready-to-apply manifests for all AUA components.
Production-grade Helm chart with values files for each hardware tier and deployment profile.
Backup state store (corrections, promotions, audit, sessions) to S3/GCS/local. Point-in-time restore.
Unified HA configuration block: replica counts, circuit breaker policy, leader election backend, service discovery backend.
Advanced platform features — multi-tenancy, caching, experiment tracking, batch, testing
Tenant isolation: separate correction stores, promotion logs, audit logs, rate limits, and model bindings per tenant.
Superseded. The keyword search index (AUA) and conversation search (AUA-Veritas) already let users check whether a query has been asked before. Adding Redis as an ops dependency for marginal cache-hit benefit is not warranted. Closed.
Pull model versions from HuggingFace Hub or MLflow registry directly into specialist config. Version pinning.
Log utility scores, routing decisions, arbiter verdicts, and blue-green metrics to MLflow or W&B automatically.
GREEN receives traffic but responses are not shown to users. Score and log GREEN silently. Promote when score threshold met.
Full regression suite against the eval datasets. Run on every specialist promotion attempt. Block promotion on regression.
Built-in load test command. Configurable concurrency, duration, query mix. Reports p50/p95/p99 latency and error rate.
Plugin types beyond v1: custom contradiction detector, custom assertion store, custom routing strategy, custom scoring components.
Async middleware, streaming middleware (intercepts SSE chunks), batch middleware, multi-tenant middleware primitives.
Allow fully replacing the utility function (not just weights) via plugin. Supports non-linear utility models and field-specific scoring architectures.
aua test command running the full test suite against running services (not fakes). Integration test harness with configurable query fixtures.
Automated compatibility test matrix run in CI across model formats, hardware tiers, and backends. Published as docs and referenced by doctor.
Research & advanced — VCG, distillation, SDKs, GPU cluster, auto-generated docs — 7 of 18 done (#56, #57, #58, #61, #65, #66, #67)
Persistent batch queue. Result polling. Priority lanes. Cost-optimized batching across specialists.
aua serve automatically downloads missing models from HuggingFace Hub before starting specialists.
Full Vickrey-Clarke-Groves mechanism for multi-specialist arbitration as described in the whitepaper theorems S1–S3.
Automated pipeline: collect DPO pairs → distill specialist → evaluate → blue-green candidate. End-to-end self-improvement loop.
Dynamically partition the specialist graph across available hardware based on VRAM and latency constraints.
Integrate PubMed, arXiv, SymPy as external ground truth sources for the Arbiter's empirical check (currently stubbed).
Node.js SDK with full TypeScript types. Mirrors Python API. SSE streaming support.
Go client library with streaming support. Targeting platform engineers building Go-based AI infrastructure.
Java client library. Enterprise-focused. Spring Boot integration example.
Tier template for 8×H100 cluster. NVLink, tensor parallelism, pipeline parallelism configuration.
vLLM tensor parallelism configuration for large models across multiple GPUs per specialist.
Optimized Ollama configuration for consumer MacBooks and gaming PCs (RTX 3080/4080 class).
Updated tutorial covering HA deployment, multi-tenancy, advanced plugins, and distillation pipeline.
Already provided by FastAPI. Expand with richer examples, field explanations, and code samples in all SDK languages.
Per-domain documentation pages covering field-specific utility weights, prompt templates, arbiter policies, and example queries.
Update whitepaper with v1 framework results, production metrics, and framework architecture. Update site links.
Separate repo with more complete example applications: medical assistant, legal assistant, full-stack coding assistant, research agent.
Public changelog and SemVer policy published on the framework site.
Wire ModelBackendPlugin at the per-specialist level. Each SpecialistConfig and ArbiterConfig carries an optional backend_plugin: block. When present, the router calls plugin.complete(request) instead of the built-in httpx _call() path for that specialist only — zero change for specialists without a plugin. Design: backend_plugin field on SpecialistConfig/ArbiterConfig; _call() gains backend_plugin=None kwarg; plugin receives OpenAI-compat request dict, returns {response: str, confidence: float}. Enables: custom inference stacks, multi-model ensembles per specialist, math specialist backed by 3 models + arXiv cross-check, domain-specific retrieval augmentation, or any external inference API.
Wire StateStorePlugin so the SQLite default can be replaced via aua_config.yaml. Requires restructuring Router.__init__ to load the state_store plugin before constructing BatchQueue, ShadowStore, DomainTree, OntologyJob, and RemoteModelConfig (all of which currently receive self._state_store at construction time). Two candidate designs: (a) two-phase init — load state_store plugin first, pass to dependents; (b) late-bind factory — dependents accept a callable that returns the store on first access. Enables Postgres, Redis, or cloud-native persistence backends as drop-in community plugins without any source modifications.
Wire ModelBackendPlugin at per-specialist level. Each SpecialistConfig carries an optional backend_plugin block. Router dispatches to plugin.complete() instead of built-in httpx/_call() path for that specialist only. Enables custom inference stacks, ensembles, and domain-specific pipelines (e.g. math specialist backed by 3 models + arXiv cross-check). Design: backend_plugin field on SpecialistConfig + ArbiterConfig; _call() gains backend_plugin=None kwarg; plugin receives OpenAI-compat request dict, returns {response: str, confidence: float}.
Wire StateStorePlugin so the SQLite backend can be replaced via YAML config. Requires restructuring Router.__init__ to load plugins before constructing BatchQueue, ShadowStore, DomainTree, OntologyJob, and RemoteModelConfig (all currently receive self._state_store at construction time). Design options: (a) two-phase init — load state_store plugin first, then construct dependents; (b) late-bind — dependents accept a factory callable. Enables Postgres, Redis, or cloud-native persistence backends as community plugins.
| Milestone | Items | Theme | Gate |
|---|---|---|---|
| v0.6-alpha | P-01 – P-12 (12 items) | Production-harden #01–#10 | Fresh-clone validation + CI passes |
| v0.7-beta | #11–#13, #05A–#05D, #14 (8 items) | Docker + Django config foundation | aua init --preset coding && docker-compose up |
| v0.8-framework-beta | F-01 – F-17 (17 items) | Architecture, contracts, plugins, hooks, middleware | Custom plugin test passes; no AUA source edits required |
| v0.9-rc1 | #15–#22, #26, #23, #28, #24–#25, #27, #29–#32 (17 items) | Security + observability | Token + certs + Prometheus + Grafana all working |
| v0.9-rc2 | E-01–E-03, U-01–U-02 (5 items) | Evaluation + chat UI | aua serve --with-ui + debugger panel working end-to-end |
| v1.0.0 — Shipped | D-01–D-04 (4 items) | Release readiness | All v1.0 definition-of-done commands pass |
| v1.1.0 — Shipped | V-P0–V-P3 (49 items) + #15, #19, F-09–F-11 completion | AUA-Veritas production backport — persistence, search, corrections, backup, analytics, ontology — plus end-to-end session IDs, live secrets-provider tests, and YAML-wired plugins/hooks/middleware | All 4 priorities shipped. 297 tests passing; every documented endpoint verified against a live router. |
| v2.0+ | #33–#73 (41 items) | HA, multi-tenancy, SDKs, research, cluster | Post-v1 — requires stable v1 first |
Key ordering rationale: Config strictness (P-06) before Docker (#11) because Docker needs correct config. Model registry (#05C) before presets (#05B) because presets reference aliases. Field registry (#05D) before config commands (#05A) because explain needs field knowledge. Secrets (#19) before auth (#17) because auth tokens are secrets. OTEL (#23) before Prometheus (#24) because Prometheus can export from OTEL. State store (F-05) before Chat Session API (U-01) because sessions need storage. Plugin interfaces (F-07) before import system (F-09) because the importer validates against the interfaces. Error taxonomy (F-03) before security (#17–#22) because auth errors need stable codes. Architecture spec (F-01) before all framework items because it defines component boundaries that prevent drift.