This page records the v1.0 implementation contract, validation criteria, and future roadmap. Items marked done are included in v1.0.0. Items marked wip or future are deferred to v1.1+ or v2.0. The goal: Django for adaptive multi-model LLM systems — batteries included, deeply configurable, extensible without editing AUA internals.
✓ Status: v1.0.0 shipped.
All items marked done are included in v1.0.0 and validated. See docs/v1_validation_report.md for the full validation record: 132 tests, 23 REST endpoints, 8 plugin protocols, 15 Prometheus metrics, CLI reference, install transcript, Docker Compose, Chat UI, security, and observability.
Reading guide. Items are ordered so each step unlocks the next with minimal rework. Original roadmap numbers (#01–#73) are preserved where they exist; new items introduced by the planning documents use prefixed IDs (P-, F-, E-, U-, D-). Items already implemented (v0.5 POC) are marked done. Items needing production hardening are marked wip.
Milestones
Items #01–#10 are production-complete in v1.0.0. This milestone records the hardening work completed: quality, contracts, correctness, and packaging. All items are shipped.
Repo hygiene & packaging
Must happen first — CI, installs, and all other work depend on this being clean.
Run black aua tests and isort aua tests across the entire repo. Add [tool.black], [tool.isort], [tool.ruff], and [tool.mypy] sections to pyproject.toml. line-length=100, target-version=py310. This is a prerequisite for CI.
Package name adaptive-utility-agent. Optional extras: pip install aua[vllm], aua[dev], aua[ui], aua[postgres], aua[otel]. vllm must be optional — it is not Mac-friendly. Include aua/tiers/*.yaml in wheel data. Add Makefile with install, test, lint, format, typecheck, build targets.
Create aua/version.py containing __version__ = "1.0.0". All other references import from here. Create aua/py.typed marker file. Add version consistency test: importlib.metadata.version("adaptive-utility-agent") == aua.__version__.
Matrix: Python 3.10, 3.11, 3.12. Steps: pip install -e ".[dev]" → ruff → black check → mypy → pytest. Must pass without GPUs using fake specialists. Also add wheel-build validation step.
Public API alignment & config strictness
Config correctness unblocks Docker, tiers, presets, and all downstream work.
Export exact roadmap API: Arbiter = ArbiterAgent (alias), BlueGreenDeployment, CorrectionLoop. Expose all endpoint models from aua.endpoints. Stable __all__. Add import smoke tests in tests/test_imports.py. Add docs/public_api.md separating stable from internal APIs.
Remove hardcoded localhost from endpoint construction. Add host, scheme, endpoint_path, endpoint_override fields to SpecialistConfig and ArbiterConfig. Add strict unknown-key validation (catches typos). Add duplicate port validation. Add threshold range validation. Add RuntimeConfig (.aua/logs, .aua/pids, .aua/state, .aua/checkpoints). Add APIConfig with cors_origins.
Canonical tier names: macbook, single-4090, quad-4090, a100-cluster. Backward-compatible aliases: rtx4090 → single-4090, a100 → a100-cluster. Add quad-4090.yaml template. Every tier template must load in CI. Annotate generated YAML with inline comments.
Runtime hardening — serve, router, rollback, CLI, tests
Production lifecycle, API contracts, and test suite.
SIGINT/SIGTERM handler that terminates child processes (TERM → 15s grace → KILL). Write .aua/pids/ and .aua/logs/ on startup. Async readiness polling per specialist. Port-conflict detection before start with --reuse-running flag. Deterministic --dry-run output (exit 0). Document foreground-only mode explicitly in help.
Move CORS from wildcard to APIConfig.cors_origins. Add session_id (auto-generated UUID if not supplied) to every request/response. Add ErrorResponse model with stable error codes. Distinguish 503 (unreachable) from 504 (timeout). Add GET /version. Redact secrets from GET /config. Make POST /deploy/green honest (return dry_run_only until harness exists). Preserve batch result order with per-item index and ok fields.
Move promotions log to .aua/state/promotions.jsonl with UUID per record. File locking via filelock to prevent concurrent rollback+deploy races. Atomic config writes: write to .tmp then os.replace(). Add --dry-run to rollback. Add POST /deploy/rollback REST endpoint or document CLI-only clearly.
All CLI commands: correct exit codes (0=pass, 1=fail, 2=warn-in-strict). Add --json to aua doctor and aua status. Add --strict to aua doctor. Add --once, --url, --refresh to aua status. Stream: add SSE named event fields (event: chunk), heartbeat comments (: keep-alive every 15s), client disconnect handling, robust data: [DONE] parser. Add Content-Encoding: none header on SSE routes to prevent gzip middleware.
Create tests/fakes/openai_server.py: FastAPI fake with GET /v1/models, POST /v1/chat/completions (buffered + streaming). Create fixture configs: minimal, rtx4090, macbook, invalid_duplicate_ports, invalid_unknown_key, invalid_threshold. Test files: test_imports, test_config, test_cli_init, test_cli_doctor, test_router_api, test_streaming, test_status, test_rollback. CI must pass without GPUs.
v0.6-alpha definition of done: pip install -e ".[dev]" && python -c "from aua import Router, Arbiter, UtilityScorer, BlueGreenDeployment, CorrectionLoop; print('ok')" && aua init --tier macbook --force && aua doctor --strict && aua serve --dry-run && pytest -q && ruff check aua tests && black --check aua tests && mypy aua all pass from a fresh clone.
Packaging is clean. Now add the deployment layer (#11–#13) and the Django-like configuration system
(presets, model registry, field registry, aua config command group). These are ordered so the
registry items come before the commands that reference them.
Docker + hardware tier templates
Run everything in one command. Depends on clean pyproject.toml (P-02) and correct tier names (P-07).
docker-compose up starts router, specialists, arbiter. docker-compose --profile gpu up for GPU variant. Profiles: cpu (Ollama/local), gpu (vLLM), observability (Prometheus + Grafana optional), secure. Health checks on all containers. Volume mounts for .aua/ state. Environment file support. Structured logs to stdout.
Canonical tiers: macbook (Ollama), single-4090 (RTX 4090, vLLM AWQ), quad-4090 (4× RTX 4090, multi-GPU), a100-cluster (A100 80GB, fp16). Each specifies specialists, arbiter, GPU assignment, memory split, ports, promotion thresholds, default models, observability defaults, security defaults. Annotated with inline YAML comments.
Django config layer — registries, presets, config commands
Registries before presets. Presets before config commands. Config commands before tutorial.
Built-in aliases: qwen-coder-7b-awq, qwen-math-7b-awq, qwen-14b-awq, llama3-8b, etc. Each entry: provider, full model ID, backend, quantization, recommended VRAM. User-defined registry entries in YAML. Hardware compatibility checks (AWQ + Ollama → error). CLI: aua models list, aua models inspect <alias>. Compact config uses aliases: model: qwen-coder-7b-awq.
Built-in fields: software_engineering, mathematics, general, research, medicine, law, finance. Each has aliases (swe, coding), description, default utility weights, default confidence threshold, default arbiter policy. User-defined custom fields. CLI: aua fields list, aua fields inspect <field>.
Built-in presets: coding, math, research, generalist, medical-safe, legal-safe, local-ollama. Each preset selects fields, default models (via registry aliases), routing thresholds, utility weights, arbiter policy. Presets live in aua/defaults/presets/. Compact config: preset: coding expands to full config. CLI: aua presets list, aua presets inspect <name>. aua init --preset coding --tier single-4090 works end-to-end.
aua config validate (strict schema check), aua config expand (compact → full YAML printed to stdout), aua config show (current loaded config, secrets redacted), aua config diff config_a.yaml config_b.yaml, aua config explain routing.fanout_threshold (human-readable explanation with default, range, higher/lower meaning). JSON output option on all commands. Config schema registry with explanation strings for every key.
Hot reload + tutorial rewrite
Tutorial written last so it reflects the final config system and presets.
SIGHUP triggers reload. aua config reload CLI command. Hot-reloadable without restart: routing thresholds, utility weights, promotion thresholds, logging level, rate limits, CORS origins, webhook config. Partial restart required for: new specialist, changed model, changed backend, changed mTLS certs. Reload is atomic — validate new config before applying.
Replaces tutorial.html. Do not start with theory. Part 1: quickstart in <10 min. Parts 2–12 progressively teach: models & fields, routing & utility, arbiter & corrections, blue-green, plugins, hooks/middleware, security, observability, Docker deployment, expert deployment (full custom config without editing AUA internals). Tutorial verifies commands from each section actually run.
v0.7-beta definition of done: docker-compose up works. aua init --preset coding --tier single-4090 && aua config validate && aua config expand && aua models list && aua fields list && aua presets list && aua serve --dry-run all pass. Tutorial Part 1 tested from scratch.
The configuration layer exists. Now add the extensibility layer: architecture spec, contracts, error taxonomy, state abstraction, plugin system, hooks, middleware. These are ordered so lower-level items (architecture, contracts, errors, state) come before the systems that depend on them (plugins, hooks, middleware).
Framework discipline — architecture, contracts, errors, state, config versioning
These define the contracts everything else implements. Must come before plugins and hooks.
Defines: runtime architecture, full request lifecycle (Input → Middleware → Session → Correction Retrieval → Classifier → Routing → Specialist Calls → Utility Scoring → Arbiter → Hooks → Correction Logging → Response → Metrics/Logs/Traces/Audit), component boundaries, component ownership, plugin loading lifecycle, hook/middleware execution order, observability flow, security boundary. This doc prevents implementation drift.
Define formal Python protocols for all extension points: FieldClassifierPlugin, UtilityScorerPlugin, ArbiterPolicyPlugin, PromotionPolicyPlugin, CorrectionStorePlugin, ModelBackendPlugin, HookPlugin, AUAMiddleware. Create docs/public_api.md. Guarantee: v1.x will not break public import paths or plugin method signatures. Deprecated APIs get one minor release of warning.
Define stable error code list: AUA_CONFIG_INVALID, AUA_BACKEND_UNREACHABLE, AUA_SPECIALIST_TIMEOUT, AUA_ARBITER_TIMEOUT, AUA_PLUGIN_LOAD_FAILED, AUA_PLUGIN_CONTRACT_INVALID, AUA_HOOK_FAILED, AUA_MIDDLEWARE_FAILED, AUA_AUTH_REQUIRED, AUA_FORBIDDEN, AUA_RATE_LIMITED, AUA_PROMOTION_REJECTED, AUA_ROLLBACK_FAILED, AUA_STATE_STORE_UNAVAILABLE, etc. Map each code to HTTP status and CLI exit code. REST error format: {"error": "AUA_*", "message": "...", "trace_id": "...", "details": {}}. Create AUA_Framework_v1_Error_Codes.md.
Define scopes: aua:query, aua:stream, aua:status, aua:config:read, aua:config:write, aua:corrections:read, aua:corrections:write, aua:deploy, aua:rollback, aua:extensions:read, aua:extensions:write, aua:tokens:read, aua:tokens:write, aua:admin. Map every endpoint to required scope. Extension/runtime import endpoints require aua:extensions:write or aua:admin. Create scope matrix table in docs.
Define state store interface. State categories: chat sessions, corrections, promotion logs, audit logs, DPO pairs, config snapshots, token metadata, cost records, eval runs, routing traces. Implementations: local files (default), SQLite (local dev), Postgres (production). Config: state: {backend: sqlite, path: .aua/state/aua.db}. Migration support. Doctor checks. Removes the scattered .jsonl files in favor of structured storage.
Add config_version: "1.0" field to YAML. CLI: aua config check-version, aua config migrate --from 0.6 --to 1.0, aua config migrate --dry-run. Clear error for unsupported config versions. Migration tests against old config fixtures. Required before v1.0 to support users upgrading from v0.x.
Extension system — plugins, backends, hooks, middleware, extension CLI
Ordered: plugin interfaces → backend plugins → import system → hooks → middleware → runtime API → extension CLI.
Define base protocols in aua/plugins/: FieldClassifierPlugin, UtilityScorerPlugin, ArbiterPolicyPlugin, PromotionPolicyPlugin, CorrectionStorePlugin. Each has: required method signatures, type hints, docstrings, example implementations, test harness. Version compatibility guarantees documented.
ModelBackendPlugin protocol: complete(request) -> dict, stream(request) -> AsyncIterator, health() -> dict. Built-in backends: vLLM (existing), Ollama (existing), OpenAI-compatible generic. YAML registration: backends: {my_gateway: {plugin: my_company.backends:GatewayBackend, base_url: ..., auth_secret: ...}}. Doctor checks backend health for all registered backends.
YAML syntax: import_path: plugins.custom_utility:RiskWeightedUtilityScorer. Config injection: config: {risk_weight: 0.7} passed to constructor. Strict interface validation on load (not on import). Clear error messages on mismatch. Doctor checks all import paths. Reload support where safe. Allowlisted paths only in production mode. Users never edit AUA source files.
Hooks: pre_query, post_query, pre_route, post_route, pre_specialist_call, post_specialist_call, pre_arbiter, post_arbiter, pre_response, post_response, on_correction, on_promotion, on_rollback. Hook interface: async def __call__(self, event: dict) -> dict. Configurable ordering. Error handling policy (fail-open vs fail-closed). Timeout policy per hook. Audit logged. Metrics tracked. YAML registration under hooks:.
Interface: before_query(request) -> request, after_response(response) -> response. Ordered execution (list order in YAML). Short-circuit support (return early without calling downstream). Error behavior: configurable (skip, fail). Metrics per middleware. Built-in examples: PIIRedactionMiddleware, TenantPolicyMiddleware, AuditMiddleware. YAML: middleware: [plugins.middleware:PIIRedactionMiddleware].
GET /extensions, GET /extensions/{name}, POST /extensions/test, POST /extensions/reload. Disabled by default in production. Requires aua:extensions:write scope. Only loads allowlisted paths. Audit logged. Warns loudly if bound to 0.0.0.0. Development-only: POST /extensions/import.
aua extensions list, aua extensions inspect <name>, aua extensions test --kind utility_scorer --import-path plugins.custom_utility:RiskWeightedUtilityScorer, aua extensions reload, aua plugins doctor. Also: management command system stub: aua run <command> dispatching to import_path registered in YAML.
Framework defaults — prompt templates, safety policy, defaults registry
Prompt templates and safety policy before presets reference them.
Built-in templates in aua/templates/prompts/: classifier_v1.txt, arbiter_balanced_v1.txt, arbiter_conservative_v1.txt, correction_injection_v1.txt, abstention_v1.txt, medical_safe_v1.txt, legal_safe_v1.txt. Config: prompts: {arbiter_template: arbiter_conservative_v1}. Advanced: arbiter_template_path: ./prompts/custom.txt. Template registry with version tracking. Prompts are framework behavior, not implementation details.
Config: safety: {abstention_enabled: true, high_risk_fields: [medicine, law, finance], require_arbiter_for_high_risk: true, min_confidence_for_direct_answer: 0.90}. Behavior: low confidence → abstain or clarify. High-risk field → force arbiter. Contradiction detected → do not answer confidently. Required by medical-safe and legal-safe presets. Abstention response model in endpoints.py.
aua/defaults/: presets/, models.yaml, fields.yaml, utility.yaml, routing.yaml, security.yaml, prompts.yaml. CLI: aua defaults show, aua defaults show preset coding, aua defaults show models. Makes batteries-included defaults inspectable and avoids hidden magic. Underpins aua config expand.
Required examples: quickstart_coding/, custom_utility_plugin/, custom_field_classifier/, custom_backend/, blue_green_demo/, secure_docker_deploy/. Each: aua_config.yaml, README.md, sample curl commands, expected output, troubleshooting. Examples validate the plugin system, docs, and API simultaneously. Users copy these before reading advanced docs.
v0.8-framework-beta definition of done: aua extensions test --kind utility_scorer --import-path plugins.example:ExampleUtilityScorer passes. aua defaults show and aua models list work. Custom middleware and hook examples in examples/ run. Architecture spec exists and is accurate.
Security items ordered so the foundation (session IDs, scopes) comes before the features that use them (audit log, rate limiting, mTLS). Observability ordered so the pipeline (OTEL) precedes the outputs (Prometheus, Grafana, Datadog).
Security foundation — identity, auth, audit, rate limiting
Session IDs first (everything else references them). Auth before audit. Secrets before mTLS certs.
Every query gets session_id, trace_id, request_id. Propagated through: router, field classifier, specialist calls, arbiter, correction loop, response, logs, metrics, audit log. Returned in every API response. Generated UUID if not supplied by client. Context available to all hooks and middleware.
Config references secret names, not values. Supported: environment variables, local encrypted file, HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager. YAML: secrets: {provider: env}, api_key_secret: OPENAI_API_KEY. Doctor checks secret resolution. Redaction in GET /config and logs.
External endpoints require auth. JWT or signed token support. Scope validation per endpoint (using permission matrix from F-04). Token ID in logs and audit. Admin endpoints strongly protected. Extension endpoints require aua:extensions:write. Local dev mode can opt out of auth with explicit warning.
aua token create --scope aua:query --expires 30d, aua token list, aua token revoke <token-id>, aua token inspect <token-id>. Token metadata stored in state store (F-05). Revocation list checked on every request.
Events: query, routing decision, arbiter verdict, correction injection, config reload, token create/revoke, promotion, rollback, extension load, hook failure, auth failure. Required fields: timestamp, session_id, trace_id, token_id, event_type, field, specialist, utility_score, confidence, latency_ms, hash_chain_previous, hash_chain_current. Append-only to state store. Hash chain provides tamper evidence.
Router ↔ specialists, router ↔ arbiter, arbiter ↔ correction loop. Config: security: {mtls: {enabled: true, cert_dir: .aua/certs, auto_generate: true}}. CLI: aua certs generate, aua certs rotate. Doctor check: aua doctor --check-certs. Auto-generated dev certs with warning. Production: bring your own CA.
Configurable per token/scope. Config: rate_limits: {aua:query: {requests_per_minute: 60}, aua:admin: {requests_per_minute: 10}}. Behavior options: reject, queue, warn, alert. Returns 429 with Retry-After header. Metrics tracked per scope. Audit logged when limits hit.
CORS configurable via api.cors_origins (already in P-06/P-09). External endpoints explicit. Internal specialist ports not exposed externally by default. Admin endpoints protected. Extension endpoints protected. Unsafe settings (wildcard CORS + 0.0.0.0 + no auth) produce loud doctor warnings.
Encrypt at rest: correction payloads, assertion store entries, DPO pairs, plugin secrets, token metadata, sensitive audit fields. AES-256-GCM. Key managed per deployment. mTLS (#16) covers in-transit encryption. Config: security: {encryption: {enabled: true, key_secret: AUA_ENCRYPTION_KEY}}.
Observability — OTEL, Prometheus, Grafana, logging, alerting, cost
OTEL pipeline first (Prometheus and Datadog export from it). Alerting after metrics. Cost after metrics.
JSON log emission. Fields: timestamp, level, message, session_id, trace_id, request_id, token_id, field, specialist, utility_score, confidence, contradiction, verdict, latency_ms, plugin, hook, middleware. Apply LoggingConfig from config at startup. ELK/Splunk compatible.
Instrument: query, routing, classification, specialist call, arbiter, correction loop, blue-green decision, plugin execution, hook execution, middleware execution. Trace propagation through all components. Span IDs in API response. Optional extra: pip install aua[otel].
Full W3C trace context propagation from router through all specialist calls. Trace ID in every API response header and body. Jaeger/Tempo exporter via OTEL collector. Debugger panel links to trace UI per request.
Metrics: aua_query_total, aua_query_latency_seconds, aua_utility_score, aua_contradiction_rate, aua_routing_field_distribution, aua_specialist_confidence, aua_bluegreen_traffic_split, aua_correction_count, aua_abstention_rate, aua_arbiter_verdict_distribution, aua_dpo_pairs_accumulated, aua_token_requests_total, aua_token_requests_rejected, aua_specialist_vram_utilization, aua_plugin_execution_seconds, aua_hook_failures_total.
Dashboard JSON in docker/grafana/. Panels: query volume, latency p50/p95/p99, routing distribution, utility score trends, contradiction rate, arbiter verdict distribution, specialist health, VRAM usage, blue-green split, auth failures, cost metrics, plugin latency.
OTEL collector config preset for Datadog exporter. Docs. Env var secret handling (DD_API_KEY). Doctor check for Datadog connectivity. Optional extra: pip install aua[datadog].
Alert rules: AUA_HighContradictionRate, AUA_SpecialistDown, AUA_LowUtilityScore, AUA_BlueGreenStalled, AUA_ArbiterCaseFourSpike, AUA_TokenRejectionSpike, AUA_PluginFailureSpike, AUA_HookFailureSpike. Alert rules YAML in docker/prometheus/alerts.yaml.
Events: specialist_promoted, rollback_completed, contradiction_threshold_exceeded, arbiter_inconclusive_spike, specialist_down, plugin_failure, security_auth_failure_spike. Config: webhooks: {slack: {url_secret: SLACK_WEBHOOK_URL, events: [specialist_down, high_contradiction_rate]}}. Retry with backoff. Audit logged.
Track: GPU hours per query, cost per specialist call, cost per arbiter call, cost per blue-green cycle, cost per plugin execution. Config: cost: {gpu_hour_rates: {rtx4090: 0.50, a100: 2.50}}. Exposed in GET /metrics/cost and aua status.
v0.9-rc1 definition of done: aua token create --scope aua:query --expires 30d && aua certs generate && aua doctor --strict && curl http://localhost:8000/metrics && curl http://localhost:8000/metrics/cost all pass. Grafana dashboard loads in Docker Compose observability profile.
Evaluation harness before Chat UI because the UI's debugger panel references eval runs and DPO pairs. Chat Session API before Chat UI frontend (API-first).
Evaluation + correction export
Required by blue-green promotion demo, tutorial Part 6, and the UI's blue-green debug panel.
aua eval run --dataset evals/coding.yaml --config aua_config.yaml, aua eval report .aua/evals/latest.json, aua eval compare --baseline blue --candidate green. Dataset format: YAML with cases (id, field, prompt, expected_properties). Runner routes each case through the framework, scores with utility function, checks expected properties. JSON output for CI. Regression detection. Wires up to POST /deploy/green harness.
aua corrections export --format jsonl, aua dpo export --format preference-pairs. Output format: {prompt, chosen, rejected, field, utility_chosen, utility_rejected, correction_ids, trace_id}. Traceability to source queries. Safe redaction option. Not a training pipeline — just export. Users bring their own fine-tuning infrastructure.
evals/coding_smoke.yaml, math_smoke.yaml, routing_smoke.yaml, correction_smoke.yaml, arbiter_smoke.yaml, safety_smoke.yaml. Small but useful. Run by aua eval run. Used in CI where feasible (fake specialists). Used in tutorial Part 6. Used in blue-green promotion demo in examples/blue_green_demo/.
Chat UI — session API + reference frontend
Chat Session API (U-01) before the frontend (U-02). Frontend consumes the API.
POST /sessions, GET /sessions, GET /sessions/{id}, DELETE /sessions/{id}, GET /sessions/{id}/messages, POST /sessions/{id}/messages, POST /sessions/{id}/stream. Session store backed by state store abstraction (F-05). Schema: chat_sessions, chat_messages, routing_traces tables. Chat request/response schema includes session_id, trace_id, routing, utility, debug fields. SSE streaming events: start, route, specialist_start, chunk, specialist_done, arbiter_start, arbiter_done, done, error.
Three-zone layout: session sidebar (left) + main chat window (center) + Framework Debugger panel (right). Left pop-up AUA Controls drawer exposes: preset, fields, models, routing thresholds, utility weights, arbiter policy, corrections, blue-green, backends, security, observability, extensions. Right debugger shows: route summary, field classifier output, specialist calls, candidate responses (optional), utility breakdown, arbiter decision, corrections injected, blue-green debug, latency breakdown, cost estimate, trace metadata, plugin/hook/middleware execution. Tech: React/Next.js + TypeScript + Tailwind + shadcn/ui. CLI: aua serve --with-ui, aua ui --port 3000. Config: ui: {enabled: true, host: ..., port: 3000, persist_sessions: true}. Auth required in production. Layout in apps/aua_chat/.
v0.9-rc2 definition of done: aua serve --with-ui starts everything. Open http://localhost:3000, create a chat, ask a coding question, watch streaming response, open Framework Debugger, see route + utility + arbiter + trace, open AUA Controls drawer, change routing threshold, reload config, ask again, see behavior change.
Final polish, documentation, release discipline. Everything else is implemented; this milestone makes it shippable.
Release readiness — compatibility, deployment profiles, examples, release docs
Four profiles documented in AUA_Framework_v1_Deployment_Profiles.md: (1) Local Developer — localhost only, auth optional, SQLite; (2) Single GPU Workstation — auth recommended, local file/SQLite state; (3) Team Server — auth required, mTLS required, Postgres state, Prometheus/Grafana; (4) Enterprise — custom backends, IAM, secrets manager, strict audit, runtime import disabled. Doctor validates profile-specific requirements.
Document in docs/compatibility.md: Python 3.10/3.11/3.12, CUDA versions, vLLM versions, Ollama versions, macOS (Ollama only), Linux (vLLM + GPU), GPU memory recommendations per tier, supported model formats (AWQ, GPTQ, fp16, GGUF), supported backends, browser support for UI, Docker support, Postgres versions, Prometheus/Grafana versions. Doctor references this matrix.
Create: CHANGELOG.md, RELEASE.md (release checklist: tests, lint, types, docs, tutorial, examples, Docker, fresh-clone, UI smoke, security profile, migration notes), MIGRATIONS.md, DEPRECATIONS.md. Versioning policy: semver, v1.x preserves public API, v2 may break plugin contracts with migration guide, deprecated features warn before removal.
All of the following must pass from a fresh clone: install, import, CLI (init, config, doctor, serve), REST (health, version, config, metrics), query, streaming, plugin test, security (token, certs), observability (metrics, cost, status). Chat UI smoke test. Tutorial Part 1 verified from scratch. All examples run. Docker Compose profiles verified. mypy, black, ruff, pytest all pass.
v1.0.0 — shipped: The one-liner from objective_v1.md works — pip install adaptive-utility-agent && aua init --preset coding --tier single-4090 && aua serve — and the expert path works without editing any AUA source files.
Original roadmap items #33–#73. These are real and valuable, but are post-v1. They require v1 to be stable first. Grouped by concern for clarity.
High-availability — consensus, leader election, service discovery, failover
Kafka KRaft-based coordinator for multi-node router/arbiter election without ZooKeeper dependency.
Active-standby leader election with automatic failover for router and arbiter replicas.
Distributed configuration and state via etcd. Enables multi-node deployments to share routing state.
Specialists register in Consul. Router discovers and load-balances across specialist replicas automatically.
Prevent cascade failures. Open circuit after threshold failures, half-open probe, close on recovery.
When primary specialist fails, route to healthy replica or fall back to arbiter with degraded-mode flag.
Per-specialist retry policy. Configurable max retries, base delay, jitter. Audit logged.
aua k8s generate --tier a100-cluster produces ready-to-apply manifests for all AUA components.
Production-grade Helm chart with values files for each hardware tier and deployment profile.
Backup state store (corrections, promotions, audit, sessions) to S3/GCS/local. Point-in-time restore.
Unified HA configuration block: replica counts, circuit breaker policy, leader election backend, service discovery backend.
Advanced platform features — multi-tenancy, caching, experiment tracking, batch, testing
Tenant isolation: separate correction stores, promotion logs, audit logs, rate limits, and model bindings per tenant.
Cache specialist responses by semantic similarity. Configurable TTL per field. Cache-miss threshold. Significant latency reduction for repeated similar queries.
Pull model versions from HuggingFace Hub or MLflow registry directly into specialist config. Version pinning.
Log utility scores, routing decisions, arbiter verdicts, and blue-green metrics to MLflow or W&B automatically.
GREEN receives traffic but responses are not shown to users. Score and log GREEN silently. Promote when score threshold met.
Full regression suite against the eval datasets. Run on every specialist promotion attempt. Block promotion on regression.
Built-in load test command. Configurable concurrency, duration, query mix. Reports p50/p95/p99 latency and error rate.
Plugin types beyond v1: custom contradiction detector, custom assertion store, custom routing strategy, custom scoring components.
Async middleware, streaming middleware (intercepts SSE chunks), batch middleware, multi-tenant middleware primitives.
Allow fully replacing the utility function (not just weights) via plugin. Supports non-linear utility models and field-specific scoring architectures.
aua test command running the full test suite against running services (not fakes). Integration test harness with configurable query fixtures.
Automated compatibility test matrix run in CI across model formats, hardware tiers, and backends. Published as docs and referenced by doctor.
Research & advanced — VCG, distillation, SDKs, GPU cluster, auto-generated docs
Persistent batch queue. Result polling. Priority lanes. Cost-optimized batching across specialists.
aua serve automatically downloads missing models from HuggingFace Hub before starting specialists.
Full Vickrey-Clarke-Groves mechanism for multi-specialist arbitration as described in the whitepaper theorems S1–S3.
Automated pipeline: collect DPO pairs → distill specialist → evaluate → blue-green candidate. End-to-end self-improvement loop.
Dynamically partition the specialist graph across available hardware based on VRAM and latency constraints.
Integrate PubMed, arXiv, SymPy as external ground truth sources for the Arbiter's empirical check (currently stubbed).
Node.js SDK with full TypeScript types. Mirrors Python API. SSE streaming support.
Go client library with streaming support. Targeting platform engineers building Go-based AI infrastructure.
Java client library. Enterprise-focused. Spring Boot integration example.
Tier template for 8×H100 cluster. NVLink, tensor parallelism, pipeline parallelism configuration.
vLLM tensor parallelism configuration for large models across multiple GPUs per specialist.
Optimized Ollama configuration for consumer MacBooks and gaming PCs (RTX 3080/4080 class).
Updated tutorial covering HA deployment, multi-tenancy, advanced plugins, and distillation pipeline.
Already provided by FastAPI. Expand with richer examples, field explanations, and code samples in all SDK languages.
Per-domain documentation pages covering field-specific utility weights, prompt templates, arbiter policies, and example queries.
Update whitepaper with v1 framework results, production metrics, and framework architecture. Update site links.
Separate repo with more complete example applications: medical assistant, legal assistant, full-stack coding assistant, research agent.
Public changelog and SemVer policy published on the framework site.
| Milestone | Items | Theme | Gate |
|---|---|---|---|
| v0.6-alpha | P-01 – P-12 (12 items) | Production-harden #01–#10 | Fresh-clone validation + CI passes |
| v0.7-beta | #11–#13, #05A–#05D, #14 (8 items) | Docker + Django config foundation | aua init --preset coding && docker-compose up |
| v0.8-framework-beta | F-01 – F-17 (17 items) | Architecture, contracts, plugins, hooks, middleware | Custom plugin test passes; no AUA source edits required |
| v0.9-rc1 | #15–#22, #26, #23, #28, #24–#25, #27, #29–#32 (17 items) | Security + observability | Token + certs + Prometheus + Grafana all working |
| v0.9-rc2 | E-01–E-03, U-01–U-02 (5 items) | Evaluation + chat UI | aua serve --with-ui + debugger panel working end-to-end |
| v1.0.0 — Shipped | D-01–D-04 (4 items) | Release readiness | All v1.0 definition-of-done commands pass |
| v2.0+ | #33–#73 (41 items) | HA, multi-tenancy, SDKs, research, cluster | Post-v1 — requires stable v1 first |
Key ordering rationale: Config strictness (P-06) before Docker (#11) because Docker needs correct config. Model registry (#05C) before presets (#05B) because presets reference aliases. Field registry (#05D) before config commands (#05A) because explain needs field knowledge. Secrets (#19) before auth (#17) because auth tokens are secrets. OTEL (#23) before Prometheus (#24) because Prometheus can export from OTEL. State store (F-05) before Chat Session API (U-01) because sessions need storage. Plugin interfaces (F-07) before import system (F-09) because the importer validates against the interfaces. Error taxonomy (F-03) before security (#17–#22) because auth errors need stable codes. Architecture spec (F-01) before all framework items because it defines component boundaries that prevent drift.