AUA Framework v1.0 — Release Roadmap & Validation Matrix

AUA Framework v1.0 — Release Roadmap and Validation Matrix

This page records the v1.0 implementation contract, validation criteria, and future roadmap. Items marked done are included in v1.0.0. Items marked wip or future are deferred to v1.1+ or v2.0. The goal: Django for adaptive multi-model LLM systems — batteries included, deeply configurable, extensible without editing AUA internals.

Source documents: aua_roadmap.html · AUA_Framework_v1_Roadmap.md · objective_v1.md · aua_roadmap_01_10_production_completeness.md · missing_from_v1.md · Front_end_for_UA_Framework_v1.md

✓ Status: v1.0.0 shipped.

All items marked done are included in v1.0.0 and validated. See docs/v1_validation_report.md for the full validation record: 132 tests, 23 REST endpoints, 8 plugin protocols, 15 Prometheus metrics, CLI reference, install transcript, Docker Compose, Chat UI, security, and observability.

Reading guide. Items are ordered so each step unlocks the next with minimal rework. Original roadmap numbers (#01–#73) are preserved where they exist; new items introduced by the planning documents use prefixed IDs (P-, F-, E-, U-, D-). Items already implemented (v0.5 POC) are marked done. Items needing production hardening are marked wip.

73
Original roadmap items
55+
New items from planning docs
7
Release milestones
v1.0.0
Released: production framework v1.0.0

v0.6-alpha — #01–#10 Production-Complete

Items #01–#10 are production-complete in v1.0.0. This milestone records the hardening work completed: quality, contracts, correctness, and packaging. All items are shipped.

P-01–P-04

Repo hygiene & packaging

Must happen first — CI, installs, and all other work depend on this being clean.

Do first
P-01
Code formatting — black, isort, ruff

Run black aua tests and isort aua tests across the entire repo. Add [tool.black], [tool.isort], [tool.ruff], and [tool.mypy] sections to pyproject.toml. line-length=100, target-version=py310. This is a prerequisite for CI.

code
P-02
pyproject.toml — production-complete package metadata

Package name adaptive-utility-agent. Optional extras: pip install aua[vllm], aua[dev], aua[ui], aua[postgres], aua[otel]. vllm must be optional — it is not Mac-friendly. Include aua/tiers/*.yaml in wheel data. Add Makefile with install, test, lint, format, typecheck, build targets.

infra
P-03
aua/version.py — single version source + py.typed

Create aua/version.py containing __version__ = "1.0.0". All other references import from here. Create aua/py.typed marker file. Add version consistency test: importlib.metadata.version("adaptive-utility-agent") == aua.__version__.

code
P-04
CI workflow — .github/workflows/ci.yml

Matrix: Python 3.10, 3.11, 3.12. Steps: pip install -e ".[dev]" → ruff → black check → mypy → pytest. Must pass without GPUs using fake specialists. Also add wheel-build validation step.

infra
P-05–P-07

Public API alignment & config strictness

Config correctness unblocks Docker, tiers, presets, and all downstream work.

Do second
P-05
Public API alignment — aua/__init__.py

Export exact roadmap API: Arbiter = ArbiterAgent (alias), BlueGreenDeployment, CorrectionLoop. Expose all endpoint models from aua.endpoints. Stable __all__. Add import smoke tests in tests/test_imports.py. Add docs/public_api.md separating stable from internal APIs.

code
P-06
Config strictness — validation, host/scheme, runtime paths

Remove hardcoded localhost from endpoint construction. Add host, scheme, endpoint_path, endpoint_override fields to SpecialistConfig and ArbiterConfig. Add strict unknown-key validation (catches typos). Add duplicate port validation. Add threshold range validation. Add RuntimeConfig (.aua/logs, .aua/pids, .aua/state, .aua/checkpoints). Add APIConfig with cors_origins.

code
P-07
Tier name alignment & aliases

Canonical tier names: macbook, single-4090, quad-4090, a100-cluster. Backward-compatible aliases: rtx4090 → single-4090, a100 → a100-cluster. Add quad-4090.yaml template. Every tier template must load in CI. Annotate generated YAML with inline comments.

code
P-08–P-12

Runtime hardening — serve, router, rollback, CLI, tests

Production lifecycle, API contracts, and test suite.

Do third
P-08
serve.py — process lifecycle hardening

SIGINT/SIGTERM handler that terminates child processes (TERM → 15s grace → KILL). Write .aua/pids/ and .aua/logs/ on startup. Async readiness polling per specialist. Port-conflict detection before start with --reuse-running flag. Deterministic --dry-run output (exit 0). Document foreground-only mode explicitly in help.

code
P-09
router.py — API contract hardening

Move CORS from wildcard to APIConfig.cors_origins. Add session_id (auto-generated UUID if not supplied) to every request/response. Add ErrorResponse model with stable error codes. Distinguish 503 (unreachable) from 504 (timeout). Add GET /version. Redact secrets from GET /config. Make POST /deploy/green honest (return dry_run_only until harness exists). Preserve batch result order with per-item index and ok fields.

code
P-10
rollback.py — durable atomic state

Move promotions log to .aua/state/promotions.jsonl with UUID per record. File locking via filelock to prevent concurrent rollback+deploy races. Atomic config writes: write to .tmp then os.replace(). Add --dry-run to rollback. Add POST /deploy/rollback REST endpoint or document CLI-only clearly.

code
P-11
CLI discipline — exit codes, flags, streaming robustness

All CLI commands: correct exit codes (0=pass, 1=fail, 2=warn-in-strict). Add --json to aua doctor and aua status. Add --strict to aua doctor. Add --once, --url, --refresh to aua status. Stream: add SSE named event fields (event: chunk), heartbeat comments (: keep-alive every 15s), client disconnect handling, robust data: [DONE] parser. Add Content-Encoding: none header on SSE routes to prevent gzip middleware.

code
P-12
Test suite — fake specialists, fixtures, contract tests

Create tests/fakes/openai_server.py: FastAPI fake with GET /v1/models, POST /v1/chat/completions (buffered + streaming). Create fixture configs: minimal, rtx4090, macbook, invalid_duplicate_ports, invalid_unknown_key, invalid_threshold. Test files: test_imports, test_config, test_cli_init, test_cli_doctor, test_router_api, test_streaming, test_status, test_rollback. CI must pass without GPUs.

code

v0.6-alpha definition of done: pip install -e ".[dev]" && python -c "from aua import Router, Arbiter, UtilityScorer, BlueGreenDeployment, CorrectionLoop; print('ok')" && aua init --tier macbook --force && aua doctor --strict && aua serve --dry-run && pytest -q && ruff check aua tests && black --check aua tests && mypy aua all pass from a fresh clone.

v0.7-beta — Docker + Django Config Foundation

Packaging is clean. Now add the deployment layer (#11–#13) and the Django-like configuration system (presets, model registry, field registry, aua config command group). These are ordered so the registry items come before the commands that reference them.

#11–#12

Docker + hardware tier templates

Run everything in one command. Depends on clean pyproject.toml (P-02) and correct tier names (P-07).

v0.7-beta
#11
Official Dockerfile + docker-compose

docker-compose up starts router, specialists, arbiter. docker-compose --profile gpu up for GPU variant. Profiles: cpu (Ollama/local), gpu (vLLM), observability (Prometheus + Grafana optional), secure. Health checks on all containers. Volume mounts for .aua/ state. Environment file support. Structured logs to stdout.

infra
#12
Hardware tier config templates — complete set

Canonical tiers: macbook (Ollama), single-4090 (RTX 4090, vLLM AWQ), quad-4090 (4× RTX 4090, multi-GPU), a100-cluster (A100 80GB, fp16). Each specifies specialists, arbiter, GPU assignment, memory split, ports, promotion thresholds, default models, observability defaults, security defaults. Annotated with inline YAML comments.

infra
#05C · #05D · #05B · #05A

Django config layer — registries, presets, config commands

Registries before presets. Presets before config commands. Config commands before tutorial.

v0.7-beta
#05C
Model registry — aliases, compatibility, aua models CLI

Built-in aliases: qwen-coder-7b-awq, qwen-math-7b-awq, qwen-14b-awq, llama3-8b, etc. Each entry: provider, full model ID, backend, quantization, recommended VRAM. User-defined registry entries in YAML. Hardware compatibility checks (AWQ + Ollama → error). CLI: aua models list, aua models inspect <alias>. Compact config uses aliases: model: qwen-coder-7b-awq.

code
#05D
Field registry — canonical IDs, aliases, aua fields CLI

Built-in fields: software_engineering, mathematics, general, research, medicine, law, finance. Each has aliases (swe, coding), description, default utility weights, default confidence threshold, default arbiter policy. User-defined custom fields. CLI: aua fields list, aua fields inspect <field>.

code
#05B
Preset system — built-in presets, aua presets CLI

Built-in presets: coding, math, research, generalist, medical-safe, legal-safe, local-ollama. Each preset selects fields, default models (via registry aliases), routing thresholds, utility weights, arbiter policy. Presets live in aua/defaults/presets/. Compact config: preset: coding expands to full config. CLI: aua presets list, aua presets inspect <name>. aua init --preset coding --tier single-4090 works end-to-end.

code
#05A
aua config command group — validate, expand, show, diff, explain

aua config validate (strict schema check), aua config expand (compact → full YAML printed to stdout), aua config show (current loaded config, secrets redacted), aua config diff config_a.yaml config_b.yaml, aua config explain routing.fanout_threshold (human-readable explanation with default, range, higher/lower meaning). JSON output option on all commands. Config schema registry with explanation strings for every key.

code
#13 · #14

Hot reload + tutorial rewrite

Tutorial written last so it reflects the final config system and presets.

v0.7-beta
#13
Hot reload for aua_config.yaml

SIGHUP triggers reload. aua config reload CLI command. Hot-reloadable without restart: routing thresholds, utility weights, promotion thresholds, logging level, rate limits, CORS origins, webhook config. Partial restart required for: new specialist, changed model, changed backend, changed mTLS certs. Reload is atomic — validate new config before applying.

code
#14
Tutorial rewrite — Django-style, 12 parts

Replaces tutorial.html. Do not start with theory. Part 1: quickstart in <10 min. Parts 2–12 progressively teach: models & fields, routing & utility, arbiter & corrections, blue-green, plugins, hooks/middleware, security, observability, Docker deployment, expert deployment (full custom config without editing AUA internals). Tutorial verifies commands from each section actually run.

doc

v0.7-beta definition of done: docker-compose up works. aua init --preset coding --tier single-4090 && aua config validate && aua config expand && aua models list && aua fields list && aua presets list && aua serve --dry-run all pass. Tutorial Part 1 tested from scratch.

v0.8-framework-beta — Django-Like Framework UX

The configuration layer exists. Now add the extensibility layer: architecture spec, contracts, error taxonomy, state abstraction, plugin system, hooks, middleware. These are ordered so lower-level items (architecture, contracts, errors, state) come before the systems that depend on them (plugins, hooks, middleware).

F-01–F-06

Framework discipline — architecture, contracts, errors, state, config versioning

These define the contracts everything else implements. Must come before plugins and hooks.

v0.8-framework
F-01
Architecture spec — AUA_Framework_v1_Architecture.md

Defines: runtime architecture, full request lifecycle (Input → Middleware → Session → Correction Retrieval → Classifier → Routing → Specialist Calls → Utility Scoring → Arbiter → Hooks → Correction Logging → Response → Metrics/Logs/Traces/Audit), component boundaries, component ownership, plugin loading lifecycle, hook/middleware execution order, observability flow, security boundary. This doc prevents implementation drift.

doc
F-02
Stable public API contracts — plugin interfaces and compatibility guarantee

Define formal Python protocols for all extension points: FieldClassifierPlugin, UtilityScorerPlugin, ArbiterPolicyPlugin, PromotionPolicyPlugin, CorrectionStorePlugin, ModelBackendPlugin, HookPlugin, AUAMiddleware. Create docs/public_api.md. Guarantee: v1.x will not break public import paths or plugin method signatures. Deprecated APIs get one minor release of warning.

doccode
F-03
Error taxonomy — stable AUA_* error codes

Define stable error code list: AUA_CONFIG_INVALID, AUA_BACKEND_UNREACHABLE, AUA_SPECIALIST_TIMEOUT, AUA_ARBITER_TIMEOUT, AUA_PLUGIN_LOAD_FAILED, AUA_PLUGIN_CONTRACT_INVALID, AUA_HOOK_FAILED, AUA_MIDDLEWARE_FAILED, AUA_AUTH_REQUIRED, AUA_FORBIDDEN, AUA_RATE_LIMITED, AUA_PROMOTION_REJECTED, AUA_ROLLBACK_FAILED, AUA_STATE_STORE_UNAVAILABLE, etc. Map each code to HTTP status and CLI exit code. REST error format: {"error": "AUA_*", "message": "...", "trace_id": "...", "details": {}}. Create AUA_Framework_v1_Error_Codes.md.

codedoc
F-04
Permission/scope matrix — fine-grained API scopes

Define scopes: aua:query, aua:stream, aua:status, aua:config:read, aua:config:write, aua:corrections:read, aua:corrections:write, aua:deploy, aua:rollback, aua:extensions:read, aua:extensions:write, aua:tokens:read, aua:tokens:write, aua:admin. Map every endpoint to required scope. Extension/runtime import endpoints require aua:extensions:write or aua:admin. Create scope matrix table in docs.

secdoc
F-05
State store abstraction — SQLite → Postgres, pluggable

Define state store interface. State categories: chat sessions, corrections, promotion logs, audit logs, DPO pairs, config snapshots, token metadata, cost records, eval runs, routing traces. Implementations: local files (default), SQLite (local dev), Postgres (production). Config: state: {backend: sqlite, path: .aua/state/aua.db}. Migration support. Doctor checks. Removes the scattered .jsonl files in favor of structured storage.

code
F-06
Config versioning + migration CLI

Add config_version: "1.0" field to YAML. CLI: aua config check-version, aua config migrate --from 0.6 --to 1.0, aua config migrate --dry-run. Clear error for unsupported config versions. Migration tests against old config fixtures. Required before v1.0 to support users upgrading from v0.x.

code
F-07–F-13

Extension system — plugins, backends, hooks, middleware, extension CLI

Ordered: plugin interfaces → backend plugins → import system → hooks → middleware → runtime API → extension CLI.

v0.8-framework
F-07
Plugin interfaces — formal protocols for all plugin types

Define base protocols in aua/plugins/: FieldClassifierPlugin, UtilityScorerPlugin, ArbiterPolicyPlugin, PromotionPolicyPlugin, CorrectionStorePlugin. Each has: required method signatures, type hints, docstrings, example implementations, test harness. Version compatibility guarantees documented.

code
F-08
Model backend plugin system

ModelBackendPlugin protocol: complete(request) -> dict, stream(request) -> AsyncIterator, health() -> dict. Built-in backends: vLLM (existing), Ollama (existing), OpenAI-compatible generic. YAML registration: backends: {my_gateway: {plugin: my_company.backends:GatewayBackend, base_url: ..., auth_secret: ...}}. Doctor checks backend health for all registered backends.

code
F-09
Extension import system — import_path syntax, config injection

YAML syntax: import_path: plugins.custom_utility:RiskWeightedUtilityScorer. Config injection: config: {risk_weight: 0.7} passed to constructor. Strict interface validation on load (not on import). Clear error messages on mismatch. Doctor checks all import paths. Reload support where safe. Allowlisted paths only in production mode. Users never edit AUA source files.

code
F-10
Hook system — lifecycle hooks with ordering and timeout policy

Hooks: pre_query, post_query, pre_route, post_route, pre_specialist_call, post_specialist_call, pre_arbiter, post_arbiter, pre_response, post_response, on_correction, on_promotion, on_rollback. Hook interface: async def __call__(self, event: dict) -> dict. Configurable ordering. Error handling policy (fail-open vs fail-closed). Timeout policy per hook. Audit logged. Metrics tracked. YAML registration under hooks:.

code
F-11
Middleware system — ordered pipeline, before/after, short-circuit

Interface: before_query(request) -> request, after_response(response) -> response. Ordered execution (list order in YAML). Short-circuit support (return early without calling downstream). Error behavior: configurable (skip, fail). Metrics per middleware. Built-in examples: PIIRedactionMiddleware, TenantPolicyMiddleware, AuditMiddleware. YAML: middleware: [plugins.middleware:PIIRedactionMiddleware].

code
F-12
Extension runtime API — GET/POST /extensions

GET /extensions, GET /extensions/{name}, POST /extensions/test, POST /extensions/reload. Disabled by default in production. Requires aua:extensions:write scope. Only loads allowlisted paths. Audit logged. Warns loudly if bound to 0.0.0.0. Development-only: POST /extensions/import.

codesec
F-13
Extension CLI — aua extensions commands

aua extensions list, aua extensions inspect <name>, aua extensions test --kind utility_scorer --import-path plugins.custom_utility:RiskWeightedUtilityScorer, aua extensions reload, aua plugins doctor. Also: management command system stub: aua run <command> dispatching to import_path registered in YAML.

code
F-14–F-17

Framework defaults — prompt templates, safety policy, defaults registry

Prompt templates and safety policy before presets reference them.

v0.8-framework
F-14
Prompt template system — versioned, field-specific, config override

Built-in templates in aua/templates/prompts/: classifier_v1.txt, arbiter_balanced_v1.txt, arbiter_conservative_v1.txt, correction_injection_v1.txt, abstention_v1.txt, medical_safe_v1.txt, legal_safe_v1.txt. Config: prompts: {arbiter_template: arbiter_conservative_v1}. Advanced: arbiter_template_path: ./prompts/custom.txt. Template registry with version tracking. Prompts are framework behavior, not implementation details.

code
F-15
Safety/abstention policy — high-risk field gating

Config: safety: {abstention_enabled: true, high_risk_fields: [medicine, law, finance], require_arbiter_for_high_risk: true, min_confidence_for_direct_answer: 0.90}. Behavior: low confidence → abstain or clarify. High-risk field → force arbiter. Contradiction detected → do not answer confidently. Required by medical-safe and legal-safe presets. Abstention response model in endpoints.py.

codesec
F-16
Defaults registry — aua/defaults/ + aua defaults CLI

aua/defaults/: presets/, models.yaml, fields.yaml, utility.yaml, routing.yaml, security.yaml, prompts.yaml. CLI: aua defaults show, aua defaults show preset coding, aua defaults show models. Makes batteries-included defaults inspectable and avoids hidden magic. Underpins aua config expand.

code
F-17
Built-in example projects — examples/ directory

Required examples: quickstart_coding/, custom_utility_plugin/, custom_field_classifier/, custom_backend/, blue_green_demo/, secure_docker_deploy/. Each: aua_config.yaml, README.md, sample curl commands, expected output, troubleshooting. Examples validate the plugin system, docs, and API simultaneously. Users copy these before reading advanced docs.

doc

v0.8-framework-beta definition of done: aua extensions test --kind utility_scorer --import-path plugins.example:ExampleUtilityScorer passes. aua defaults show and aua models list work. Custom middleware and hook examples in examples/ run. Architecture spec exists and is accurate.

v0.9-rc1 — Security + Observability

Security items ordered so the foundation (session IDs, scopes) comes before the features that use them (audit log, rate limiting, mTLS). Observability ordered so the pipeline (OTEL) precedes the outputs (Prometheus, Grafana, Datadog).

#15–#22

Security foundation — identity, auth, audit, rate limiting

Session IDs first (everything else references them). Auth before audit. Secrets before mTLS certs.

v0.9-rc1
#15
Session IDs — end-to-end query tracking

Every query gets session_id, trace_id, request_id. Propagated through: router, field classifier, specialist calls, arbiter, correction loop, response, logs, metrics, audit log. Returned in every API response. Generated UUID if not supplied by client. Context available to all hooks and middleware.

sec
#19
Secrets management — no plaintext in config (moved up: auth depends on this)

Config references secret names, not values. Supported: environment variables, local encrypted file, HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager. YAML: secrets: {provider: env}, api_key_secret: OPENAI_API_KEY. Doctor checks secret resolution. Redaction in GET /config and logs.

sec
#17
IAM and bearer token authentication

External endpoints require auth. JWT or signed token support. Scope validation per endpoint (using permission matrix from F-04). Token ID in logs and audit. Admin endpoints strongly protected. Extension endpoints require aua:extensions:write. Local dev mode can opt out of auth with explicit warning.

sec
#18
Token management CLI

aua token create --scope aua:query --expires 30d, aua token list, aua token revoke <token-id>, aua token inspect <token-id>. Token metadata stored in state store (F-05). Revocation list checked on every request.

sec
#20
Audit log — append-only, tamper-evident, hash chain

Events: query, routing decision, arbiter verdict, correction injection, config reload, token create/revoke, promotion, rollback, extension load, hook failure, auth failure. Required fields: timestamp, session_id, trace_id, token_id, event_type, field, specialist, utility_score, confidence, latency_ms, hash_chain_previous, hash_chain_current. Append-only to state store. Hash chain provides tamper evidence.

sec
#16
mTLS — encrypted internal communication

Router ↔ specialists, router ↔ arbiter, arbiter ↔ correction loop. Config: security: {mtls: {enabled: true, cert_dir: .aua/certs, auto_generate: true}}. CLI: aua certs generate, aua certs rotate. Doctor check: aua doctor --check-certs. Auto-generated dev certs with warning. Production: bring your own CA.

sec
#21
Rate limiting and quota management

Configurable per token/scope. Config: rate_limits: {aua:query: {requests_per_minute: 60}, aua:admin: {requests_per_minute: 10}}. Behavior options: reject, queue, warn, alert. Returns 429 with Retry-After header. Metrics tracked per scope. Audit logged when limits hit.

sec
#22
CORS and network policy

CORS configurable via api.cors_origins (already in P-06/P-09). External endpoints explicit. Internal specialist ports not exposed externally by default. Admin endpoints protected. Extension endpoints protected. Unsafe settings (wildcard CORS + 0.0.0.0 + no auth) produce loud doctor warnings.

sec
#31
Encrypted internal values at rest

Encrypt at rest: correction payloads, assertion store entries, DPO pairs, plugin secrets, token metadata, sensitive audit fields. AES-256-GCM. Key managed per deployment. mTLS (#16) covers in-transit encryption. Config: security: {encryption: {enabled: true, key_secret: AUA_ENCRYPTION_KEY}}.

sec
#23–#30 · #32

Observability — OTEL, Prometheus, Grafana, logging, alerting, cost

OTEL pipeline first (Prometheus and Datadog export from it). Alerting after metrics. Cost after metrics.

v0.9-rc1
#26
Structured logging (moved up: needed by all other observability items)

JSON log emission. Fields: timestamp, level, message, session_id, trace_id, request_id, token_id, field, specialist, utility_score, confidence, contradiction, verdict, latency_ms, plugin, hook, middleware. Apply LoggingConfig from config at startup. ELK/Splunk compatible.

ops
#23
OpenTelemetry instrumentation

Instrument: query, routing, classification, specialist call, arbiter, correction loop, blue-green decision, plugin execution, hook execution, middleware execution. Trace propagation through all components. Span IDs in API response. Optional extra: pip install aua[otel].

ops
#28
Distributed tracing — trace propagation and UI link

Full W3C trace context propagation from router through all specialist calls. Trace ID in every API response header and body. Jaeger/Tempo exporter via OTEL collector. Debugger panel links to trace UI per request.

ops
#24
Prometheus metrics endpoint — GET /metrics

Metrics: aua_query_total, aua_query_latency_seconds, aua_utility_score, aua_contradiction_rate, aua_routing_field_distribution, aua_specialist_confidence, aua_bluegreen_traffic_split, aua_correction_count, aua_abstention_rate, aua_arbiter_verdict_distribution, aua_dpo_pairs_accumulated, aua_token_requests_total, aua_token_requests_rejected, aua_specialist_vram_utilization, aua_plugin_execution_seconds, aua_hook_failures_total.

ops
#25
Grafana dashboard — pre-built, shipped with package

Dashboard JSON in docker/grafana/. Panels: query volume, latency p50/p95/p99, routing distribution, utility score trends, contradiction rate, arbiter verdict distribution, specialist health, VRAM usage, blue-green split, auth failures, cost metrics, plugin latency.

ops
#27
Datadog integration — OTEL collector preset

OTEL collector config preset for Datadog exporter. Docs. Env var secret handling (DD_API_KEY). Doctor check for Datadog connectivity. Optional extra: pip install aua[datadog].

ops
#29
Prometheus alert rules — shipped with package

Alert rules: AUA_HighContradictionRate, AUA_SpecialistDown, AUA_LowUtilityScore, AUA_BlueGreenStalled, AUA_ArbiterCaseFourSpike, AUA_TokenRejectionSpike, AUA_PluginFailureSpike, AUA_HookFailureSpike. Alert rules YAML in docker/prometheus/alerts.yaml.

ops
#30
Webhook events — external system integration

Events: specialist_promoted, rollback_completed, contradiction_threshold_exceeded, arbiter_inconclusive_spike, specialist_down, plugin_failure, security_auth_failure_spike. Config: webhooks: {slack: {url_secret: SLACK_WEBHOOK_URL, events: [specialist_down, high_contradiction_rate]}}. Retry with backoff. Audit logged.

ops
#32
Cost tracking endpoint — GET /metrics/cost

Track: GPU hours per query, cost per specialist call, cost per arbiter call, cost per blue-green cycle, cost per plugin execution. Config: cost: {gpu_hour_rates: {rtx4090: 0.50, a100: 2.50}}. Exposed in GET /metrics/cost and aua status.

ops

v0.9-rc1 definition of done: aua token create --scope aua:query --expires 30d && aua certs generate && aua doctor --strict && curl http://localhost:8000/metrics && curl http://localhost:8000/metrics/cost all pass. Grafana dashboard loads in Docker Compose observability profile.

v0.9-rc2 — Evaluation + Chat UI

Evaluation harness before Chat UI because the UI's debugger panel references eval runs and DPO pairs. Chat Session API before Chat UI frontend (API-first).

E-01–E-03

Evaluation + correction export

Required by blue-green promotion demo, tutorial Part 6, and the UI's blue-green debug panel.

v0.9-rc2
E-01
Evaluation harness — aua eval CLI

aua eval run --dataset evals/coding.yaml --config aua_config.yaml, aua eval report .aua/evals/latest.json, aua eval compare --baseline blue --candidate green. Dataset format: YAML with cases (id, field, prompt, expected_properties). Runner routes each case through the framework, scores with utility function, checks expected properties. JSON output for CI. Regression detection. Wires up to POST /deploy/green harness.

ml
E-02
Correction / DPO export — aua corrections export

aua corrections export --format jsonl, aua dpo export --format preference-pairs. Output format: {prompt, chosen, rejected, field, utility_chosen, utility_rejected, correction_ids, trace_id}. Traceability to source queries. Safe redaction option. Not a training pipeline — just export. Users bring their own fine-tuning infrastructure.

ml
E-03
Built-in smoke eval datasets

evals/coding_smoke.yaml, math_smoke.yaml, routing_smoke.yaml, correction_smoke.yaml, arbiter_smoke.yaml, safety_smoke.yaml. Small but useful. Run by aua eval run. Used in CI where feasible (fake specialists). Used in tutorial Part 6. Used in blue-green promotion demo in examples/blue_green_demo/.

ml
U-01–U-02

Chat UI — session API + reference frontend

Chat Session API (U-01) before the frontend (U-02). Frontend consumes the API.

v0.9-rc2
U-01
#10B Chat Session API — session persistence endpoints

POST /sessions, GET /sessions, GET /sessions/{id}, DELETE /sessions/{id}, GET /sessions/{id}/messages, POST /sessions/{id}/messages, POST /sessions/{id}/stream. Session store backed by state store abstraction (F-05). Schema: chat_sessions, chat_messages, routing_traces tables. Chat request/response schema includes session_id, trace_id, routing, utility, debug fields. SSE streaming events: start, route, specialist_start, chunk, specialist_done, arbiter_start, arbiter_done, done, error.

codeui
U-02
#10A Chat UI Reference App — React/Next.js framework debugger

Three-zone layout: session sidebar (left) + main chat window (center) + Framework Debugger panel (right). Left pop-up AUA Controls drawer exposes: preset, fields, models, routing thresholds, utility weights, arbiter policy, corrections, blue-green, backends, security, observability, extensions. Right debugger shows: route summary, field classifier output, specialist calls, candidate responses (optional), utility breakdown, arbiter decision, corrections injected, blue-green debug, latency breakdown, cost estimate, trace metadata, plugin/hook/middleware execution. Tech: React/Next.js + TypeScript + Tailwind + shadcn/ui. CLI: aua serve --with-ui, aua ui --port 3000. Config: ui: {enabled: true, host: ..., port: 3000, persist_sessions: true}. Auth required in production. Layout in apps/aua_chat/.

uicode

v0.9-rc2 definition of done: aua serve --with-ui starts everything. Open http://localhost:3000, create a chat, ask a coding question, watch streaming response, open Framework Debugger, see route + utility + arbiter + trace, open AUA Controls drawer, change routing threshold, reload config, ask again, see behavior change.

v1.0.0 — Shipped

Final polish, documentation, release discipline. Everything else is implemented; this milestone makes it shippable.

D-01–D-04

Release readiness — compatibility, deployment profiles, examples, release docs

v1.0.0 — Shipped
D-01
Deployment profiles documentation

Four profiles documented in AUA_Framework_v1_Deployment_Profiles.md: (1) Local Developer — localhost only, auth optional, SQLite; (2) Single GPU Workstation — auth recommended, local file/SQLite state; (3) Team Server — auth required, mTLS required, Postgres state, Prometheus/Grafana; (4) Enterprise — custom backends, IAM, secrets manager, strict audit, runtime import disabled. Doctor validates profile-specific requirements.

doc
D-02
Compatibility matrix

Document in docs/compatibility.md: Python 3.10/3.11/3.12, CUDA versions, vLLM versions, Ollama versions, macOS (Ollama only), Linux (vLLM + GPU), GPU memory recommendations per tier, supported model formats (AWQ, GPTQ, fp16, GGUF), supported backends, browser support for UI, Docker support, Postgres versions, Prometheus/Grafana versions. Doctor references this matrix.

doc
D-03
Release engineering docs — CHANGELOG, RELEASE, MIGRATIONS, DEPRECATIONS

Create: CHANGELOG.md, RELEASE.md (release checklist: tests, lint, types, docs, tutorial, examples, Docker, fresh-clone, UI smoke, security profile, migration notes), MIGRATIONS.md, DEPRECATIONS.md. Versioning policy: semver, v1.x preserves public API, v2 may break plugin contracts with migration guide, deprecated features warn before removal.

doc
D-04
Final fresh-clone validation — v1.0 definition of done

All of the following must pass from a fresh clone: install, import, CLI (init, config, doctor, serve), REST (health, version, config, metrics), query, streaming, plugin test, security (token, certs), observability (metrics, cost, status). Chat UI smoke test. Tutorial Part 1 verified from scratch. All examples run. Docker Compose profiles verified. mypy, black, ruff, pytest all pass.

ops

v1.0.0 — shipped: The one-liner from objective_v1.md works — pip install adaptive-utility-agent && aua init --preset coding --tier single-4090 && aua serve — and the expert path works without editing any AUA source files.

v2.0+ — High Availability, Distributed, Advanced

Original roadmap items #33–#73. These are real and valuable, but are post-v1. They require v1 to be stable first. Grouped by concern for clarity.

#33–#43

High-availability — consensus, leader election, service discovery, failover

v2.0
#33
KRaft consensus for coordinator election

Kafka KRaft-based coordinator for multi-node router/arbiter election without ZooKeeper dependency.

#34
Leader election for router and arbiter

Active-standby leader election with automatic failover for router and arbiter replicas.

#35
Shared state via etcd

Distributed configuration and state via etcd. Enables multi-node deployments to share routing state.

#36
Service discovery via Consul

Specialists register in Consul. Router discovers and load-balances across specialist replicas automatically.

#37
Circuit breaker per specialist endpoint

Prevent cascade failures. Open circuit after threshold failures, half-open probe, close on recovery.

#38
Automatic failover with health-based routing

When primary specialist fails, route to healthy replica or fall back to arbiter with degraded-mode flag.

#39
Retry logic with exponential backoff

Per-specialist retry policy. Configurable max retries, base delay, jitter. Audit logged.

#40
Kubernetes manifest generation

aua k8s generate --tier a100-cluster produces ready-to-apply manifests for all AUA components.

#41
Helm charts

Production-grade Helm chart with values files for each hardware tier and deployment profile.

#42
Backup and restore

Backup state store (corrections, promotions, audit, sessions) to S3/GCS/local. Point-in-time restore.

#43
HA config section in aua_config.yaml

Unified HA configuration block: replica counts, circuit breaker policy, leader election backend, service discovery backend.

#44–#55

Advanced platform features — multi-tenancy, caching, experiment tracking, batch, testing

v2.0
#44
Multi-tenancy — isolated namespaces per tenant

Tenant isolation: separate correction stores, promotion logs, audit logs, rate limits, and model bindings per tenant.

#45
Semantic caching via Redis

Cache specialist responses by semantic similarity. Configurable TTL per field. Cache-miss threshold. Significant latency reduction for repeated similar queries.

#46
Model registry integration — HuggingFace Hub, MLflow

Pull model versions from HuggingFace Hub or MLflow registry directly into specialist config. Version pinning.

#47
Experiment tracking — MLflow / Weights & Biases

Log utility scores, routing decisions, arbiter verdicts, and blue-green metrics to MLflow or W&B automatically.

#48
Shadow mode for blue-green

GREEN receives traffic but responses are not shown to users. Score and log GREEN silently. Promote when score threshold met.

#49
Automated regression test suite

Full regression suite against the eval datasets. Run on every specialist promotion attempt. Block promotion on regression.

#50
Load testing preset — aua loadtest

Built-in load test command. Configurable concurrency, duration, query mix. Reports p50/p95/p99 latency and error rate.

#51
Extended plugin system — custom specialists and detectors

Plugin types beyond v1: custom contradiction detector, custom assertion store, custom routing strategy, custom scoring components.

#52
Extended middleware pipeline

Async middleware, streaming middleware (intercepts SSE chunks), batch middleware, multi-tenant middleware primitives.

#53
Custom utility function support — full replacement

Allow fully replacing the utility function (not just weights) via plugin. Supports non-linear utility models and field-specific scoring architectures.

#54
Built-in test harness — aua test

aua test command running the full test suite against running services (not fakes). Integration test harness with configurable query fixtures.

#55
Extended compatibility matrix — model × hardware × backend

Automated compatibility test matrix run in CI across model formats, hardware tiers, and backends. Published as docs and referenced by doctor.

#56–#73

Research & advanced — VCG, distillation, SDKs, GPU cluster, auto-generated docs

v2.x / Research
#56
Batch inference endpoint — production-grade

Persistent batch queue. Result polling. Priority lanes. Cost-optimized batching across specialists.

#57
Auto-setup on aua serve — model download integration

aua serve automatically downloads missing models from HuggingFace Hub before starting specialists.

#58
VCG arbitration mechanism — full implementation (whitepaper §S1)

Full Vickrey-Clarke-Groves mechanism for multi-specialist arbitration as described in the whitepaper theorems S1–S3.

#59
Release-level distillation pipeline

Automated pipeline: collect DPO pairs → distill specialist → evaluate → blue-green candidate. End-to-end self-improvement loop.

#60
Hardware-adaptive graph decomposition

Dynamically partition the specialist graph across available hardware based on VRAM and latency constraints.

#61
Empirical check — Arbiter Stage 4 (external API integration)

Integrate PubMed, arXiv, SymPy as external ground truth sources for the Arbiter's empirical check (currently stubbed).

#62
JavaScript / TypeScript SDK

Node.js SDK with full TypeScript types. Mirrors Python API. SSE streaming support.

#63
Go SDK

Go client library with streaming support. Targeting platform engineers building Go-based AI infrastructure.

#64
Java SDK

Java client library. Enterprise-focused. Spring Boot integration example.

#65
Multi-node H100 cluster template

Tier template for 8×H100 cluster. NVLink, tensor parallelism, pipeline parallelism configuration.

#66
Tensor parallel specialist serving

vLLM tensor parallelism configuration for large models across multiple GPUs per specialist.

#67
MacBook Pro / gaming PC template — consumer hardware tier

Optimized Ollama configuration for consumer MacBooks and gaming PCs (RTX 3080/4080 class).

#68
Full tutorial rewrite — v2 framework edition

Updated tutorial covering HA deployment, multi-tenancy, advanced plugins, and distillation pipeline.

#69
Auto-generated API reference at /docs and /redoc

Already provided by FastAPI. Expand with richer examples, field explanations, and code samples in all SDK languages.

#70
Domain deep-dive pages — all 9 domains linked from tutorial

Per-domain documentation pages covering field-specific utility weights, prompt templates, arbiter policies, and example queries.

#71
SSRN paper update — link to v1 site and measured results

Update whitepaper with v1 framework results, production metrics, and framework architecture. Update site links.

#72
Example applications repository

Separate repo with more complete example applications: medical assistant, legal assistant, full-stack coding assistant, research agent.

#73
Changelog and versioning policy — public commitments

Public changelog and SemVer policy published on the framework site.


Summary — Full Item Count by Milestone

Milestone Items Theme Gate
v0.6-alpha P-01 – P-12 (12 items) Production-harden #01–#10 Fresh-clone validation + CI passes
v0.7-beta #11–#13, #05A–#05D, #14 (8 items) Docker + Django config foundation aua init --preset coding && docker-compose up
v0.8-framework-beta F-01 – F-17 (17 items) Architecture, contracts, plugins, hooks, middleware Custom plugin test passes; no AUA source edits required
v0.9-rc1 #15–#22, #26, #23, #28, #24–#25, #27, #29–#32 (17 items) Security + observability Token + certs + Prometheus + Grafana all working
v0.9-rc2 E-01–E-03, U-01–U-02 (5 items) Evaluation + chat UI aua serve --with-ui + debugger panel working end-to-end
v1.0.0 — Shipped D-01–D-04 (4 items) Release readiness All v1.0 definition-of-done commands pass
v2.0+ #33–#73 (41 items) HA, multi-tenancy, SDKs, research, cluster Post-v1 — requires stable v1 first

Key ordering rationale: Config strictness (P-06) before Docker (#11) because Docker needs correct config. Model registry (#05C) before presets (#05B) because presets reference aliases. Field registry (#05D) before config commands (#05A) because explain needs field knowledge. Secrets (#19) before auth (#17) because auth tokens are secrets. OTEL (#23) before Prometheus (#24) because Prometheus can export from OTEL. State store (F-05) before Chat Session API (U-01) because sessions need storage. Plugin interfaces (F-07) before import system (F-09) because the importer validates against the interfaces. Error taxonomy (F-03) before security (#17–#22) because auth errors need stable codes. Architecture spec (F-01) before all framework items because it defines component boundaries that prevent drift.