AUA Framework v1.1 — Release Roadmap & Validation Matrix

AUA Framework v1.1 — Release Roadmap and Validation Matrix

This page records the v1.0/v1.1 implementation contract, validation criteria, and future roadmap. Items marked done are included in the shipped releases (v1.0.0 and v1.1.0). Items marked wip or future are deferred to v2.0+. The goal: Django for adaptive multi-model LLM systems — batteries included, deeply configurable, extensible without editing AUA internals.

Source documents: aua_roadmap.html · AUA_Framework_v1_Roadmap.md · objective_v1.md · aua_roadmap_01_10_production_completeness.md · missing_from_v1.md · Front_end_for_UA_Framework_v1.md

✓ Status: v1.1.0 shipped.

All items marked done are included in v1.0.0/v1.1.0 and validated. v1.1.0 adds the complete AUA-Veritas backport (V-P1–V-P3), end-to-end session IDs (#15), live secrets-provider tests (#19), and YAML-wired plugins/hooks/middleware (F-09–F-11): 297 tests, 50+ REST endpoints, a 40-check live-router E2E suite, and every tutorial command verified as written. v1.0 record: docs/v1_validation_report.md (197 tests, 23 REST endpoints, 8 plugin protocols, 18 Prometheus metrics).

Reading guide. Items are ordered so each step unlocks the next with minimal rework. Original roadmap numbers (#01–#73) are preserved where they exist; new items introduced by the planning documents use prefixed IDs (P-, F-, E-, U-, D-). Items already implemented (v0.5 POC) are marked done. Items needing production hardening are marked wip.

73
Original roadmap items
55+
New items from planning docs
7
Release milestones
v1.0.0
Released: production framework v1.0.0

v0.6-alpha — #01–#10 Production-Complete

Items #01–#10 are production-complete in v1.0.0. This milestone records the hardening work completed: quality, contracts, correctness, and packaging. All items are shipped.

P-01–P-04

Repo hygiene & packaging

Must happen first — CI, installs, and all other work depend on this being clean.

Do first
P-01
Code formatting — black, isort, ruff

Run black aua tests and isort aua tests across the entire repo. Add [tool.black], [tool.isort], [tool.ruff], and [tool.mypy] sections to pyproject.toml. line-length=100, target-version=py310. This is a prerequisite for CI.

code
P-02
pyproject.toml — production-complete package metadata

Package name adaptive-utility-agent. Optional extras: pip install aua[vllm], aua[dev], aua[ui], aua[postgres], aua[otel]. vllm must be optional — it is not Mac-friendly. Include aua/tiers/*.yaml in wheel data. Add Makefile with install, test, lint, format, typecheck, build targets.

infra
P-03
aua/version.py — single version source + py.typed

Create aua/version.py containing __version__ = "1.0.0". All other references import from here. Create aua/py.typed marker file. Add version consistency test: importlib.metadata.version("adaptive-utility-agent") == aua.__version__.

code
P-04
CI workflow — .github/workflows/ci.yml

Matrix: Python 3.10, 3.11, 3.12. Steps: pip install -e ".[dev]" → ruff → black check → mypy → pytest. Must pass without GPUs using fake specialists. Also add wheel-build validation step.

infra
P-05–P-07

Public API alignment & config strictness

Config correctness unblocks Docker, tiers, presets, and all downstream work.

Do second
P-05
Public API alignment — aua/__init__.py

Export exact roadmap API: Arbiter = ArbiterAgent (alias), BlueGreenDeployment, CorrectionLoop. Expose all endpoint models from aua.endpoints. Stable __all__. Add import smoke tests in tests/test_imports.py. Add docs/public_api.md separating stable from internal APIs.

code
P-06
Config strictness — validation, host/scheme, runtime paths

Remove hardcoded localhost from endpoint construction. Add host, scheme, endpoint_path, endpoint_override fields to SpecialistConfig and ArbiterConfig. Add strict unknown-key validation (catches typos). Add duplicate port validation. Add threshold range validation. Add RuntimeConfig (.aua/logs, .aua/pids, .aua/state, .aua/checkpoints). Add APIConfig with cors_origins.

code
P-07
Tier name alignment & aliases

Canonical tier names: macbook, single-4090, quad-4090, a100-cluster. Backward-compatible aliases: rtx4090 → single-4090, a100 → a100-cluster. Add quad-4090.yaml template. Every tier template must load in CI. Annotate generated YAML with inline comments.

code
P-08–P-12

Runtime hardening — serve, router, rollback, CLI, tests

Production lifecycle, API contracts, and test suite.

Do third
P-08
serve.py — process lifecycle hardening

SIGINT/SIGTERM handler that terminates child processes (TERM → 15s grace → KILL). Write .aua/pids/ and .aua/logs/ on startup. Async readiness polling per specialist. Port-conflict detection before start with --reuse-running flag. Deterministic --dry-run output (exit 0). Document foreground-only mode explicitly in help.

code
P-09
router.py — API contract hardening

Move CORS from wildcard to APIConfig.cors_origins. Add session_id (auto-generated UUID if not supplied) to every request/response. Add ErrorResponse model with stable error codes. Distinguish 503 (unreachable) from 504 (timeout). Add GET /version. Redact secrets from GET /config. Make POST /deploy/green honest (return dry_run_only until harness exists). Preserve batch result order with per-item index and ok fields.

code
P-10
rollback.py — durable atomic state

Move promotions log to .aua/state/promotions.jsonl with UUID per record. File locking via filelock to prevent concurrent rollback+deploy races. Atomic config writes: write to .tmp then os.replace(). Add --dry-run to rollback. Add POST /deploy/rollback REST endpoint or document CLI-only clearly.

code
P-11
CLI discipline — exit codes, flags, streaming robustness

All CLI commands: correct exit codes (0=pass, 1=fail, 2=warn-in-strict). Add --json to aua doctor and aua status. Add --strict to aua doctor. Add --once, --url, --refresh to aua status. Stream: add SSE named event fields (event: chunk), heartbeat comments (: keep-alive every 15s), client disconnect handling, robust data: [DONE] parser. Add Content-Encoding: none header on SSE routes to prevent gzip middleware.

code
P-12
Test suite — fake specialists, fixtures, contract tests

Create tests/fakes/openai_server.py: FastAPI fake with GET /v1/models, POST /v1/chat/completions (buffered + streaming). Create fixture configs: minimal, rtx4090, macbook, invalid_duplicate_ports, invalid_unknown_key, invalid_threshold. Test files: test_imports, test_config, test_cli_init, test_cli_doctor, test_router_api, test_streaming, test_status, test_rollback. CI must pass without GPUs.

code

v0.6-alpha definition of done: pip install -e ".[dev]" && python -c "from aua import Router, Arbiter, UtilityScorer, BlueGreenDeployment, CorrectionLoop; print('ok')" && aua init --tier macbook --force && aua doctor --strict && aua serve --dry-run && pytest -q && ruff check aua tests && black --check aua tests && mypy aua all pass from a fresh clone.

v0.7-beta — Docker + Django Config Foundation

Packaging is clean. Now add the deployment layer (#11–#13) and the Django-like configuration system (presets, model registry, field registry, aua config command group). These are ordered so the registry items come before the commands that reference them.

#11–#12

Docker + hardware tier templates

Run everything in one command. Depends on clean pyproject.toml (P-02) and correct tier names (P-07).

v0.7-beta
#11
Official Dockerfile + docker-compose

docker-compose up starts router, specialists, arbiter. docker-compose --profile gpu up for GPU variant. Profiles: cpu (Ollama/local), gpu (vLLM), observability (Prometheus + Grafana optional), secure. Health checks on all containers. Volume mounts for .aua/ state. Environment file support. Structured logs to stdout.

infra
#12
Hardware tier config templates — complete set

Canonical tiers: macbook (Ollama), single-4090 (RTX 4090, vLLM AWQ), quad-4090 (4× RTX 4090, multi-GPU), a100-cluster (A100 80GB, fp16). Each specifies specialists, arbiter, GPU assignment, memory split, ports, promotion thresholds, default models, observability defaults, security defaults. Annotated with inline YAML comments.

infra
#05C · #05D · #05B · #05A

Django config layer — registries, presets, config commands

Registries before presets. Presets before config commands. Config commands before tutorial.

v0.7-beta
#05C
Model registry — aliases, compatibility, aua models CLI

Built-in aliases: qwen-coder-7b-awq, qwen-math-7b-awq, qwen-14b-awq, llama3-8b, etc. Each entry: provider, full model ID, backend, quantization, recommended VRAM. User-defined registry entries in YAML. Hardware compatibility checks (AWQ + Ollama → error). CLI: aua models list, aua models inspect <alias>. Compact config uses aliases: model: qwen-coder-7b-awq.

code
#05D
Field registry — canonical IDs, aliases, aua fields CLI

Built-in fields: software_engineering, mathematics, general, research, medicine, law, finance. Each has aliases (swe, coding), description, default utility weights, default confidence threshold, default arbiter policy. User-defined custom fields. CLI: aua fields list, aua fields inspect <field>.

code
#05B
Preset system — built-in presets, aua presets CLI

Built-in presets: coding, math, research, generalist, medical-safe, legal-safe, local-ollama. Each preset selects fields, default models (via registry aliases), routing thresholds, utility weights, arbiter policy. Presets live in aua/defaults/presets/. Compact config: preset: coding expands to full config. CLI: aua presets list, aua presets inspect <name>. aua init --preset coding --tier single-4090 works end-to-end.

code
#05A
aua config command group — validate, expand, show, diff, explain

aua config validate (strict schema check), aua config expand (compact → full YAML printed to stdout), aua config show (current loaded config, secrets redacted), aua config diff config_a.yaml config_b.yaml, aua config explain routing.fanout_threshold (human-readable explanation with default, range, higher/lower meaning). JSON output option on all commands. Config schema registry with explanation strings for every key.

code
#13 · #14

Hot reload + tutorial rewrite

Tutorial written last so it reflects the final config system and presets.

v0.7-beta
#13
Hot reload for aua_config.yaml

SIGHUP triggers reload. aua config reload CLI command. Hot-reloadable without restart: routing thresholds, utility weights, promotion thresholds, logging level, rate limits, CORS origins, webhook config. Partial restart required for: new specialist, changed model, changed backend, changed mTLS certs. Reload is atomic — validate new config before applying.

code
#14
Tutorial rewrite — Django-style, 12 parts

Replaces tutorial.html. Do not start with theory. Part 1: quickstart in <10 min. Parts 2–12 progressively teach: models & fields, routing & utility, arbiter & corrections, blue-green, plugins, hooks/middleware, security, observability, Docker deployment, expert deployment (full custom config without editing AUA internals). Tutorial verifies commands from each section actually run.

doc

v0.7-beta definition of done: docker-compose up works. aua init --preset coding --tier single-4090 && aua config validate && aua config expand && aua models list && aua fields list && aua presets list && aua serve --dry-run all pass. Tutorial Part 1 tested from scratch.

v0.8-framework-beta — Django-Like Framework UX

The configuration layer exists. Now add the extensibility layer: architecture spec, contracts, error taxonomy, state abstraction, plugin system, hooks, middleware. These are ordered so lower-level items (architecture, contracts, errors, state) come before the systems that depend on them (plugins, hooks, middleware).

F-01–F-06

Framework discipline — architecture, contracts, errors, state, config versioning

These define the contracts everything else implements. Must come before plugins and hooks.

v0.8-framework
F-01
Architecture spec — AUA_Framework_v1_Architecture.md

Defines: runtime architecture, full request lifecycle (Input → Middleware → Session → Correction Retrieval → Classifier → Routing → Specialist Calls → Utility Scoring → Arbiter → Hooks → Correction Logging → Response → Metrics/Logs/Traces/Audit), component boundaries, component ownership, plugin loading lifecycle, hook/middleware execution order, observability flow, security boundary. This doc prevents implementation drift.

doc
F-02
Stable public API contracts — plugin interfaces and compatibility guarantee

Define formal Python protocols for all extension points: FieldClassifierPlugin, UtilityScorerPlugin, ArbiterPolicyPlugin, PromotionPolicyPlugin, CorrectionStorePlugin, ModelBackendPlugin, HookPlugin, AUAMiddleware. Create docs/public_api.md. Guarantee: v1.x will not break public import paths or plugin method signatures. Deprecated APIs get one minor release of warning.

doccode
F-03
Error taxonomy — stable AUA_* error codes

Define stable error code list: AUA_CONFIG_INVALID, AUA_BACKEND_UNREACHABLE, AUA_SPECIALIST_TIMEOUT, AUA_ARBITER_TIMEOUT, AUA_PLUGIN_LOAD_FAILED, AUA_PLUGIN_CONTRACT_INVALID, AUA_HOOK_FAILED, AUA_MIDDLEWARE_FAILED, AUA_AUTH_REQUIRED, AUA_FORBIDDEN, AUA_RATE_LIMITED, AUA_PROMOTION_REJECTED, AUA_ROLLBACK_FAILED, AUA_STATE_STORE_UNAVAILABLE, etc. Map each code to HTTP status and CLI exit code. REST error format: {"error": "AUA_*", "message": "...", "trace_id": "...", "details": {}}. Create AUA_Framework_v1_Error_Codes.md.

codedoc
F-04
Permission/scope matrix — fine-grained API scopes

Define scopes: aua:query, aua:stream, aua:status, aua:config:read, aua:config:write, aua:corrections:read, aua:corrections:write, aua:deploy, aua:rollback, aua:extensions:read, aua:extensions:write, aua:tokens:read, aua:tokens:write, aua:admin. Map every endpoint to required scope. Extension/runtime import endpoints require aua:extensions:write or aua:admin. Create scope matrix table in docs.

secdoc
F-05
State store abstraction — SQLite → Postgres, pluggable

Define state store interface. State categories: chat sessions, corrections, promotion logs, audit logs, DPO pairs, config snapshots, token metadata, cost records, eval runs, routing traces. Implementations: local files (default), SQLite (local dev), Postgres (production). Config: state: {backend: sqlite, path: .aua/state/aua.db}. Migration support. Doctor checks. Removes the scattered .jsonl files in favor of structured storage.

code
F-06
Config versioning + migration CLI

Add config_version: "1.0" field to YAML. CLI: aua config check-version, aua config migrate --from 0.6 --to 1.0, aua config migrate --dry-run. Clear error for unsupported config versions. Migration tests against old config fixtures. Required before v1.0 to support users upgrading from v0.x.

code
F-07–F-13

Extension system — plugins, backends, hooks, middleware, extension CLI

Ordered: plugin interfaces → backend plugins → import system → hooks → middleware → runtime API → extension CLI.

v0.8-framework
F-07
Plugin interfaces — formal protocols for all plugin types

Define base protocols in aua/plugins/: FieldClassifierPlugin, UtilityScorerPlugin, ArbiterPolicyPlugin, PromotionPolicyPlugin, CorrectionStorePlugin. Each has: required method signatures, type hints, docstrings, example implementations, test harness. Version compatibility guarantees documented.

code
F-08
Model backend plugin system

ModelBackendPlugin protocol: complete(request) -> dict, stream(request) -> AsyncIterator, health() -> dict. Built-in backends: vLLM (existing), Ollama (existing), OpenAI-compatible generic. YAML registration: backends: {my_gateway: {plugin: my_company.backends:GatewayBackend, base_url: ..., auth_secret: ...}}. Doctor checks backend health for all registered backends.

code
F-09
Extension import system — import_path syntax, config injection

YAML syntax: import_path: plugins.custom_utility:RiskWeightedUtilityScorer. Config injection: config: {risk_weight: 0.7} passed to constructor. Strict interface validation on load (not on import). Clear error messages on mismatch. Doctor checks all import paths. Reload support where safe. Allowlisted paths only in production mode. Users never edit AUA source files.

code
F-10
Hook system — lifecycle hooks with ordering and timeout policy

Hooks: pre_query, post_query, pre_route, post_route, pre_specialist_call, post_specialist_call, pre_arbiter, post_arbiter, pre_response, post_response, on_correction, on_promotion, on_rollback. Hook interface: async def __call__(self, event: dict) -> dict. Configurable ordering. Error handling policy (fail-open vs fail-closed). Timeout policy per hook. Audit logged. Metrics tracked. YAML registration under hooks:.

code
F-11
Middleware system — ordered pipeline, before/after, short-circuit

Interface: before_query(request) -> request, after_response(response) -> response. Ordered execution (list order in YAML). Short-circuit support (return early without calling downstream). Error behavior: configurable (skip, fail). Metrics per middleware. Built-in examples: PIIRedactionMiddleware, TenantPolicyMiddleware, AuditMiddleware. YAML: middleware: [plugins.middleware:PIIRedactionMiddleware].

code
F-12
Extension runtime API — GET/POST /extensions

GET /extensions, GET /extensions/{name}, POST /extensions/test, POST /extensions/reload. Disabled by default in production. Requires aua:extensions:write scope. Only loads allowlisted paths. Audit logged. Warns loudly if bound to 0.0.0.0. Development-only: POST /extensions/import.

codesec
F-13
Extension CLI — aua extensions commands

aua extensions list, aua extensions inspect <name>, aua extensions test --kind utility_scorer --import-path plugins.custom_utility:RiskWeightedUtilityScorer, aua extensions reload, aua plugins doctor. Also: management command system stub: aua run <command> dispatching to import_path registered in YAML.

code
F-14–F-17

Framework defaults — prompt templates, safety policy, defaults registry

Prompt templates and safety policy before presets reference them.

v0.8-framework
F-14
Prompt template system — versioned, field-specific, config override

Built-in templates in aua/templates/prompts/: classifier_v1.txt, arbiter_balanced_v1.txt, arbiter_conservative_v1.txt, correction_injection_v1.txt, abstention_v1.txt, medical_safe_v1.txt, legal_safe_v1.txt. Config: prompts: {arbiter_template: arbiter_conservative_v1}. Advanced: arbiter_template_path: ./prompts/custom.txt. Template registry with version tracking. Prompts are framework behavior, not implementation details.

code
F-15
Safety/abstention policy — high-risk field gating

Config: safety: {abstention_enabled: true, high_risk_fields: [medicine, law, finance], require_arbiter_for_high_risk: true, min_confidence_for_direct_answer: 0.90}. Behavior: low confidence → abstain or clarify. High-risk field → force arbiter. Contradiction detected → do not answer confidently. Required by medical-safe and legal-safe presets. Abstention response model in endpoints.py.

codesec
F-16
Defaults registry — aua/defaults/ + aua defaults CLI

aua/defaults/: presets/, models.yaml, fields.yaml, utility.yaml, routing.yaml, security.yaml, prompts.yaml. CLI: aua defaults show, aua defaults show preset coding, aua defaults show models. Makes batteries-included defaults inspectable and avoids hidden magic. Underpins aua config expand.

code
F-17
Built-in example projects — examples/ directory

Required examples: quickstart_coding/, custom_utility_plugin/, custom_field_classifier/, custom_backend/, blue_green_demo/, secure_docker_deploy/. Each: aua_config.yaml, README.md, sample curl commands, expected output, troubleshooting. Examples validate the plugin system, docs, and API simultaneously. Users copy these before reading advanced docs.

doc

v0.8-framework-beta definition of done: aua extensions test --kind utility_scorer --import-path plugins.example:ExampleUtilityScorer passes. aua defaults show and aua models list work. Custom middleware and hook examples in examples/ run. Architecture spec exists and is accurate.

v0.9-rc1 — Security + Observability

Security items ordered so the foundation (session IDs, scopes) comes before the features that use them (audit log, rate limiting, mTLS). Observability ordered so the pipeline (OTEL) precedes the outputs (Prometheus, Grafana, Datadog).

#15–#22

Security foundation — identity, auth, audit, rate limiting

Session IDs first (everything else references them). Auth before audit. Secrets before mTLS certs.

v0.9-rc1
#15
Session IDs — end-to-end query tracking

Every query gets session_id, trace_id, request_id. Propagated through: router, field classifier, specialist calls, arbiter, correction loop, response, logs, metrics, audit log. Returned in every API response. Generated UUID if not supplied by client. Context available to all hooks and middleware.

sec
#19
Secrets management — no plaintext in config (moved up: auth depends on this)

Config references secret names, not values. Supported: environment variables, local encrypted file, HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager. YAML: secrets: {provider: env}, api_key_secret: OPENAI_API_KEY. Doctor checks secret resolution. Redaction in GET /config and logs.

sec
#17
IAM and bearer token authentication

External endpoints require auth. JWT or signed token support. Scope validation per endpoint (using permission matrix from F-04). Token ID in logs and audit. Admin endpoints strongly protected. Extension endpoints require aua:extensions:write. Local dev mode can opt out of auth with explicit warning.

sec
#18
Token management CLI

aua token create --scope aua:query --expires 30d, aua token list, aua token revoke <token-id>, aua token inspect <token-id>. Token metadata stored in state store (F-05). Revocation list checked on every request.

sec
#20
Audit log — append-only, tamper-evident, hash chain

Events: query, routing decision, arbiter verdict, correction injection, config reload, token create/revoke, promotion, rollback, extension load, hook failure, auth failure. Required fields: timestamp, session_id, trace_id, token_id, event_type, field, specialist, utility_score, confidence, latency_ms, hash_chain_previous, hash_chain_current. Append-only to state store. Hash chain provides tamper evidence.

sec
#16
mTLS — encrypted internal communication

Router ↔ specialists, router ↔ arbiter, arbiter ↔ correction loop. Config: security: {mtls: {enabled: true, cert_dir: .aua/certs, auto_generate: true}}. CLI: aua certs generate, aua certs rotate. Doctor check: aua doctor --check-certs. Auto-generated dev certs with warning. Production: bring your own CA.

sec
#21
Rate limiting and quota management

Configurable per token/scope. Config: rate_limits: {aua:query: {requests_per_minute: 60}, aua:admin: {requests_per_minute: 10}}. Behavior options: reject, queue, warn, alert. Returns 429 with Retry-After header. Metrics tracked per scope. Audit logged when limits hit.

sec
#22
CORS and network policy

CORS configurable via api.cors_origins (already in P-06/P-09). External endpoints explicit. Internal specialist ports not exposed externally by default. Admin endpoints protected. Extension endpoints protected. Unsafe settings (wildcard CORS + 0.0.0.0 + no auth) produce loud doctor warnings.

sec
#31
Encrypted internal values at rest

Encrypt at rest: correction payloads, assertion store entries, DPO pairs, plugin secrets, token metadata, sensitive audit fields. AES-256-GCM. Key managed per deployment. mTLS (#16) covers in-transit encryption. Config: security: {encryption: {enabled: true, key_secret: AUA_ENCRYPTION_KEY}}.

sec
#23–#30 · #32

Observability — OTEL, Prometheus, Grafana, logging, alerting, cost

OTEL pipeline first (Prometheus and Datadog export from it). Alerting after metrics. Cost after metrics.

v0.9-rc1
#26
Structured logging (moved up: needed by all other observability items)

JSON log emission. Fields: timestamp, level, message, session_id, trace_id, request_id, token_id, field, specialist, utility_score, confidence, contradiction, verdict, latency_ms, plugin, hook, middleware. Apply LoggingConfig from config at startup. ELK/Splunk compatible.

ops
#23
OpenTelemetry instrumentation

Instrument: query, routing, classification, specialist call, arbiter, correction loop, blue-green decision, plugin execution, hook execution, middleware execution. Trace propagation through all components. Span IDs in API response. Optional extra: pip install aua[otel].

ops
#28
Distributed tracing — trace propagation and UI link

Full W3C trace context propagation from router through all specialist calls. Trace ID in every API response header and body. Jaeger/Tempo exporter via OTEL collector. Debugger panel links to trace UI per request.

ops
#24
Prometheus metrics endpoint — GET /metrics

Metrics: aua_query_total, aua_query_latency_seconds, aua_utility_score, aua_contradiction_rate, aua_routing_field_distribution, aua_specialist_confidence, aua_bluegreen_traffic_split, aua_correction_count, aua_abstention_rate, aua_arbiter_verdict_distribution, aua_dpo_pairs_accumulated, aua_token_requests_total, aua_token_requests_rejected, aua_specialist_vram_utilization, aua_plugin_execution_seconds, aua_hook_failures_total.

ops
#25
Grafana dashboard — pre-built, shipped with package

Dashboard JSON in docker/grafana/. Panels: query volume, latency p50/p95/p99, routing distribution, utility score trends, contradiction rate, arbiter verdict distribution, specialist health, VRAM usage, blue-green split, auth failures, cost metrics, plugin latency.

ops
#27
Datadog integration — OTEL collector preset

OTEL collector config preset for Datadog exporter. Docs. Env var secret handling (DD_API_KEY). Doctor check for Datadog connectivity. Optional extra: pip install aua[datadog].

ops
#29
Prometheus alert rules — shipped with package

Alert rules: AUA_HighContradictionRate, AUA_SpecialistDown, AUA_LowUtilityScore, AUA_BlueGreenStalled, AUA_ArbiterCaseFourSpike, AUA_TokenRejectionSpike, AUA_PluginFailureSpike, AUA_HookFailureSpike. Alert rules YAML in docker/prometheus/alerts.yaml.

ops
#30
Webhook events — external system integration

Events: specialist_promoted, rollback_completed, contradiction_threshold_exceeded, arbiter_inconclusive_spike, specialist_down, plugin_failure, security_auth_failure_spike. Config: webhooks: {slack: {url_secret: SLACK_WEBHOOK_URL, events: [specialist_down, high_contradiction_rate]}}. Retry with backoff. Audit logged.

ops
#32
Cost tracking endpoint — GET /metrics/cost

Track: GPU hours per query, cost per specialist call, cost per arbiter call, cost per blue-green cycle, cost per plugin execution. Config: cost: {gpu_hour_rates: {rtx4090: 0.50, a100: 2.50}}. Exposed in GET /metrics/cost and aua status.

ops

v0.9-rc1 definition of done: aua token create --scope aua:query --expires 30d && aua certs generate && aua doctor --strict && curl http://localhost:8000/metrics && curl http://localhost:8000/metrics/cost all pass. Grafana dashboard loads in Docker Compose observability profile.

v0.9-rc2 — Evaluation + Chat UI

Evaluation harness before Chat UI because the UI's debugger panel references eval runs and DPO pairs. Chat Session API before Chat UI frontend (API-first).

E-01–E-03

Evaluation + correction export

Required by blue-green promotion demo, tutorial Part 6, and the UI's blue-green debug panel.

v0.9-rc2
E-01
Evaluation harness — aua eval CLI

aua eval run --dataset evals/coding.yaml --config aua_config.yaml, aua eval report .aua/evals/latest.json, aua eval compare --baseline blue --candidate green. Dataset format: YAML with cases (id, field, prompt, expected_properties). Runner routes each case through the framework, scores with utility function, checks expected properties. JSON output for CI. Regression detection. Wires up to POST /deploy/green harness.

ml
E-02
Correction / DPO export — aua corrections export

aua corrections export --format jsonl, aua dpo export --format preference-pairs. Output format: {prompt, chosen, rejected, field, utility_chosen, utility_rejected, correction_ids, trace_id}. Traceability to source queries. Safe redaction option. Not a training pipeline — just export. Users bring their own fine-tuning infrastructure.

ml
E-03
Built-in smoke eval datasets

evals/coding_smoke.yaml, math_smoke.yaml, routing_smoke.yaml, correction_smoke.yaml, arbiter_smoke.yaml, safety_smoke.yaml. Small but useful. Run by aua eval run. Used in CI where feasible (fake specialists). Used in tutorial Part 6. Used in blue-green promotion demo in examples/blue_green_demo/.

ml
U-01–U-02

Chat UI — session API + reference frontend

Chat Session API (U-01) before the frontend (U-02). Frontend consumes the API.

v0.9-rc2
U-01
#10B Chat Session API — session persistence endpoints

POST /sessions, GET /sessions, GET /sessions/{id}, DELETE /sessions/{id}, GET /sessions/{id}/messages, POST /sessions/{id}/messages, POST /sessions/{id}/stream. Session store backed by state store abstraction (F-05). Schema: chat_sessions, chat_messages, routing_traces tables. Chat request/response schema includes session_id, trace_id, routing, utility, debug fields. SSE streaming events: start, route, specialist_start, chunk, specialist_done, arbiter_start, arbiter_done, done, error.

codeui
U-02
#10A Chat UI Reference App — React/Next.js framework debugger

Three-zone layout: session sidebar (left) + main chat window (center) + Framework Debugger panel (right). Left pop-up AUA Controls drawer exposes: preset, fields, models, routing thresholds, utility weights, arbiter policy, corrections, blue-green, backends, security, observability, extensions. Right debugger shows: route summary, field classifier output, specialist calls, candidate responses (optional), utility breakdown, arbiter decision, corrections injected, blue-green debug, latency breakdown, cost estimate, trace metadata, plugin/hook/middleware execution. Tech: React/Next.js + TypeScript + Tailwind + shadcn/ui. CLI: aua serve --with-ui, aua ui --port 3000. Config: ui: {enabled: true, host: ..., port: 3000, persist_sessions: true}. Auth required in production. Layout in apps/aua_chat/.

uicode

v0.9-rc2 definition of done: aua serve --with-ui starts everything. Open http://localhost:3000, create a chat, ask a coding question, watch streaming response, open Framework Debugger, see route + utility + arbiter + trace, open AUA Controls drawer, change routing threshold, reload config, ask again, see behavior change.

v1.0.0 — Shipped

Final polish, documentation, release discipline. Everything else is implemented; this milestone makes it shippable.

D-01–D-04

Release readiness — compatibility, deployment profiles, examples, release docs

v1.0.0 — Shipped
D-01
Deployment profiles documentation

Four profiles documented in AUA_Framework_v1_Deployment_Profiles.md: (1) Local Developer — localhost only, auth optional, SQLite; (2) Single GPU Workstation — auth recommended, local file/SQLite state; (3) Team Server — auth required, mTLS required, Postgres state, Prometheus/Grafana; (4) Enterprise — custom backends, IAM, secrets manager, strict audit, runtime import disabled. Doctor validates profile-specific requirements.

doc
D-02
Compatibility matrix

Document in docs/compatibility.md: Python 3.10/3.11/3.12, CUDA versions, vLLM versions, Ollama versions, macOS (Ollama only), Linux (vLLM + GPU), GPU memory recommendations per tier, supported model formats (AWQ, GPTQ, fp16, GGUF), supported backends, browser support for UI, Docker support, Postgres versions, Prometheus/Grafana versions. Doctor references this matrix.

doc
D-03
Release engineering docs — CHANGELOG, RELEASE, MIGRATIONS, DEPRECATIONS

Create: CHANGELOG.md, RELEASE.md (release checklist: tests, lint, types, docs, tutorial, examples, Docker, fresh-clone, UI smoke, security profile, migration notes), MIGRATIONS.md, DEPRECATIONS.md. Versioning policy: semver, v1.x preserves public API, v2 may break plugin contracts with migration guide, deprecated features warn before removal.

doc
D-04
Final fresh-clone validation — v1.0 definition of done

All of the following must pass from a fresh clone: install, import, CLI (init, config, doctor, serve), REST (health, version, config, metrics), query, streaming, plugin test, security (token, certs), observability (metrics, cost, status). Chat UI smoke test. Tutorial Part 1 verified from scratch. All examples run. Docker Compose profiles verified. mypy, black, ruff, pytest all pass.

ops

v1.0.0 — shipped: The one-liner from objective_v1.md works — pip install adaptive-utility-agent && aua init --preset coding --tier single-4090 && aua serve — and the expert path works without editing any AUA source files.

v1.1-veritas — AUA-Veritas Production Backport (Phase 13)

These 49 items were discovered, built, and battle-tested in AUA-Veritas — a macOS desktop AI assistant built on top of the AUA framework. Every item here was required to make a real user-facing product work. All four priorities are done: P0 shipped with v1.0.0; P1–P3 were backported and shipped in v1.1.0 (296+ tests, 40-check live-router E2E suite, CI green across Python 3.10–3.12).

Implementation rules discovered through production. Each warning below represents a real bug that caused silent failures in AUA-Veritas. The rules are carried forward into the framework to prevent recurrence.

V-P0

Priority 0 — Required for any production deployment

All items implemented and tested against real Claude API calls (75 E2E tests, 0 failures).

done v1.1-veritas
Persistence layer
V-P0.1
Conversation + message persistence API done

New tables: conversations (with project_id FK), messages, model_runs, token_counters, message_keywords, context_backups, projects. Endpoints: POST /conversations, GET /conversations (with ?project_id= filter), PATCH /conversations/{id}/title, GET /conversations/{id}/messages (paginated with before/after timestamp cursors), POST /projects, GET /projects. All implemented in aua/state.py and aua/router.py.

codenew
V-P0.2
Schema: model_runs.conversation_id — required join column done

model_runs previously had query_id but no conversation_id. There was literally no join path between a conversation and its model runs in the database. Added conversation_id TEXT column + idx_runs_conv index. Also added domain_l0 and domain_path for hierarchical domain tracking.
Implementation rule: every method that stores a model run must receive conversation_id as an explicit parameter — never rely on closure capture. The _peer_review() method in Veritas silently dropped conversation IDs for 3 months because of this.

codenew
V-P0.3
Off-critical-path writes via fire_and_forget() done

Model score updates, model_runs inserts, and token counter increments were blocking the response path (1–8 ms each, compounding). Moved off the critical path using asyncio.create_task() via the new fire_and_forget(coro) helper in aua/state.py. Saves 15–25 ms per query. Score thresholds validated in production: 3 correct responses → +1 point, 2 wrong → −1 point. A threshold of 5 made score movement invisible within a typical session.
Implementation rule: asyncio must be imported at the top of lifespan() (or locally in each use site) — never as a lazy mid-body import. Any startup code that runs before the lazy import will crash with NameError. This bug manifested in 2 separate places in Veritas.

codenew
V-P0.4
LRU message cache — MessageCache class done

Implemented MessageCache in aua/state.py using collections.OrderedDict + move_to_end() on every cache hit (true LRU). FIFO dict evicts the most-accessed conversation first — the opposite of what you want. Capacity: 500 conversations.
Cache bypass rule: only serve from cache when limit ≥ 50 (the default). A custom ?limit=1 query must bypass the cache and hit the DB with the actual limit — otherwise pagination is silently broken (limit=1 returns all cached messages). This bug was found by the E2E test suite.

codenew
V-P0.5
Context backup reads full DB history, not request slice done

The context backup generator previously used conversation_history from the client request body — whatever messages the frontend happened to have loaded. If the UI had 20 messages loaded from an 80-message conversation, the backup was written from those 20. Fixed: backup reads the full conversation from the DB (last 60 messages, decrypted). The get_messages() method now serves as the canonical source. Also: max backup tokens raised 600 → 900 to accommodate the structured 6-section template (GOAL / DECISIONS / STATUS / ACTIVE FILE / PREFERENCES / RESUME INSTRUCTION).

code
V-P1

Priority 1 — Required for desktop / long-running sessions

Backported and shipped in v1.1.0. Verified by tests/test_v11_veritas.py plus a 40-check live-router E2E suite; CI green on Python 3.10/3.11/3.12.

done v1.1-veritas
V-P1.1
Full-text keyword search — in-memory inverted index + bisect prefix

In-memory inverted index (keyword → {conversation_id, ...}) backed by a sorted keyword list for O(log n) prefix matching via bisect. Multi-word AND semantics. Average query latency: 4–10 ms (vs 1.07 ms DB fallback). Queue-based async worker (asyncio.Queue) drains keyword extraction off the response path in 50 ms batches.
Closure-scope trap (Phase 13): the _kw_worker closure inside lifespan() used _tm.time() where _tm was only imported later in the same body. The task crashed on its first item in every session — search was 100% broken for 3 months. Fix: add import time as _kw_tm inside the closure.
Startup backfill required: the async worker is killed on process restart before flushing. At startup, scan messages for rows not yet in message_keywords and index them. Without this, search returns empty after every rebuild.

codenew
V-P1.2
Context backup coverage job — 6-hour sweep

Background job that runs every 6 hours (first run: 60 s after startup). SQL validity check: backup valid ⇔ MAX(context_backups.created_at) > MAX(messages.created_at) for that conversation + specialist pair. Finds all stale or missing backups and generates them at 1/s (rate-limit pacing). Exposes GET /context/backup/coverage and POST /context/backup/run-coverage-job.
Method signature hygiene: when refactoring method signatures, grep all call sites. A stale conversation_id= kwarg passed to a method that no longer accepts it crashed every coverage job run silently.

codenew
V-P1.3
Implicit correction confirm/reject — POST /corrections/confirm-implicit

Layer 1 trigger detector catches implicit corrections (short reply + negation pattern + semantic similarity). Instead of asking the user to re-type their intent, a modal presents Accept / Reject buttons. POST /corrections/confirm-implicit handles the response.
Explicit prefix rule: correction: X is a preference statement — store it regardless of whether a prior AI turn exists. The _handle_correction() early-return (if not last_ai_response: return None) must be guarded by and not explicit_prefix. Without this guard, corrections sent at the start of a conversation are silently discarded.
Regex pattern rule: patterns ending in punctuation or space (e.g. actually,  no,  in fact,) cannot use a trailing  word boundary — comma + space is not a word boundary. Split CORRECTION_PATTERNS into two groups: word-terminated (use ...) and punct/space-terminated (use  prefix only).

codenew
V-P1.4
Structured context backup prompt — 6-section handoff template

Replaces the free-form backup prompt with a structured template that forces the model to capture: GOAL (objective + constraints), DECISIONS MADE (“Decided X because Y. Rejected Z because W.”), CURRENT STATUS (✅ completed / 🔄 in progress / ❓ unresolved), ACTIVE FILE / CODE CONTEXT (exact path + function + next step), USER PREFERENCES LEARNED, and RESUME INSTRUCTION (one sentence for the new window). Max tokens raised 600 → 900 to accommodate all 6 sections without truncation.

codenew
V-P1.5
Crash reporter — startup sentinel + auto-report on next launch

On startup: write status=‘running’ sentinel to DB. On clean shutdown: update to status=‘clean’. On next startup: detect status=‘running’ → previous session crashed → send report async to GitHub Contents API. Includes pending_error_reports table for queuing errors from crashed sessions.

codenewops
V-P1.6
Remote model config — fetch → DB cache → hardcoded fallback chain

Three-level fallback: remote JSON (e.g. GitHub Pages, fetched at startup + every 24 h) → DB-cached config (last successful fetch, kept 7 days) → hardcoded fallback. Allows model IDs, pricing, and context windows to be updated without a rebuild. Critical for production: Gemini 1.5 Pro was silently deprecated with a 404 and required a full rebuild to fix without this system.

codenew
V-P2

Priority 2 — UX quality

Backported and shipped in v1.1.0 — review_notes on RouterResponse, the analytics suite (/analytics, /reliability, /usage, /pricing), update management, and corrections CRUD with event-history evidence.

done v1.1-veritas
V-P2.1
Peer review notes in response — review_notes field

When the peer reviewer flags an issue, surface the findings to the client: reviewer model name, ISSUES found, CORRECTION suggested. Add review_notes: str | None to RouterResponse. Parse ISSUES: / CORRECTION: structured sections from reviewer output. Currently the framework discards reviewer findings after parsing the verdict.

code
V-P2.2
Analytics, reliability, and usage endpoints

GET /analytics — session stats, agreement rate, domain distribution, correction stats, confidence distribution. GET /reliability — per-specialist win rate and effective_u trajectory. GET /usage — token usage summary. GET /pricing — per-specialist token pricing for cost estimation.

codenew
V-P2.3
Update management — version check, skip-version persistence

GET /version/check — check GitHub releases for newer version. POST /update/skip — store skipped_version so update banner doesn't re-appear. GET /update/skipped.

code
V-P2.4
Corrections CRUD — edit, soft-delete, evidence endpoint

PATCH /corrections/{id} — update correction text. DELETE /corrections/{id} — soft-delete (sets scope=‘superseded’, row stays in DB, excluded from retrieval). GET /corrections/evidence — per-correction evidence and application history.

code
V-P3

Priority 3 — Production hardening

Backported and shipped in v1.1.0 — structured bug reporting, project scoping, local model management, and the dynamic domain ontology with 4-gate candidate promotion.

done v1.1-veritas
V-P3.1
Bug reporting — structured report to GitHub Contents API

POST /bug-report: collects system log tail, API log tail, JS console errors, optional conversation exchange. Submits via GitHub Contents API to a private repo. Falls back gracefully when PAT not configured (returns 200 with error message, never 500).

codeops
V-P3.2
Projects — conversation scoping via project_id FK

conversations.project_id FK (schema already in V-P0.1). GET /conversations?project_id=X filters correctly. POST /conversations accepts project_id in body. Sidebar filters to that project’s conversations. “All chats” option shows conversations with project_id=NULL.

code
V-P3.3
Local model management API — Ollama discovery + specialist tagging

GET /local/models — list connected local models with specialist_domain + specialist_depth fields. GET /local/settings, POST /local/settings. PATCH /local/specialist/{id} — tag a local model as specialist for a domain node. Framework currently has ollama_backend.py but no discovery, management, or specialist routing API.

codenew
V-P3.4
Dynamic domain ontology — candidate queue + promotion

Dynamic domain tree with candidate promotion: domain nodes start as aliases, get promoted when query volume and divergence thresholds are met. GET /domain-tree — full ontology with node stats and candidate queue. Background ontology job runs divergence test, applies promotion criteria.

mlnew

v1.1-veritas P0 definition of done: from aua.state import SQLiteStateStore, MessageCache, fire_and_forget imports cleanly. POST /conversations, GET /conversations, GET /conversations/{id}/messages?limit=1 (bypasses cache), POST /projects, GET /context/backup/coverage all return correct responses. Schema includes conversations, messages, model_runs (with conversation_id), token_counters, message_keywords, context_backups, projects. fire_and_forget() schedules background tasks without blocking the response path.

v1.1.0 definition of done — verified: All 14 V-P1–V-P3 items live behind 28 new REST endpoints (/search, /context/backup/run-coverage-job, /corrections/confirm-implicit, corrections CRUD + evidence, /analytics suite, /version/check, /bug-report, /local/*, /domain-tree). Session/trace/request IDs returned on every response and propagated to specialists, hooks, audit, and logs (#15). secrets: config block with live Vault (wire-faithful KV v2 + real hvac) and AWS Secrets Manager (moto + real boto3) integration tests in CI (#19). plugins:, hooks:, middleware:, state:, and security: blocks parse with strict validation and wire at startup; GET /extensions reports server truth (F-09–F-11). 297 tests passing, ruff/black/isort/mypy clean, CI green on Python 3.10/3.11/3.12.

v2.0+ — High Availability, Distributed, Advanced

Original roadmap items #33–#73. These are real and valuable, but are post-v1. They require v1 to be stable first. Grouped by concern for clarity.

#33–#43

High-availability — consensus, leader election, service discovery, failover

v2.0
#33
KRaft consensus for coordinator election

Kafka KRaft-based coordinator for multi-node router/arbiter election without ZooKeeper dependency.

#34
Leader election for router and arbiter

Active-standby leader election with automatic failover for router and arbiter replicas.

#35
Shared state via etcd

Distributed configuration and state via etcd. Enables multi-node deployments to share routing state.

#36
Service discovery via Consul

Specialists register in Consul. Router discovers and load-balances across specialist replicas automatically.

#37
Circuit breaker per specialist endpoint done

Prevent cascade failures. Open circuit after threshold failures, half-open probe, close on recovery.

#38
Automatic failover with health-based routing done

When primary specialist fails, route to healthy replica or fall back to arbiter with degraded-mode flag.

#39
Retry logic with exponential backoff done

Per-specialist retry policy. Configurable max retries, base delay, jitter. Audit logged.

#40
Kubernetes manifest generation

aua k8s generate --tier a100-cluster produces ready-to-apply manifests for all AUA components.

#41
Helm charts

Production-grade Helm chart with values files for each hardware tier and deployment profile.

#42
Backup and restore

Backup state store (corrections, promotions, audit, sessions) to S3/GCS/local. Point-in-time restore.

#43
HA config section in aua_config.yaml

Unified HA configuration block: replica counts, circuit breaker policy, leader election backend, service discovery backend.

#44–#55

Advanced platform features — multi-tenancy, caching, experiment tracking, batch, testing

v2.0
#44
Multi-tenancy — isolated namespaces per tenant done

Tenant isolation: separate correction stores, promotion logs, audit logs, rate limits, and model bindings per tenant.

#45
Semantic caching via Redis n/a

Superseded. The keyword search index (AUA) and conversation search (AUA-Veritas) already let users check whether a query has been asked before. Adding Redis as an ops dependency for marginal cache-hit benefit is not warranted. Closed.

#46
Model registry integration — HuggingFace Hub, MLflow done

Pull model versions from HuggingFace Hub or MLflow registry directly into specialist config. Version pinning.

#47
Experiment tracking — MLflow / Weights & Biases done

Log utility scores, routing decisions, arbiter verdicts, and blue-green metrics to MLflow or W&B automatically.

#48
Shadow mode for blue-green done

GREEN receives traffic but responses are not shown to users. Score and log GREEN silently. Promote when score threshold met.

#49
Automated regression test suite done

Full regression suite against the eval datasets. Run on every specialist promotion attempt. Block promotion on regression.

#50
Load testing preset — aua loadtest done

Built-in load test command. Configurable concurrency, duration, query mix. Reports p50/p95/p99 latency and error rate.

#51
Extended plugin system — custom specialists and detectors done

Plugin types beyond v1: custom contradiction detector, custom assertion store, custom routing strategy, custom scoring components.

#52
Extended middleware pipeline done

Async middleware, streaming middleware (intercepts SSE chunks), batch middleware, multi-tenant middleware primitives.

#53
Custom utility function support — full replacement done

Allow fully replacing the utility function (not just weights) via plugin. Supports non-linear utility models and field-specific scoring architectures.

#54
Built-in test harness — aua test done

aua test command running the full test suite against running services (not fakes). Integration test harness with configurable query fixtures.

#55
Extended compatibility matrix done — model × hardware × backend

Automated compatibility test matrix run in CI across model formats, hardware tiers, and backends. Published as docs and referenced by doctor.

#56–#73

Research & advanced — VCG, distillation, SDKs, GPU cluster, auto-generated docs — 7 of 18 done (#56, #57, #58, #61, #65, #66, #67)

v2.x / Research
#56
Batch inference endpoint — production-grade done

Persistent batch queue. Result polling. Priority lanes. Cost-optimized batching across specialists.

#57
Auto-setup on aua serve — model download integration done

aua serve automatically downloads missing models from HuggingFace Hub before starting specialists.

#58
VCG arbitration mechanism — full implementation (whitepaper §S1) done

Full Vickrey-Clarke-Groves mechanism for multi-specialist arbitration as described in the whitepaper theorems S1–S3.

#59
Release-level distillation pipeline

Automated pipeline: collect DPO pairs → distill specialist → evaluate → blue-green candidate. End-to-end self-improvement loop.

#60
Hardware-adaptive graph decomposition

Dynamically partition the specialist graph across available hardware based on VRAM and latency constraints.

#61
Empirical check — Arbiter Stage 4 (external API integration) done

Integrate PubMed, arXiv, SymPy as external ground truth sources for the Arbiter's empirical check (currently stubbed).

#62
JavaScript / TypeScript SDK

Node.js SDK with full TypeScript types. Mirrors Python API. SSE streaming support.

#63
Go SDK

Go client library with streaming support. Targeting platform engineers building Go-based AI infrastructure.

#64
Java SDK

Java client library. Enterprise-focused. Spring Boot integration example.

#65
Multi-node H100 cluster template done

Tier template for 8×H100 cluster. NVLink, tensor parallelism, pipeline parallelism configuration.

#66
Tensor parallel specialist serving done

vLLM tensor parallelism configuration for large models across multiple GPUs per specialist.

#67
MacBook Pro / gaming PC template — consumer hardware tier done

Optimized Ollama configuration for consumer MacBooks and gaming PCs (RTX 3080/4080 class).

#68
Full tutorial rewrite — v2 framework edition

Updated tutorial covering HA deployment, multi-tenancy, advanced plugins, and distillation pipeline.

#69
Auto-generated API reference at /docs and /redoc

Already provided by FastAPI. Expand with richer examples, field explanations, and code samples in all SDK languages.

#70
Domain deep-dive pages — all 9 domains linked from tutorial

Per-domain documentation pages covering field-specific utility weights, prompt templates, arbiter policies, and example queries.

#71
SSRN paper update — link to v1 site and measured results

Update whitepaper with v1 framework results, production metrics, and framework architecture. Update site links.

#72
Example applications repository

Separate repo with more complete example applications: medical assistant, legal assistant, full-stack coding assistant, research agent.

#73
Changelog and versioning policy — public commitments

Public changelog and SemVer policy published on the framework site.

#74
Per-specialist ModelBackendPlugin wiring

Wire ModelBackendPlugin at the per-specialist level. Each SpecialistConfig and ArbiterConfig carries an optional backend_plugin: block. When present, the router calls plugin.complete(request) instead of the built-in httpx _call() path for that specialist only — zero change for specialists without a plugin. Design: backend_plugin field on SpecialistConfig/ArbiterConfig; _call() gains backend_plugin=None kwarg; plugin receives OpenAI-compat request dict, returns {response: str, confidence: float}. Enables: custom inference stacks, multi-model ensembles per specialist, math specialist backed by 3 models + arXiv cross-check, domain-specific retrieval augmentation, or any external inference API.

#75
StateStorePlugin wiring — pluggable persistence backend

Wire StateStorePlugin so the SQLite default can be replaced via aua_config.yaml. Requires restructuring Router.__init__ to load the state_store plugin before constructing BatchQueue, ShadowStore, DomainTree, OntologyJob, and RemoteModelConfig (all of which currently receive self._state_store at construction time). Two candidate designs: (a) two-phase init — load state_store plugin first, pass to dependents; (b) late-bind factory — dependents accept a callable that returns the store on first access. Enables Postgres, Redis, or cloud-native persistence backends as drop-in community plugins without any source modifications.

#74
Per-specialist model_backend plugin wiring

Wire ModelBackendPlugin at per-specialist level. Each SpecialistConfig carries an optional backend_plugin block. Router dispatches to plugin.complete() instead of built-in httpx/_call() path for that specialist only. Enables custom inference stacks, ensembles, and domain-specific pipelines (e.g. math specialist backed by 3 models + arXiv cross-check). Design: backend_plugin field on SpecialistConfig + ArbiterConfig; _call() gains backend_plugin=None kwarg; plugin receives OpenAI-compat request dict, returns {response: str, confidence: float}.

#75
state_store plugin wiring — pluggable persistence backend

Wire StateStorePlugin so the SQLite backend can be replaced via YAML config. Requires restructuring Router.__init__ to load plugins before constructing BatchQueue, ShadowStore, DomainTree, OntologyJob, and RemoteModelConfig (all currently receive self._state_store at construction time). Design options: (a) two-phase init — load state_store plugin first, then construct dependents; (b) late-bind — dependents accept a factory callable. Enables Postgres, Redis, or cloud-native persistence backends as community plugins.


Summary — Full Item Count by Milestone

Milestone Items Theme Gate
v0.6-alpha P-01 – P-12 (12 items) Production-harden #01–#10 Fresh-clone validation + CI passes
v0.7-beta #11–#13, #05A–#05D, #14 (8 items) Docker + Django config foundation aua init --preset coding && docker-compose up
v0.8-framework-beta F-01 – F-17 (17 items) Architecture, contracts, plugins, hooks, middleware Custom plugin test passes; no AUA source edits required
v0.9-rc1 #15–#22, #26, #23, #28, #24–#25, #27, #29–#32 (17 items) Security + observability Token + certs + Prometheus + Grafana all working
v0.9-rc2 E-01–E-03, U-01–U-02 (5 items) Evaluation + chat UI aua serve --with-ui + debugger panel working end-to-end
v1.0.0 — Shipped D-01–D-04 (4 items) Release readiness All v1.0 definition-of-done commands pass
v1.1.0 — Shipped V-P0–V-P3 (49 items) + #15, #19, F-09–F-11 completion AUA-Veritas production backport — persistence, search, corrections, backup, analytics, ontology — plus end-to-end session IDs, live secrets-provider tests, and YAML-wired plugins/hooks/middleware All 4 priorities shipped. 297 tests passing; every documented endpoint verified against a live router.
v2.0+ #33–#73 (41 items) HA, multi-tenancy, SDKs, research, cluster Post-v1 — requires stable v1 first

Key ordering rationale: Config strictness (P-06) before Docker (#11) because Docker needs correct config. Model registry (#05C) before presets (#05B) because presets reference aliases. Field registry (#05D) before config commands (#05A) because explain needs field knowledge. Secrets (#19) before auth (#17) because auth tokens are secrets. OTEL (#23) before Prometheus (#24) because Prometheus can export from OTEL. State store (F-05) before Chat Session API (U-01) because sessions need storage. Plugin interfaces (F-07) before import system (F-09) because the importer validates against the interfaces. Error taxonomy (F-03) before security (#17–#22) because auth errors need stable codes. Architecture spec (F-01) before all framework items because it defines component boundaries that prevent drift.