AUA Framework v1.1 — Documentation

Technical Documentation

Architecture specification, deployment profiles, compatibility matrix, permission scopes, validation report, and supplemental roadmap — all rendered here for easy reading.

DB Schema ↗

AUA Framework v1 — Architecture Specification

Version: 1.1.0

Status: Canonical. Implementation must match this document. Divergence = a bug.

1. System Overview

AUA (Adaptive Utility Agents) is a multi-specialist LLM routing framework. It routes queries to domain-expert models, scores outputs using a utility function, detects contradictions, resolves them with an arbiter, and feeds verified corrections back into training.

The design goal is Django for adaptive multi-model LLM systems — batteries included, deeply configurable, extensible without editing framework internals.

2. Component Boundaries

┌─────────────────────────────────────────────────────┐
│                    AUA Router                        │
│                                                      │
│  ┌──────────┐  ┌───────────┐  ┌─────────────────┐   │
│  │Middleware │  │  Session  │  │  Correction     │   │
│  │ Pipeline │  │  Manager  │  │  Retrieval      │   │
│  └──────────┘  └───────────┘  └─────────────────┘   │
│                                                      │
│  ┌──────────────────────────────────────────────┐    │
│  │          Field Classifier                    │    │
│  │  (pluggable via FieldClassifierPlugin)       │    │
│  └──────────────────────────────────────────────┘    │
│                                                      │
│  ┌──────────┐  ┌───────────┐  ┌─────────────────┐   │
│  │  Router  │  │ Specialist│  │ Utility Scorer  │   │
│  │ Decision │  │  Calls    │  │ (pluggable)     │   │
│  └──────────┘  └───────────┘  └─────────────────┘   │
│                                                      │
│  ┌──────────────────────────────────────────────┐    │
│  │              Arbiter Agent                   │    │
│  │  (pluggable policy via ArbiterPolicyPlugin)  │    │
│  └──────────────────────────────────────────────┘    │
│                                                      │
│  ┌──────────┐  ┌───────────┐  ┌─────────────────┐   │
│  │   Hook   │  │Correction │  │  State Store    │   │
│  │ Registry │  │  Logger   │  │  (pluggable)    │   │
│  └──────────┘  └───────────┘  └─────────────────┘   │
└─────────────────────────────────────────────────────┘

External:
  Specialist servers  (vLLM / Ollama / custom ModelBackendPlugin)
  Arbiter server      (same backends)
  State store         (files / SQLite / Postgres)
  Observability       (stdout / Prometheus / OTEL)

2.1 v1.1 production services (AUA-Veritas backport)

v1.1.0 adds eight services that run inside the router process. Each starts in the FastAPI lifespan and shuts down cleanly; none blocks the query path.

Service	Module	What it does
Keyword index	`aua/keywords.py`	Message-level inverted index with async batch worker (50 ms), startup backfill, and DB fallback. Serves `GET /search`.
Context backups	`aua/context_backup.py`	Per-(specialist, conversation) token counters; 6-section handoff notes on token/message/time-gap triggers; 6-hour coverage sweep.
Trigger detector	`aua/trigger_detector.py`	Two-layer correction detection: regex Layer 1 ships built-in, Layer 2 is pluggable. Feeds `POST /corrections/confirm-implicit`.
Crash reporter	`aua/crash_reporter.py`	Startup sentinel + clean-shutdown marking; previous-session crashes detected (before the new sentinel is written) and reported with a queued-error flush.
Remote model config	`aua/remote_config.py`	Model registry refresh with a remote → DB-cache (7-day) → builtin fallback chain; field allowlist; 24-hour refresh job.
Domain ontology	`aua/domain_tree.py`	10 fixed L0 roots; alias map + edit-distance resolution; candidate queue with 4-gate promotion (volume, diversity, coverage, divergence); hourly maintenance job. Serves `GET /domain-tree`.
Session ID middleware	`aua/session.py`	Per-request SessionContext (session/trace/request IDs) — client-supplied honored, UUIDs generated, returned as headers on every response, propagated to specialists, hooks, audit, and logs (#15).
YAML extension loader	`aua/router.py` + `aua/plugins/registry.py`	Loads `plugins:`, `hooks:`, and `middleware:` from config at startup with contract validation and Django-style project-dir imports (F-09–F-11). `GET /extensions` reports server truth.

3. Full Request Lifecycle

Every query follows this pipeline in order. Steps marked [pluggable] can be replaced or extended via plugins/hooks.

 1. HTTP Request arrives at router
    └─ session_id / trace_id / request_id assigned (UUID if not supplied)

 2. Middleware pipeline — before_query() [pluggable]
    └─ PII redaction, tenant policy, rate limiting, auth check

 3. Session lookup
    └─ Retrieve prior session context from state store (if session_id known)

 4. Correction retrieval
    └─ Load relevant verified claims from AssertionsStore for this domain

 5. Field Classifier [pluggable]
    └─ Scores query against all known fields → domain_distribution dict
    └─ Emits: primary_domain, domain_distribution, routing_mode decision

 6. Routing decision
    ├─ single: one field above single_domain_threshold → one specialist
    ├─ fanout: multiple fields above fanout_threshold → multiple specialists
    └─ force_domain: override from request

 7. Specialist calls [pluggable via ModelBackendPlugin]
    └─ POST to specialist endpoint with correction context injected
    └─ Timeout: specialist_timeout (default 60s) → AUA_SPECIALIST_TIMEOUT

 8. Utility Scoring [pluggable]
    └─ U = w_e·E + w_c·C + w_k·K per specialist response
    └─ Kalman filter updates confidence estimate

 9. Arbiter [pluggable policy]
    └─ Runs if: fanout + contradiction detected
    └─ 4 checks: logical, mathematical, cross-session, empirical
    └─ Issues Case 1/2/3/4 verdict → correction signal

10. Hook registry — on_correction / on_promotion / etc. [pluggable]
    └─ Fire registered hooks for this event type

11. Correction logging
    └─ Store DPO pair to state store (if arbiter issued correction)
    └─ Update AssertionsStore with verified claim

12. Response assembly
    └─ RouterResponse model with session_id, u_score, routing_mode, response

13. Middleware pipeline — after_response() [pluggable]
    └─ Response transformation, audit logging

Background (off the request path, v1.1):
    keyword index worker · backup coverage job (6h) · remote config
    refresh (24h) · ontology job (1h) · crash report on startup

14. Metrics / Logs / Traces / Audit
    └─ Structured JSON to stdout
    └─ Prometheus metrics (if observability profile enabled)
    └─ OTEL traces (if otel extra installed)
    └─ Audit log entry written to state store (append-only, hash chain)

4. Component Ownership

ComponentModuleOwner interface
Field classifier	`aua.field_classifier`	`FieldClassifierPlugin`
Utility scorer	`aua.utility_scorer`	`UtilityScorerPlugin`
Arbiter policy	`aua.arbiter`	`ArbiterPolicyPlugin`
Promotion policy	`aua.blue_green`	`PromotionPolicyPlugin`
Correction store	`aua.assertions_store`	`CorrectionStorePlugin`
Model backend	`aua.router` (http calls)	`ModelBackendPlugin`
State store	`aua.state`	`StateStorePlugin`
Hooks	`aua.hooks`	`HookPlugin`
Middleware	`aua.middleware`	`AUAMiddleware`

All plugin types are defined in aua/plugins/interfaces.py as Python Protocol classes.

5. Plugin Loading Lifecycle

1. Config loaded (load_config)
2. For each plugin reference in config:
   a. Resolve import_path: "module.path:ClassName"
   b. Import module
   c. Instantiate class with config dict injected
   d. Validate against Protocol (runtime isinstance check)
   e. Register in plugin registry
3. Router initialised with plugin registry
4. On SIGHUP: reload config → re-run steps 1-5 atomically

Plugins are validated at startup. A failed plugin load causes startup to abort with AUA_PLUGIN_LOAD_FAILED.

6. Hook Execution Order

For each hook point, hooks fire in YAML registration order:

pre_query → [middleware.before_query] → post_route → pre_specialist_call
→ post_specialist_call → pre_arbiter → post_arbiter → on_correction
→ pre_response → [middleware.after_response] → post_response
→ on_promotion / on_rollback (async, not in request path)

Hook failures default to fail-open (log + continue). Set hooks.{name}.fail_closed: true to abort on failure.

7. Observability Flow

Every request → structured JSON log line (stdout)
             → Prometheus counter/histogram increment (if enabled)
             → OTEL span (if aua[otel] installed)
             → Audit log entry (state store, append-only)

Key metrics:
  aua_queries_total{domain, routing_mode, status}
  aua_query_latency_seconds{domain, routing_mode}
  aua_utility_score{domain}
  aua_contradiction_rate{domain}
  aua_arbiter_verdict_total{case}
  aua_specialist_errors_total{specialist, error_code}

8. Security Boundary

Only the router port (default 8000) is public-facing.

Specialist ports are internal — bind to 127.0.0.1 or Docker internal network.

Extension endpoints (/extensions/*) are disabled in production mode.

All external endpoints require bearer token auth (v0.9+).

Secrets are never logged, traced, or returned via GET /config.

Audit log is append-only with a hash chain for tamper detection (v0.9+).

9. State Store

All persistent state goes through the StateStore interface:

Datav0.7 locationv0.8+ (default)
Promotion log	`.aua/state/promotions.jsonl`	SQLite: `promotions` table
Correction pairs	`dpo_pairs/*.jsonl`	SQLite: `corrections` table
Assertions	In-memory (AssertionsStore)	SQLite: `assertions` table
Sessions	None	SQLite: `sessions` table
Audit log	None	SQLite: `audit_log` table

Migration from v0.7 flat files: aua config migrate --from 0.7 --to 0.8

10. Extension Points Summary

Users extend AUA by adding YAML entries — never by editing framework source files.

# Custom utility scorer
utility_scorer:
  import_path: plugins.custom_utility:RiskWeightedUtilityScorer
  config:
    risk_weight: 0.7

# Custom middleware
middleware:
  - import_path: plugins.middleware:PIIRedactionMiddleware
  - import_path: plugins.middleware:AuditMiddleware

# Custom hook
hooks:
  on_correction:
    - import_path: plugins.hooks:SlackNotificationHook
      config:
        webhook_url_secret: SLACK_WEBHOOK_URL

# Custom backend
backends:
  my_gateway:
    import_path: plugins.backends:GatewayBackend
    base_url: https://gateway.internal
    auth_secret: GATEWAY_API_KEY

Document maintained by: Praneeth Tota. Last updated: v0.8.0b0. For implementation questions, check this document first.

AUA Framework v1 — Deployment Profiles

Version: 1.1.0

Status: Canonical. Each profile defines minimum requirements and recommended settings.

Overview

AUA ships with four deployment profiles. Choose the profile that matches your environment, then use aua init with the appropriate tier and configure security accordingly.

ProfileAuthStatemTLSObservabilityUse case
Local Developer	Optional	SQLite	No	Optional	Solo dev, experimentation
Single GPU Workstation	Recommended	SQLite	No	Optional	Personal GPU server
Team Server	Required	Postgres/SQLite	Required	Required	Shared team deployment
Enterprise	Required + IAM	Postgres	Required	Required	Production, regulated

Profile 1 — Local Developer

Target: MacBook Pro / laptop. Ollama backend. No GPU required.

# aua_config.yaml
aua:
  version: "1.0"
  backend: ollama

security:
  auth_enabled: false   # acceptable for localhost-only

state:
  backend: sqlite
  path: .aua/state/aua.db

logging:
  level: INFO
  format: text           # human-readable for local dev

Setup:

brew install ollama
aua init . --tier macbook --preset coding
aua doctor
aua serve

Doctor checks for this profile:

Ollama reachable at port 11434

Required models pulled

Auth disabled warning (non-fatal on localhost)

Limitations:

Not suitable for network exposure

Single user only

No authentication enforced

Profile 2 — Single GPU Workstation

Target: RTX 4090 or similar consumer GPU. vLLM backend. Single user or small team on LAN.

aua:
  version: "1.0"
  backend: vllm

security:
  auth_enabled: true
  token_secret_env: AUA_TOKEN_SECRET

state:
  backend: sqlite
  path: .aua/state/aua.db

logging:
  level: INFO
  format: json

Setup:

export AUA_TOKEN_SECRET=$(python3 -c "import secrets; print(secrets.token_hex(32))")
aua init . --tier single-4090 --preset coding
aua token create --scope aua:query --expires 90d --label "primary"
aua doctor --strict
aua serve

Doctor checks for this profile:

CUDA available

VRAM sufficient for configured specialists

Auth enabled

Token secret set

Profile 3 — Team Server

Target: Dedicated Linux server, RTX 4090 or A100. Shared team access. Prometheus + Grafana monitoring.

aua:
  version: "1.0"
  backend: vllm

security:
  auth_enabled: true
  token_secret_env: AUA_TOKEN_SECRET
  mtls:
    enabled: true
    cert_dir: /etc/aua/certs
    auto_generate: false    # use your own CA in production

state:
  backend: sqlite           # or postgres for HA
  path: /var/lib/aua/state/aua.db

logging:
  level: INFO
  format: json
  output: /var/log/aua/router.log

rate_limits:
  aua:query:
    requests_per_minute: 120
  aua:admin:
    requests_per_minute: 10

Setup:

# Generate certs (or use your own CA)
aua certs generate --cert-dir /etc/aua/certs

# Create tokens per team member
aua token create --scope aua:query --scope aua:stream --expires 30d --label "team-alice"
aua token create --scope aua:admin --expires 1d --label "ci-deploy"

# Start with observability
docker compose --profile obs up prometheus grafana -d
aua serve

Doctor checks for this profile:

Auth enabled (fatal if disabled)

mTLS certs present and not expired

Rate limits configured

Prometheus reachable

Profile 4 — Enterprise

Target: Multi-GPU cluster, regulated environment, audit requirements.

aua:
  version: "1.0"
  backend: vllm

secrets:
  provider: vault          # or: aws, gcp
  vault_url: https://vault.internal
  token_env: VAULT_TOKEN

security:
  auth_enabled: true
  token_secret_env: AUA_TOKEN_SECRET
  mtls:
    enabled: true
    cert_dir: /etc/aua/certs
    auto_generate: false
  encryption:
    enabled: true
    key_secret: AUA_ENCRYPTION_KEY

state:
  backend: sqlite           # postgres recommended for HA
  path: /var/lib/aua/state/aua.db

logging:
  level: INFO
  format: json
  output: stdout            # forward to ELK/Splunk via log aggregator

rate_limits:
  aua:query:
    requests_per_minute: 300
  aua:admin:
    requests_per_minute: 5

# Disable development features
extensions:
  runtime_import_enabled: false   # never allow runtime plugin loading
  allowlist_only: true

Additional requirements:

All secrets via Vault/AWS SM/GCP SM — no plaintext in config

Encryption at rest enabled (AES-256-GCM)

Audit log verified via hash chain integrity check

Extension runtime API disabled

mTLS between all components

Prometheus + Grafana + alert routing to PagerDuty/Slack

Token expiry ≤ 30 days, rotation enforced

Doctor checks for this profile:

All of Profile 3 checks

Secrets provider reachable

Encryption key set

Runtime import disabled

Audit log hash chain valid

Doctor Profile Validation

Run with --strict to enforce profile requirements:

aua doctor --strict

Exit codes:

0 — all checks pass

1 — one or more checks failed

2 — warnings in strict mode (treated as failures)

The doctor automatically detects which profile you're running based on your config and applies the appropriate check set.

Upgrading Between Profiles

Profile 1 → 2: Enable auth, set AUA_TOKEN_SECRET, create tokens.

Profile 2 → 3: Add mTLS, configure rate limits, add observability stack.

Profile 3 → 4: Add secrets manager, enable encryption, disable runtime imports.

Migration: aua config migrate --from 0.9 --to 1.0

AUA Framework v1 — Compatibility Matrix

Version: 1.1.0

Python

VersionStatusNotes
3.10	✓ Supported	Minimum version
3.11	✓ Supported	Recommended
3.12	✓ Supported	Tested in CI
3.9	✗ Not supported	f-string syntax incompatible
3.13	⚠ Experimental	Not yet in CI matrix

Operating Systems

OSBackendNotes
macOS (Apple Silicon M1/M2/M3/M4)	Ollama only	vLLM has no macOS support
macOS (Intel)	Ollama only	CPU inference only
Ubuntu 20.04+	vLLM + Ollama	Recommended for production
Debian 11+	vLLM + Ollama
RHEL/Rocky 8+	vLLM + Ollama
Windows	Not tested	Use WSL2 + Ubuntu

GPU & VRAM

GPUVRAMTierMax simultaneous specialists
Apple M-series (unified)	16–128 GB	macbook	3 (via Ollama, sequential)
NVIDIA RTX 4090	24 GB	single-4090	3 (AWQ, concurrent)
NVIDIA RTX 3090/4080	24 GB / 16 GB	single-4090	2–3 (may need lower util)
4× NVIDIA RTX 4090	96 GB total	quad-4090	6–8
NVIDIA A100 80 GB	80 GB	a100-cluster	4–6 (fp16)
NVIDIA H100 80 GB	80 GB	a100-cluster	4–6 (fp16)

VRAM estimates (AWQ 4-bit):

3B model: ~2.5 GB

7B model: ~5 GB

14B model: ~9 GB

32B model: ~20 GB

LLM Backends

Ollama

VersionStatusNotes
0.3.x	✓ Supported
0.4.x	✓ Supported	Recommended
0.5.x+	✓ Supported

Supported model formats via Ollama: GGUF (Q4, Q5, Q8), fp16

vLLM

VersionStatusNotes
0.4.x	✓ Supported
0.5.x	✓ Supported	Recommended
0.6.x+	✓ Supported

Supported model formats via vLLM: AWQ, GPTQ, fp16, bf16

Model Formats

FormatOllamavLLMNotes
GGUF (Q4_K_M)	✓	✗	Ollama default
GGUF (Q5_K_M)	✓	✗	Higher quality
AWQ	✗	✓	Fastest on GPU
GPTQ	✗	✓
fp16	✓ (via Ollama)	✓	Full precision
bf16	✗	✓	A100/H100 only

CUDA

CUDA VersionStatusNotes
11.8	✓ Supported
12.0	✓ Supported
12.1	✓ Supported	Recommended
12.2+	✓ Supported

Requires: nvidia-driver >= 520

State Store

BackendStatusNotes
SQLite (WAL)	✓ Default	All deployments
Files (JSONL)	✓ Legacy	v0.7 compatibility
PostgreSQL 14+	✓ Supported	Team/Enterprise profiles
PostgreSQL 13	⚠ Partial	No JSON operators
MySQL/MariaDB	✗ Not supported

Observability

ToolVersionStatus
Prometheus	2.x / 3.x	✓ Supported
Grafana	9.x / 10.x / 13.x	✓ Supported
OpenTelemetry Collector	0.80+	✓ Supported
Datadog	Any (via OTEL)	✓ Supported
Jaeger	1.x	✓ Supported

Docker

ToolVersionStatus
Docker Engine	24+	✓ Supported
Docker Desktop (Mac)	4.x	✓ Supported
Docker Compose	v2.x	✓ Required
Podman	4+	⚠ Experimental

Chat UI

BrowserStatus
Chrome / Chromium 110+	✓ Supported
Firefox 110+	✓ Supported
Safari 16+	✓ Supported
Edge 110+	✓ Supported

Runtime: Node.js 18+ required for aua ui / aua serve --with-ui

Python Dependencies (key packages)

PackageMin versionNotes
fastapi	0.100+
uvicorn	0.20+
httpx	0.25+
pydantic	2.0+	v1 not supported
click	8.0+
rich	13.0+
pyyaml	6.0+
cryptography	41.0+	Optional — certs + encryption
prometheus-client	0.17+	Optional — metrics
opentelemetry-sdk	1.20+	Optional — aua[otel]

Tested Hardware (v1.0 — unchanged in v1.1)

HardwareOSBackendStatus
MacBook Pro M1 Max (32 GB)	macOS 14	Ollama	✓ Primary dev platform
MacBook Pro M2 (16 GB)	macOS 14	Ollama	✓
Desktop RTX 4090	Ubuntu 22.04	vLLM	✓
RunPod RTX 4090 (24 GB)	Ubuntu 22.04	vLLM	✓ CI validation

AUA Framework — Permission / Scope Matrix

Version: 1.1.0

Status: Canonical. Authentication implemented in v0.9-rc1.

Scopes

ScopeDescription
`aua:query`	Send queries via `POST /query`
`aua:stream`	Send streaming queries via `POST /query/stream`
`aua:batch`	Send batch queries via `POST /query/batch`
`aua:status`	Read `GET /status`, `GET /health/*`, `GET /version`
`aua:config:read`	Read `GET /config` (secrets redacted)
`aua:config:write`	Reload config via `POST /config/reload`
`aua:corrections:read`	Read `GET /corrections`
`aua:corrections:write`	Inject corrections via `POST /corrections`
`aua:deploy`	Trigger green evaluation via `POST /deploy/green`
`aua:rollback`	Execute rollback (CLI + REST)
`aua:extensions:read`	Read `GET /extensions`, `GET /extensions/{name}`
`aua:extensions:write`	Load/reload extensions, test imports
`aua:tokens:read`	List and inspect tokens (CLI: `aua token list`)
`aua:tokens:write`	Create and revoke tokens (CLI: `aua token create/revoke`)
`aua:admin`	All scopes — for operator/admin use only

Endpoint → Required Scope

EndpointMethodRequired ScopeNotes
`/query`	POST	`aua:query`
`/query/stream`	POST	`aua:stream`
`/query/batch`	POST	`aua:batch`
`/health/live`	GET	none	Public — used by load balancers
`/health/ready`	GET	none	Public
`/health/startup`	GET	none	Public
`/version`	GET	none	Public
`/docs`	GET	none	Disable in production via config
`/status`	GET	`aua:status`
`/config`	GET	`aua:config:read`	Secrets always redacted
`/config/reload`	POST	`aua:config:write`
`/corrections`	GET	`aua:corrections:read`
`/corrections`	POST	`aua:corrections:write`
`/deploy/green`	POST	`aua:deploy`
`/deploy/rollback`	POST	`aua:rollback`
`/extensions`	GET	`aua:extensions:read`	Disabled in production
`/extensions/{name}`	GET	`aua:extensions:read`	Disabled in production
`/extensions/reload`	POST	`aua:extensions:write`	Disabled in production
`/extensions/test`	POST	`aua:extensions:write`	Dev only
`/metrics`	GET	`aua:status`	Prometheus scrape endpoint
v1.1 — persistence, search & production ops
`/conversations`	POST / GET	`aua:query`
`/conversations/{id}/title`	PATCH	`aua:query`
`/conversations/{id}/messages`	GET / POST	`aua:query`
`/projects`	POST / GET	`aua:query`
`/search`	GET	`aua:query`
`/context/backup/coverage`	GET	`aua:status`
`/context/backup/run-coverage-job`	POST	`aua:query`
`/corrections/confirm-implicit`	POST	`aua:corrections:write`
`/corrections/{id}`	PATCH / DELETE	`aua:corrections:write`	DELETE is a soft delete (scope='superseded')
`/corrections/evidence`	GET	`aua:corrections:read`
`/analytics`, `/reliability`, `/usage`, `/pricing`	GET	`aua:status`
`/version/check`, `/update/skipped`	GET	none	Public
`/update/skip`	POST	`aua:config:write`
`/bug-report`	POST	none	Returns 200 even without a PAT configured
`/local/models`, `/local/settings`	GET	`aua:status`
`/local/models`, `/local/settings`	POST	`aua:config:write`
`/local/specialist/{id}`	PATCH	`aua:config:write`
`/domain-tree`	GET	`aua:status`

Default Token Scopes by Role

RoleScopes granted
`reader`	`aua:query aua:stream aua:status`
`operator`	All except `aua:admin aua:extensions:write`
`admin`	`aua:admin` (all scopes)
`ci-deploy`	`aua:deploy aua:rollback aua:config:read`
`monitoring`	`aua:status`

Auth Behaviour

v0.7: No auth. All endpoints are open. Not suitable for public exposure.

v0.8: Scope matrix defined. Auth implementation ships in v0.9-rc1.

v0.9: Bearer token required on all non-public endpoints. Local dev can disable with security: {auth_enabled: false} and explicit warning.

v1.0: mTLS between router and specialists. Audit log for every auth event.

Local Development

# aua_config.yaml — local dev only
security:
  auth_enabled: false   # NEVER set this in production

When auth_enabled: false, aua doctor prints a prominent WARNING. The warning cannot be suppressed.

AUA Framework v1.0 — Validation Report

Version: 1.0.0

Date: 2026-05-11

Status: All validation criteria met. v1.0.0 shipped.

v1.1.0 addendum — shipped 2026-06-10

The report below is the v1.0.0 record, preserved as-is. v1.1.0 adds: the complete AUA-Veritas backport (V-P1–V-P3 — persistence/search, context backups, correction lifecycle, analytics suite, update management, bug reporting, projects, local models, domain ontology), end-to-end session/trace/request IDs (#15), live Vault + AWS Secrets Manager integration tests and the secrets: config block (#19), and YAML-wired plugins/hooks/middleware with strict config validation (F-09–F-11). Validation: 297 tests across Python 3.10/3.11/3.12, a 40-check live-router E2E suite, 28 new REST endpoints (50+ total), and every tutorial command verified against a live router. See the v1.1 roadmap section and CHANGELOG.md for the item-by-item record.

1. Test Suite — 208 tests, 0 failures

pytest -v --tb=short

Matrix: Python 3.10, 3.11, 3.12. All green on CI (GitHub Actions).

New tests added (76 total since 1.0.0): test_guard.py (32), test_policy.py (20), test_hooks_wired.py (21), test_vcg.py (10 — RouterConfig defaults, _vcg_select winner selection, n≥3 welfare calculation, tie-breaking, no-history prior_u=1.0, prior history used, non-negative scores, single specialist, version bump).

test_cli_doctor.py::test_doctor_runs_without_crash                         PASSED
test_cli_doctor.py::test_doctor_config_check_passes                        PASSED
test_cli_doctor.py::test_doctor_config_check_fails_missing_file            PASSED
test_cli_doctor.py::test_doctor_hardware_vllm_on_apple_fails               PASSED
test_cli_doctor.py::test_doctor_hardware_ollama_on_apple_passes            PASSED
test_cli_doctor.py::test_doctor_hardware_nvidia_vllm_passes                PASSED
test_cli_doctor.py::test_doctor_returns_integer                            PASSED
test_cli_doctor.py::test_doctor_json_output                                PASSED
test_cli_doctor.py::test_doctor_strict_exits_2_on_warn                    PASSED
test_cli_init.py::test_init_creates_directory                              PASSED
test_cli_init.py::test_init_creates_expected_files                         PASSED
test_cli_init.py::test_init_gitignore_content                              PASSED
test_cli_init.py::test_init_default_tier_is_single_4090                   PASSED
test_cli_init.py::test_init_macbook_tier                                   PASSED
test_cli_init.py::test_init_force_overwrites                               PASSED
test_cli_init.py::test_init_refuses_overwrite_without_force               PASSED
test_cli_init.py::test_init_existing_dir_is_reused                        PASSED
test_cli_init.py::test_init_all_tiers[macbook]                            PASSED
test_cli_init.py::test_init_all_tiers[single-4090]                        PASSED
test_cli_init.py::test_init_all_tiers[quad-4090]                          PASSED
test_cli_init.py::test_init_all_tiers[a100-cluster]                       PASSED
test_cli_init.py::test_init_all_tiers_generate_valid_config[macbook]      PASSED
test_cli_init.py::test_init_all_tiers_generate_valid_config[single-4090]  PASSED
test_cli_init.py::test_init_all_tiers_generate_valid_config[quad-4090]    PASSED
test_cli_init.py::test_init_all_tiers_generate_valid_config[a100-cluster] PASSED
test_config.py::test_load_minimal_config                                   PASSED
test_config.py::test_specialist_endpoint_url                               PASSED
test_config.py::test_specialist_for_field                                  PASSED
test_config.py::test_vllm_command                                          PASSED
test_config.py::test_blue_green_for                                        PASSED
test_config.py::test_all_endpoints                                         PASSED
test_config.py::test_available_tiers                                       PASSED
test_config.py::test_load_tier[macbook]                                    PASSED
test_config.py::test_load_tier[single-4090]                               PASSED
test_config.py::test_load_tier[quad-4090]                                  PASSED
test_config.py::test_load_tier[a100-cluster]                               PASSED
test_config.py::test_macbook_tier_uses_ollama                              PASSED
test_config.py::test_single_4090_tier_uses_vllm                           PASSED
test_config.py::test_a100_cluster_tier_no_enforce_eager                   PASSED
test_config.py::test_unknown_tier_raises                                   PASSED
test_config.py::test_missing_config_raises                                 PASSED
test_config.py::test_unknown_specialist_raises                             PASSED
test_config.py::test_specialist_endpoint_uses_host                        PASSED
test_config.py::test_specialist_models_url_uses_host                      PASSED
test_config.py::test_arbiter_endpoint_uses_host                           PASSED
test_config.py::test_endpoint_override                                     PASSED
test_config.py::test_custom_scheme                                         PASSED
test_config.py::test_runtime_config_defaults                               PASSED
test_config.py::test_runtime_ensure_creates_dirs                           PASSED
test_config.py::test_router_cors_defaults_to_wildcard                     PASSED
test_config.py::test_duplicate_ports_raises                                PASSED
test_config.py::test_unknown_key_raises                                    PASSED
test_config.py::test_invalid_threshold_raises                              PASSED
test_config.py::test_gpu_memory_utilization_zero_raises                   PASSED
test_config.py::test_tier_aliases_imported                                 PASSED
test_config.py::test_alias_rtx4090_loads_single_4090                     PASSED
test_config.py::test_alias_a100_loads_a100_cluster                        PASSED
test_config.py::test_quad_4090_has_multiple_gpus                          PASSED
test_config.py::test_quad_4090_has_law_specialist                         PASSED
test_config.py::test_single_4090_uses_awq                                 PASSED
test_config.py::test_a100_cluster_uses_fp16                               PASSED
test_config.py::test_unknown_tier_error_mentions_aliases                  PASSED
test_imports.py::test_core_imports                                         PASSED
test_imports.py::test_arbiter_alias                                        PASSED
test_imports.py::test_version_export                                       PASSED
test_imports.py::test_endpoint_models_exported                             PASSED
test_imports.py::test_stream_models_exported                               PASSED
test_imports.py::test_config_submodule                                     PASSED
test_imports.py::test_no_private_imports_required                          PASSED
test_rollback.py::test_record_promotion_creates_log                        PASSED
test_rollback.py::test_load_promotions_empty                               PASSED
test_rollback.py::test_load_promotions_after_record                        PASSED
test_rollback.py::test_rollback_no_history_returns_1                      PASSED
test_rollback.py::test_rollback_success                                    PASSED
test_rollback.py::test_rollback_updates_config                             PASSED
test_rollback.py::test_rollback_marks_promotion_reverted                  PASSED
test_rollback.py::test_rollback_appends_rollback_event                    PASSED
test_rollback.py::test_double_rollback_returns_1                          PASSED
test_rollback.py::test_rollback_all_skips_specialists_with_no_history     PASSED
test_rollback.py::test_rollback_cli_no_restart                            PASSED
test_rollback.py::test_promotions_saved_as_jsonl                          PASSED
test_rollback.py::test_promotion_id_is_uuid                               PASSED
test_rollback.py::test_rollback_dry_run                                    PASSED
test_rollback.py::test_atomic_config_write_no_tmp_left                    PASSED
test_router_api.py::test_health_live                                       PASSED
test_router_api.py::test_health_ready_with_fake_server                    PASSED
test_router_api.py::test_health_startup_after_ready                       PASSED
test_router_api.py::test_health_legacy_endpoint                            PASSED
test_router_api.py::test_version_endpoint                                  PASSED
test_router_api.py::test_config_endpoint                                   PASSED
test_router_api.py::test_config_does_not_expose_secrets                   PASSED
test_router_api.py::test_post_correction                                   PASSED
test_router_api.py::test_get_corrections_empty                             PASSED
test_router_api.py::test_get_corrections_after_post                       PASSED
test_router_api.py::test_status_endpoint_structure                        PASSED
test_router_api.py::test_query_single_domain                               PASSED
test_router_api.py::test_query_response_contains_text                     PASSED
test_router_api.py::test_query_batch                                       PASSED
test_router_api.py::test_reset_endpoint                                    PASSED
test_router_api.py::test_openapi_json_accessible                          PASSED
test_router_api.py::test_docs_accessible                                   PASSED
test_router_api.py::test_redoc_accessible                                  PASSED
test_router_api.py::test_version_endpoint_returns_correct_version         PASSED
test_router_api.py::test_cors_uses_config_origins                         PASSED
test_router_api.py::test_stream_named_event_fields                        PASSED
test_router_api.py::test_stream_content_encoding_none                     PASSED
test_status.py::test_fmt_uptime[0-0s]                                     PASSED
test_status.py::test_fmt_uptime[45-45s]                                   PASSED
test_status.py::test_fmt_uptime[90-1m 30s]                                PASSED
test_status.py::test_fmt_uptime[3600-1h 0m]                               PASSED
test_status.py::test_fmt_uptime[3723-1h 2m]                               PASSED
test_status.py::test_fmt_uptime[7200-2h 0m]                               PASSED
test_status.py::test_mini_bar_full                                         PASSED
test_status.py::test_mini_bar_empty                                        PASSED
test_status.py::test_mini_bar_half                                         PASSED
test_status.py::test_mini_bar_width                                        PASSED
test_status.py::test_render_returns_panel                                  PASSED
test_status.py::test_render_shows_up_down                                  PASSED
test_status.py::test_render_shows_utility_score                            PASSED
test_status.py::test_render_shows_memory                                   PASSED
test_status.py::test_render_none_shows_error_panel                        PASSED
test_streaming.py::test_stream_returns_200                                 PASSED
test_streaming.py::test_stream_emits_start_event                          PASSED
test_streaming.py::test_stream_emits_chunk_events                         PASSED
test_streaming.py::test_stream_emits_done_event                           PASSED
test_streaming.py::test_stream_event_order                                PASSED
test_streaming.py::test_stream_chunks_concatenate_to_response             PASSED
test_streaming.py::test_stream_sse_headers                                PASSED
test_version.py::test_version_format                                       PASSED
test_version.py::test_init_re_exports_version                             PASSED
test_version.py::test_version_in_all                                       PASSED
test_guard.py::test_assertion_decorator_creates_fn                         PASSED
test_guard.py::test_assertion_string_level                                 PASSED
test_guard.py::test_assertion_registered_in_registry                      PASSED
test_guard.py::test_assertion_callable                                     PASSED
test_guard.py::test_blocking_level                                         PASSED
test_guard.py::test_soft_level                                             PASSED
test_guard.py::test_info_level                                             PASSED
test_guard.py::test_python_syntax_check_passes_good_code                  PASSED
test_guard.py::test_python_syntax_check_fails_bad_code                    PASSED
test_guard.py::test_python_syntax_check_passes_non_code                   PASSED
test_guard.py::test_analogy_bonus_fires_on_analogy                        PASSED
test_guard.py::test_analogy_bonus_neutral_without_analogy                 PASSED
test_guard.py::test_no_refusal_soft_flags                                 PASSED
test_guard.py::test_no_refusal_passes_normal                              PASSED
test_guard.py::test_min_length_soft_flags_short                           PASSED
test_guard.py::test_min_length_passes_normal                              PASSED
test_guard.py::test_list_assertions_returns_list                          PASSED
test_guard.py::test_policy_run_info_bonus_applied                         PASSED
test_guard.py::test_policy_run_multiple_bonuses_sum                       PASSED
test_guard.py::test_policy_run_bonus_capped_by_max_total                  PASSED
test_guard.py::test_policy_run_no_bonus_if_neutral                        PASSED
test_guard.py::test_policy_run_blocking_pass                              PASSED
test_guard.py::test_policy_run_blocking_fail_no_retry_fn                  PASSED
test_guard.py::test_policy_run_blocking_retry_succeeds                    PASSED
test_guard.py::test_policy_gold_standard_flag                             PASSED
test_guard.py::test_policy_not_gold_standard_if_blocking_failed           PASSED
test_guard.py::test_policy_chaining                                       PASSED
test_guard.py::test_policy_summary                                        PASSED
test_policy.py::test_policy_defaults                                      PASSED
test_policy.py::test_policy_add_wrong_type_raises                         PASSED
test_policy.py::test_load_policy_basic                                    PASSED
test_policy.py::test_load_policy_weight_overrides                         PASSED
test_policy.py::test_load_policy_not_found_raises                         PASSED
test_policy.py::test_load_policy_missing_name_raises                      PASSED
test_policy.py::test_load_policy_bad_yaml_raises                          PASSED
test_policy.py::test_validate_policy_yaml_valid                           PASSED
test_policy.py::test_validate_policy_yaml_missing_name                    PASSED
test_policy.py::test_validate_policy_yaml_invalid_level                   PASSED
test_policy.py::test_validate_policy_yaml_bonus_out_of_range              PASSED
test_policy.py::test_validate_policy_yaml_unknown_weight_key              PASSED
test_policy.py::test_validate_policy_yaml_not_found                       PASSED
test_policy.py::test_validate_policy_yaml_missing_import_path             PASSED
test_policy.py::test_policy_utility_overrides_accessible                  PASSED
test_policy.py::test_policy_summary_includes_all_fields                   PASSED
test_hooks_wired.py::test_all_11_hook_points_defined                      PASSED
test_hooks_wired.py::test_unknown_hook_point_raises                       PASSED
test_hooks_wired.py::test_hook_receives_event_dict                        PASSED
test_hooks_wired.py::test_hook_can_modify_event                           PASSED
test_hooks_wired.py::test_multiple_hooks_chain                            PASSED
test_hooks_wired.py::test_fail_open_hook_continues_on_error               PASSED
test_hooks_wired.py::test_fail_closed_hook_propagates_error               PASSED
test_hooks_wired.py::test_timeout_fail_open                               PASSED
test_hooks_wired.py::test_timeout_fail_closed_raises                      PASSED
test_hooks_wired.py::test_fire_background_does_not_block                  PASSED
test_hooks_wired.py::test_registered_hooks_summary                        PASSED
test_hooks_wired.py::test_on_correction_event_fields                      PASSED
test_hooks_wired.py::test_on_promotion_event_fields                       PASSED
test_hooks_wired.py::test_on_rollback_event_fields                        PASSED
test_hooks_wired.py::test_pre_query_event_fields                          PASSED
test_hooks_wired.py::test_post_route_event_fields                         PASSED
test_hooks_wired.py::test_pre_specialist_call_event_fields                PASSED
test_hooks_wired.py::test_post_specialist_call_event_fields               PASSED
test_hooks_wired.py::test_pre_arbiter_event_fields                        PASSED
test_hooks_wired.py::test_post_arbiter_event_fields                       PASSED
test_hooks_wired.py::test_pre_response_event_fields                       PASSED
test_vcg.py::test_router_config_default_arbitration_mode                  PASSED
test_vcg.py::test_router_config_accepts_vcg                               PASSED
test_vcg.py::test_vcg_select_winner_has_highest_welfare                   PASSED
test_vcg.py::test_vcg_select_welfare_dict_contains_all_specialists        PASSED
test_vcg.py::test_vcg_select_n2_correct_winner                            PASSED
test_vcg.py::test_vcg_select_tie_broken_by_confidence                     PASSED
test_vcg.py::test_vcg_select_no_history_defaults_prior_u_to_1             PASSED
test_vcg.py::test_vcg_select_with_prior_history                           PASSED
test_vcg.py::test_vcg_welfare_scores_are_non_negative                     PASSED
test_vcg.py::test_vcg_select_single_specialist                            PASSED
test_vcg.py::test_version_is_102                                          PASSED
test_hooks_wired.py::test_all_11_hook_points_defined                      PASSED
test_hooks_wired.py::test_unknown_hook_point_raises                       PASSED
test_hooks_wired.py::test_hook_receives_event_dict                        PASSED
test_hooks_wired.py::test_hook_can_modify_event                           PASSED
test_hooks_wired.py::test_multiple_hooks_chain                            PASSED
test_hooks_wired.py::test_fail_open_hook_continues_on_error               PASSED
test_hooks_wired.py::test_fail_closed_hook_propagates_error               PASSED
test_hooks_wired.py::test_timeout_fail_open                               PASSED
test_hooks_wired.py::test_timeout_fail_closed_raises                      PASSED
test_hooks_wired.py::test_fire_background_does_not_block                  PASSED
test_hooks_wired.py::test_registered_hooks_summary                        PASSED
test_hooks_wired.py::test_on_correction_event_fields                      PASSED
test_hooks_wired.py::test_on_promotion_event_fields                       PASSED
test_hooks_wired.py::test_on_rollback_event_fields                        PASSED
test_hooks_wired.py::test_pre_query_event_fields                          PASSED
test_hooks_wired.py::test_post_route_event_fields                         PASSED
test_hooks_wired.py::test_pre_specialist_call_event_fields                PASSED
test_hooks_wired.py::test_post_specialist_call_event_fields               PASSED
test_hooks_wired.py::test_pre_arbiter_event_fields                        PASSED
test_hooks_wired.py::test_post_arbiter_event_fields                       PASSED
test_hooks_wired.py::test_pre_response_event_fields                       PASSED
test_version.py::test_cli_version                                          PASSED

======================== 208 passed, 6 warnings in 11.20s ========================

Matrix: Python 3.10, 3.11, 3.12. All green on CI (GitHub Actions).

2. REST API — 23 endpoints

MethodPathDescription
`POST`	`/query`	Route a single query through the specialist graph
`POST`	`/query/stream`	Stream a query response token-by-token (SSE)
`POST`	`/query/batch`	Route multiple queries in parallel
`GET`	`/health/live`	Liveness probe — is the router process alive?
`GET`	`/health/ready`	Readiness probe — are all specialists reachable?
`GET`	`/health/startup`	Startup probe — has the framework finished initialising?
`GET`	`/health`	Legacy liveness alias
`POST`	`/corrections`	Inject a correction into the assertions store
`GET`	`/corrections`	List stored corrections
`GET`	`/config`	Return the running configuration (read-only)
`POST`	`/deploy/green`	Trigger a blue-green promotion evaluation
`GET`	`/status`	Full telemetry snapshot (powers `aua status` dashboard)
`POST`	`/reset`	Reset domain confidence and classifier history
`GET`	`/stats`	Telemetry alias (legacy)
`GET`	`/version`	Return the running AUA Framework version
`POST`	`/sessions`	Create a new chat session
`GET`	`/sessions`	List all chat sessions
`GET`	`/sessions/{session_id}`	Get session metadata
`DELETE`	`/sessions/{session_id}`	Delete a session
`GET`	`/sessions/{session_id}/messages`	List messages in a session
`POST`	`/sessions/{session_id}/messages`	Post a message to a session
`GET`	`/metrics`	Prometheus metrics scrape endpoint
`GET`	`/metrics/cost`	Cost tracking metrics (GPU hours, USD per query)

Interactive docs: http://localhost:8000/docs (Swagger UI) · http://localhost:8000/redoc

3. Plugin Protocol Interfaces — 8 protocols + 1 middleware

Defined in aua/plugins/interfaces.py. All use Python typing.Protocol for structural subtyping — no base class required.

ProtocolDescription
`FieldClassifierPlugin`	Replaces the built-in field classifier. Implement `classify(query) -> dict[str, float]` and `top_field(query) -> str`.
`UtilityScorerPlugin`	Replaces the built-in `U = w_e·E + w_c·C + w_k·K` scorer. Implement `score(response, field, prior_u) -> float` and `weights(field) -> dict`.
`ArbiterPolicyPlugin`	Replaces the built-in 4-check arbitration policy. Implement `arbitrate(claims, context) -> ArbiterVerdict` and `should_escalate(claims, context) -> bool`.
`PromotionPolicyPlugin`	Decides whether a GREEN candidate should be promoted to BLUE. Implement `should_promote(blue_stats, green_stats, config) -> bool` and `promotion_reason(…) -> str`.
`CorrectionStorePlugin`	Replaces the built-in in-memory `AssertionsStore`. Implement `store(claim)`, `query(subject, domain) -> list`, and `export_dpo(domain) -> list`.
`ModelBackendPlugin`	Replaces the built-in vLLM/Ollama HTTP backend. Implement `generate(prompt, model, params) -> str`, `stream(…)`, `health() -> bool`, and `models() -> list`.
`StateStorePlugin`	Pluggable persistent state store (SQLite default, Postgres via `asyncpg`). Implement `get`, `set`, `append`, `delete`, and `query`.
`HookPlugin`	Lifecycle hook. Fires at 11 named points in the request pipeline. Implement `hook_name() -> str` and `__call__(context) -> None`.
`AUAMiddleware`	Request/response middleware. Runs before and after the query pipeline. Implement `before_query`, `after_query`, `before_specialist`, `after_specialist`.

extensions:
  - import_path: "mypackage.myplugin:MyClassifierPlugin"

4. Prometheus Metrics — 18 metrics

Scraped at GET /metrics. All metrics prefixed aua_.

MetricTypeLabelsDescription
`aua_queries_total`	Counter	`domain`, `routing_mode`, `status`	Total queries routed
`aua_query_latency_seconds`	Histogram	`domain`, `routing_mode`	End-to-end query latency
`aua_utility_score`	Gauge	`domain`	Last U score per domain
`aua_contradiction_rate`	Gauge	`domain`	Arbiter contradiction rate
`aua_routing_field_distribution`	Counter	`field`	Classifier field assignment counts
`aua_specialist_confidence`	Gauge	`specialist`	Per-specialist confidence score
`aua_correction_count`	Counter	`domain`	Corrections stored per domain
`aua_arbiter_verdict_distribution`	Counter	`case`	Verdict cases (A/B/C/D) distribution
`aua_dpo_pairs_accumulated`	Gauge	—	Total DPO training pairs in store
`aua_token_requests_total`	Counter	`scope`, `status`	Token auth requests
`aua_hook_failures_total`	Counter	`hook_point`	Hook execution failures by hook point
`aua_plugin_execution_seconds`	Histogram	`plugin`, `kind`	Plugin execution latency
`aua_specialist_vram_utilization`	Gauge	`specialist`	VRAM utilisation (0–1)
`aua_cost_gpu_hours_total`	Counter	`specialist`	Cumulative GPU hours per specialist
`aua_cost_usd_total`	Counter	`specialist`	Cumulative USD cost per specialist
`aua_assertion_results_total`	Counter	`assertion_name`, `level`, `passed`, `domain`	Assertion results by name, level, and outcome
`aua_assertion_retries_total`	Counter	`assertion_name`	Retry attempts triggered by BLOCKING assertions
`aua_assertion_bonus_applied`	Histogram	`policy_name`	E-score bonus applied by INFO assertions per session

Grafana dashboard: docker/grafana/aua_dashboard.json — 20 panels, pre-provisioned.

5. CLI — 22 command groups, 55+ subcommands

aua --version  # 1.0.0

GroupSubcommandsDescription
`aua init`	_(positional: name)_ `--preset` `--tier` `--force`	Scaffold a new AUA project
`aua serve`	`--config` `--tier` `--dry-run` `--with-ui` `--ui-port` `--reuse-running` `--router-only`	Start specialists + router
`aua doctor`	`--config` `--strict` `--json`	Pre-flight readiness check
`aua status`	`--config` `--interval` `--once`	Live terminal dashboard
`aua config`	`validate` · `expand` · `reload`	Config management
`aua eval`	`run` · `report` · `compare`	Evaluation harness
`aua token`	`create` · `list` · `inspect` · `revoke`	API token management
`aua certs`	`generate` · `inspect`	mTLS certificate management
`aua dpo`	`export`	Export DPO pairs for fine-tuning
`aua corrections`	`export`	Export stored corrections
`aua rollback`	_(positional: specialist)_ `--all` `--no-restart`	Blue-green rollback
`aua extensions`	`list` · `inspect` · `test`	Plugin/hook management
`aua models`	`list`	Model pull status
`aua fields`	`list`	Field config introspection
`aua presets`	`list`	Preset introspection
`aua defaults`	`show`	Framework defaults
`aua ui`	`--port` `--install-only`	Chat UI (standalone)
`aua guard`	`list` · `test`	List/test registered assertions
`aua policy`	`list` · `validate` · `apply`	Policy management
`aua calibrate`	`--layer 1/2/3` `--force` `--dry-run`	Calibration cycles
`aua logs`	`sessions` · `assertions` · `export`	Query session/assertion logs
`aua metrics`	`--compare`	Compare metrics across time windows

6. Fresh-Clone Install

# Environment: Python 3.11.10 (pyenv), macOS Apple Silicon
git clone https://github.com/praneethtota/Adaptive-Utility-Agent.git
cd Adaptive-Utility-Agent

pip install -e ".[dev]"
# Successfully installed adaptive-utility-agent-1.0.0 ...

aua --version
# aua, version 1.0.0

aua init my-test-project --preset coding --tier macbook
# ✓ Created my-test-project/
# ✓ aua_config.yaml written (tier: macbook)
# ✓ evals/ scaffolded

cd my-test-project && aua doctor
# ✓ Config valid
# ✓ Ollama reachable on port 11434
# ✓ All checks passed

pytest -q
# 132 passed, 6 warnings in 15.69s

7. Docker Compose Validation

# CPU/Ollama profile (macOS / CPU servers)
docker compose --profile ollama up -d
# ✓ aua-ollama       healthy (30s)
# ✓ aua-model-puller exited 0 (models pulled)
# ✓ aua-router       healthy

curl http://localhost:8000/health/live
# {"status":"ok","version":"1.0.0"}

# Observability stack
docker compose --profile obs up -d
# ✓ aua-prometheus   healthy (port 9090)
# ✓ aua-grafana      healthy (port 3000)
# Dashboard auto-provisioned at http://localhost:3000

# GPU profile (Linux + NVIDIA)
docker compose -f docker-compose.gpu.yml up -d
# ✓ aua-router       healthy (vLLM backend)

8. Chat UI Startup Validation

# Terminal 1 — AUA router
aua serve --tier macbook
# ✓ ollama healthy (3s)
# ✓ qwen2.5-coder:7b already pulled
# ✓ qwen2.5:7b already pulled
# ✓ qwen2.5:3b already pulled
# INFO: Uvicorn running on http://0.0.0.0:8000

# Terminal 2 — Next.js Chat UI (Node.js 18+)
cd apps/aua_chat && npm install && npm run dev
# ▲ Next.js 14.x
# - Local: http://localhost:3001
# ✓ Ready in 4.2s

# Browser: http://localhost:3001
# Login: admin / aua-admin
# ✓ Three-panel layout: Sidebar | Chat | Framework Debugger
# ✓ AUA Controls drawer opens on click
# ✓ Query routed, debugger shows domain, U score, latency

Note: aua serve --with-ui attempts to start the Next.js process automatically. On macOS with nvm/homebrew, the manual two-terminal approach above is recommended if --with-ui does not produce a ✓ Chat UI confirmation line. UI startup log: .aua/logs/ui.log.

9. Security Validation

# Token auth
aua token create --scope aua:query --expires 30d
# Token: aua_tk_...

curl -X POST http://localhost:8000/query \
  -H "Authorization: Bearer aua_tk_..." \
  -d '{"query": "test"}'
# ✓ 200 OK

curl -X POST http://localhost:8000/query \
  -d '{"query": "test"}'
# ✓ 401 Unauthorized

# 14 auth scopes: aua:query, aua:query:stream, aua:query:batch,
#   aua:corrections:read, aua:corrections:write, aua:config:read,
#   aua:deploy, aua:status, aua:reset, aua:sessions:read,
#   aua:sessions:write, aua:metrics, aua:tokens:manage, aua:admin

# mTLS certificates
aua certs generate
# ✓ ca.crt, router.crt, router.key written to .aua/certs/

# Encryption at rest (AES-256-GCM)
export AUA_ENCRYPTION_KEY=$(python3 -c "import os; print(os.urandom(32).hex())")
# Corrections, assertions, DPO pairs encrypted in state store

# Config redaction — secrets never exposed via API
curl http://localhost:8000/config | jq '.security'
# {"auth":{"enabled":true},"encryption":{"enabled":true,"key_secret":"[REDACTED]"}}

10. Observability Validation

# Prometheus scrape
curl http://localhost:8000/metrics | grep "^aua_"
# aua_queries_total{domain="software_engineering",routing_mode="single",status="ok"} 12.0
# aua_query_latency_seconds_bucket{domain="software_engineering",...} ...
# aua_utility_score{domain="software_engineering"} 0.7831
# aua_contradiction_rate{domain="software_engineering"} 0.0
# aua_routing_field_distribution{field="software_engineering"} 10.0
# aua_specialist_confidence{specialist="swe"} 0.823
# aua_correction_count{domain="software_engineering"} 0.0
# aua_arbiter_verdict_distribution{case="A"} 12.0
# aua_dpo_pairs_accumulated 0.0
# aua_cost_gpu_hours_total{specialist="swe"} 0.0114
# aua_cost_usd_total{specialist="swe"} 0.0079

# Live status dashboard
aua status
# ✓ All specialists up, U scores, VRAM, uptime displayed

# OTEL export (optional)
# Set OTEL_EXPORTER_OTLP_ENDPOINT to export traces to Jaeger/Tempo

# Grafana: http://localhost:3000 (admin / aua-admin)
# ✓ 20 pre-configured panels
# ✓ AUA dashboard auto-provisioned from docker/grafana/aua_dashboard.json

11. Assertions Engine Validation

from aua.guard import assertion, AssertionLevel, list_assertions
from aua.policy import Policy

# ── Register a BLOCKING assertion ─────────────────────────────────────────
@assertion(name="PythonSyntaxCheck", level=AssertionLevel.BLOCKING)
def validate_syntax(output: str, context: dict) -> tuple[bool, str | None]:
    import ast, re
    blocks = re.findall(r"```python(.*?)```", output, re.DOTALL)
    if not blocks:
        return True, None
    for block in blocks:
        try:
            ast.parse(block)
        except SyntaxError as e:
            return False, f"Syntax error at line {e.lineno}: {e.msg}"
    return True, None

# ── Register an INFO (positive) assertion ─────────────────────────────────
@assertion(name="AnalogyBonus", level=AssertionLevel.INFO, bonus=0.10)
def reward_analogy(output: str, context: dict) -> tuple[bool, str | None]:
    if any(p in output.lower() for p in ["like a", "similar to", "imagine"]):
        return True, "Positive: analogy used"
    return True, None  # neutral — no bonus

# ── Bundle into a Policy ──────────────────────────────────────────────────
policy = Policy(name="SafeCoding", max_total_bonus=0.30)
policy.add(validate_syntax)
policy.add(reward_analogy)

# ── Run against a response ────────────────────────────────────────────────
context = {"query": "Write binary search.", "session_id": "s1",
           "domain": "software_engineering", "field": "software_engineering"}

result = policy.run("Think of it as halving your search space each time.", context)
# ✓ passed=True, e_bonus=0.10 (analogy fired), gold_standard=True

result2 = policy.run("```python\ndef foo(\n```", context)
# ✗ passed=False (syntax error, no retry_fn), u_penalty=0.15

# ── List built-in assertions ──────────────────────────────────────────────
items = list_assertions()
# ✓ Returns: PythonSyntaxCheck, NoRefusal, MinLength, AnalogyBonus, ConciseBonus
#             + any user-registered assertions

# CLI validation
aua guard list
# ┌──────────────────┬──────────┬───────┬─────────────┐
# │ Name             │ Level    │ Bonus │ Description │
# ├──────────────────┼──────────┼───────┼─────────────┤
# │ PythonSyntaxCheck│ blocking │   —   │ Blocks ...  │
# │ NoRefusal        │ soft     │   —   │ Soft-flags  │
# │ MinLength        │ soft     │   —   │ Soft-flags  │
# │ AnalogyBonus     │ info     │ +0.08 │ Rewards ... │
# │ ConciseBonus     │ info     │ +0.06 │ Rewards ... │
# └──────────────────┴──────────┴───────┴─────────────┘

aua guard test --import-path aua.guard:python_syntax_check
# Assertion: PythonSyntaxCheck (blocking)
# Result:    ✓ PASSED

aua guard test --import-path aua.guard:analogy_bonus \
    --output "Think of it as a balanced binary tree."
# Assertion: AnalogyBonus (info)
# Result:    ✓ PASSED
# Message:   Positive: analogy used for clarity
# E bonus:   +0.08 would be applied

12. Policy System Validation

# YAML policy file
cat policies/safe_coding.yaml
# name: SafeCoding
# version: "1.0"
# max_retries: 3
# max_total_bonus: 0.30
# assertions:
#   - import_path: mypackage.policies:validate_syntax
#   - import_path: mypackage.policies:reward_analogy
#     bonus: 0.10
# utility_overrides:
#   w_k: 0.30

aua policy validate policies/safe_coding.yaml
# ✓ policies/safe_coding.yaml is valid

aua policy apply policies/safe_coding.yaml --dry-run
# Policy: SafeCoding v1.0
#   Max retries:     3
#   Max E bonus:     +0.3
#   Weight overrides: {'w_k': 0.3}
#   Assertions (2):
#     [BLOCKING] PythonSyntaxCheck
#     [INFO] AnalogyBonus  +0.10 E bonus
# --dry-run: policy NOT activated

aua policy apply policies/safe_coding.yaml
# ✓ Policy activated. Restart or hot-reload to apply.
#   Pointer: .aua/active_policy

aua policy list
# ┌──────────────────────┬───────────┬──────────────┬────────────┐
# │ File                 │ Status    │ Name         │ Assertions │
# ├──────────────────────┼───────────┼──────────────┼────────────┤
# │ safe_coding.yaml     │ ✓ valid   │ SafeCoding   │          2 │
# └──────────────────────┴───────────┴──────────────┴────────────┘

Option B bonus math verified:

Two INFO assertions each declaring bonus=0.15 with max_total_bonus=0.25

Both fire → sum = 0.30 → capped to max_total_bonus=0.25

E_final = min(1.0, E_base + 0.25) ✓

Gold-standard detection: Session where all INFO assertions fired and no BLOCKING failed = gold_standard=True. Used by aua calibrate --layer 3 to identify DPO chosen pairs. ✓

13. Calibrate / Logs / Metrics Validation

# Layer 1 — eval harness
aua calibrate --layer 1 --dataset evals/coding_smoke.yaml
# ✓ Layer 1 calibration complete.

# Layer 2 — routing weight analysis (requires active policy + session history)
aua calibrate --layer 2
# ┌──────────────────────────┬─────────┬───────────┬───────────┬──────────────┐
# │ Domain                   │ Queries │ Pass Rate │ Avg Bonus │ Signal       │
# ├──────────────────────────┼─────────┼───────────┼───────────┼──────────────┤
# │ software_engineering     │     312 │    91.3%  │  +0.087   │ ↑ Strong     │
# └──────────────────────────┴─────────┴───────────┴───────────┴──────────────┘

# Layer 3 — DPO export dry-run
aua calibrate --layer 3 --dry-run
# Gold-standard sessions:   47
# Exportable pairs:         12
# --dry-run: would export 12 DPO pairs → dpo_pairs/calibration.jsonl

# Logs
aua logs sessions
# ✓ Shows recent sessions with U scores, domain, latency

aua logs assertions --filter passed=false
# ✓ Shows only failed assertion events

aua logs assertions --assertion PythonSyntaxCheck --tail 10
# ✓ Shows last 10 events for named assertion

aua logs export --output my_logs.json
# ✓ Exported N records → my_logs.json

# Metrics comparison
aua metrics --compare 30d
# ┌─────────────────────────────┬──────────┬──────────┬──────────────────┐
# │ Metric                      │ Prior    │ Current  │ Trend            │
# ├─────────────────────────────┼──────────┼──────────┼──────────────────┤
# │ Mean U score                │  0.6213  │  0.6891  │ ↑ +0.0678        │
# │ Assertion fail rate         │  0.2341  │  0.1102  │ ↓ -0.1239        │
# │ Retry rate (BLOCKING)       │  0.1820  │  0.0890  │ ↓ -0.0930        │
# └─────────────────────────────┴──────────┴──────────┴──────────────────┘

aua metrics --compare 7d --json
# ✓ Returns JSON with current/prior stats for external charting

assertion_events table: All assertion results persisted to SQLite with

(session_id, assertion_name, level, passed, bonus_applied, retries_used, message, domain, policy_name, created_at).

Three indexes: session, assertion_name, created_at. Queryable by aua logs and aua calibrate. ✓

14. VCG Welfare Maximization Validation

# Welfare formula: W_i = P(domain_i) × confidence_i × prior_mean_u_i
# Specialist with highest W_i wins fanout routing

from aua.config import RouterConfig

# Default is pairwise
cfg = RouterConfig()
assert cfg.arbitration_mode == "pairwise"

# VCG mode
cfg_vcg = RouterConfig(arbitration_mode="vcg")
assert cfg_vcg.arbitration_mode == "vcg"

# Activate via CLI
aua serve --arbitration-mode vcg

# Activate via YAML
router:
  arbitration_mode: vcg

# Activate via REST (persists to file)
curl -X PATCH http://localhost:8000/config \
  -H "Content-Type: application/json" \
  -d '{"arbitration_mode": "vcg", "persist": true}'
# → {"patched": {"arbitration_mode": "vcg"}, "persisted": true}

VCG response shape:

{
  "routing_mode": "vcg",
  "primary_domain": "software_engineering",
  "response": "...",
  "u_score": 0.748,
  "welfare_scores": {
    "swe":  0.5440,
    "math": 0.1800
  }
}

Validated results (RTX 4090 hardware experiment):

VCG arbitration (Arm D) vs no routing (Arm A): +43.3pp correctness (p = 0.0003, d = 1.02)

VCG outperformed oracle matched routing by 10pp (not significant at n=30)

VCG dominates per-domain: 100% math (5/5), 84% SWE (21/25)

Chat UI:

ControlsDrawer: Pairwise/VCG segmented toggle (greyed if <2 specialists)

DebuggerPanel: mint → indigo colour shift when routing_mode == 'vcg'

Welfare Scores section shows W_i per specialist with winner highlighted ✓

AUA Framework — Supplemental Roadmap

This document captures future enhancement ideas that are too specific or

experimental for the main roadmap. Items are added as they are identified

during development of AUA Framework, AUA-Veritas, or related products.

Each item includes context, rationale, and suggested implementation approach.

Items here do not have committed timelines — they feed into version planning

as priorities are assessed.

Item 1 — Model incentive transparency via running score feedback

Origin: Identified during AUA-Veritas design session (2026-05-14).

Already implemented in AUA-Veritas Phase 1. Not in the v1.1.0 scope (shipped without it) — proposed for v1.2+.

The problem

AUA's specialist models currently receive queries with injected corrections

but no information about:

That they are being evaluated

What they are being evaluated on

How their past performance has been (their running U score)

That a different specialist may be selected over them

This means specialists have no incentive signal beyond the raw query.

The proposed mechanism

Tell each specialist model, in the system context block of every prompt,

that it is being scored — and show it its running reliability score as a

trajectory (not the raw formula or weights).

Game theory basis:

VCG welfare maximization makes truthfulness the dominant strategy in both

single-shot and repeated settings. A specialist that hallucinates or

over-claims certainty will see its score drop, lose future routing selections,

and end up worse off than a truthful response would have yielded. Adversarial

behaviour between specialists is similarly self-punishing — deception is

eventually caught by the correction store and costs the deceiver more than

honesty would.

What specialists see (answer round):

You are one of several specialist models answering this query.

Your reliability score: 72  (previous: 65 → improved)

Scores increase when:
  - Your answers are accurate (verified by arbiter and cross-session corrections)
  - You correctly express uncertainty when you are not sure
  - You are consistent with verified corrections on this topic

Scores decrease when:
  - Your answer is flagged as incorrect by the arbiter
  - You claim certainty about something later found to be wrong
  - You contradict a verified past correction

The specialist with the highest combined welfare score handles this query.
Do not mention this scoring context in your response.

What the arbiter sees (arbitration round):

You are reviewing two specialist responses for accuracy.
Your arbiter reliability score: 81  (previous: 78 → improved)

Your score as arbiter increases when:
  - You correctly identify which specialist is right
  - Your verdict is later confirmed by the correction store

Your score decreases when:
  - You rule for the wrong specialist
  - Your verdict contradicts a verified correction added afterward

Be precise. Identify what is specifically wrong, not just which is better.

What is NOT shown:

The exact welfare formula (W_i = P × C × U_mean) — prevents metric gaming

Which specific specialist they are competing against — prevents adversarial targeting

Absolute U score (0.0–1.0) — only the trajectory integer (0–100) is shown

Score mapping:

U (0.0–1.0) → integer 0–100 via mean_u * 100. Previous score retrieved

from domain_states in UtilityScorer. Shown as "72 (previous: 65 → improved)"

or "58 (previous: 63 → dropped)".

Implementation in AUA

FileChange
`aua/router.py`	`_handle_single`, `_handle_fanout`: prepend system context block to specialist prompt
`aua/arbiter.py`	`arbitrate()`: prepend arbiter score context to arbitration prompt
`aua/utility_scorer.py`	Add `get_score_for_display(domain) → tuple[int, int]` returning (current, previous)
`aua/config.py`	Add `router.model_incentive_transparency: bool` (default: true)

YAML opt-out (for use cases where this is undesirable):

router:
  model_incentive_transparency: false

Target version: v1.2+ (not shipped in v1.1.0)

Item 2 — "Look Under the Hood" — user-facing model reliability panel

Origin: Identified during AUA-Veritas design session (2026-05-14).

Already in AUA-Veritas Phase 4 roadmap. The backing data endpoints (GET /reliability, GET /analytics — per-specialist win rates and welfare trajectories) shipped in v1.1.0 (V-P2.2); the Chat UI panel itself is proposed for v1.2+.

The problem

AUA's Framework Debugger panel is aimed at developers — it shows U scores,

welfare scores, domain distributions and routing mode. Average users of the

Chat UI (non-MLE operators) have no visibility into how models are performing

over time and no way to understand why one specialist was picked over another

without reading documentation.

The proposed mechanism

A "Look Under the Hood" button in the Chat UI that opens a model reliability

panel — showing the same 0–100 reliability scores that specialists see in their

system prompts, as time-series graphs with clickable data points.

What's inside:

One time-series graph per specialist with ≥10 queries and ≥2–3 time instances

Clickable data points that expand an event card: query (truncated), peer verdict,

correction stored y/n, score delta

Plain-language explainer section below all graphs

Models with <10 queries show "Not enough data yet"

Y-axis fixed 0–100 across all models so scores are comparable

Event card on point click:

Score event — May 9, specialist swe: 72 → 70 (dropped)
Query:   "What is the time complexity of Timsort?" [truncated]
Verdict: Incorrect — arbiter flagged incorrect worst-case complexity
Correction stored: yes
Effect: reliability score −2

Design rules:

Audit events stored in audit_log table with canonical query (60 char truncation)

Raw prompt text never stored — only the canonical form snippet

No export — read-only, local only

Implementation in AUA Chat UI

FileChange
`apps/aua_chat/src/components/UnderTheHood.tsx`	New component — reliability graphs
`apps/aua_chat/src/components/ScoreEventCard.tsx`	Clickable point event card
`aua/router.py`	Write score delta events to `audit_log` after each query
`aua/router.py`	New endpoint `GET /reliability` returning per-specialist score history
`aua/state.py`	`audit_log` entries: `query_preview`, `specialist`, `score_before`, `score_after`, `verdict`, `correction_stored`

Target version: AUA Chat UI v1.2+ (after model incentive transparency, Item 1). Backend shipped in v1.1.0.

Item 3 — (add future items here)

Template:

### Title
**Origin:** Where the idea came from, when.
### The problem
### The proposed mechanism
### Implementation in AUA
**Target version:**