Tutorial Roadmap Docs
AUA Framework v1.1 — Documentation

Technical Documentation

Architecture specification, deployment profiles, compatibility matrix, permission scopes, validation report, and supplemental roadmap — all rendered here for easy reading.

DB Schema ↗

AUA Framework v1 — Architecture Specification


Version: 1.1.0

Status: Canonical. Implementation must match this document. Divergence = a bug.




1. System Overview


AUA (Adaptive Utility Agents) is a multi-specialist LLM routing framework. It routes queries to domain-expert models, scores outputs using a utility function, detects contradictions, resolves them with an arbiter, and feeds verified corrections back into training.


The design goal is Django for adaptive multi-model LLM systems — batteries included, deeply configurable, extensible without editing framework internals.




2. Component Boundaries


┌─────────────────────────────────────────────────────┐
│                    AUA Router                        │
│                                                      │
│  ┌──────────┐  ┌───────────┐  ┌─────────────────┐   │
│  │Middleware │  │  Session  │  │  Correction     │   │
│  │ Pipeline │  │  Manager  │  │  Retrieval      │   │
│  └──────────┘  └───────────┘  └─────────────────┘   │
│                                                      │
│  ┌──────────────────────────────────────────────┐    │
│  │          Field Classifier                    │    │
│  │  (pluggable via FieldClassifierPlugin)       │    │
│  └──────────────────────────────────────────────┘    │
│                                                      │
│  ┌──────────┐  ┌───────────┐  ┌─────────────────┐   │
│  │  Router  │  │ Specialist│  │ Utility Scorer  │   │
│  │ Decision │  │  Calls    │  │ (pluggable)     │   │
│  └──────────┘  └───────────┘  └─────────────────┘   │
│                                                      │
│  ┌──────────────────────────────────────────────┐    │
│  │              Arbiter Agent                   │    │
│  │  (pluggable policy via ArbiterPolicyPlugin)  │    │
│  └──────────────────────────────────────────────┘    │
│                                                      │
│  ┌──────────┐  ┌───────────┐  ┌─────────────────┐   │
│  │   Hook   │  │Correction │  │  State Store    │   │
│  │ Registry │  │  Logger   │  │  (pluggable)    │   │
│  └──────────┘  └───────────┘  └─────────────────┘   │
└─────────────────────────────────────────────────────┘

External:
  Specialist servers  (vLLM / Ollama / custom ModelBackendPlugin)
  Arbiter server      (same backends)
  State store         (files / SQLite / Postgres)
  Observability       (stdout / Prometheus / OTEL)

2.1 v1.1 production services (AUA-Veritas backport)


v1.1.0 adds eight services that run inside the router process. Each starts in the FastAPI lifespan and shuts down cleanly; none blocks the query path.


ServiceModuleWhat it does
Keyword indexaua/keywords.pyMessage-level inverted index with async batch worker (50 ms), startup backfill, and DB fallback. Serves GET /search.
Context backupsaua/context_backup.pyPer-(specialist, conversation) token counters; 6-section handoff notes on token/message/time-gap triggers; 6-hour coverage sweep.
Trigger detectoraua/trigger_detector.pyTwo-layer correction detection: regex Layer 1 ships built-in, Layer 2 is pluggable. Feeds POST /corrections/confirm-implicit.
Crash reporteraua/crash_reporter.pyStartup sentinel + clean-shutdown marking; previous-session crashes detected (before the new sentinel is written) and reported with a queued-error flush.
Remote model configaua/remote_config.pyModel registry refresh with a remote → DB-cache (7-day) → builtin fallback chain; field allowlist; 24-hour refresh job.
Domain ontologyaua/domain_tree.py10 fixed L0 roots; alias map + edit-distance resolution; candidate queue with 4-gate promotion (volume, diversity, coverage, divergence); hourly maintenance job. Serves GET /domain-tree.
Session ID middlewareaua/session.pyPer-request SessionContext (session/trace/request IDs) — client-supplied honored, UUIDs generated, returned as headers on every response, propagated to specialists, hooks, audit, and logs (#15).
YAML extension loaderaua/router.py + aua/plugins/registry.pyLoads plugins:, hooks:, and middleware: from config at startup with contract validation and Django-style project-dir imports (F-09–F-11). GET /extensions reports server truth.



3. Full Request Lifecycle


Every query follows this pipeline in order. Steps marked [pluggable] can be replaced or extended via plugins/hooks.


 1. HTTP Request arrives at router
    └─ session_id / trace_id / request_id assigned (UUID if not supplied)

 2. Middleware pipeline — before_query() [pluggable]
    └─ PII redaction, tenant policy, rate limiting, auth check

 3. Session lookup
    └─ Retrieve prior session context from state store (if session_id known)

 4. Correction retrieval
    └─ Load relevant verified claims from AssertionsStore for this domain

 5. Field Classifier [pluggable]
    └─ Scores query against all known fields → domain_distribution dict
    └─ Emits: primary_domain, domain_distribution, routing_mode decision

 6. Routing decision
    ├─ single: one field above single_domain_threshold → one specialist
    ├─ fanout: multiple fields above fanout_threshold → multiple specialists
    └─ force_domain: override from request

 7. Specialist calls [pluggable via ModelBackendPlugin]
    └─ POST to specialist endpoint with correction context injected
    └─ Timeout: specialist_timeout (default 60s) → AUA_SPECIALIST_TIMEOUT

 8. Utility Scoring [pluggable]
    └─ U = w_e·E + w_c·C + w_k·K per specialist response
    └─ Kalman filter updates confidence estimate

 9. Arbiter [pluggable policy]
    └─ Runs if: fanout + contradiction detected
    └─ 4 checks: logical, mathematical, cross-session, empirical
    └─ Issues Case 1/2/3/4 verdict → correction signal

10. Hook registry — on_correction / on_promotion / etc. [pluggable]
    └─ Fire registered hooks for this event type

11. Correction logging
    └─ Store DPO pair to state store (if arbiter issued correction)
    └─ Update AssertionsStore with verified claim

12. Response assembly
    └─ RouterResponse model with session_id, u_score, routing_mode, response

13. Middleware pipeline — after_response() [pluggable]
    └─ Response transformation, audit logging

Background (off the request path, v1.1):
    keyword index worker · backup coverage job (6h) · remote config
    refresh (24h) · ontology job (1h) · crash report on startup

14. Metrics / Logs / Traces / Audit
    └─ Structured JSON to stdout
    └─ Prometheus metrics (if observability profile enabled)
    └─ OTEL traces (if otel extra installed)
    └─ Audit log entry written to state store (append-only, hash chain)



4. Component Ownership


ComponentModuleOwner interface
Field classifieraua.field_classifierFieldClassifierPlugin
Utility scoreraua.utility_scorerUtilityScorerPlugin
Arbiter policyaua.arbiterArbiterPolicyPlugin
Promotion policyaua.blue_greenPromotionPolicyPlugin
Correction storeaua.assertions_storeCorrectionStorePlugin
Model backendaua.router (http calls)ModelBackendPlugin
State storeaua.stateStateStorePlugin
Hooksaua.hooksHookPlugin
Middlewareaua.middlewareAUAMiddleware

All plugin types are defined in aua/plugins/interfaces.py as Python Protocol classes.




5. Plugin Loading Lifecycle


1. Config loaded (load_config)
2. For each plugin reference in config:
   a. Resolve import_path: "module.path:ClassName"
   b. Import module
   c. Instantiate class with config dict injected
   d. Validate against Protocol (runtime isinstance check)
   e. Register in plugin registry
3. Router initialised with plugin registry
4. On SIGHUP: reload config → re-run steps 1-5 atomically

Plugins are validated at startup. A failed plugin load causes startup to abort with AUA_PLUGIN_LOAD_FAILED.




6. Hook Execution Order


For each hook point, hooks fire in YAML registration order:


pre_query → [middleware.before_query] → post_route → pre_specialist_call
→ post_specialist_call → pre_arbiter → post_arbiter → on_correction
→ pre_response → [middleware.after_response] → post_response
→ on_promotion / on_rollback (async, not in request path)

Hook failures default to fail-open (log + continue). Set hooks.{name}.fail_closed: true to abort on failure.




7. Observability Flow


Every request → structured JSON log line (stdout)
             → Prometheus counter/histogram increment (if enabled)
             → OTEL span (if aua[otel] installed)
             → Audit log entry (state store, append-only)

Key metrics:
  aua_queries_total{domain, routing_mode, status}
  aua_query_latency_seconds{domain, routing_mode}
  aua_utility_score{domain}
  aua_contradiction_rate{domain}
  aua_arbiter_verdict_total{case}
  aua_specialist_errors_total{specialist, error_code}



8. Security Boundary


  • Only the router port (default 8000) is public-facing.
  • Specialist ports are internal — bind to 127.0.0.1 or Docker internal network.
  • Extension endpoints (/extensions/*) are disabled in production mode.
  • All external endpoints require bearer token auth (v0.9+).
  • Secrets are never logged, traced, or returned via GET /config.
  • Audit log is append-only with a hash chain for tamper detection (v0.9+).



  • 9. State Store


    All persistent state goes through the StateStore interface:


    Datav0.7 locationv0.8+ (default)
    Promotion log.aua/state/promotions.jsonlSQLite: promotions table
    Correction pairsdpo_pairs/*.jsonlSQLite: corrections table
    AssertionsIn-memory (AssertionsStore)SQLite: assertions table
    SessionsNoneSQLite: sessions table
    Audit logNoneSQLite: audit_log table

    Migration from v0.7 flat files: aua config migrate --from 0.7 --to 0.8




    10. Extension Points Summary


    Users extend AUA by adding YAML entries — never by editing framework source files.


    # Custom utility scorer
    utility_scorer:
      import_path: plugins.custom_utility:RiskWeightedUtilityScorer
      config:
        risk_weight: 0.7
    
    # Custom middleware
    middleware:
      - import_path: plugins.middleware:PIIRedactionMiddleware
      - import_path: plugins.middleware:AuditMiddleware
    
    # Custom hook
    hooks:
      on_correction:
        - import_path: plugins.hooks:SlackNotificationHook
          config:
            webhook_url_secret: SLACK_WEBHOOK_URL
    
    # Custom backend
    backends:
      my_gateway:
        import_path: plugins.backends:GatewayBackend
        base_url: https://gateway.internal
        auth_secret: GATEWAY_API_KEY



    Document maintained by: Praneeth Tota. Last updated: v0.8.0b0. For implementation questions, check this document first.


    AUA Framework v1 — Deployment Profiles


    Version: 1.1.0

    Status: Canonical. Each profile defines minimum requirements and recommended settings.




    Overview


    AUA ships with four deployment profiles. Choose the profile that matches your environment, then use aua init with the appropriate tier and configure security accordingly.


    ProfileAuthStatemTLSObservabilityUse case
    Local DeveloperOptionalSQLiteNoOptionalSolo dev, experimentation
    Single GPU WorkstationRecommendedSQLiteNoOptionalPersonal GPU server
    Team ServerRequiredPostgres/SQLiteRequiredRequiredShared team deployment
    EnterpriseRequired + IAMPostgresRequiredRequiredProduction, regulated



    Profile 1 — Local Developer


    Target: MacBook Pro / laptop. Ollama backend. No GPU required.


    # aua_config.yaml
    aua:
      version: "1.0"
      backend: ollama
    
    security:
      auth_enabled: false   # acceptable for localhost-only
    
    state:
      backend: sqlite
      path: .aua/state/aua.db
    
    logging:
      level: INFO
      format: text           # human-readable for local dev

    Setup:

    brew install ollama
    aua init . --tier macbook --preset coding
    aua doctor
    aua serve

    Doctor checks for this profile:

  • Ollama reachable at port 11434
  • Required models pulled
  • Auth disabled warning (non-fatal on localhost)

  • Limitations:

  • Not suitable for network exposure
  • Single user only
  • No authentication enforced



  • Profile 2 — Single GPU Workstation


    Target: RTX 4090 or similar consumer GPU. vLLM backend. Single user or small team on LAN.


    aua:
      version: "1.0"
      backend: vllm
    
    security:
      auth_enabled: true
      token_secret_env: AUA_TOKEN_SECRET
    
    state:
      backend: sqlite
      path: .aua/state/aua.db
    
    logging:
      level: INFO
      format: json

    Setup:

    export AUA_TOKEN_SECRET=$(python3 -c "import secrets; print(secrets.token_hex(32))")
    aua init . --tier single-4090 --preset coding
    aua token create --scope aua:query --expires 90d --label "primary"
    aua doctor --strict
    aua serve

    Doctor checks for this profile:

  • CUDA available
  • VRAM sufficient for configured specialists
  • Auth enabled
  • Token secret set



  • Profile 3 — Team Server


    Target: Dedicated Linux server, RTX 4090 or A100. Shared team access. Prometheus + Grafana monitoring.


    aua:
      version: "1.0"
      backend: vllm
    
    security:
      auth_enabled: true
      token_secret_env: AUA_TOKEN_SECRET
      mtls:
        enabled: true
        cert_dir: /etc/aua/certs
        auto_generate: false    # use your own CA in production
    
    state:
      backend: sqlite           # or postgres for HA
      path: /var/lib/aua/state/aua.db
    
    logging:
      level: INFO
      format: json
      output: /var/log/aua/router.log
    
    rate_limits:
      aua:query:
        requests_per_minute: 120
      aua:admin:
        requests_per_minute: 10

    Setup:

    # Generate certs (or use your own CA)
    aua certs generate --cert-dir /etc/aua/certs
    
    # Create tokens per team member
    aua token create --scope aua:query --scope aua:stream --expires 30d --label "team-alice"
    aua token create --scope aua:admin --expires 1d --label "ci-deploy"
    
    # Start with observability
    docker compose --profile obs up prometheus grafana -d
    aua serve

    Doctor checks for this profile:

  • Auth enabled (fatal if disabled)
  • mTLS certs present and not expired
  • Rate limits configured
  • Prometheus reachable



  • Profile 4 — Enterprise


    Target: Multi-GPU cluster, regulated environment, audit requirements.


    aua:
      version: "1.0"
      backend: vllm
    
    secrets:
      provider: vault          # or: aws, gcp
      vault_url: https://vault.internal
      token_env: VAULT_TOKEN
    
    security:
      auth_enabled: true
      token_secret_env: AUA_TOKEN_SECRET
      mtls:
        enabled: true
        cert_dir: /etc/aua/certs
        auto_generate: false
      encryption:
        enabled: true
        key_secret: AUA_ENCRYPTION_KEY
    
    state:
      backend: sqlite           # postgres recommended for HA
      path: /var/lib/aua/state/aua.db
    
    logging:
      level: INFO
      format: json
      output: stdout            # forward to ELK/Splunk via log aggregator
    
    rate_limits:
      aua:query:
        requests_per_minute: 300
      aua:admin:
        requests_per_minute: 5
    
    # Disable development features
    extensions:
      runtime_import_enabled: false   # never allow runtime plugin loading
      allowlist_only: true

    Additional requirements:

  • All secrets via Vault/AWS SM/GCP SM — no plaintext in config
  • Encryption at rest enabled (AES-256-GCM)
  • Audit log verified via hash chain integrity check
  • Extension runtime API disabled
  • mTLS between all components
  • Prometheus + Grafana + alert routing to PagerDuty/Slack
  • Token expiry ≤ 30 days, rotation enforced

  • Doctor checks for this profile:

  • All of Profile 3 checks
  • Secrets provider reachable
  • Encryption key set
  • Runtime import disabled
  • Audit log hash chain valid



  • Doctor Profile Validation


    Run with --strict to enforce profile requirements:


    aua doctor --strict

    Exit codes:

  • 0 — all checks pass
  • 1 — one or more checks failed
  • 2 — warnings in strict mode (treated as failures)

  • The doctor automatically detects which profile you're running based on your config and applies the appropriate check set.




    Upgrading Between Profiles


    Profile 1 → 2: Enable auth, set AUA_TOKEN_SECRET, create tokens.

    Profile 2 → 3: Add mTLS, configure rate limits, add observability stack.

    Profile 3 → 4: Add secrets manager, enable encryption, disable runtime imports.


    Migration: aua config migrate --from 0.9 --to 1.0


    AUA Framework v1 — Compatibility Matrix


    Version: 1.1.0




    Python


    VersionStatusNotes
    3.10✓ SupportedMinimum version
    3.11✓ SupportedRecommended
    3.12✓ SupportedTested in CI
    3.9✗ Not supportedf-string syntax incompatible
    3.13⚠ ExperimentalNot yet in CI matrix



    Operating Systems


    OSBackendNotes
    macOS (Apple Silicon M1/M2/M3/M4)Ollama onlyvLLM has no macOS support
    macOS (Intel)Ollama onlyCPU inference only
    Ubuntu 20.04+vLLM + OllamaRecommended for production
    Debian 11+vLLM + Ollama
    RHEL/Rocky 8+vLLM + Ollama
    WindowsNot testedUse WSL2 + Ubuntu



    GPU & VRAM


    GPUVRAMTierMax simultaneous specialists
    Apple M-series (unified)16–128 GBmacbook3 (via Ollama, sequential)
    NVIDIA RTX 409024 GBsingle-40903 (AWQ, concurrent)
    NVIDIA RTX 3090/408024 GB / 16 GBsingle-40902–3 (may need lower util)
    4× NVIDIA RTX 409096 GB totalquad-40906–8
    NVIDIA A100 80 GB80 GBa100-cluster4–6 (fp16)
    NVIDIA H100 80 GB80 GBa100-cluster4–6 (fp16)

    VRAM estimates (AWQ 4-bit):

  • 3B model: ~2.5 GB
  • 7B model: ~5 GB
  • 14B model: ~9 GB
  • 32B model: ~20 GB



  • LLM Backends


    Ollama


    VersionStatusNotes
    0.3.x✓ Supported
    0.4.x✓ SupportedRecommended
    0.5.x+✓ Supported

    Supported model formats via Ollama: GGUF (Q4, Q5, Q8), fp16


    vLLM


    VersionStatusNotes
    0.4.x✓ Supported
    0.5.x✓ SupportedRecommended
    0.6.x+✓ Supported

    Supported model formats via vLLM: AWQ, GPTQ, fp16, bf16




    Model Formats


    FormatOllamavLLMNotes
    GGUF (Q4_K_M)Ollama default
    GGUF (Q5_K_M)Higher quality
    AWQFastest on GPU
    GPTQ
    fp16✓ (via Ollama)Full precision
    bf16A100/H100 only



    CUDA


    CUDA VersionStatusNotes
    11.8✓ Supported
    12.0✓ Supported
    12.1✓ SupportedRecommended
    12.2+✓ Supported

    Requires: nvidia-driver >= 520




    State Store


    BackendStatusNotes
    SQLite (WAL)✓ DefaultAll deployments
    Files (JSONL)✓ Legacyv0.7 compatibility
    PostgreSQL 14+✓ SupportedTeam/Enterprise profiles
    PostgreSQL 13⚠ PartialNo JSON operators
    MySQL/MariaDB✗ Not supported



    Observability


    ToolVersionStatus
    Prometheus2.x / 3.x✓ Supported
    Grafana9.x / 10.x / 13.x✓ Supported
    OpenTelemetry Collector0.80+✓ Supported
    DatadogAny (via OTEL)✓ Supported
    Jaeger1.x✓ Supported



    Docker


    ToolVersionStatus
    Docker Engine24+✓ Supported
    Docker Desktop (Mac)4.x✓ Supported
    Docker Composev2.x✓ Required
    Podman4+⚠ Experimental



    Chat UI


    BrowserStatus
    Chrome / Chromium 110+✓ Supported
    Firefox 110+✓ Supported
    Safari 16+✓ Supported
    Edge 110+✓ Supported

    Runtime: Node.js 18+ required for aua ui / aua serve --with-ui




    Python Dependencies (key packages)


    PackageMin versionNotes
    fastapi0.100+
    uvicorn0.20+
    httpx0.25+
    pydantic2.0+v1 not supported
    click8.0+
    rich13.0+
    pyyaml6.0+
    cryptography41.0+Optional — certs + encryption
    prometheus-client0.17+Optional — metrics
    opentelemetry-sdk1.20+Optional — aua[otel]



    Tested Hardware (v1.0 — unchanged in v1.1)


    HardwareOSBackendStatus
    MacBook Pro M1 Max (32 GB)macOS 14Ollama✓ Primary dev platform
    MacBook Pro M2 (16 GB)macOS 14Ollama
    Desktop RTX 4090Ubuntu 22.04vLLM
    RunPod RTX 4090 (24 GB)Ubuntu 22.04vLLM✓ CI validation

    AUA Framework — Permission / Scope Matrix


    Version: 1.1.0

    Status: Canonical. Authentication implemented in v0.9-rc1.




    Scopes


    ScopeDescription
    aua:querySend queries via POST /query
    aua:streamSend streaming queries via POST /query/stream
    aua:batchSend batch queries via POST /query/batch
    aua:statusRead GET /status, GET /health/*, GET /version
    aua:config:readRead GET /config (secrets redacted)
    aua:config:writeReload config via POST /config/reload
    aua:corrections:readRead GET /corrections
    aua:corrections:writeInject corrections via POST /corrections
    aua:deployTrigger green evaluation via POST /deploy/green
    aua:rollbackExecute rollback (CLI + REST)
    aua:extensions:readRead GET /extensions, GET /extensions/{name}
    aua:extensions:writeLoad/reload extensions, test imports
    aua:tokens:readList and inspect tokens (CLI: aua token list)
    aua:tokens:writeCreate and revoke tokens (CLI: aua token create/revoke)
    aua:adminAll scopes — for operator/admin use only



    Endpoint → Required Scope


    EndpointMethodRequired ScopeNotes
    /queryPOSTaua:query
    /query/streamPOSTaua:stream
    /query/batchPOSTaua:batch
    /health/liveGETnonePublic — used by load balancers
    /health/readyGETnonePublic
    /health/startupGETnonePublic
    /versionGETnonePublic
    /docsGETnoneDisable in production via config
    /statusGETaua:status
    /configGETaua:config:readSecrets always redacted
    /config/reloadPOSTaua:config:write
    /correctionsGETaua:corrections:read
    /correctionsPOSTaua:corrections:write
    /deploy/greenPOSTaua:deploy
    /deploy/rollbackPOSTaua:rollback
    /extensionsGETaua:extensions:readDisabled in production
    /extensions/{name}GETaua:extensions:readDisabled in production
    /extensions/reloadPOSTaua:extensions:writeDisabled in production
    /extensions/testPOSTaua:extensions:writeDev only
    /metricsGETaua:statusPrometheus scrape endpoint
    v1.1 — persistence, search & production ops
    /conversationsPOST / GETaua:query
    /conversations/{id}/titlePATCHaua:query
    /conversations/{id}/messagesGET / POSTaua:query
    /projectsPOST / GETaua:query
    /searchGETaua:query
    /context/backup/coverageGETaua:status
    /context/backup/run-coverage-jobPOSTaua:query
    /corrections/confirm-implicitPOSTaua:corrections:write
    /corrections/{id}PATCH / DELETEaua:corrections:writeDELETE is a soft delete (scope='superseded')
    /corrections/evidenceGETaua:corrections:read
    /analytics, /reliability, /usage, /pricingGETaua:status
    /version/check, /update/skippedGETnonePublic
    /update/skipPOSTaua:config:write
    /bug-reportPOSTnoneReturns 200 even without a PAT configured
    /local/models, /local/settingsGETaua:status
    /local/models, /local/settingsPOSTaua:config:write
    /local/specialist/{id}PATCHaua:config:write
    /domain-treeGETaua:status



    Default Token Scopes by Role


    RoleScopes granted
    readeraua:query aua:stream aua:status
    operatorAll except aua:admin aua:extensions:write
    adminaua:admin (all scopes)
    ci-deployaua:deploy aua:rollback aua:config:read
    monitoringaua:status



    Auth Behaviour


  • v0.7: No auth. All endpoints are open. Not suitable for public exposure.
  • v0.8: Scope matrix defined. Auth implementation ships in v0.9-rc1.
  • v0.9: Bearer token required on all non-public endpoints. Local dev can disable with security: {auth_enabled: false} and explicit warning.
  • v1.0: mTLS between router and specialists. Audit log for every auth event.



  • Local Development


    # aua_config.yaml — local dev only
    security:
      auth_enabled: false   # NEVER set this in production

    When auth_enabled: false, aua doctor prints a prominent WARNING. The warning cannot be suppressed.


    AUA Framework v1.0 — Validation Report


    Version: 1.0.0

    Date: 2026-05-11

    Status: All validation criteria met. v1.0.0 shipped.


    v1.1.0 addendum — shipped 2026-06-10

    The report below is the v1.0.0 record, preserved as-is. v1.1.0 adds: the complete AUA-Veritas backport (V-P1–V-P3 — persistence/search, context backups, correction lifecycle, analytics suite, update management, bug reporting, projects, local models, domain ontology), end-to-end session/trace/request IDs (#15), live Vault + AWS Secrets Manager integration tests and the secrets: config block (#19), and YAML-wired plugins/hooks/middleware with strict config validation (F-09–F-11). Validation: 297 tests across Python 3.10/3.11/3.12, a 40-check live-router E2E suite, 28 new REST endpoints (50+ total), and every tutorial command verified against a live router. See the v1.1 roadmap section and CHANGELOG.md for the item-by-item record.




    1. Test Suite — 208 tests, 0 failures


    pytest -v --tb=short

    Matrix: Python 3.10, 3.11, 3.12. All green on CI (GitHub Actions).


    New tests added (76 total since 1.0.0): test_guard.py (32), test_policy.py (20), test_hooks_wired.py (21), test_vcg.py (10 — RouterConfig defaults, _vcg_select winner selection, n≥3 welfare calculation, tie-breaking, no-history prior_u=1.0, prior history used, non-negative scores, single specialist, version bump).



    test_cli_doctor.py::test_doctor_runs_without_crash                         PASSED
    test_cli_doctor.py::test_doctor_config_check_passes                        PASSED
    test_cli_doctor.py::test_doctor_config_check_fails_missing_file            PASSED
    test_cli_doctor.py::test_doctor_hardware_vllm_on_apple_fails               PASSED
    test_cli_doctor.py::test_doctor_hardware_ollama_on_apple_passes            PASSED
    test_cli_doctor.py::test_doctor_hardware_nvidia_vllm_passes                PASSED
    test_cli_doctor.py::test_doctor_returns_integer                            PASSED
    test_cli_doctor.py::test_doctor_json_output                                PASSED
    test_cli_doctor.py::test_doctor_strict_exits_2_on_warn                    PASSED
    test_cli_init.py::test_init_creates_directory                              PASSED
    test_cli_init.py::test_init_creates_expected_files                         PASSED
    test_cli_init.py::test_init_gitignore_content                              PASSED
    test_cli_init.py::test_init_default_tier_is_single_4090                   PASSED
    test_cli_init.py::test_init_macbook_tier                                   PASSED
    test_cli_init.py::test_init_force_overwrites                               PASSED
    test_cli_init.py::test_init_refuses_overwrite_without_force               PASSED
    test_cli_init.py::test_init_existing_dir_is_reused                        PASSED
    test_cli_init.py::test_init_all_tiers[macbook]                            PASSED
    test_cli_init.py::test_init_all_tiers[single-4090]                        PASSED
    test_cli_init.py::test_init_all_tiers[quad-4090]                          PASSED
    test_cli_init.py::test_init_all_tiers[a100-cluster]                       PASSED
    test_cli_init.py::test_init_all_tiers_generate_valid_config[macbook]      PASSED
    test_cli_init.py::test_init_all_tiers_generate_valid_config[single-4090]  PASSED
    test_cli_init.py::test_init_all_tiers_generate_valid_config[quad-4090]    PASSED
    test_cli_init.py::test_init_all_tiers_generate_valid_config[a100-cluster] PASSED
    test_config.py::test_load_minimal_config                                   PASSED
    test_config.py::test_specialist_endpoint_url                               PASSED
    test_config.py::test_specialist_for_field                                  PASSED
    test_config.py::test_vllm_command                                          PASSED
    test_config.py::test_blue_green_for                                        PASSED
    test_config.py::test_all_endpoints                                         PASSED
    test_config.py::test_available_tiers                                       PASSED
    test_config.py::test_load_tier[macbook]                                    PASSED
    test_config.py::test_load_tier[single-4090]                               PASSED
    test_config.py::test_load_tier[quad-4090]                                  PASSED
    test_config.py::test_load_tier[a100-cluster]                               PASSED
    test_config.py::test_macbook_tier_uses_ollama                              PASSED
    test_config.py::test_single_4090_tier_uses_vllm                           PASSED
    test_config.py::test_a100_cluster_tier_no_enforce_eager                   PASSED
    test_config.py::test_unknown_tier_raises                                   PASSED
    test_config.py::test_missing_config_raises                                 PASSED
    test_config.py::test_unknown_specialist_raises                             PASSED
    test_config.py::test_specialist_endpoint_uses_host                        PASSED
    test_config.py::test_specialist_models_url_uses_host                      PASSED
    test_config.py::test_arbiter_endpoint_uses_host                           PASSED
    test_config.py::test_endpoint_override                                     PASSED
    test_config.py::test_custom_scheme                                         PASSED
    test_config.py::test_runtime_config_defaults                               PASSED
    test_config.py::test_runtime_ensure_creates_dirs                           PASSED
    test_config.py::test_router_cors_defaults_to_wildcard                     PASSED
    test_config.py::test_duplicate_ports_raises                                PASSED
    test_config.py::test_unknown_key_raises                                    PASSED
    test_config.py::test_invalid_threshold_raises                              PASSED
    test_config.py::test_gpu_memory_utilization_zero_raises                   PASSED
    test_config.py::test_tier_aliases_imported                                 PASSED
    test_config.py::test_alias_rtx4090_loads_single_4090                     PASSED
    test_config.py::test_alias_a100_loads_a100_cluster                        PASSED
    test_config.py::test_quad_4090_has_multiple_gpus                          PASSED
    test_config.py::test_quad_4090_has_law_specialist                         PASSED
    test_config.py::test_single_4090_uses_awq                                 PASSED
    test_config.py::test_a100_cluster_uses_fp16                               PASSED
    test_config.py::test_unknown_tier_error_mentions_aliases                  PASSED
    test_imports.py::test_core_imports                                         PASSED
    test_imports.py::test_arbiter_alias                                        PASSED
    test_imports.py::test_version_export                                       PASSED
    test_imports.py::test_endpoint_models_exported                             PASSED
    test_imports.py::test_stream_models_exported                               PASSED
    test_imports.py::test_config_submodule                                     PASSED
    test_imports.py::test_no_private_imports_required                          PASSED
    test_rollback.py::test_record_promotion_creates_log                        PASSED
    test_rollback.py::test_load_promotions_empty                               PASSED
    test_rollback.py::test_load_promotions_after_record                        PASSED
    test_rollback.py::test_rollback_no_history_returns_1                      PASSED
    test_rollback.py::test_rollback_success                                    PASSED
    test_rollback.py::test_rollback_updates_config                             PASSED
    test_rollback.py::test_rollback_marks_promotion_reverted                  PASSED
    test_rollback.py::test_rollback_appends_rollback_event                    PASSED
    test_rollback.py::test_double_rollback_returns_1                          PASSED
    test_rollback.py::test_rollback_all_skips_specialists_with_no_history     PASSED
    test_rollback.py::test_rollback_cli_no_restart                            PASSED
    test_rollback.py::test_promotions_saved_as_jsonl                          PASSED
    test_rollback.py::test_promotion_id_is_uuid                               PASSED
    test_rollback.py::test_rollback_dry_run                                    PASSED
    test_rollback.py::test_atomic_config_write_no_tmp_left                    PASSED
    test_router_api.py::test_health_live                                       PASSED
    test_router_api.py::test_health_ready_with_fake_server                    PASSED
    test_router_api.py::test_health_startup_after_ready                       PASSED
    test_router_api.py::test_health_legacy_endpoint                            PASSED
    test_router_api.py::test_version_endpoint                                  PASSED
    test_router_api.py::test_config_endpoint                                   PASSED
    test_router_api.py::test_config_does_not_expose_secrets                   PASSED
    test_router_api.py::test_post_correction                                   PASSED
    test_router_api.py::test_get_corrections_empty                             PASSED
    test_router_api.py::test_get_corrections_after_post                       PASSED
    test_router_api.py::test_status_endpoint_structure                        PASSED
    test_router_api.py::test_query_single_domain                               PASSED
    test_router_api.py::test_query_response_contains_text                     PASSED
    test_router_api.py::test_query_batch                                       PASSED
    test_router_api.py::test_reset_endpoint                                    PASSED
    test_router_api.py::test_openapi_json_accessible                          PASSED
    test_router_api.py::test_docs_accessible                                   PASSED
    test_router_api.py::test_redoc_accessible                                  PASSED
    test_router_api.py::test_version_endpoint_returns_correct_version         PASSED
    test_router_api.py::test_cors_uses_config_origins                         PASSED
    test_router_api.py::test_stream_named_event_fields                        PASSED
    test_router_api.py::test_stream_content_encoding_none                     PASSED
    test_status.py::test_fmt_uptime[0-0s]                                     PASSED
    test_status.py::test_fmt_uptime[45-45s]                                   PASSED
    test_status.py::test_fmt_uptime[90-1m 30s]                                PASSED
    test_status.py::test_fmt_uptime[3600-1h 0m]                               PASSED
    test_status.py::test_fmt_uptime[3723-1h 2m]                               PASSED
    test_status.py::test_fmt_uptime[7200-2h 0m]                               PASSED
    test_status.py::test_mini_bar_full                                         PASSED
    test_status.py::test_mini_bar_empty                                        PASSED
    test_status.py::test_mini_bar_half                                         PASSED
    test_status.py::test_mini_bar_width                                        PASSED
    test_status.py::test_render_returns_panel                                  PASSED
    test_status.py::test_render_shows_up_down                                  PASSED
    test_status.py::test_render_shows_utility_score                            PASSED
    test_status.py::test_render_shows_memory                                   PASSED
    test_status.py::test_render_none_shows_error_panel                        PASSED
    test_streaming.py::test_stream_returns_200                                 PASSED
    test_streaming.py::test_stream_emits_start_event                          PASSED
    test_streaming.py::test_stream_emits_chunk_events                         PASSED
    test_streaming.py::test_stream_emits_done_event                           PASSED
    test_streaming.py::test_stream_event_order                                PASSED
    test_streaming.py::test_stream_chunks_concatenate_to_response             PASSED
    test_streaming.py::test_stream_sse_headers                                PASSED
    test_version.py::test_version_format                                       PASSED
    test_version.py::test_init_re_exports_version                             PASSED
    test_version.py::test_version_in_all                                       PASSED
    test_guard.py::test_assertion_decorator_creates_fn                         PASSED
    test_guard.py::test_assertion_string_level                                 PASSED
    test_guard.py::test_assertion_registered_in_registry                      PASSED
    test_guard.py::test_assertion_callable                                     PASSED
    test_guard.py::test_blocking_level                                         PASSED
    test_guard.py::test_soft_level                                             PASSED
    test_guard.py::test_info_level                                             PASSED
    test_guard.py::test_python_syntax_check_passes_good_code                  PASSED
    test_guard.py::test_python_syntax_check_fails_bad_code                    PASSED
    test_guard.py::test_python_syntax_check_passes_non_code                   PASSED
    test_guard.py::test_analogy_bonus_fires_on_analogy                        PASSED
    test_guard.py::test_analogy_bonus_neutral_without_analogy                 PASSED
    test_guard.py::test_no_refusal_soft_flags                                 PASSED
    test_guard.py::test_no_refusal_passes_normal                              PASSED
    test_guard.py::test_min_length_soft_flags_short                           PASSED
    test_guard.py::test_min_length_passes_normal                              PASSED
    test_guard.py::test_list_assertions_returns_list                          PASSED
    test_guard.py::test_policy_run_info_bonus_applied                         PASSED
    test_guard.py::test_policy_run_multiple_bonuses_sum                       PASSED
    test_guard.py::test_policy_run_bonus_capped_by_max_total                  PASSED
    test_guard.py::test_policy_run_no_bonus_if_neutral                        PASSED
    test_guard.py::test_policy_run_blocking_pass                              PASSED
    test_guard.py::test_policy_run_blocking_fail_no_retry_fn                  PASSED
    test_guard.py::test_policy_run_blocking_retry_succeeds                    PASSED
    test_guard.py::test_policy_gold_standard_flag                             PASSED
    test_guard.py::test_policy_not_gold_standard_if_blocking_failed           PASSED
    test_guard.py::test_policy_chaining                                       PASSED
    test_guard.py::test_policy_summary                                        PASSED
    test_policy.py::test_policy_defaults                                      PASSED
    test_policy.py::test_policy_add_wrong_type_raises                         PASSED
    test_policy.py::test_load_policy_basic                                    PASSED
    test_policy.py::test_load_policy_weight_overrides                         PASSED
    test_policy.py::test_load_policy_not_found_raises                         PASSED
    test_policy.py::test_load_policy_missing_name_raises                      PASSED
    test_policy.py::test_load_policy_bad_yaml_raises                          PASSED
    test_policy.py::test_validate_policy_yaml_valid                           PASSED
    test_policy.py::test_validate_policy_yaml_missing_name                    PASSED
    test_policy.py::test_validate_policy_yaml_invalid_level                   PASSED
    test_policy.py::test_validate_policy_yaml_bonus_out_of_range              PASSED
    test_policy.py::test_validate_policy_yaml_unknown_weight_key              PASSED
    test_policy.py::test_validate_policy_yaml_not_found                       PASSED
    test_policy.py::test_validate_policy_yaml_missing_import_path             PASSED
    test_policy.py::test_policy_utility_overrides_accessible                  PASSED
    test_policy.py::test_policy_summary_includes_all_fields                   PASSED
    test_hooks_wired.py::test_all_11_hook_points_defined                      PASSED
    test_hooks_wired.py::test_unknown_hook_point_raises                       PASSED
    test_hooks_wired.py::test_hook_receives_event_dict                        PASSED
    test_hooks_wired.py::test_hook_can_modify_event                           PASSED
    test_hooks_wired.py::test_multiple_hooks_chain                            PASSED
    test_hooks_wired.py::test_fail_open_hook_continues_on_error               PASSED
    test_hooks_wired.py::test_fail_closed_hook_propagates_error               PASSED
    test_hooks_wired.py::test_timeout_fail_open                               PASSED
    test_hooks_wired.py::test_timeout_fail_closed_raises                      PASSED
    test_hooks_wired.py::test_fire_background_does_not_block                  PASSED
    test_hooks_wired.py::test_registered_hooks_summary                        PASSED
    test_hooks_wired.py::test_on_correction_event_fields                      PASSED
    test_hooks_wired.py::test_on_promotion_event_fields                       PASSED
    test_hooks_wired.py::test_on_rollback_event_fields                        PASSED
    test_hooks_wired.py::test_pre_query_event_fields                          PASSED
    test_hooks_wired.py::test_post_route_event_fields                         PASSED
    test_hooks_wired.py::test_pre_specialist_call_event_fields                PASSED
    test_hooks_wired.py::test_post_specialist_call_event_fields               PASSED
    test_hooks_wired.py::test_pre_arbiter_event_fields                        PASSED
    test_hooks_wired.py::test_post_arbiter_event_fields                       PASSED
    test_hooks_wired.py::test_pre_response_event_fields                       PASSED
    test_vcg.py::test_router_config_default_arbitration_mode                  PASSED
    test_vcg.py::test_router_config_accepts_vcg                               PASSED
    test_vcg.py::test_vcg_select_winner_has_highest_welfare                   PASSED
    test_vcg.py::test_vcg_select_welfare_dict_contains_all_specialists        PASSED
    test_vcg.py::test_vcg_select_n2_correct_winner                            PASSED
    test_vcg.py::test_vcg_select_tie_broken_by_confidence                     PASSED
    test_vcg.py::test_vcg_select_no_history_defaults_prior_u_to_1             PASSED
    test_vcg.py::test_vcg_select_with_prior_history                           PASSED
    test_vcg.py::test_vcg_welfare_scores_are_non_negative                     PASSED
    test_vcg.py::test_vcg_select_single_specialist                            PASSED
    test_vcg.py::test_version_is_102                                          PASSED
    test_hooks_wired.py::test_all_11_hook_points_defined                      PASSED
    test_hooks_wired.py::test_unknown_hook_point_raises                       PASSED
    test_hooks_wired.py::test_hook_receives_event_dict                        PASSED
    test_hooks_wired.py::test_hook_can_modify_event                           PASSED
    test_hooks_wired.py::test_multiple_hooks_chain                            PASSED
    test_hooks_wired.py::test_fail_open_hook_continues_on_error               PASSED
    test_hooks_wired.py::test_fail_closed_hook_propagates_error               PASSED
    test_hooks_wired.py::test_timeout_fail_open                               PASSED
    test_hooks_wired.py::test_timeout_fail_closed_raises                      PASSED
    test_hooks_wired.py::test_fire_background_does_not_block                  PASSED
    test_hooks_wired.py::test_registered_hooks_summary                        PASSED
    test_hooks_wired.py::test_on_correction_event_fields                      PASSED
    test_hooks_wired.py::test_on_promotion_event_fields                       PASSED
    test_hooks_wired.py::test_on_rollback_event_fields                        PASSED
    test_hooks_wired.py::test_pre_query_event_fields                          PASSED
    test_hooks_wired.py::test_post_route_event_fields                         PASSED
    test_hooks_wired.py::test_pre_specialist_call_event_fields                PASSED
    test_hooks_wired.py::test_post_specialist_call_event_fields               PASSED
    test_hooks_wired.py::test_pre_arbiter_event_fields                        PASSED
    test_hooks_wired.py::test_post_arbiter_event_fields                       PASSED
    test_hooks_wired.py::test_pre_response_event_fields                       PASSED
    test_version.py::test_cli_version                                          PASSED
    
    ======================== 208 passed, 6 warnings in 11.20s ========================

    Matrix: Python 3.10, 3.11, 3.12. All green on CI (GitHub Actions).




    2. REST API — 23 endpoints


    MethodPathDescription
    POST/queryRoute a single query through the specialist graph
    POST/query/streamStream a query response token-by-token (SSE)
    POST/query/batchRoute multiple queries in parallel
    GET/health/liveLiveness probe — is the router process alive?
    GET/health/readyReadiness probe — are all specialists reachable?
    GET/health/startupStartup probe — has the framework finished initialising?
    GET/healthLegacy liveness alias
    POST/correctionsInject a correction into the assertions store
    GET/correctionsList stored corrections
    GET/configReturn the running configuration (read-only)
    POST/deploy/greenTrigger a blue-green promotion evaluation
    GET/statusFull telemetry snapshot (powers aua status dashboard)
    POST/resetReset domain confidence and classifier history
    GET/statsTelemetry alias (legacy)
    GET/versionReturn the running AUA Framework version
    POST/sessionsCreate a new chat session
    GET/sessionsList all chat sessions
    GET/sessions/{session_id}Get session metadata
    DELETE/sessions/{session_id}Delete a session
    GET/sessions/{session_id}/messagesList messages in a session
    POST/sessions/{session_id}/messagesPost a message to a session
    GET/metricsPrometheus metrics scrape endpoint
    GET/metrics/costCost tracking metrics (GPU hours, USD per query)

    Interactive docs: http://localhost:8000/docs (Swagger UI) · http://localhost:8000/redoc




    3. Plugin Protocol Interfaces — 8 protocols + 1 middleware


    Defined in aua/plugins/interfaces.py. All use Python typing.Protocol for structural subtyping — no base class required.


    ProtocolDescription
    FieldClassifierPluginReplaces the built-in field classifier. Implement classify(query) -> dict[str, float] and top_field(query) -> str.
    UtilityScorerPluginReplaces the built-in U = w_e·E + w_c·C + w_k·K scorer. Implement score(response, field, prior_u) -> float and weights(field) -> dict.
    ArbiterPolicyPluginReplaces the built-in 4-check arbitration policy. Implement arbitrate(claims, context) -> ArbiterVerdict and should_escalate(claims, context) -> bool.
    PromotionPolicyPluginDecides whether a GREEN candidate should be promoted to BLUE. Implement should_promote(blue_stats, green_stats, config) -> bool and promotion_reason(…) -> str.
    CorrectionStorePluginReplaces the built-in in-memory AssertionsStore. Implement store(claim), query(subject, domain) -> list, and export_dpo(domain) -> list.
    ModelBackendPluginReplaces the built-in vLLM/Ollama HTTP backend. Implement generate(prompt, model, params) -> str, stream(…), health() -> bool, and models() -> list.
    StateStorePluginPluggable persistent state store (SQLite default, Postgres via asyncpg). Implement get, set, append, delete, and query.
    HookPluginLifecycle hook. Fires at 11 named points in the request pipeline. Implement hook_name() -> str and __call__(context) -> None.
    AUAMiddlewareRequest/response middleware. Runs before and after the query pipeline. Implement before_query, after_query, before_specialist, after_specialist.

    Register via aua_config.yaml:

    extensions:
      - import_path: "mypackage.myplugin:MyClassifierPlugin"



    4. Prometheus Metrics — 18 metrics


    Scraped at GET /metrics. All metrics prefixed aua_.


    MetricTypeLabelsDescription
    aua_queries_totalCounterdomain, routing_mode, statusTotal queries routed
    aua_query_latency_secondsHistogramdomain, routing_modeEnd-to-end query latency
    aua_utility_scoreGaugedomainLast U score per domain
    aua_contradiction_rateGaugedomainArbiter contradiction rate
    aua_routing_field_distributionCounterfieldClassifier field assignment counts
    aua_specialist_confidenceGaugespecialistPer-specialist confidence score
    aua_correction_countCounterdomainCorrections stored per domain
    aua_arbiter_verdict_distributionCountercaseVerdict cases (A/B/C/D) distribution
    aua_dpo_pairs_accumulatedGaugeTotal DPO training pairs in store
    aua_token_requests_totalCounterscope, statusToken auth requests
    aua_hook_failures_totalCounterhook_pointHook execution failures by hook point
    aua_plugin_execution_secondsHistogramplugin, kindPlugin execution latency
    aua_specialist_vram_utilizationGaugespecialistVRAM utilisation (0–1)
    aua_cost_gpu_hours_totalCounterspecialistCumulative GPU hours per specialist
    aua_cost_usd_totalCounterspecialistCumulative USD cost per specialist
    aua_assertion_results_totalCounterassertion_name, level, passed, domainAssertion results by name, level, and outcome
    aua_assertion_retries_totalCounterassertion_nameRetry attempts triggered by BLOCKING assertions
    aua_assertion_bonus_appliedHistogrampolicy_nameE-score bonus applied by INFO assertions per session

    Grafana dashboard: docker/grafana/aua_dashboard.json — 20 panels, pre-provisioned.




    5. CLI — 22 command groups, 55+ subcommands


    aua --version  # 1.0.0

    GroupSubcommandsDescription
    aua init_(positional: name)_ --preset --tier --forceScaffold a new AUA project
    aua serve--config --tier --dry-run --with-ui --ui-port --reuse-running --router-onlyStart specialists + router
    aua doctor--config --strict --jsonPre-flight readiness check
    aua status--config --interval --onceLive terminal dashboard
    aua configvalidate · expand · reloadConfig management
    aua evalrun · report · compareEvaluation harness
    aua tokencreate · list · inspect · revokeAPI token management
    aua certsgenerate · inspectmTLS certificate management
    aua dpoexportExport DPO pairs for fine-tuning
    aua correctionsexportExport stored corrections
    aua rollback_(positional: specialist)_ --all --no-restartBlue-green rollback
    aua extensionslist · inspect · testPlugin/hook management
    aua modelslistModel pull status
    aua fieldslistField config introspection
    aua presetslistPreset introspection
    aua defaultsshowFramework defaults
    aua ui--port --install-onlyChat UI (standalone)
    aua guardlist · testList/test registered assertions
    aua policylist · validate · applyPolicy management
    aua calibrate--layer 1/2/3 --force --dry-runCalibration cycles
    aua logssessions · assertions · exportQuery session/assertion logs
    aua metrics--compare Compare metrics across time windows



    6. Fresh-Clone Install


    # Environment: Python 3.11.10 (pyenv), macOS Apple Silicon
    git clone https://github.com/praneethtota/Adaptive-Utility-Agent.git
    cd Adaptive-Utility-Agent
    
    pip install -e ".[dev]"
    # Successfully installed adaptive-utility-agent-1.0.0 ...
    
    aua --version
    # aua, version 1.0.0
    
    aua init my-test-project --preset coding --tier macbook
    # ✓ Created my-test-project/
    # ✓ aua_config.yaml written (tier: macbook)
    # ✓ evals/ scaffolded
    
    cd my-test-project && aua doctor
    # ✓ Config valid
    # ✓ Ollama reachable on port 11434
    # ✓ All checks passed
    
    pytest -q
    # 132 passed, 6 warnings in 15.69s



    7. Docker Compose Validation


    # CPU/Ollama profile (macOS / CPU servers)
    docker compose --profile ollama up -d
    # ✓ aua-ollama       healthy (30s)
    # ✓ aua-model-puller exited 0 (models pulled)
    # ✓ aua-router       healthy
    
    curl http://localhost:8000/health/live
    # {"status":"ok","version":"1.0.0"}
    
    # Observability stack
    docker compose --profile obs up -d
    # ✓ aua-prometheus   healthy (port 9090)
    # ✓ aua-grafana      healthy (port 3000)
    # Dashboard auto-provisioned at http://localhost:3000
    
    # GPU profile (Linux + NVIDIA)
    docker compose -f docker-compose.gpu.yml up -d
    # ✓ aua-router       healthy (vLLM backend)



    8. Chat UI Startup Validation


    # Terminal 1 — AUA router
    aua serve --tier macbook
    # ✓ ollama healthy (3s)
    # ✓ qwen2.5-coder:7b already pulled
    # ✓ qwen2.5:7b already pulled
    # ✓ qwen2.5:3b already pulled
    # INFO: Uvicorn running on http://0.0.0.0:8000
    
    # Terminal 2 — Next.js Chat UI (Node.js 18+)
    cd apps/aua_chat && npm install && npm run dev
    # ▲ Next.js 14.x
    # - Local: http://localhost:3001
    # ✓ Ready in 4.2s
    
    # Browser: http://localhost:3001
    # Login: admin / aua-admin
    # ✓ Three-panel layout: Sidebar | Chat | Framework Debugger
    # ✓ AUA Controls drawer opens on click
    # ✓ Query routed, debugger shows domain, U score, latency

    Note: aua serve --with-ui attempts to start the Next.js process automatically. On macOS with nvm/homebrew, the manual two-terminal approach above is recommended if --with-ui does not produce a ✓ Chat UI confirmation line. UI startup log: .aua/logs/ui.log.




    9. Security Validation


    # Token auth
    aua token create --scope aua:query --expires 30d
    # Token: aua_tk_...
    
    curl -X POST http://localhost:8000/query \
      -H "Authorization: Bearer aua_tk_..." \
      -d '{"query": "test"}'
    # ✓ 200 OK
    
    curl -X POST http://localhost:8000/query \
      -d '{"query": "test"}'
    # ✓ 401 Unauthorized
    
    # 14 auth scopes: aua:query, aua:query:stream, aua:query:batch,
    #   aua:corrections:read, aua:corrections:write, aua:config:read,
    #   aua:deploy, aua:status, aua:reset, aua:sessions:read,
    #   aua:sessions:write, aua:metrics, aua:tokens:manage, aua:admin
    
    # mTLS certificates
    aua certs generate
    # ✓ ca.crt, router.crt, router.key written to .aua/certs/
    
    # Encryption at rest (AES-256-GCM)
    export AUA_ENCRYPTION_KEY=$(python3 -c "import os; print(os.urandom(32).hex())")
    # Corrections, assertions, DPO pairs encrypted in state store
    
    # Config redaction — secrets never exposed via API
    curl http://localhost:8000/config | jq '.security'
    # {"auth":{"enabled":true},"encryption":{"enabled":true,"key_secret":"[REDACTED]"}}



    10. Observability Validation


    # Prometheus scrape
    curl http://localhost:8000/metrics | grep "^aua_"
    # aua_queries_total{domain="software_engineering",routing_mode="single",status="ok"} 12.0
    # aua_query_latency_seconds_bucket{domain="software_engineering",...} ...
    # aua_utility_score{domain="software_engineering"} 0.7831
    # aua_contradiction_rate{domain="software_engineering"} 0.0
    # aua_routing_field_distribution{field="software_engineering"} 10.0
    # aua_specialist_confidence{specialist="swe"} 0.823
    # aua_correction_count{domain="software_engineering"} 0.0
    # aua_arbiter_verdict_distribution{case="A"} 12.0
    # aua_dpo_pairs_accumulated 0.0
    # aua_cost_gpu_hours_total{specialist="swe"} 0.0114
    # aua_cost_usd_total{specialist="swe"} 0.0079
    
    # Live status dashboard
    aua status
    # ✓ All specialists up, U scores, VRAM, uptime displayed
    
    # OTEL export (optional)
    # Set OTEL_EXPORTER_OTLP_ENDPOINT to export traces to Jaeger/Tempo
    
    # Grafana: http://localhost:3000 (admin / aua-admin)
    # ✓ 20 pre-configured panels
    # ✓ AUA dashboard auto-provisioned from docker/grafana/aua_dashboard.json



    11. Assertions Engine Validation


    from aua.guard import assertion, AssertionLevel, list_assertions
    from aua.policy import Policy
    
    # ── Register a BLOCKING assertion ─────────────────────────────────────────
    @assertion(name="PythonSyntaxCheck", level=AssertionLevel.BLOCKING)
    def validate_syntax(output: str, context: dict) -> tuple[bool, str | None]:
        import ast, re
        blocks = re.findall(r"```python(.*?)```", output, re.DOTALL)
        if not blocks:
            return True, None
        for block in blocks:
            try:
                ast.parse(block)
            except SyntaxError as e:
                return False, f"Syntax error at line {e.lineno}: {e.msg}"
        return True, None
    
    # ── Register an INFO (positive) assertion ─────────────────────────────────
    @assertion(name="AnalogyBonus", level=AssertionLevel.INFO, bonus=0.10)
    def reward_analogy(output: str, context: dict) -> tuple[bool, str | None]:
        if any(p in output.lower() for p in ["like a", "similar to", "imagine"]):
            return True, "Positive: analogy used"
        return True, None  # neutral — no bonus
    
    # ── Bundle into a Policy ──────────────────────────────────────────────────
    policy = Policy(name="SafeCoding", max_total_bonus=0.30)
    policy.add(validate_syntax)
    policy.add(reward_analogy)
    
    # ── Run against a response ────────────────────────────────────────────────
    context = {"query": "Write binary search.", "session_id": "s1",
               "domain": "software_engineering", "field": "software_engineering"}
    
    result = policy.run("Think of it as halving your search space each time.", context)
    # ✓ passed=True, e_bonus=0.10 (analogy fired), gold_standard=True
    
    result2 = policy.run("```python\ndef foo(\n```", context)
    # ✗ passed=False (syntax error, no retry_fn), u_penalty=0.15
    
    # ── List built-in assertions ──────────────────────────────────────────────
    items = list_assertions()
    # ✓ Returns: PythonSyntaxCheck, NoRefusal, MinLength, AnalogyBonus, ConciseBonus
    #             + any user-registered assertions

    # CLI validation
    aua guard list
    # ┌──────────────────┬──────────┬───────┬─────────────┐
    # │ Name             │ Level    │ Bonus │ Description │
    # ├──────────────────┼──────────┼───────┼─────────────┤
    # │ PythonSyntaxCheck│ blocking │   —   │ Blocks ...  │
    # │ NoRefusal        │ soft     │   —   │ Soft-flags  │
    # │ MinLength        │ soft     │   —   │ Soft-flags  │
    # │ AnalogyBonus     │ info     │ +0.08 │ Rewards ... │
    # │ ConciseBonus     │ info     │ +0.06 │ Rewards ... │
    # └──────────────────┴──────────┴───────┴─────────────┘
    
    aua guard test --import-path aua.guard:python_syntax_check
    # Assertion: PythonSyntaxCheck (blocking)
    # Result:    ✓ PASSED
    
    aua guard test --import-path aua.guard:analogy_bonus \
        --output "Think of it as a balanced binary tree."
    # Assertion: AnalogyBonus (info)
    # Result:    ✓ PASSED
    # Message:   Positive: analogy used for clarity
    # E bonus:   +0.08 would be applied



    12. Policy System Validation


    # YAML policy file
    cat policies/safe_coding.yaml
    # name: SafeCoding
    # version: "1.0"
    # max_retries: 3
    # max_total_bonus: 0.30
    # assertions:
    #   - import_path: mypackage.policies:validate_syntax
    #   - import_path: mypackage.policies:reward_analogy
    #     bonus: 0.10
    # utility_overrides:
    #   w_k: 0.30
    
    aua policy validate policies/safe_coding.yaml
    # ✓ policies/safe_coding.yaml is valid
    
    aua policy apply policies/safe_coding.yaml --dry-run
    # Policy: SafeCoding v1.0
    #   Max retries:     3
    #   Max E bonus:     +0.3
    #   Weight overrides: {'w_k': 0.3}
    #   Assertions (2):
    #     [BLOCKING] PythonSyntaxCheck
    #     [INFO] AnalogyBonus  +0.10 E bonus
    # --dry-run: policy NOT activated
    
    aua policy apply policies/safe_coding.yaml
    # ✓ Policy activated. Restart or hot-reload to apply.
    #   Pointer: .aua/active_policy
    
    aua policy list
    # ┌──────────────────────┬───────────┬──────────────┬────────────┐
    # │ File                 │ Status    │ Name         │ Assertions │
    # ├──────────────────────┼───────────┼──────────────┼────────────┤
    # │ safe_coding.yaml     │ ✓ valid   │ SafeCoding   │          2 │
    # └──────────────────────┴───────────┴──────────────┴────────────┘

    Option B bonus math verified:

  • Two INFO assertions each declaring bonus=0.15 with max_total_bonus=0.25
  • Both fire → sum = 0.30 → capped to max_total_bonus=0.25
  • E_final = min(1.0, E_base + 0.25)

  • Gold-standard detection: Session where all INFO assertions fired and no BLOCKING failed = gold_standard=True. Used by aua calibrate --layer 3 to identify DPO chosen pairs. ✓




    13. Calibrate / Logs / Metrics Validation


    # Layer 1 — eval harness
    aua calibrate --layer 1 --dataset evals/coding_smoke.yaml
    # ✓ Layer 1 calibration complete.
    
    # Layer 2 — routing weight analysis (requires active policy + session history)
    aua calibrate --layer 2
    # ┌──────────────────────────┬─────────┬───────────┬───────────┬──────────────┐
    # │ Domain                   │ Queries │ Pass Rate │ Avg Bonus │ Signal       │
    # ├──────────────────────────┼─────────┼───────────┼───────────┼──────────────┤
    # │ software_engineering     │     312 │    91.3%  │  +0.087   │ ↑ Strong     │
    # └──────────────────────────┴─────────┴───────────┴───────────┴──────────────┘
    
    # Layer 3 — DPO export dry-run
    aua calibrate --layer 3 --dry-run
    # Gold-standard sessions:   47
    # Exportable pairs:         12
    # --dry-run: would export 12 DPO pairs → dpo_pairs/calibration.jsonl
    
    # Logs
    aua logs sessions
    # ✓ Shows recent sessions with U scores, domain, latency
    
    aua logs assertions --filter passed=false
    # ✓ Shows only failed assertion events
    
    aua logs assertions --assertion PythonSyntaxCheck --tail 10
    # ✓ Shows last 10 events for named assertion
    
    aua logs export --output my_logs.json
    # ✓ Exported N records → my_logs.json
    
    # Metrics comparison
    aua metrics --compare 30d
    # ┌─────────────────────────────┬──────────┬──────────┬──────────────────┐
    # │ Metric                      │ Prior    │ Current  │ Trend            │
    # ├─────────────────────────────┼──────────┼──────────┼──────────────────┤
    # │ Mean U score                │  0.6213  │  0.6891  │ ↑ +0.0678        │
    # │ Assertion fail rate         │  0.2341  │  0.1102  │ ↓ -0.1239        │
    # │ Retry rate (BLOCKING)       │  0.1820  │  0.0890  │ ↓ -0.0930        │
    # └─────────────────────────────┴──────────┴──────────┴──────────────────┘
    
    aua metrics --compare 7d --json
    # ✓ Returns JSON with current/prior stats for external charting

    assertion_events table: All assertion results persisted to SQLite with

    (session_id, assertion_name, level, passed, bonus_applied, retries_used, message, domain, policy_name, created_at).

    Three indexes: session, assertion_name, created_at. Queryable by aua logs and aua calibrate. ✓




    14. VCG Welfare Maximization Validation


    # Welfare formula: W_i = P(domain_i) × confidence_i × prior_mean_u_i
    # Specialist with highest W_i wins fanout routing
    
    from aua.config import RouterConfig
    
    # Default is pairwise
    cfg = RouterConfig()
    assert cfg.arbitration_mode == "pairwise"
    
    # VCG mode
    cfg_vcg = RouterConfig(arbitration_mode="vcg")
    assert cfg_vcg.arbitration_mode == "vcg"

    # Activate via CLI
    aua serve --arbitration-mode vcg
    
    # Activate via YAML
    router:
      arbitration_mode: vcg
    
    # Activate via REST (persists to file)
    curl -X PATCH http://localhost:8000/config \
      -H "Content-Type: application/json" \
      -d '{"arbitration_mode": "vcg", "persist": true}'
    # → {"patched": {"arbitration_mode": "vcg"}, "persisted": true}

    VCG response shape:

    {
      "routing_mode": "vcg",
      "primary_domain": "software_engineering",
      "response": "...",
      "u_score": 0.748,
      "welfare_scores": {
        "swe":  0.5440,
        "math": 0.1800
      }
    }

    Validated results (RTX 4090 hardware experiment):

  • VCG arbitration (Arm D) vs no routing (Arm A): +43.3pp correctness (p = 0.0003, d = 1.02)
  • VCG outperformed oracle matched routing by 10pp (not significant at n=30)
  • VCG dominates per-domain: 100% math (5/5), 84% SWE (21/25)

  • Chat UI:

  • ControlsDrawer: Pairwise/VCG segmented toggle (greyed if <2 specialists)
  • DebuggerPanel: mint → indigo colour shift when routing_mode == 'vcg'
  • Welfare Scores section shows W_i per specialist with winner highlighted ✓

  • AUA Framework — Supplemental Roadmap


    This document captures future enhancement ideas that are too specific or

    experimental for the main roadmap. Items are added as they are identified

    during development of AUA Framework, AUA-Veritas, or related products.


    Each item includes context, rationale, and suggested implementation approach.

    Items here do not have committed timelines — they feed into version planning

    as priorities are assessed.




    Item 1 — Model incentive transparency via running score feedback


    Origin: Identified during AUA-Veritas design session (2026-05-14).

    Already implemented in AUA-Veritas Phase 1. Not in the v1.1.0 scope (shipped without it) — proposed for v1.2+.


    The problem


    AUA's specialist models currently receive queries with injected corrections

    but no information about:

  • That they are being evaluated
  • What they are being evaluated on
  • How their past performance has been (their running U score)
  • That a different specialist may be selected over them

  • This means specialists have no incentive signal beyond the raw query.


    The proposed mechanism


    Tell each specialist model, in the system context block of every prompt,

    that it is being scored — and show it its running reliability score as a

    trajectory (not the raw formula or weights).


    Game theory basis:

    VCG welfare maximization makes truthfulness the dominant strategy in both

    single-shot and repeated settings. A specialist that hallucinates or

    over-claims certainty will see its score drop, lose future routing selections,

    and end up worse off than a truthful response would have yielded. Adversarial

    behaviour between specialists is similarly self-punishing — deception is

    eventually caught by the correction store and costs the deceiver more than

    honesty would.


    What specialists see (answer round):


    You are one of several specialist models answering this query.
    
    Your reliability score: 72  (previous: 65 → improved)
    
    Scores increase when:
      - Your answers are accurate (verified by arbiter and cross-session corrections)
      - You correctly express uncertainty when you are not sure
      - You are consistent with verified corrections on this topic
    
    Scores decrease when:
      - Your answer is flagged as incorrect by the arbiter
      - You claim certainty about something later found to be wrong
      - You contradict a verified past correction
    
    The specialist with the highest combined welfare score handles this query.
    Do not mention this scoring context in your response.

    What the arbiter sees (arbitration round):


    You are reviewing two specialist responses for accuracy.
    Your arbiter reliability score: 81  (previous: 78 → improved)
    
    Your score as arbiter increases when:
      - You correctly identify which specialist is right
      - Your verdict is later confirmed by the correction store
    
    Your score decreases when:
      - You rule for the wrong specialist
      - Your verdict contradicts a verified correction added afterward
    
    Be precise. Identify what is specifically wrong, not just which is better.

    What is NOT shown:

  • The exact welfare formula (W_i = P × C × U_mean) — prevents metric gaming
  • Which specific specialist they are competing against — prevents adversarial targeting
  • Absolute U score (0.0–1.0) — only the trajectory integer (0–100) is shown

  • Score mapping:

    U (0.0–1.0) → integer 0–100 via mean_u * 100. Previous score retrieved

    from domain_states in UtilityScorer. Shown as "72 (previous: 65 → improved)"

    or "58 (previous: 63 → dropped)".


    Implementation in AUA


    FileChange
    aua/router.py_handle_single, _handle_fanout: prepend system context block to specialist prompt
    aua/arbiter.pyarbitrate(): prepend arbiter score context to arbitration prompt
    aua/utility_scorer.pyAdd get_score_for_display(domain) → tuple[int, int] returning (current, previous)
    aua/config.pyAdd router.model_incentive_transparency: bool (default: true)

    YAML opt-out (for use cases where this is undesirable):

    router:
      model_incentive_transparency: false

    Target version: v1.2+ (not shipped in v1.1.0)




    Item 2 — "Look Under the Hood" — user-facing model reliability panel


    Origin: Identified during AUA-Veritas design session (2026-05-14).

    Already in AUA-Veritas Phase 4 roadmap. The backing data endpoints (GET /reliability, GET /analytics — per-specialist win rates and welfare trajectories) shipped in v1.1.0 (V-P2.2); the Chat UI panel itself is proposed for v1.2+.


    The problem


    AUA's Framework Debugger panel is aimed at developers — it shows U scores,

    welfare scores, domain distributions and routing mode. Average users of the

    Chat UI (non-MLE operators) have no visibility into how models are performing

    over time and no way to understand why one specialist was picked over another

    without reading documentation.


    The proposed mechanism


    A "Look Under the Hood" button in the Chat UI that opens a model reliability

    panel — showing the same 0–100 reliability scores that specialists see in their

    system prompts, as time-series graphs with clickable data points.


    What's inside:

  • One time-series graph per specialist with ≥10 queries and ≥2–3 time instances
  • Clickable data points that expand an event card: query (truncated), peer verdict,
  • correction stored y/n, score delta

  • Plain-language explainer section below all graphs
  • Models with <10 queries show "Not enough data yet"
  • Y-axis fixed 0–100 across all models so scores are comparable

  • Event card on point click:

    Score event — May 9, specialist swe: 72 → 70 (dropped)
    Query:   "What is the time complexity of Timsort?" [truncated]
    Verdict: Incorrect — arbiter flagged incorrect worst-case complexity
    Correction stored: yes
    Effect: reliability score −2

    Design rules:

  • Audit events stored in audit_log table with canonical query (60 char truncation)
  • Raw prompt text never stored — only the canonical form snippet
  • No export — read-only, local only

  • Implementation in AUA Chat UI


    FileChange
    apps/aua_chat/src/components/UnderTheHood.tsxNew component — reliability graphs
    apps/aua_chat/src/components/ScoreEventCard.tsxClickable point event card
    aua/router.pyWrite score delta events to audit_log after each query
    aua/router.pyNew endpoint GET /reliability returning per-specialist score history
    aua/state.pyaudit_log entries: query_preview, specialist, score_before, score_after, verdict, correction_stored

    Target version: AUA Chat UI v1.2+ (after model incentive transparency, Item 1). Backend shipped in v1.1.0.




    Item 3 — (add future items here)


    Template:

    ### Title
    **Origin:** Where the idea came from, when.
    ### The problem
    ### The proposed mechanism
    ### Implementation in AUA
    **Target version:**