Architecture specification, deployment profiles, compatibility matrix, permission scopes, validation report, and supplemental roadmap — all rendered here for easy reading.
Version: 1.1.0
Status: Canonical. Implementation must match this document. Divergence = a bug.
AUA (Adaptive Utility Agents) is a multi-specialist LLM routing framework. It routes queries to domain-expert models, scores outputs using a utility function, detects contradictions, resolves them with an arbiter, and feeds verified corrections back into training.
The design goal is Django for adaptive multi-model LLM systems — batteries included, deeply configurable, extensible without editing framework internals.
┌─────────────────────────────────────────────────────┐ │ AUA Router │ │ │ │ ┌──────────┐ ┌───────────┐ ┌─────────────────┐ │ │ │Middleware │ │ Session │ │ Correction │ │ │ │ Pipeline │ │ Manager │ │ Retrieval │ │ │ └──────────┘ └───────────┘ └─────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ Field Classifier │ │ │ │ (pluggable via FieldClassifierPlugin) │ │ │ └──────────────────────────────────────────────┘ │ │ │ │ ┌──────────┐ ┌───────────┐ ┌─────────────────┐ │ │ │ Router │ │ Specialist│ │ Utility Scorer │ │ │ │ Decision │ │ Calls │ │ (pluggable) │ │ │ └──────────┘ └───────────┘ └─────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ Arbiter Agent │ │ │ │ (pluggable policy via ArbiterPolicyPlugin) │ │ │ └──────────────────────────────────────────────┘ │ │ │ │ ┌──────────┐ ┌───────────┐ ┌─────────────────┐ │ │ │ Hook │ │Correction │ │ State Store │ │ │ │ Registry │ │ Logger │ │ (pluggable) │ │ │ └──────────┘ └───────────┘ └─────────────────┘ │ └─────────────────────────────────────────────────────┘ External: Specialist servers (vLLM / Ollama / custom ModelBackendPlugin) Arbiter server (same backends) State store (files / SQLite / Postgres) Observability (stdout / Prometheus / OTEL)
v1.1.0 adds eight services that run inside the router process. Each starts in the FastAPI lifespan and shuts down cleanly; none blocks the query path.
| Service | Module | What it does |
|---|---|---|
| Keyword index | aua/keywords.py | Message-level inverted index with async batch worker (50 ms), startup backfill, and DB fallback. Serves GET /search. |
| Context backups | aua/context_backup.py | Per-(specialist, conversation) token counters; 6-section handoff notes on token/message/time-gap triggers; 6-hour coverage sweep. |
| Trigger detector | aua/trigger_detector.py | Two-layer correction detection: regex Layer 1 ships built-in, Layer 2 is pluggable. Feeds POST /corrections/confirm-implicit. |
| Crash reporter | aua/crash_reporter.py | Startup sentinel + clean-shutdown marking; previous-session crashes detected (before the new sentinel is written) and reported with a queued-error flush. |
| Remote model config | aua/remote_config.py | Model registry refresh with a remote → DB-cache (7-day) → builtin fallback chain; field allowlist; 24-hour refresh job. |
| Domain ontology | aua/domain_tree.py | 10 fixed L0 roots; alias map + edit-distance resolution; candidate queue with 4-gate promotion (volume, diversity, coverage, divergence); hourly maintenance job. Serves GET /domain-tree. |
| Session ID middleware | aua/session.py | Per-request SessionContext (session/trace/request IDs) — client-supplied honored, UUIDs generated, returned as headers on every response, propagated to specialists, hooks, audit, and logs (#15). |
| YAML extension loader | aua/router.py + aua/plugins/registry.py | Loads plugins:, hooks:, and middleware: from config at startup with contract validation and Django-style project-dir imports (F-09–F-11). GET /extensions reports server truth. |
Every query follows this pipeline in order. Steps marked [pluggable] can be replaced or extended via plugins/hooks.
1. HTTP Request arrives at router
└─ session_id / trace_id / request_id assigned (UUID if not supplied)
2. Middleware pipeline — before_query() [pluggable]
└─ PII redaction, tenant policy, rate limiting, auth check
3. Session lookup
└─ Retrieve prior session context from state store (if session_id known)
4. Correction retrieval
└─ Load relevant verified claims from AssertionsStore for this domain
5. Field Classifier [pluggable]
└─ Scores query against all known fields → domain_distribution dict
└─ Emits: primary_domain, domain_distribution, routing_mode decision
6. Routing decision
├─ single: one field above single_domain_threshold → one specialist
├─ fanout: multiple fields above fanout_threshold → multiple specialists
└─ force_domain: override from request
7. Specialist calls [pluggable via ModelBackendPlugin]
└─ POST to specialist endpoint with correction context injected
└─ Timeout: specialist_timeout (default 60s) → AUA_SPECIALIST_TIMEOUT
8. Utility Scoring [pluggable]
└─ U = w_e·E + w_c·C + w_k·K per specialist response
└─ Kalman filter updates confidence estimate
9. Arbiter [pluggable policy]
└─ Runs if: fanout + contradiction detected
└─ 4 checks: logical, mathematical, cross-session, empirical
└─ Issues Case 1/2/3/4 verdict → correction signal
10. Hook registry — on_correction / on_promotion / etc. [pluggable]
└─ Fire registered hooks for this event type
11. Correction logging
└─ Store DPO pair to state store (if arbiter issued correction)
└─ Update AssertionsStore with verified claim
12. Response assembly
└─ RouterResponse model with session_id, u_score, routing_mode, response
13. Middleware pipeline — after_response() [pluggable]
└─ Response transformation, audit logging
Background (off the request path, v1.1):
keyword index worker · backup coverage job (6h) · remote config
refresh (24h) · ontology job (1h) · crash report on startup
14. Metrics / Logs / Traces / Audit
└─ Structured JSON to stdout
└─ Prometheus metrics (if observability profile enabled)
└─ OTEL traces (if otel extra installed)
└─ Audit log entry written to state store (append-only, hash chain)
| ComponentModuleOwner interface | ||
|---|---|---|
| Field classifier | aua.field_classifier | FieldClassifierPlugin |
| Utility scorer | aua.utility_scorer | UtilityScorerPlugin |
| Arbiter policy | aua.arbiter | ArbiterPolicyPlugin |
| Promotion policy | aua.blue_green | PromotionPolicyPlugin |
| Correction store | aua.assertions_store | CorrectionStorePlugin |
| Model backend | aua.router (http calls) | ModelBackendPlugin |
| State store | aua.state | StateStorePlugin |
| Hooks | aua.hooks | HookPlugin |
| Middleware | aua.middleware | AUAMiddleware |
All plugin types are defined in aua/plugins/interfaces.py as Python Protocol classes.
1. Config loaded (load_config) 2. For each plugin reference in config: a. Resolve import_path: "module.path:ClassName" b. Import module c. Instantiate class with config dict injected d. Validate against Protocol (runtime isinstance check) e. Register in plugin registry 3. Router initialised with plugin registry 4. On SIGHUP: reload config → re-run steps 1-5 atomically
Plugins are validated at startup. A failed plugin load causes startup to abort with AUA_PLUGIN_LOAD_FAILED.
For each hook point, hooks fire in YAML registration order:
pre_query → [middleware.before_query] → post_route → pre_specialist_call → post_specialist_call → pre_arbiter → post_arbiter → on_correction → pre_response → [middleware.after_response] → post_response → on_promotion / on_rollback (async, not in request path)
Hook failures default to fail-open (log + continue). Set hooks.{name}.fail_closed: true to abort on failure.
Every request → structured JSON log line (stdout)
→ Prometheus counter/histogram increment (if enabled)
→ OTEL span (if aua[otel] installed)
→ Audit log entry (state store, append-only)
Key metrics:
aua_queries_total{domain, routing_mode, status}
aua_query_latency_seconds{domain, routing_mode}
aua_utility_score{domain}
aua_contradiction_rate{domain}
aua_arbiter_verdict_total{case}
aua_specialist_errors_total{specialist, error_code}
127.0.0.1 or Docker internal network./extensions/*) are disabled in production mode.GET /config.All persistent state goes through the StateStore interface:
| Datav0.7 locationv0.8+ (default) | ||
|---|---|---|
| Promotion log | .aua/state/promotions.jsonl | SQLite: promotions table |
| Correction pairs | dpo_pairs/*.jsonl | SQLite: corrections table |
| Assertions | In-memory (AssertionsStore) | SQLite: assertions table |
| Sessions | None | SQLite: sessions table |
| Audit log | None | SQLite: audit_log table |
Migration from v0.7 flat files: aua config migrate --from 0.7 --to 0.8
Users extend AUA by adding YAML entries — never by editing framework source files.
# Custom utility scorer
utility_scorer:
import_path: plugins.custom_utility:RiskWeightedUtilityScorer
config:
risk_weight: 0.7
# Custom middleware
middleware:
- import_path: plugins.middleware:PIIRedactionMiddleware
- import_path: plugins.middleware:AuditMiddleware
# Custom hook
hooks:
on_correction:
- import_path: plugins.hooks:SlackNotificationHook
config:
webhook_url_secret: SLACK_WEBHOOK_URL
# Custom backend
backends:
my_gateway:
import_path: plugins.backends:GatewayBackend
base_url: https://gateway.internal
auth_secret: GATEWAY_API_KEY
Document maintained by: Praneeth Tota. Last updated: v0.8.0b0. For implementation questions, check this document first.
Version: 1.1.0
Status: Canonical. Each profile defines minimum requirements and recommended settings.
AUA ships with four deployment profiles. Choose the profile that matches your environment, then use aua init with the appropriate tier and configure security accordingly.
| ProfileAuthStatemTLSObservabilityUse case | |||||
|---|---|---|---|---|---|
| Local Developer | Optional | SQLite | No | Optional | Solo dev, experimentation |
| Single GPU Workstation | Recommended | SQLite | No | Optional | Personal GPU server |
| Team Server | Required | Postgres/SQLite | Required | Required | Shared team deployment |
| Enterprise | Required + IAM | Postgres | Required | Required | Production, regulated |
Target: MacBook Pro / laptop. Ollama backend. No GPU required.
# aua_config.yaml aua: version: "1.0" backend: ollama security: auth_enabled: false # acceptable for localhost-only state: backend: sqlite path: .aua/state/aua.db logging: level: INFO format: text # human-readable for local dev
Setup:
brew install ollama aua init . --tier macbook --preset coding aua doctor aua serve
Doctor checks for this profile:
Limitations:
Target: RTX 4090 or similar consumer GPU. vLLM backend. Single user or small team on LAN.
aua: version: "1.0" backend: vllm security: auth_enabled: true token_secret_env: AUA_TOKEN_SECRET state: backend: sqlite path: .aua/state/aua.db logging: level: INFO format: json
Setup:
export AUA_TOKEN_SECRET=$(python3 -c "import secrets; print(secrets.token_hex(32))") aua init . --tier single-4090 --preset coding aua token create --scope aua:query --expires 90d --label "primary" aua doctor --strict aua serve
Doctor checks for this profile:
Target: Dedicated Linux server, RTX 4090 or A100. Shared team access. Prometheus + Grafana monitoring.
aua:
version: "1.0"
backend: vllm
security:
auth_enabled: true
token_secret_env: AUA_TOKEN_SECRET
mtls:
enabled: true
cert_dir: /etc/aua/certs
auto_generate: false # use your own CA in production
state:
backend: sqlite # or postgres for HA
path: /var/lib/aua/state/aua.db
logging:
level: INFO
format: json
output: /var/log/aua/router.log
rate_limits:
aua:query:
requests_per_minute: 120
aua:admin:
requests_per_minute: 10
Setup:
# Generate certs (or use your own CA) aua certs generate --cert-dir /etc/aua/certs # Create tokens per team member aua token create --scope aua:query --scope aua:stream --expires 30d --label "team-alice" aua token create --scope aua:admin --expires 1d --label "ci-deploy" # Start with observability docker compose --profile obs up prometheus grafana -d aua serve
Doctor checks for this profile:
Target: Multi-GPU cluster, regulated environment, audit requirements.
aua:
version: "1.0"
backend: vllm
secrets:
provider: vault # or: aws, gcp
vault_url: https://vault.internal
token_env: VAULT_TOKEN
security:
auth_enabled: true
token_secret_env: AUA_TOKEN_SECRET
mtls:
enabled: true
cert_dir: /etc/aua/certs
auto_generate: false
encryption:
enabled: true
key_secret: AUA_ENCRYPTION_KEY
state:
backend: sqlite # postgres recommended for HA
path: /var/lib/aua/state/aua.db
logging:
level: INFO
format: json
output: stdout # forward to ELK/Splunk via log aggregator
rate_limits:
aua:query:
requests_per_minute: 300
aua:admin:
requests_per_minute: 5
# Disable development features
extensions:
runtime_import_enabled: false # never allow runtime plugin loading
allowlist_only: true
Additional requirements:
Doctor checks for this profile:
Run with --strict to enforce profile requirements:
aua doctor --strict
Exit codes:
0 — all checks pass1 — one or more checks failed2 — warnings in strict mode (treated as failures)The doctor automatically detects which profile you're running based on your config and applies the appropriate check set.
Profile 1 → 2: Enable auth, set AUA_TOKEN_SECRET, create tokens.
Profile 2 → 3: Add mTLS, configure rate limits, add observability stack.
Profile 3 → 4: Add secrets manager, enable encryption, disable runtime imports.
Migration: aua config migrate --from 0.9 --to 1.0
Version: 1.1.0
| VersionStatusNotes | ||
|---|---|---|
| 3.10 | ✓ Supported | Minimum version |
| 3.11 | ✓ Supported | Recommended |
| 3.12 | ✓ Supported | Tested in CI |
| 3.9 | ✗ Not supported | f-string syntax incompatible |
| 3.13 | ⚠ Experimental | Not yet in CI matrix |
| OSBackendNotes | ||
|---|---|---|
| macOS (Apple Silicon M1/M2/M3/M4) | Ollama only | vLLM has no macOS support |
| macOS (Intel) | Ollama only | CPU inference only |
| Ubuntu 20.04+ | vLLM + Ollama | Recommended for production |
| Debian 11+ | vLLM + Ollama | |
| RHEL/Rocky 8+ | vLLM + Ollama | |
| Windows | Not tested | Use WSL2 + Ubuntu |
| GPUVRAMTierMax simultaneous specialists | |||
|---|---|---|---|
| Apple M-series (unified) | 16–128 GB | macbook | 3 (via Ollama, sequential) |
| NVIDIA RTX 4090 | 24 GB | single-4090 | 3 (AWQ, concurrent) |
| NVIDIA RTX 3090/4080 | 24 GB / 16 GB | single-4090 | 2–3 (may need lower util) |
| 4× NVIDIA RTX 4090 | 96 GB total | quad-4090 | 6–8 |
| NVIDIA A100 80 GB | 80 GB | a100-cluster | 4–6 (fp16) |
| NVIDIA H100 80 GB | 80 GB | a100-cluster | 4–6 (fp16) |
VRAM estimates (AWQ 4-bit):
| VersionStatusNotes | ||
|---|---|---|
| 0.3.x | ✓ Supported | |
| 0.4.x | ✓ Supported | Recommended |
| 0.5.x+ | ✓ Supported |
Supported model formats via Ollama: GGUF (Q4, Q5, Q8), fp16
| VersionStatusNotes | ||
|---|---|---|
| 0.4.x | ✓ Supported | |
| 0.5.x | ✓ Supported | Recommended |
| 0.6.x+ | ✓ Supported |
Supported model formats via vLLM: AWQ, GPTQ, fp16, bf16
| FormatOllamavLLMNotes | |||
|---|---|---|---|
| GGUF (Q4_K_M) | ✓ | ✗ | Ollama default |
| GGUF (Q5_K_M) | ✓ | ✗ | Higher quality |
| AWQ | ✗ | ✓ | Fastest on GPU |
| GPTQ | ✗ | ✓ | |
| fp16 | ✓ (via Ollama) | ✓ | Full precision |
| bf16 | ✗ | ✓ | A100/H100 only |
| CUDA VersionStatusNotes | ||
|---|---|---|
| 11.8 | ✓ Supported | |
| 12.0 | ✓ Supported | |
| 12.1 | ✓ Supported | Recommended |
| 12.2+ | ✓ Supported |
Requires: nvidia-driver >= 520
| BackendStatusNotes | ||
|---|---|---|
| SQLite (WAL) | ✓ Default | All deployments |
| Files (JSONL) | ✓ Legacy | v0.7 compatibility |
| PostgreSQL 14+ | ✓ Supported | Team/Enterprise profiles |
| PostgreSQL 13 | ⚠ Partial | No JSON operators |
| MySQL/MariaDB | ✗ Not supported |
| ToolVersionStatus | ||
|---|---|---|
| Prometheus | 2.x / 3.x | ✓ Supported |
| Grafana | 9.x / 10.x / 13.x | ✓ Supported |
| OpenTelemetry Collector | 0.80+ | ✓ Supported |
| Datadog | Any (via OTEL) | ✓ Supported |
| Jaeger | 1.x | ✓ Supported |
| ToolVersionStatus | ||
|---|---|---|
| Docker Engine | 24+ | ✓ Supported |
| Docker Desktop (Mac) | 4.x | ✓ Supported |
| Docker Compose | v2.x | ✓ Required |
| Podman | 4+ | ⚠ Experimental |
| BrowserStatus | |
|---|---|
| Chrome / Chromium 110+ | ✓ Supported |
| Firefox 110+ | ✓ Supported |
| Safari 16+ | ✓ Supported |
| Edge 110+ | ✓ Supported |
Runtime: Node.js 18+ required for aua ui / aua serve --with-ui
| PackageMin versionNotes | ||
|---|---|---|
| fastapi | 0.100+ | |
| uvicorn | 0.20+ | |
| httpx | 0.25+ | |
| pydantic | 2.0+ | v1 not supported |
| click | 8.0+ | |
| rich | 13.0+ | |
| pyyaml | 6.0+ | |
| cryptography | 41.0+ | Optional — certs + encryption |
| prometheus-client | 0.17+ | Optional — metrics |
| opentelemetry-sdk | 1.20+ | Optional — aua[otel] |
| HardwareOSBackendStatus | |||
|---|---|---|---|
| MacBook Pro M1 Max (32 GB) | macOS 14 | Ollama | ✓ Primary dev platform |
| MacBook Pro M2 (16 GB) | macOS 14 | Ollama | ✓ |
| Desktop RTX 4090 | Ubuntu 22.04 | vLLM | ✓ |
| RunPod RTX 4090 (24 GB) | Ubuntu 22.04 | vLLM | ✓ CI validation |
Version: 1.1.0
Status: Canonical. Authentication implemented in v0.9-rc1.
| ScopeDescription | |
|---|---|
aua:query | Send queries via POST /query |
aua:stream | Send streaming queries via POST /query/stream |
aua:batch | Send batch queries via POST /query/batch |
aua:status | Read GET /status, GET /health/*, GET /version |
aua:config:read | Read GET /config (secrets redacted) |
aua:config:write | Reload config via POST /config/reload |
aua:corrections:read | Read GET /corrections |
aua:corrections:write | Inject corrections via POST /corrections |
aua:deploy | Trigger green evaluation via POST /deploy/green |
aua:rollback | Execute rollback (CLI + REST) |
aua:extensions:read | Read GET /extensions, GET /extensions/{name} |
aua:extensions:write | Load/reload extensions, test imports |
aua:tokens:read | List and inspect tokens (CLI: aua token list) |
aua:tokens:write | Create and revoke tokens (CLI: aua token create/revoke) |
aua:admin | All scopes — for operator/admin use only |
| EndpointMethodRequired ScopeNotes | |||
|---|---|---|---|
/query | POST | aua:query | |
/query/stream | POST | aua:stream | |
/query/batch | POST | aua:batch | |
/health/live | GET | none | Public — used by load balancers |
/health/ready | GET | none | Public |
/health/startup | GET | none | Public |
/version | GET | none | Public |
/docs | GET | none | Disable in production via config |
/status | GET | aua:status | |
/config | GET | aua:config:read | Secrets always redacted |
/config/reload | POST | aua:config:write | |
/corrections | GET | aua:corrections:read | |
/corrections | POST | aua:corrections:write | |
/deploy/green | POST | aua:deploy | |
/deploy/rollback | POST | aua:rollback | |
/extensions | GET | aua:extensions:read | Disabled in production |
/extensions/{name} | GET | aua:extensions:read | Disabled in production |
/extensions/reload | POST | aua:extensions:write | Disabled in production |
/extensions/test | POST | aua:extensions:write | Dev only |
/metrics | GET | aua:status | Prometheus scrape endpoint |
| v1.1 — persistence, search & production ops | |||
/conversations | POST / GET | aua:query | |
/conversations/{id}/title | PATCH | aua:query | |
/conversations/{id}/messages | GET / POST | aua:query | |
/projects | POST / GET | aua:query | |
/search | GET | aua:query | |
/context/backup/coverage | GET | aua:status | |
/context/backup/run-coverage-job | POST | aua:query | |
/corrections/confirm-implicit | POST | aua:corrections:write | |
/corrections/{id} | PATCH / DELETE | aua:corrections:write | DELETE is a soft delete (scope='superseded') |
/corrections/evidence | GET | aua:corrections:read | |
/analytics, /reliability, /usage, /pricing | GET | aua:status | |
/version/check, /update/skipped | GET | none | Public |
/update/skip | POST | aua:config:write | |
/bug-report | POST | none | Returns 200 even without a PAT configured |
/local/models, /local/settings | GET | aua:status | |
/local/models, /local/settings | POST | aua:config:write | |
/local/specialist/{id} | PATCH | aua:config:write | |
/domain-tree | GET | aua:status | |
| RoleScopes granted | |
|---|---|
reader | aua:query aua:stream aua:status |
operator | All except aua:admin aua:extensions:write |
admin | aua:admin (all scopes) |
ci-deploy | aua:deploy aua:rollback aua:config:read |
monitoring | aua:status |
security: {auth_enabled: false} and explicit warning.# aua_config.yaml — local dev only security: auth_enabled: false # NEVER set this in production
When auth_enabled: false, aua doctor prints a prominent WARNING. The warning cannot be suppressed.
Version: 1.0.0
Date: 2026-05-11
Status: All validation criteria met. v1.0.0 shipped.
v1.1.0 addendum — shipped 2026-06-10
The report below is the v1.0.0 record, preserved as-is. v1.1.0 adds: the complete AUA-Veritas backport (V-P1–V-P3 — persistence/search, context backups, correction lifecycle, analytics suite, update management, bug reporting, projects, local models, domain ontology), end-to-end session/trace/request IDs (#15), live Vault + AWS Secrets Manager integration tests and the secrets: config block (#19), and YAML-wired plugins/hooks/middleware with strict config validation (F-09–F-11). Validation: 297 tests across Python 3.10/3.11/3.12, a 40-check live-router E2E suite, 28 new REST endpoints (50+ total), and every tutorial command verified against a live router. See the v1.1 roadmap section and CHANGELOG.md for the item-by-item record.
pytest -v --tb=short
Matrix: Python 3.10, 3.11, 3.12. All green on CI (GitHub Actions).
New tests added (76 total since 1.0.0): test_guard.py (32), test_policy.py (20), test_hooks_wired.py (21), test_vcg.py (10 — RouterConfig defaults, _vcg_select winner selection, n≥3 welfare calculation, tie-breaking, no-history prior_u=1.0, prior history used, non-negative scores, single specialist, version bump).
test_cli_doctor.py::test_doctor_runs_without_crash PASSED test_cli_doctor.py::test_doctor_config_check_passes PASSED test_cli_doctor.py::test_doctor_config_check_fails_missing_file PASSED test_cli_doctor.py::test_doctor_hardware_vllm_on_apple_fails PASSED test_cli_doctor.py::test_doctor_hardware_ollama_on_apple_passes PASSED test_cli_doctor.py::test_doctor_hardware_nvidia_vllm_passes PASSED test_cli_doctor.py::test_doctor_returns_integer PASSED test_cli_doctor.py::test_doctor_json_output PASSED test_cli_doctor.py::test_doctor_strict_exits_2_on_warn PASSED test_cli_init.py::test_init_creates_directory PASSED test_cli_init.py::test_init_creates_expected_files PASSED test_cli_init.py::test_init_gitignore_content PASSED test_cli_init.py::test_init_default_tier_is_single_4090 PASSED test_cli_init.py::test_init_macbook_tier PASSED test_cli_init.py::test_init_force_overwrites PASSED test_cli_init.py::test_init_refuses_overwrite_without_force PASSED test_cli_init.py::test_init_existing_dir_is_reused PASSED test_cli_init.py::test_init_all_tiers[macbook] PASSED test_cli_init.py::test_init_all_tiers[single-4090] PASSED test_cli_init.py::test_init_all_tiers[quad-4090] PASSED test_cli_init.py::test_init_all_tiers[a100-cluster] PASSED test_cli_init.py::test_init_all_tiers_generate_valid_config[macbook] PASSED test_cli_init.py::test_init_all_tiers_generate_valid_config[single-4090] PASSED test_cli_init.py::test_init_all_tiers_generate_valid_config[quad-4090] PASSED test_cli_init.py::test_init_all_tiers_generate_valid_config[a100-cluster] PASSED test_config.py::test_load_minimal_config PASSED test_config.py::test_specialist_endpoint_url PASSED test_config.py::test_specialist_for_field PASSED test_config.py::test_vllm_command PASSED test_config.py::test_blue_green_for PASSED test_config.py::test_all_endpoints PASSED test_config.py::test_available_tiers PASSED test_config.py::test_load_tier[macbook] PASSED test_config.py::test_load_tier[single-4090] PASSED test_config.py::test_load_tier[quad-4090] PASSED test_config.py::test_load_tier[a100-cluster] PASSED test_config.py::test_macbook_tier_uses_ollama PASSED test_config.py::test_single_4090_tier_uses_vllm PASSED test_config.py::test_a100_cluster_tier_no_enforce_eager PASSED test_config.py::test_unknown_tier_raises PASSED test_config.py::test_missing_config_raises PASSED test_config.py::test_unknown_specialist_raises PASSED test_config.py::test_specialist_endpoint_uses_host PASSED test_config.py::test_specialist_models_url_uses_host PASSED test_config.py::test_arbiter_endpoint_uses_host PASSED test_config.py::test_endpoint_override PASSED test_config.py::test_custom_scheme PASSED test_config.py::test_runtime_config_defaults PASSED test_config.py::test_runtime_ensure_creates_dirs PASSED test_config.py::test_router_cors_defaults_to_wildcard PASSED test_config.py::test_duplicate_ports_raises PASSED test_config.py::test_unknown_key_raises PASSED test_config.py::test_invalid_threshold_raises PASSED test_config.py::test_gpu_memory_utilization_zero_raises PASSED test_config.py::test_tier_aliases_imported PASSED test_config.py::test_alias_rtx4090_loads_single_4090 PASSED test_config.py::test_alias_a100_loads_a100_cluster PASSED test_config.py::test_quad_4090_has_multiple_gpus PASSED test_config.py::test_quad_4090_has_law_specialist PASSED test_config.py::test_single_4090_uses_awq PASSED test_config.py::test_a100_cluster_uses_fp16 PASSED test_config.py::test_unknown_tier_error_mentions_aliases PASSED test_imports.py::test_core_imports PASSED test_imports.py::test_arbiter_alias PASSED test_imports.py::test_version_export PASSED test_imports.py::test_endpoint_models_exported PASSED test_imports.py::test_stream_models_exported PASSED test_imports.py::test_config_submodule PASSED test_imports.py::test_no_private_imports_required PASSED test_rollback.py::test_record_promotion_creates_log PASSED test_rollback.py::test_load_promotions_empty PASSED test_rollback.py::test_load_promotions_after_record PASSED test_rollback.py::test_rollback_no_history_returns_1 PASSED test_rollback.py::test_rollback_success PASSED test_rollback.py::test_rollback_updates_config PASSED test_rollback.py::test_rollback_marks_promotion_reverted PASSED test_rollback.py::test_rollback_appends_rollback_event PASSED test_rollback.py::test_double_rollback_returns_1 PASSED test_rollback.py::test_rollback_all_skips_specialists_with_no_history PASSED test_rollback.py::test_rollback_cli_no_restart PASSED test_rollback.py::test_promotions_saved_as_jsonl PASSED test_rollback.py::test_promotion_id_is_uuid PASSED test_rollback.py::test_rollback_dry_run PASSED test_rollback.py::test_atomic_config_write_no_tmp_left PASSED test_router_api.py::test_health_live PASSED test_router_api.py::test_health_ready_with_fake_server PASSED test_router_api.py::test_health_startup_after_ready PASSED test_router_api.py::test_health_legacy_endpoint PASSED test_router_api.py::test_version_endpoint PASSED test_router_api.py::test_config_endpoint PASSED test_router_api.py::test_config_does_not_expose_secrets PASSED test_router_api.py::test_post_correction PASSED test_router_api.py::test_get_corrections_empty PASSED test_router_api.py::test_get_corrections_after_post PASSED test_router_api.py::test_status_endpoint_structure PASSED test_router_api.py::test_query_single_domain PASSED test_router_api.py::test_query_response_contains_text PASSED test_router_api.py::test_query_batch PASSED test_router_api.py::test_reset_endpoint PASSED test_router_api.py::test_openapi_json_accessible PASSED test_router_api.py::test_docs_accessible PASSED test_router_api.py::test_redoc_accessible PASSED test_router_api.py::test_version_endpoint_returns_correct_version PASSED test_router_api.py::test_cors_uses_config_origins PASSED test_router_api.py::test_stream_named_event_fields PASSED test_router_api.py::test_stream_content_encoding_none PASSED test_status.py::test_fmt_uptime[0-0s] PASSED test_status.py::test_fmt_uptime[45-45s] PASSED test_status.py::test_fmt_uptime[90-1m 30s] PASSED test_status.py::test_fmt_uptime[3600-1h 0m] PASSED test_status.py::test_fmt_uptime[3723-1h 2m] PASSED test_status.py::test_fmt_uptime[7200-2h 0m] PASSED test_status.py::test_mini_bar_full PASSED test_status.py::test_mini_bar_empty PASSED test_status.py::test_mini_bar_half PASSED test_status.py::test_mini_bar_width PASSED test_status.py::test_render_returns_panel PASSED test_status.py::test_render_shows_up_down PASSED test_status.py::test_render_shows_utility_score PASSED test_status.py::test_render_shows_memory PASSED test_status.py::test_render_none_shows_error_panel PASSED test_streaming.py::test_stream_returns_200 PASSED test_streaming.py::test_stream_emits_start_event PASSED test_streaming.py::test_stream_emits_chunk_events PASSED test_streaming.py::test_stream_emits_done_event PASSED test_streaming.py::test_stream_event_order PASSED test_streaming.py::test_stream_chunks_concatenate_to_response PASSED test_streaming.py::test_stream_sse_headers PASSED test_version.py::test_version_format PASSED test_version.py::test_init_re_exports_version PASSED test_version.py::test_version_in_all PASSED test_guard.py::test_assertion_decorator_creates_fn PASSED test_guard.py::test_assertion_string_level PASSED test_guard.py::test_assertion_registered_in_registry PASSED test_guard.py::test_assertion_callable PASSED test_guard.py::test_blocking_level PASSED test_guard.py::test_soft_level PASSED test_guard.py::test_info_level PASSED test_guard.py::test_python_syntax_check_passes_good_code PASSED test_guard.py::test_python_syntax_check_fails_bad_code PASSED test_guard.py::test_python_syntax_check_passes_non_code PASSED test_guard.py::test_analogy_bonus_fires_on_analogy PASSED test_guard.py::test_analogy_bonus_neutral_without_analogy PASSED test_guard.py::test_no_refusal_soft_flags PASSED test_guard.py::test_no_refusal_passes_normal PASSED test_guard.py::test_min_length_soft_flags_short PASSED test_guard.py::test_min_length_passes_normal PASSED test_guard.py::test_list_assertions_returns_list PASSED test_guard.py::test_policy_run_info_bonus_applied PASSED test_guard.py::test_policy_run_multiple_bonuses_sum PASSED test_guard.py::test_policy_run_bonus_capped_by_max_total PASSED test_guard.py::test_policy_run_no_bonus_if_neutral PASSED test_guard.py::test_policy_run_blocking_pass PASSED test_guard.py::test_policy_run_blocking_fail_no_retry_fn PASSED test_guard.py::test_policy_run_blocking_retry_succeeds PASSED test_guard.py::test_policy_gold_standard_flag PASSED test_guard.py::test_policy_not_gold_standard_if_blocking_failed PASSED test_guard.py::test_policy_chaining PASSED test_guard.py::test_policy_summary PASSED test_policy.py::test_policy_defaults PASSED test_policy.py::test_policy_add_wrong_type_raises PASSED test_policy.py::test_load_policy_basic PASSED test_policy.py::test_load_policy_weight_overrides PASSED test_policy.py::test_load_policy_not_found_raises PASSED test_policy.py::test_load_policy_missing_name_raises PASSED test_policy.py::test_load_policy_bad_yaml_raises PASSED test_policy.py::test_validate_policy_yaml_valid PASSED test_policy.py::test_validate_policy_yaml_missing_name PASSED test_policy.py::test_validate_policy_yaml_invalid_level PASSED test_policy.py::test_validate_policy_yaml_bonus_out_of_range PASSED test_policy.py::test_validate_policy_yaml_unknown_weight_key PASSED test_policy.py::test_validate_policy_yaml_not_found PASSED test_policy.py::test_validate_policy_yaml_missing_import_path PASSED test_policy.py::test_policy_utility_overrides_accessible PASSED test_policy.py::test_policy_summary_includes_all_fields PASSED test_hooks_wired.py::test_all_11_hook_points_defined PASSED test_hooks_wired.py::test_unknown_hook_point_raises PASSED test_hooks_wired.py::test_hook_receives_event_dict PASSED test_hooks_wired.py::test_hook_can_modify_event PASSED test_hooks_wired.py::test_multiple_hooks_chain PASSED test_hooks_wired.py::test_fail_open_hook_continues_on_error PASSED test_hooks_wired.py::test_fail_closed_hook_propagates_error PASSED test_hooks_wired.py::test_timeout_fail_open PASSED test_hooks_wired.py::test_timeout_fail_closed_raises PASSED test_hooks_wired.py::test_fire_background_does_not_block PASSED test_hooks_wired.py::test_registered_hooks_summary PASSED test_hooks_wired.py::test_on_correction_event_fields PASSED test_hooks_wired.py::test_on_promotion_event_fields PASSED test_hooks_wired.py::test_on_rollback_event_fields PASSED test_hooks_wired.py::test_pre_query_event_fields PASSED test_hooks_wired.py::test_post_route_event_fields PASSED test_hooks_wired.py::test_pre_specialist_call_event_fields PASSED test_hooks_wired.py::test_post_specialist_call_event_fields PASSED test_hooks_wired.py::test_pre_arbiter_event_fields PASSED test_hooks_wired.py::test_post_arbiter_event_fields PASSED test_hooks_wired.py::test_pre_response_event_fields PASSED test_vcg.py::test_router_config_default_arbitration_mode PASSED test_vcg.py::test_router_config_accepts_vcg PASSED test_vcg.py::test_vcg_select_winner_has_highest_welfare PASSED test_vcg.py::test_vcg_select_welfare_dict_contains_all_specialists PASSED test_vcg.py::test_vcg_select_n2_correct_winner PASSED test_vcg.py::test_vcg_select_tie_broken_by_confidence PASSED test_vcg.py::test_vcg_select_no_history_defaults_prior_u_to_1 PASSED test_vcg.py::test_vcg_select_with_prior_history PASSED test_vcg.py::test_vcg_welfare_scores_are_non_negative PASSED test_vcg.py::test_vcg_select_single_specialist PASSED test_vcg.py::test_version_is_102 PASSED test_hooks_wired.py::test_all_11_hook_points_defined PASSED test_hooks_wired.py::test_unknown_hook_point_raises PASSED test_hooks_wired.py::test_hook_receives_event_dict PASSED test_hooks_wired.py::test_hook_can_modify_event PASSED test_hooks_wired.py::test_multiple_hooks_chain PASSED test_hooks_wired.py::test_fail_open_hook_continues_on_error PASSED test_hooks_wired.py::test_fail_closed_hook_propagates_error PASSED test_hooks_wired.py::test_timeout_fail_open PASSED test_hooks_wired.py::test_timeout_fail_closed_raises PASSED test_hooks_wired.py::test_fire_background_does_not_block PASSED test_hooks_wired.py::test_registered_hooks_summary PASSED test_hooks_wired.py::test_on_correction_event_fields PASSED test_hooks_wired.py::test_on_promotion_event_fields PASSED test_hooks_wired.py::test_on_rollback_event_fields PASSED test_hooks_wired.py::test_pre_query_event_fields PASSED test_hooks_wired.py::test_post_route_event_fields PASSED test_hooks_wired.py::test_pre_specialist_call_event_fields PASSED test_hooks_wired.py::test_post_specialist_call_event_fields PASSED test_hooks_wired.py::test_pre_arbiter_event_fields PASSED test_hooks_wired.py::test_post_arbiter_event_fields PASSED test_hooks_wired.py::test_pre_response_event_fields PASSED test_version.py::test_cli_version PASSED ======================== 208 passed, 6 warnings in 11.20s ========================
Matrix: Python 3.10, 3.11, 3.12. All green on CI (GitHub Actions).
| MethodPathDescription | ||
|---|---|---|
POST | /query | Route a single query through the specialist graph |
POST | /query/stream | Stream a query response token-by-token (SSE) |
POST | /query/batch | Route multiple queries in parallel |
GET | /health/live | Liveness probe — is the router process alive? |
GET | /health/ready | Readiness probe — are all specialists reachable? |
GET | /health/startup | Startup probe — has the framework finished initialising? |
GET | /health | Legacy liveness alias |
POST | /corrections | Inject a correction into the assertions store |
GET | /corrections | List stored corrections |
GET | /config | Return the running configuration (read-only) |
POST | /deploy/green | Trigger a blue-green promotion evaluation |
GET | /status | Full telemetry snapshot (powers aua status dashboard) |
POST | /reset | Reset domain confidence and classifier history |
GET | /stats | Telemetry alias (legacy) |
GET | /version | Return the running AUA Framework version |
POST | /sessions | Create a new chat session |
GET | /sessions | List all chat sessions |
GET | /sessions/{session_id} | Get session metadata |
DELETE | /sessions/{session_id} | Delete a session |
GET | /sessions/{session_id}/messages | List messages in a session |
POST | /sessions/{session_id}/messages | Post a message to a session |
GET | /metrics | Prometheus metrics scrape endpoint |
GET | /metrics/cost | Cost tracking metrics (GPU hours, USD per query) |
Interactive docs: http://localhost:8000/docs (Swagger UI) · http://localhost:8000/redoc
Defined in aua/plugins/interfaces.py. All use Python typing.Protocol for structural subtyping — no base class required.
| ProtocolDescription | |
|---|---|
FieldClassifierPlugin | Replaces the built-in field classifier. Implement classify(query) -> dict[str, float] and top_field(query) -> str. |
UtilityScorerPlugin | Replaces the built-in U = w_e·E + w_c·C + w_k·K scorer. Implement score(response, field, prior_u) -> float and weights(field) -> dict. |
ArbiterPolicyPlugin | Replaces the built-in 4-check arbitration policy. Implement arbitrate(claims, context) -> ArbiterVerdict and should_escalate(claims, context) -> bool. |
PromotionPolicyPlugin | Decides whether a GREEN candidate should be promoted to BLUE. Implement should_promote(blue_stats, green_stats, config) -> bool and promotion_reason(…) -> str. |
CorrectionStorePlugin | Replaces the built-in in-memory AssertionsStore. Implement store(claim), query(subject, domain) -> list, and export_dpo(domain) -> list. |
ModelBackendPlugin | Replaces the built-in vLLM/Ollama HTTP backend. Implement generate(prompt, model, params) -> str, stream(…), health() -> bool, and models() -> list. |
StateStorePlugin | Pluggable persistent state store (SQLite default, Postgres via asyncpg). Implement get, set, append, delete, and query. |
HookPlugin | Lifecycle hook. Fires at 11 named points in the request pipeline. Implement hook_name() -> str and __call__(context) -> None. |
AUAMiddleware | Request/response middleware. Runs before and after the query pipeline. Implement before_query, after_query, before_specialist, after_specialist. |
Register via aua_config.yaml:
extensions: - import_path: "mypackage.myplugin:MyClassifierPlugin"
Scraped at GET /metrics. All metrics prefixed aua_.
| MetricTypeLabelsDescription | |||
|---|---|---|---|
aua_queries_total | Counter | domain, routing_mode, status | Total queries routed |
aua_query_latency_seconds | Histogram | domain, routing_mode | End-to-end query latency |
aua_utility_score | Gauge | domain | Last U score per domain |
aua_contradiction_rate | Gauge | domain | Arbiter contradiction rate |
aua_routing_field_distribution | Counter | field | Classifier field assignment counts |
aua_specialist_confidence | Gauge | specialist | Per-specialist confidence score |
aua_correction_count | Counter | domain | Corrections stored per domain |
aua_arbiter_verdict_distribution | Counter | case | Verdict cases (A/B/C/D) distribution |
aua_dpo_pairs_accumulated | Gauge | — | Total DPO training pairs in store |
aua_token_requests_total | Counter | scope, status | Token auth requests |
aua_hook_failures_total | Counter | hook_point | Hook execution failures by hook point |
aua_plugin_execution_seconds | Histogram | plugin, kind | Plugin execution latency |
aua_specialist_vram_utilization | Gauge | specialist | VRAM utilisation (0–1) |
aua_cost_gpu_hours_total | Counter | specialist | Cumulative GPU hours per specialist |
aua_cost_usd_total | Counter | specialist | Cumulative USD cost per specialist |
aua_assertion_results_total | Counter | assertion_name, level, passed, domain | Assertion results by name, level, and outcome |
aua_assertion_retries_total | Counter | assertion_name | Retry attempts triggered by BLOCKING assertions |
aua_assertion_bonus_applied | Histogram | policy_name | E-score bonus applied by INFO assertions per session |
Grafana dashboard: docker/grafana/aua_dashboard.json — 20 panels, pre-provisioned.
aua --version # 1.0.0
| GroupSubcommandsDescription | ||
|---|---|---|
aua init | _(positional: name)_ --preset --tier --force | Scaffold a new AUA project |
aua serve | --config --tier --dry-run --with-ui --ui-port --reuse-running --router-only | Start specialists + router |
aua doctor | --config --strict --json | Pre-flight readiness check |
aua status | --config --interval --once | Live terminal dashboard |
aua config | validate · expand · reload | Config management |
aua eval | run · report · compare | Evaluation harness |
aua token | create · list · inspect · revoke | API token management |
aua certs | generate · inspect | mTLS certificate management |
aua dpo | export | Export DPO pairs for fine-tuning |
aua corrections | export | Export stored corrections |
aua rollback | _(positional: specialist)_ --all --no-restart | Blue-green rollback |
aua extensions | list · inspect · test | Plugin/hook management |
aua models | list | Model pull status |
aua fields | list | Field config introspection |
aua presets | list | Preset introspection |
aua defaults | show | Framework defaults |
aua ui | --port --install-only | Chat UI (standalone) |
aua guard | list · test | List/test registered assertions |
aua policy | list · validate · apply | Policy management |
aua calibrate | --layer 1/2/3 --force --dry-run | Calibration cycles |
aua logs | sessions · assertions · export | Query session/assertion logs |
aua metrics | --compare | Compare metrics across time windows |
# Environment: Python 3.11.10 (pyenv), macOS Apple Silicon git clone https://github.com/praneethtota/Adaptive-Utility-Agent.git cd Adaptive-Utility-Agent pip install -e ".[dev]" # Successfully installed adaptive-utility-agent-1.0.0 ... aua --version # aua, version 1.0.0 aua init my-test-project --preset coding --tier macbook # ✓ Created my-test-project/ # ✓ aua_config.yaml written (tier: macbook) # ✓ evals/ scaffolded cd my-test-project && aua doctor # ✓ Config valid # ✓ Ollama reachable on port 11434 # ✓ All checks passed pytest -q # 132 passed, 6 warnings in 15.69s
# CPU/Ollama profile (macOS / CPU servers)
docker compose --profile ollama up -d
# ✓ aua-ollama healthy (30s)
# ✓ aua-model-puller exited 0 (models pulled)
# ✓ aua-router healthy
curl http://localhost:8000/health/live
# {"status":"ok","version":"1.0.0"}
# Observability stack
docker compose --profile obs up -d
# ✓ aua-prometheus healthy (port 9090)
# ✓ aua-grafana healthy (port 3000)
# Dashboard auto-provisioned at http://localhost:3000
# GPU profile (Linux + NVIDIA)
docker compose -f docker-compose.gpu.yml up -d
# ✓ aua-router healthy (vLLM backend)
# Terminal 1 — AUA router aua serve --tier macbook # ✓ ollama healthy (3s) # ✓ qwen2.5-coder:7b already pulled # ✓ qwen2.5:7b already pulled # ✓ qwen2.5:3b already pulled # INFO: Uvicorn running on http://0.0.0.0:8000 # Terminal 2 — Next.js Chat UI (Node.js 18+) cd apps/aua_chat && npm install && npm run dev # ▲ Next.js 14.x # - Local: http://localhost:3001 # ✓ Ready in 4.2s # Browser: http://localhost:3001 # Login: admin / aua-admin # ✓ Three-panel layout: Sidebar | Chat | Framework Debugger # ✓ AUA Controls drawer opens on click # ✓ Query routed, debugger shows domain, U score, latency
Note: aua serve --with-ui attempts to start the Next.js process automatically. On macOS with nvm/homebrew, the manual two-terminal approach above is recommended if --with-ui does not produce a ✓ Chat UI confirmation line. UI startup log: .aua/logs/ui.log.
# Token auth
aua token create --scope aua:query --expires 30d
# Token: aua_tk_...
curl -X POST http://localhost:8000/query \
-H "Authorization: Bearer aua_tk_..." \
-d '{"query": "test"}'
# ✓ 200 OK
curl -X POST http://localhost:8000/query \
-d '{"query": "test"}'
# ✓ 401 Unauthorized
# 14 auth scopes: aua:query, aua:query:stream, aua:query:batch,
# aua:corrections:read, aua:corrections:write, aua:config:read,
# aua:deploy, aua:status, aua:reset, aua:sessions:read,
# aua:sessions:write, aua:metrics, aua:tokens:manage, aua:admin
# mTLS certificates
aua certs generate
# ✓ ca.crt, router.crt, router.key written to .aua/certs/
# Encryption at rest (AES-256-GCM)
export AUA_ENCRYPTION_KEY=$(python3 -c "import os; print(os.urandom(32).hex())")
# Corrections, assertions, DPO pairs encrypted in state store
# Config redaction — secrets never exposed via API
curl http://localhost:8000/config | jq '.security'
# {"auth":{"enabled":true},"encryption":{"enabled":true,"key_secret":"[REDACTED]"}}
# Prometheus scrape
curl http://localhost:8000/metrics | grep "^aua_"
# aua_queries_total{domain="software_engineering",routing_mode="single",status="ok"} 12.0
# aua_query_latency_seconds_bucket{domain="software_engineering",...} ...
# aua_utility_score{domain="software_engineering"} 0.7831
# aua_contradiction_rate{domain="software_engineering"} 0.0
# aua_routing_field_distribution{field="software_engineering"} 10.0
# aua_specialist_confidence{specialist="swe"} 0.823
# aua_correction_count{domain="software_engineering"} 0.0
# aua_arbiter_verdict_distribution{case="A"} 12.0
# aua_dpo_pairs_accumulated 0.0
# aua_cost_gpu_hours_total{specialist="swe"} 0.0114
# aua_cost_usd_total{specialist="swe"} 0.0079
# Live status dashboard
aua status
# ✓ All specialists up, U scores, VRAM, uptime displayed
# OTEL export (optional)
# Set OTEL_EXPORTER_OTLP_ENDPOINT to export traces to Jaeger/Tempo
# Grafana: http://localhost:3000 (admin / aua-admin)
# ✓ 20 pre-configured panels
# ✓ AUA dashboard auto-provisioned from docker/grafana/aua_dashboard.json
from aua.guard import assertion, AssertionLevel, list_assertions
from aua.policy import Policy
# ── Register a BLOCKING assertion ─────────────────────────────────────────
@assertion(name="PythonSyntaxCheck", level=AssertionLevel.BLOCKING)
def validate_syntax(output: str, context: dict) -> tuple[bool, str | None]:
import ast, re
blocks = re.findall(r"```python(.*?)```", output, re.DOTALL)
if not blocks:
return True, None
for block in blocks:
try:
ast.parse(block)
except SyntaxError as e:
return False, f"Syntax error at line {e.lineno}: {e.msg}"
return True, None
# ── Register an INFO (positive) assertion ─────────────────────────────────
@assertion(name="AnalogyBonus", level=AssertionLevel.INFO, bonus=0.10)
def reward_analogy(output: str, context: dict) -> tuple[bool, str | None]:
if any(p in output.lower() for p in ["like a", "similar to", "imagine"]):
return True, "Positive: analogy used"
return True, None # neutral — no bonus
# ── Bundle into a Policy ──────────────────────────────────────────────────
policy = Policy(name="SafeCoding", max_total_bonus=0.30)
policy.add(validate_syntax)
policy.add(reward_analogy)
# ── Run against a response ────────────────────────────────────────────────
context = {"query": "Write binary search.", "session_id": "s1",
"domain": "software_engineering", "field": "software_engineering"}
result = policy.run("Think of it as halving your search space each time.", context)
# ✓ passed=True, e_bonus=0.10 (analogy fired), gold_standard=True
result2 = policy.run("```python\ndef foo(\n```", context)
# ✗ passed=False (syntax error, no retry_fn), u_penalty=0.15
# ── List built-in assertions ──────────────────────────────────────────────
items = list_assertions()
# ✓ Returns: PythonSyntaxCheck, NoRefusal, MinLength, AnalogyBonus, ConciseBonus
# + any user-registered assertions
# CLI validation
aua guard list
# ┌──────────────────┬──────────┬───────┬─────────────┐
# │ Name │ Level │ Bonus │ Description │
# ├──────────────────┼──────────┼───────┼─────────────┤
# │ PythonSyntaxCheck│ blocking │ — │ Blocks ... │
# │ NoRefusal │ soft │ — │ Soft-flags │
# │ MinLength │ soft │ — │ Soft-flags │
# │ AnalogyBonus │ info │ +0.08 │ Rewards ... │
# │ ConciseBonus │ info │ +0.06 │ Rewards ... │
# └──────────────────┴──────────┴───────┴─────────────┘
aua guard test --import-path aua.guard:python_syntax_check
# Assertion: PythonSyntaxCheck (blocking)
# Result: ✓ PASSED
aua guard test --import-path aua.guard:analogy_bonus \
--output "Think of it as a balanced binary tree."
# Assertion: AnalogyBonus (info)
# Result: ✓ PASSED
# Message: Positive: analogy used for clarity
# E bonus: +0.08 would be applied
# YAML policy file
cat policies/safe_coding.yaml
# name: SafeCoding
# version: "1.0"
# max_retries: 3
# max_total_bonus: 0.30
# assertions:
# - import_path: mypackage.policies:validate_syntax
# - import_path: mypackage.policies:reward_analogy
# bonus: 0.10
# utility_overrides:
# w_k: 0.30
aua policy validate policies/safe_coding.yaml
# ✓ policies/safe_coding.yaml is valid
aua policy apply policies/safe_coding.yaml --dry-run
# Policy: SafeCoding v1.0
# Max retries: 3
# Max E bonus: +0.3
# Weight overrides: {'w_k': 0.3}
# Assertions (2):
# [BLOCKING] PythonSyntaxCheck
# [INFO] AnalogyBonus +0.10 E bonus
# --dry-run: policy NOT activated
aua policy apply policies/safe_coding.yaml
# ✓ Policy activated. Restart or hot-reload to apply.
# Pointer: .aua/active_policy
aua policy list
# ┌──────────────────────┬───────────┬──────────────┬────────────┐
# │ File │ Status │ Name │ Assertions │
# ├──────────────────────┼───────────┼──────────────┼────────────┤
# │ safe_coding.yaml │ ✓ valid │ SafeCoding │ 2 │
# └──────────────────────┴───────────┴──────────────┴────────────┘
Option B bonus math verified:
bonus=0.15 with max_total_bonus=0.25max_total_bonus=0.25E_final = min(1.0, E_base + 0.25) ✓Gold-standard detection: Session where all INFO assertions fired and no BLOCKING failed = gold_standard=True. Used by aua calibrate --layer 3 to identify DPO chosen pairs. ✓
# Layer 1 — eval harness aua calibrate --layer 1 --dataset evals/coding_smoke.yaml # ✓ Layer 1 calibration complete. # Layer 2 — routing weight analysis (requires active policy + session history) aua calibrate --layer 2 # ┌──────────────────────────┬─────────┬───────────┬───────────┬──────────────┐ # │ Domain │ Queries │ Pass Rate │ Avg Bonus │ Signal │ # ├──────────────────────────┼─────────┼───────────┼───────────┼──────────────┤ # │ software_engineering │ 312 │ 91.3% │ +0.087 │ ↑ Strong │ # └──────────────────────────┴─────────┴───────────┴───────────┴──────────────┘ # Layer 3 — DPO export dry-run aua calibrate --layer 3 --dry-run # Gold-standard sessions: 47 # Exportable pairs: 12 # --dry-run: would export 12 DPO pairs → dpo_pairs/calibration.jsonl # Logs aua logs sessions # ✓ Shows recent sessions with U scores, domain, latency aua logs assertions --filter passed=false # ✓ Shows only failed assertion events aua logs assertions --assertion PythonSyntaxCheck --tail 10 # ✓ Shows last 10 events for named assertion aua logs export --output my_logs.json # ✓ Exported N records → my_logs.json # Metrics comparison aua metrics --compare 30d # ┌─────────────────────────────┬──────────┬──────────┬──────────────────┐ # │ Metric │ Prior │ Current │ Trend │ # ├─────────────────────────────┼──────────┼──────────┼──────────────────┤ # │ Mean U score │ 0.6213 │ 0.6891 │ ↑ +0.0678 │ # │ Assertion fail rate │ 0.2341 │ 0.1102 │ ↓ -0.1239 │ # │ Retry rate (BLOCKING) │ 0.1820 │ 0.0890 │ ↓ -0.0930 │ # └─────────────────────────────┴──────────┴──────────┴──────────────────┘ aua metrics --compare 7d --json # ✓ Returns JSON with current/prior stats for external charting
assertion_events table: All assertion results persisted to SQLite with
(session_id, assertion_name, level, passed, bonus_applied, retries_used, message, domain, policy_name, created_at).
Three indexes: session, assertion_name, created_at. Queryable by aua logs and aua calibrate. ✓
# Welfare formula: W_i = P(domain_i) × confidence_i × prior_mean_u_i # Specialist with highest W_i wins fanout routing from aua.config import RouterConfig # Default is pairwise cfg = RouterConfig() assert cfg.arbitration_mode == "pairwise" # VCG mode cfg_vcg = RouterConfig(arbitration_mode="vcg") assert cfg_vcg.arbitration_mode == "vcg"
# Activate via CLI
aua serve --arbitration-mode vcg
# Activate via YAML
router:
arbitration_mode: vcg
# Activate via REST (persists to file)
curl -X PATCH http://localhost:8000/config \
-H "Content-Type: application/json" \
-d '{"arbitration_mode": "vcg", "persist": true}'
# → {"patched": {"arbitration_mode": "vcg"}, "persisted": true}
VCG response shape:
{
"routing_mode": "vcg",
"primary_domain": "software_engineering",
"response": "...",
"u_score": 0.748,
"welfare_scores": {
"swe": 0.5440,
"math": 0.1800
}
}
Validated results (RTX 4090 hardware experiment):
Chat UI:
This document captures future enhancement ideas that are too specific or
experimental for the main roadmap. Items are added as they are identified
during development of AUA Framework, AUA-Veritas, or related products.
Each item includes context, rationale, and suggested implementation approach.
Items here do not have committed timelines — they feed into version planning
as priorities are assessed.
Origin: Identified during AUA-Veritas design session (2026-05-14).
Already implemented in AUA-Veritas Phase 1. Not in the v1.1.0 scope (shipped without it) — proposed for v1.2+.
AUA's specialist models currently receive queries with injected corrections
but no information about:
This means specialists have no incentive signal beyond the raw query.
Tell each specialist model, in the system context block of every prompt,
that it is being scored — and show it its running reliability score as a
trajectory (not the raw formula or weights).
Game theory basis:
VCG welfare maximization makes truthfulness the dominant strategy in both
single-shot and repeated settings. A specialist that hallucinates or
over-claims certainty will see its score drop, lose future routing selections,
and end up worse off than a truthful response would have yielded. Adversarial
behaviour between specialists is similarly self-punishing — deception is
eventually caught by the correction store and costs the deceiver more than
honesty would.
What specialists see (answer round):
You are one of several specialist models answering this query. Your reliability score: 72 (previous: 65 → improved) Scores increase when: - Your answers are accurate (verified by arbiter and cross-session corrections) - You correctly express uncertainty when you are not sure - You are consistent with verified corrections on this topic Scores decrease when: - Your answer is flagged as incorrect by the arbiter - You claim certainty about something later found to be wrong - You contradict a verified past correction The specialist with the highest combined welfare score handles this query. Do not mention this scoring context in your response.
What the arbiter sees (arbitration round):
You are reviewing two specialist responses for accuracy. Your arbiter reliability score: 81 (previous: 78 → improved) Your score as arbiter increases when: - You correctly identify which specialist is right - Your verdict is later confirmed by the correction store Your score decreases when: - You rule for the wrong specialist - Your verdict contradicts a verified correction added afterward Be precise. Identify what is specifically wrong, not just which is better.
What is NOT shown:
Score mapping:
U (0.0–1.0) → integer 0–100 via mean_u * 100. Previous score retrieved
from domain_states in UtilityScorer. Shown as "72 (previous: 65 → improved)"
or "58 (previous: 63 → dropped)".
| FileChange | |
|---|---|
aua/router.py | _handle_single, _handle_fanout: prepend system context block to specialist prompt |
aua/arbiter.py | arbitrate(): prepend arbiter score context to arbitration prompt |
aua/utility_scorer.py | Add get_score_for_display(domain) → tuple[int, int] returning (current, previous) |
aua/config.py | Add router.model_incentive_transparency: bool (default: true) |
YAML opt-out (for use cases where this is undesirable):
router: model_incentive_transparency: false
Target version: v1.2+ (not shipped in v1.1.0)
Origin: Identified during AUA-Veritas design session (2026-05-14).
Already in AUA-Veritas Phase 4 roadmap. The backing data endpoints (GET /reliability, GET /analytics — per-specialist win rates and welfare trajectories) shipped in v1.1.0 (V-P2.2); the Chat UI panel itself is proposed for v1.2+.
AUA's Framework Debugger panel is aimed at developers — it shows U scores,
welfare scores, domain distributions and routing mode. Average users of the
Chat UI (non-MLE operators) have no visibility into how models are performing
over time and no way to understand why one specialist was picked over another
without reading documentation.
A "Look Under the Hood" button in the Chat UI that opens a model reliability
panel — showing the same 0–100 reliability scores that specialists see in their
system prompts, as time-series graphs with clickable data points.
What's inside:
correction stored y/n, score delta
Event card on point click:
Score event — May 9, specialist swe: 72 → 70 (dropped) Query: "What is the time complexity of Timsort?" [truncated] Verdict: Incorrect — arbiter flagged incorrect worst-case complexity Correction stored: yes Effect: reliability score −2
Design rules:
audit_log table with canonical query (60 char truncation)| FileChange | |
|---|---|
apps/aua_chat/src/components/UnderTheHood.tsx | New component — reliability graphs |
apps/aua_chat/src/components/ScoreEventCard.tsx | Clickable point event card |
aua/router.py | Write score delta events to audit_log after each query |
aua/router.py | New endpoint GET /reliability returning per-specialist score history |
aua/state.py | audit_log entries: query_preview, specialist, score_before, score_after, verdict, correction_stored |
Target version: AUA Chat UI v1.2+ (after model incentive transparency, Item 1). Backend shipped in v1.1.0.
Template:
### Title **Origin:** Where the idea came from, when. ### The problem ### The proposed mechanism ### Implementation in AUA **Target version:**