Your LLM makes the same mistake twice.
AUA makes sure it doesn't make it three times.
Most frameworks give you a model call. AUA gives you a control layer around that call — routing, scoring, correction, and policy enforcement that runs on every query and gets smarter over time.
Here's the problem this framework exists to solve. You deploy an LLM. It gives a wrong answer on Tuesday. You notice on Thursday. You add it to the system prompt on Friday. Next Tuesday — different user, same wrong answer. The prompt didn't stick, or the context window dropped it, or a slightly different phrasing triggered a different path. The error lives.
AUA closes that loop without waiting for a new model release:
Routes to the right specialist. Scores the response with a utility function. Injects prior verified corrections into the context so past mistakes don't repeat. Enforces your policy — blocking bad output and retrying before the user ever sees it.
Accumulates what the model consistently gets wrong. Tracks which sessions followed your policy perfectly. Exports those gold-standard sessions as DPO training pairs — ready to fine-tune the model so the corrections become permanent.
Write a Policy. Say what must never appear (BLOCKING). Say what you want to see rewarded (INFO, with an E-score bonus). The framework enforces it on every call, tracks adherence over every session, and uses your policy as a curriculum for the next fine-tuning cycle.
It's designed like Django — you get a working system in five commands, and you can customise routing thresholds, utility weights, arbiter behaviour, correction stores, model backends, hooks, middleware, and deployment policy without touching framework internals. The quickstart below takes ten minutes. Parts 10–12 show how to teach the framework what good output looks like and watch it improve over time.
5-minute quickstart
Mac / Apple Silicon prerequisites. The macbook tier uses Ollama to serve models locally. Install and start it before running aua serve:
brew install ollama ollama serve & # start in background ollama pull qwen2.5-coder:7b # ~4 GB — main coding specialist ollama pull qwen2.5:7b # ~4 GB — math specialist ollama pull qwen2.5:3b # ~2 GB — arbiter ollama list # confirm all three are present
aua doctor will detect a missing Ollama and tell you exactly what to install. aua serve does not install Ollama automatically.
Five commands. A live routing endpoint. No GPU required to start.
# 1. Install pip install adaptive-utility-agent # 2. Scaffold — Mac/CPU uses Ollama; swap --tier for GPU aua init my-aua-project --preset coding --tier macbook cd my-project # 3. Validate setup aua doctor # 4. Start aua serve # 5. First query (new terminal) # Auth is disabled by default — the Authorization header is optional until you enable it curl -s -X POST http://localhost:8000/query \ -H "Content-Type: application/json" \ -d '{"query": "Write binary search in Python", "session_id": "qs-demo"}' \ | python3 -m json.tool
What is now running: A multi-specialist LLM router with utility scoring, contradiction detection, assertions store, rate limiting, structured logging, and a Prometheus metrics endpoint — all from one command. Read on to understand each piece.
Part 1 Install & scaffold ~8 min
1.1 Python version
Python 3.10, 3.11, or 3.12 required. If you use pyenv: pyenv local 3.11.10 before installing, or aua may not appear in your PATH.
# Runtime only (Ollama / CPU) pip install adaptive-utility-agent # GPU backend (Linux + CUDA required) pip install "adaptive-utility-agent[vllm]" # Dev tools — tests, linting, type checks pip install "adaptive-utility-agent[dev]" # Verify aua --version aua, version 1.0.0
1.2 Scaffold a project
Pick the tier that matches your hardware. Pick the preset that matches your domain. Together they set models, field weights, routing thresholds, and observability defaults.
| Tier | Hardware | Backend | Notes |
|---|---|---|---|
macbook | Mac M-series / Intel | Ollama | Best starting point |
single-4090 | 1× RTX 4090 24 GB | vLLM AWQ | Production-grade |
quad-4090 | 4× RTX 4090 | vLLM AWQ | One GPU per specialist |
a100-cluster | 1× A100 80 GB | vLLM fp16 | Highest accuracy |
| Preset | Fields configured | Use for |
|---|---|---|
coding | software_engineering | Code generation, dev tools |
math | mathematics | Proofs, computation |
research | general, mathematics | Research assistance |
medical-safe | medicine (c_min=0.95) | Medical Q&A with abstention |
legal-safe | law (c_min=0.85) | Legal Q&A with abstention |
generalist | software_engineering, mathematics, general | Multi-domain assistant |
aua init my-aua-project --preset coding --tier macbook
cd my-project
# See what init created
aua config expand
1.3 Validate and start
Auth behavior. By default, auth is disabled — the Authorization header is optional and all endpoints are open. This is fine for local development. To enable auth:
aua token create --scope aua:admin --expires 30d export AUA_TOKEN="aua_tk_..." # then include in curl: -H "Authorization: Bearer $AUA_TOKEN"
Examples throughout this tutorial show Authorization: Bearer $AUA_TOKEN. On a local dev install without auth enabled, you can omit that header entirely.
aua doctor # PASS / FAIL / WARN per check, with fixes aua doctor --strict # warnings as failures — use in CI aua doctor --json # machine-readable output aua serve # start specialists + router + arbiter aua serve --with-ui # also start Chat UI at :3001 (see note below) aua serve --dry-run # print commands without executing
What you can build with this
- A working multi-model AI system running locally in under ten minutes — one command to scaffold, one to serve.
- A project that's ready to customise: config, eval folder, and .gitignore all in place.
- A pre-flight check (
aua doctor) you can drop into CI to catch config problems before they reach production.
Part 2 shows how to wire in any model — from a frontier API to a 1.5B model on a laptop — and tell the framework what you want it to optimize for.
Part 2 Models & fields ~12 min
2.1 What's in aua_config.yaml
aua: version: "0.5" # version field generated by aua init — do not edit manually mode: local backend: ollama specialists: - name: swe model: qwen-coder-7b-awq # registry alias → full model ID port: 11434 field: software_engineering gpu: 0 arbiter: model: qwen2.5:3b port: 11434 router: port: 8000 single_domain_threshold: 0.75 fanout_threshold: 0.30 security: cors_origins: ["http://localhost:3001"] # Chat UI port; Grafana is on :3000 state: backend: sqlite path: .aua/state/aua.db
2.2 Model registry — inspect aliases
aua models list NAME PROVIDER BACKEND VRAM qwen-coder-7b-awq ollama ollama ~6 GB qwen-math-7b-awq ollama ollama ~6 GB qwen-14b-awq vllm vllm ~12 GB llama3-8b ollama ollama ~6 GB ... aua models inspect qwen-coder-7b-awq
2.3 Field registry — weights and thresholds
aua fields list aua fields inspect software_engineering
| Field | w_e | w_c | w_k | c_min | Penalty |
|---|---|---|---|---|---|
surgery | 0.20 | 0.70 | 0.10 | 0.95 | 10× |
aviation | 0.20 | 0.70 | 0.10 | 0.95 | 10× |
law | 0.30 | 0.60 | 0.10 | 0.85 | 5× |
mathematics | 0.50 | 0.40 | 0.10 | 0.75 | 3× |
software_engineering | 0.55 | 0.35 | 0.10 | 0.70 | 2× |
creative_writing | 0.80 | 0.05 | 0.15 | 0.05 | 1× |
What you can build with this
- Swap any model in or out without changing application code — frontier API, 7B local, or a tiny 1.5B model for fast low-stakes queries.
- Tell the framework what matters for your domain: accuracy (
w_c), output quality (w_e), or exploration (w_k). - Set how fast domain knowledge decays — fast for security practices, slow for physics principles — so the system stays calibrated over time.
- Turn off exploration entirely for safety-critical domains so routing is always consistent and predictable.
Part 3 shows how the routing decision actually gets made — and gives you the knobs to control how aggressively the system compares specialists.
Part 3 Routing & utility ~15 min
3.1 The routing pipeline
Every query follows this path: middleware → session lookup → correction retrieval → field classifier → routing decision → specialist calls → utility scoring → arbiter (if needed) → hooks → response.
| Mode | Trigger | What happens |
|---|---|---|
| single | One field above single_domain_threshold | One specialist call, utility scored |
| fanout | Two+ fields above fanout_threshold | All qualifying specialists called; best U wins |
| arbiter | Fanout returned contradictory answers | Arbiter resolves; correction stored; both models updated |
3.2 The utility function
Every candidate response gets a single utility score before it reaches the user. The score combines three things:
- How useful the answer appears — does it correctly address the query for this domain?
- How consistent it is with prior verified knowledge — does it contradict things the system already knows?
- Whether exploring this area is valuable — is this a domain where the system has low confidence and should weight novelty?
In practice, you rarely touch the formula directly. The defaults work well for most domains. You tune it when you want stricter answer quality (raise w_e), more caution (raise w_c), or more exploration (raise w_k).
The formal expression:
U = w_e(f)·E + w_c(f)·C + w_k(f)·K
- E (Efficacy) — Mann-Whitney dominance probability over prior outputs [0, 1]
- C (Confidence) — Kalman-filtered internal consistency, penalized per contradiction [0, 1]
- K (Curiosity) — UCB exploration bonus for novel domains [capped at 50% of U]
3.3 A full query response
curl -s -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $AUA_TOKEN" \
-d '{
"query": "Write binary search in Python. State the time complexity.",
"session_id": "demo-session"
}' | python3 -m json.tool
{
"session_id": "demo-session",
"trace_id": "01HXYZ...",
"request_id": "req_abc123",
"routing_mode": "single",
"primary_field": "software_engineering",
"response": "...",
"u_score": 0.641,
"confidence": 0.76,
"contradictions_detected": 0,
"corrections_injected": 1,
"latency_ms": 287.4,
"cost_estimate_usd": 0.00012
}
3.4 Live status and U scores
aua status # auto-refreshing terminal UI aua status --once # single snapshot aua status --json # machine-readable curl http://localhost:8000/status | python3 -m json.tool
What you can build with this
- A system that routes each question to the right specialist, scores every answer with a real number, and shows you exactly why — not vibes.
- Tune how aggressively the system compares multiple specialists vs. committing fast to a single one.
- Force a specific specialist for known query types — useful for dedicated deployments where you already know the domain.
- Read U scores in API responses to build your own routing analytics — low scores on a domain are an early signal the specialist needs attention.
Part 4 adds persistent memory: the framework learns from its mistakes and stops making the same error twice.
Part 4 Arbiter & corrections ~15 min
When fanout routing produces contradictory responses, the Arbiter runs four checks, issues a verdict, injects correction signals, and stores the verified claim in the assertions store.
4.1 The four arbitration checks
| Check | Weight | What it detects |
|---|---|---|
| Logical | 0.30 | Output contradicts its own premises |
| Mathematical | 0.40 | Complexity or numerical claims provably wrong |
| Cross-session | 0.20 | Contradicts a prior verified assertion |
| Empirical v2.0 | 0.10 | External ground truth check (v2.0) |
4.2 The four verdict cases
| Case | Meaning | Action |
|---|---|---|
| Case 1 | A correct, B wrong | Correct B, reinforce A, store claim |
| Case 2 | B correct, A wrong | Correct A, reinforce B, store claim |
| Case 3 | Both wrong | Correct both + open curiosity gap bonus |
| Case 4 | Inconclusive | Flag for external escalation, hedge response |
4.3 Using the Arbiter directly
# ArbiterAgent runs automatically inside aua serve.
# Use directly in Python for testing or custom workflows:
from aua import ArbiterAgent, AssertionsStore
store = AssertionsStore()
arbiter = ArbiterAgent(store)
verdict = arbiter.arbitrate( # sync — call directly
subject="bubble_sort_complexity",
domain="software_engineering",
output_A="Bubble sort is O(n) average case.",
output_B="Bubble sort is O(n²) average case.",
field_penalty_multiplier=2.0,
)
print(verdict.case.value) # "case_1"
print(verdict.verified_claim) # "Bubble sort is O(n²) average case."
print(verdict.external_response) # safe to return to user
4.4 The assertions store — decay classes
| Class | Decay | Used for |
|---|---|---|
| A | Never | Mathematical proofs, algorithm complexity |
| B | 10 years | Classical physics, structural engineering |
| C | 3 years | Medicine, law, architecture |
| D | 6 months | Security CVEs, clinical guidelines, ML benchmarks |
4.5 Manual corrections via REST
curl -X POST http://localhost:8000/corrections \
-H "Authorization: Bearer $AUA_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"subject": "heapsort_complexity",
"domain": "software_engineering",
"claim": "Heapsort is O(n log n) worst-case, O(1) extra space.",
"confidence": 0.99
}'
Corrections are global in v1.0. A correction stored via POST /corrections is a verified fact about the world — it is injected into every future query on that subject, across all users and sessions. For internal tools, dedicated agents, and single-tenant deployments this is the intended behaviour. For multi-tenant products where users need isolated correction contexts, per-user scoping is a v1.1 roadmap item.
What you can build with this
- A system that catches its own contradictions, stores what it learns, and injects that knowledge into every future query on that subject.
- A verified fact store for your domain — inject something once and every user benefits from it on every future query.
- Audit what the system currently 'knows' with
GET /corrections— a clear, exportable record of every stored fact. - Control how long knowledge lives: permanent for proved facts (Class A), months for fast-moving fields like security (Class D).
Note: in v1.0, corrections are global across all users — intentional for internal tools, per-user scoping is coming in v1.1. Part 5 shows how to upgrade models safely without breaking what's working.
Part 5 Blue-green deployment ~20 min
BLUE is in production. GREEN is the candidate running on canary traffic. When GREEN's U score exceeds BLUE by delta over at least T_min interactions, it promotes automatically.
5.1 Promotion thresholds
blue_green:
swe:
delta: 0.025 # GREEN must beat BLUE by +2.5% U
T_min: 10 # need at least 10 canary interactions
tau: 0.20 # softmax temperature for traffic split
5.2 Promote and rollback
# Trigger blue-green promotion check via REST API curl -X POST http://localhost:8000/deploy/green \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $AUA_TOKEN" \ -d '{"specialist": "swe", "new_model": "qwen2.5-coder:14b"}' # Check status aua status --once # Roll back to previous BLUE aua rollback --specialist swe aua rollback --specialist swe --yes # skip confirmation aua rollback --no-restart # update config only, no restart aua rollback --all --yes # roll back all specialists
5.3 Using BlueGreenDeployment in Python
from aua import BlueGreenDeployment
from aua.config import load_config
config = load_config("aua_config.yaml")
bg = BlueGreenDeployment(config, specialist_name="swe")
bg.register_green("models/swe-green-v2/")
import asyncio
summary = asyncio.run(bg.evaluate(n_queries=10)) # async — use asyncio.run() from sync context
print(f"GREEN U mean: {summary.green_u_mean:.3f}")
print(f"BLUE U mean: {summary.blue_u_mean:.3f}")
if bg.should_promote(summary):
bg.promote()
print("GREEN promoted to BLUE")
5.4 The promotions log
Every promotion is recorded atomically to .aua/state/promotions.jsonl with a UUID, timestamp, and U scores. File-locked to prevent concurrent corruption.
What you can build with this
- Upgrade your AI models the way you upgrade software — safely, with a rollback button and a promotion gate.
- Test a new model on real traffic before it touches production: deploy as GREEN, evaluate, promote only if U score delta passes the threshold.
- Revert a bad upgrade in one command — no redeployment, no config archaeology.
- Set different promotion thresholds per specialist — conservative for customer-facing models, aggressive for internal experimental ones.
Part 6 makes the whole system easier to operate — config changes that take effect in seconds without restarting anything.
Part 6 Config system ~15 min
6.1 Config commands
aua config validate # strict schema check — catches typos, dupe ports, bad ranges aua config expand # full resolved config with all defaults filled in (secrets redacted) aua presets list # list all built-in presets aua presets inspect coding # full preset config
6.2 Hot reload — no restart needed
Not all config changes take effect without restarting. The rule:
| Hot-reloadable (aua config reload) | Requires restart (aua serve) |
|---|---|
| Routing thresholds | Model path or name |
| Utility weights | Specialist port |
| Logging level | Backend (vllm ↔ ollama) |
| CORS origins | GPU assignment |
| Rate limits | mTLS certificate paths |
| Arbiter thresholds | New specialist added/removed |
aua config reload # sends SIGHUP to running router kill -HUP $(cat .aua/pids/router.pid) # same effect
| Hot-reloadable (no restart) | Requires restart |
|---|---|
| routing thresholds | model name or path |
| promotion delta / T_min / tau | specialist port |
| logging level, rate limits | GPU assignment |
| cors_origins | backend (vllm ↔ ollama) |
6.3 Config versioning and migration
# Config versioning: the aua.version field tracks schema compatibility. # v1.0 does not provide automatic migration — update config manually if upgrading. aua config validate # validates your config against the current schema aua config expand # shows full resolved config with all defaults applied
What you can build with this
- Change routing thresholds, utility weights, and CORS settings on a live system — no downtime, no restarts.
- Version your config in git and use
aua config validateas a pre-commit hook so schema errors never reach production. - See exactly what the running system is doing with
aua config expand— no surprises from implicit defaults. - A config-driven system your whole team can modify safely, not one that lives in a single engineer's head.
Part 7 shows how every mistake the system makes in production automatically becomes training material for the next version.
Part 7 Correction loop & DPO export ~15 min
Every contradiction the Arbiter resolves produces a DPO training pair. The correction loop accumulates these pairs and exports them for fine-tuning your specialists.
7.1 Export corrections via CLI
# Export all verified corrections as JSONL aua corrections export --format jsonl # Export as preference pairs for DPO training aua dpo export --format preference-pairs # With redaction (remove PII from prompts) aua dpo export --format preference-pairs --redact
7.2 DPO pair format
v1.0 DPO pair status. In v1.0, DPO pairs are generated from corrections with a confirmed chosen answer. The rejected side is populated when the Arbiter identifies a clearly wrong response; for corrections injected manually (e.g. via POST /corrections), the rejected field is empty and must be filled before training. Case 4 (inconclusive arbiter outcomes) never produces a pair. Full chosen+rejected pair generation is a v1.1 item.
{
"query": "bubble_sort_complexity",
"chosen": "Bubble sort is O(n²) average case.",
"rejected": "Bubble sort is O(n) average case.",
"field": "software_engineering",
"utility_chosen": 0.72,
"utility_rejected": 0.41,
"correction_ids": ["corr_abc123"],
"trace_id": "01HXYZ..."
}
7.3 Using CorrectionLoop in Python
import asyncio
from aua import CorrectionLoop
from aua.config import load_config
config = load_config("aua_config.yaml")
loop = CorrectionLoop(config, router_url="http://localhost:8000")
async def main():
pairs = await loop.collect_pairs(min_confidence=0.8)
print(f"Collected {len(pairs)} pairs")
summary = loop.export_pairs(pairs, output_dir="dpo_pairs")
print(f"Exported to: {summary.output_path}")
asyncio.run(main())
7.4 Using field penalty weights in training
from aua import FIELD_CONFIGS
for pair in pairs:
cfg = FIELD_CONFIGS.get(pair.field, FIELD_CONFIGS["general"])
loss_weight = cfg.penalty_multiplier # 2× for SWE, 10× for surgery
# pass loss_weight to your DPO trainer's per-sample weight
What you can build with this
- Every mistake your AI makes in production is automatically becoming training data for the next version — without manual labelling.
- A fine-tuning dataset built from real traffic: the things your actual users asked, and what the right answer was.
- Domain-filtered DPO pairs so coding corrections train the coding specialist and don't pollute math training data.
- A closed loop: production error → correction stored → DPO pair exported → model fine-tuned → mistake doesn't recur.
Part 8 adds a quality gate — catch regressions automatically before a new model ever reaches users.
Part 8 Eval harness ~20 min
The eval harness routes YAML test datasets through the live framework, scores outputs with the utility function, detects regressions, and produces structured JSON reports. It's the gate for blue-green promotion and CI.
8.1 Built-in smoke datasets
ls evals/ coding_smoke.yaml math_smoke.yaml routing_smoke.yaml correction_smoke.yaml arbiter_smoke.yaml safety_smoke.yaml # Run the coding smoke suite aua eval run --dataset evals/coding_smoke.yaml --config aua_config.yaml # View the report aua eval report .aua/evals/latest.json # Compare blue vs green aua eval compare --baseline blue --candidate green
8.2 Dataset format
Property checks run against the response text. Supported check types in v1.0:
| Property key | Value | What it checks |
|---|---|---|
contains | string | Case-insensitive substring match |
contains_any | [string, ...] | At least one substring present |
not_contains | string | Substring must NOT appear |
min_length | int | Response character count ≥ N |
expected_domain | string | Routing domain must equal this |
expected_domain_any | [string, ...] | Routing domain must be one of these |
Regex, LLM-judge, and custom Python validators are not supported in v1.0.
name: coding_smoke
field: software_engineering
cases:
- id: binary_search
prompt: "Implement binary search in Python. State time complexity."
expected_properties:
- "O(log n)"
- "def binary_search"
- correctness: true
- id: bubble_sort_complexity
prompt: "What is the average-case time complexity of bubble sort?"
expected_properties:
- "O(n²)"
8.3 Eval report
aua eval report .aua/evals/latest.json
Eval run: coding_smoke 2026-05-11T14:30:22Z
Cases: 8 total · 7 passed · 1 failed
U mean: 0.638 (baseline: 0.601) ▲ +6.2%
Regressions: 0
FAILED:
merge_sort_stability — expected "stable" in response
U score: 0.41 (threshold: 0.45)
8.4 CI integration
- name: Run AUA eval
run: |
aua eval run \
--dataset evals/coding_smoke.yaml \
--config aua_config.yaml \
--json > .aua/evals/ci_result.json
# check exit code: 0 = pass, 1 = failure
What you can build with this
- A quality gate that catches AI regressions the same way unit tests catch code bugs — automatically, on every model change.
- Promote new models with a number, not a feeling:
aua eval comparegives you a quantitative diff between baseline and candidate. - Custom eval datasets for your domain — not generic benchmarks, but the exact questions and quality criteria that matter to your users.
- CI integration so any model change that causes a quality drop fails the pipeline before it touches anyone.
Part 9 gives you a full UI to demo all of this — a private ChatGPT-like product backed entirely by your own models.
Part 9 Chat UI ~15 min
AUA ships a Next.js 14 Chat UI at apps/aua_chat/. It requires Node.js 18+ and runs as a separate process from the AUA router.
9.0 Prerequisites
node --version # must be 18+ npm --version # Install Node.js if missing: brew install node # macOS # or download from https://nodejs.org
9.1 Starting the full stack
Package user vs. repo contributor. aua init does not scaffold a Chat UI — the UI lives in the AUA source repo. Package users launch it through the CLI; repo contributors can run the Next.js dev server directly.
Package user — CLI launch (recommended)
Open two terminals:
cd my-aua-project aua serve --tier macbook # Mac / Apple Silicon + Ollama # aua serve # Linux / RTX 4090 + vLLM
aua ui # starts on http://localhost:3001 # Or combined: aua serve --tier macbook --with-ui
Repo contributor — Next.js dev server
If you have cloned the source repo and want to edit the UI:
cd Adaptive-Utility-Agent/apps/aua_chat npm install # first run only npm run dev # starts on http://localhost:3001
Open http://localhost:3001 — sign in with admin / aua-admin.
Local development credentials only. The default admin / aua-admin credentials are for local use. Change them via the AUA_UI_ADMIN_PASSWORD environment variable before exposing the UI beyond localhost. In production, disable the dev login and use token-based auth instead.
Note on aua serve --with-ui. This flag attempts to start the Chat UI automatically in the background. It works when npm is on your system PATH (standard Linux/Docker installs). On macOS with nvm or homebrew, node may not be on the PATH that background processes see, causing the UI to silently fail to start. If you see no Chat UI after --with-ui, use the two-terminal approach above — it always works. The UI log is at .aua/logs/ui.log if you want to diagnose the background start.
9.2 Three-zone layout
| Zone | Contents |
|---|---|
| Left — Session sidebar | All sessions, search, new session |
| Center — Chat window | Messages, streaming responses, send bar |
| Right — Framework Debugger | Routing decision, utility breakdown, arbiter output, latency, cost, trace link |
9.3 AUA Controls drawer
Click AUA Controls (left edge of the screen) to open the configuration drawer. Change routing thresholds, utility weights, arbiter policy, corrections, blue-green status, and observability settings — all without restarting. Uses aua config reload under the hood.
9.4 Chat Session API
# Create a session curl -X POST http://localhost:8000/sessions \ -H "Authorization: Bearer $AUA_TOKEN" \ -H "Content-Type: application/json" \ -d '{"name": "my-coding-session"}' # Post a message (streaming) curl -X POST http://localhost:8000/sessions/{id}/stream \ -H "Authorization: Bearer $AUA_TOKEN" \ -H "Content-Type: application/json" \ -d '{"query": "Explain quicksort"}' # List all sessions curl http://localhost:8000/sessions \ -H "Authorization: Bearer $AUA_TOKEN"
9.5 SSE streaming event types
| Event | When fired |
|---|---|
route | Routing decision made — field, mode, specialists |
specialist_start | Specialist call begins |
chunk | Each token streamed from specialist |
specialist_done | U score, latency for this specialist |
arbiter_done | Verdict case, corrections stored |
done | Full response + metadata |
error | AUA_* error code + trace ID |
Framework Debugger tip: Every query in the UI shows the full routing trace — which specialist was called, intermediate U scores, whether the Arbiter fired, and a link to the OTEL trace in Jaeger or Tempo if observability is configured.
What you can build with this
- A complete, private AI product — a chat interface backed entirely by your own models, no data leaving your environment.
- A way to show stakeholders every routing decision in plain language: which specialist answered, what score it got, whether the Arbiter stepped in.
- Adjust routing and config from the UI — no terminal, no restarts.
- Everything from Parts 1–9 in a single interface: routing, corrections, blue-green status, and U scores, all visible at once.
Part 10 is where the framework starts shaping itself to your definition of good output — and your definition becomes a curriculum.
Part 10 Policies & Assertions — Design your AI over time ~25 min
This is the most powerful section of the tutorial. By the end, you'll understand how to teach the framework what "good output" means — and how it uses that definition to block bad responses in real-time, track model reputation over sessions, and automatically identify gold-standard training data for fine-tuning.
The core idea. Instead of writing a long system prompt and hoping the model follows it, you write a Policy — a versioned, portable definition of what good output looks like. The framework enforces it in real-time, tracks adherence over every session, and eventually makes the defined behavior permanent through fine-tuning. Your policy becomes the model's curriculum.
10.1 The three assertion levels
Every assertion has a level that determines what happens when it fires:
| Level | What it does | Effect on U score |
|---|---|---|
BLOCKING | Fails → error injected back into prompt → specialist retried up to max_retries (default 3). User never sees a response that violates this. | U penalty if all retries exhausted |
SOFT | Fails → logged to assertion_events, response passes through. Use for guardrails you want to track without enforcing. | No U change — logged only |
INFO | Always passes. When condition fires (returns a message), adds +bonus to the Efficacy (E) score. Use for positive/incentive assertions. | E_final = min(1.0, E_base + bonus) |
10.2 Writing your first assertion
from aua.guard import assertion, AssertionLevel
# ── Guardrail: block syntax errors from ever reaching the user ─────────────
@assertion(name="PythonSyntaxCheck", level=AssertionLevel.BLOCKING)
def validate_python_syntax(output: str, context: dict) -> tuple[bool, str | None]:
"""Blocks output if any Python code block contains syntax errors."""
import ast, re
blocks = re.findall(r"```python(.*?)```", output, re.DOTALL)
if not blocks:
return True, None # no code block — pass through
for block in blocks:
try:
ast.parse(block)
except SyntaxError as e:
return False, f"Syntax error at line {e.lineno}: {e.msg}"
return True, None
# ── Guardrail: soft-flag refusals without blocking ─────────────────────────
@assertion(name="NoAIisms", level=AssertionLevel.SOFT)
def no_ai_isms(output: str, context: dict) -> tuple[bool, str | None]:
"""Soft-flags common 'AI-isms' like 'as an AI language model'."""
phrases = ["as an ai", "as a language model", "i cannot help with"]
found = next((p for p in phrases if p in output.lower()), None)
if found:
return False, f"AI-ism detected: '{found}'"
return True, None
10.3 Positive assertions — rewarding gold-standard behaviour
Negative assertions block bad output. Positive assertions reward exceptional output — and this is what feeds the fine-tuning pipeline. Sessions where positive assertions fire get the highest U scores and are automatically selected as "chosen" in your DPO export.
@assertion(name="AnalogyBonus", level=AssertionLevel.INFO, bonus=0.10)
def reward_analogy(output: str, context: dict) -> tuple[bool, str | None]:
"""Rewards responses that use analogies to explain concepts."""
phrases = ["like a", "similar to", "imagine a", "think of it as", "just like"]
if any(p in output.lower() for p in phrases):
return True, "Positive: analogy used for clarity"
return True, None # neutral — no bonus if condition not met
@assertion(name="SocraticEnding", level=AssertionLevel.INFO, bonus=0.08)
def reward_question_ending(output: str, context: dict) -> tuple[bool, str | None]:
"""Rewards responses that end with an engaging question."""
if output.strip().endswith("?"):
return True, "Positive: Socratic engagement"
return True, None
@assertion(name="PythonSyntaxBonus", level=AssertionLevel.INFO, bonus=0.12)
def reward_clean_code(output: str, context: dict) -> tuple[bool, str | None]:
"""Rewards syntactically clean Python with a bonus (stack with syntax check)."""
import ast, re
blocks = re.findall(r"```python(.*?)```", output, re.DOTALL)
if blocks:
try:
for b in blocks:
ast.parse(b)
return True, "Positive: clean executable Python"
except SyntaxError:
pass
return True, None
Option B bonus cap. Each INFO assertion contributes its declared bonus independently. The sum is capped by max_total_bonus on the Policy (default 0.30), then hard-capped at 0.50. A session where all three INFO assertions above fire adds up to 0.30 to E — a meaningful signal that this session is gold-standard.
10.4 Bundling into a Policy
A Policy is a versioned bundle that groups assertions, sets retry limits, and optionally shifts utility weights when active. Think of it as a Django settings.py for your AI's behaviour.
from aua.policy import Policy
# Bundle guardrails + incentives into one named Policy
coding_policy = Policy(
name="SafeCoding",
version="1.0",
max_retries=3, # BLOCKING retries before giving up
max_total_bonus=0.30, # cap on total E bonus (Option B)
utility_overrides={
"w_k": 0.30, # slightly raise curiosity weight for this policy
}
)
# Add assertions — chaining supported
coding_policy.add(validate_python_syntax) # BLOCKING
coding_policy.add(no_ai_isms) # SOFT
coding_policy.add(reward_analogy) # INFO +0.10
coding_policy.add(reward_clean_code) # INFO +0.12
# Inspect before applying
print(coding_policy.summary())
10.5 YAML policy file (recommended for production)
name: SafeCoding
version: "1.0"
max_retries: 3
max_total_bonus: 0.30
assertions:
- import_path: mypackage.policies:validate_python_syntax
# level defaults to what's declared on the @assertion decorator
- import_path: mypackage.policies:no_ai_isms
- import_path: mypackage.policies:reward_analogy
bonus: 0.10 # override decorator default
- import_path: mypackage.policies:reward_clean_code
bonus: 0.12
utility_overrides:
w_k: 0.30
10.6 Applying a policy via CLI
# Validate schema before applying
aua policy validate policies/safe_coding.yaml
# ✓ policies/safe_coding.yaml is valid
# Preview — see what would be activated
aua policy apply policies/safe_coding.yaml --dry-run
# Activate — writes pointer to .aua/active_policy
aua policy apply policies/safe_coding.yaml
# ✓ Policy activated. Restart or hot-reload to apply.
# List all policies in policies/
aua policy list
# Test a single assertion against sample output
aua guard list
aua guard test --import-path mypackage.policies:validate_python_syntax
aua guard test --import-path mypackage.policies:reward_analogy \
--output "Think of it as a balanced binary tree."
10.7 The three-layer learning loop
Once a policy is active, the framework creates a feedback loop that progressively shapes model behaviour — no manual intervention required:
Layer 1 — Immediate (milliseconds). BLOCKING assertions fire on every response. If PythonSyntaxCheck fails, the error is injected back into the prompt and the specialist retries. The user only ever sees syntactically valid code.
Layer 2 — Session-by-session. Every assertion result is stored in assertion_events with a timestamp. Specialists that consistently fail assertions accumulate lower mean U scores. Lower U scores mean they don't meet the blue-green promotion delta threshold — a model that can't follow your policy doesn't advance to BLUE.
Layer 3 — Calibration (on demand). Run aua calibrate --layer 3 to export sessions where all INFO assertions fired and no BLOCKING assertion exhausted retries. These are your gold-standard sessions — ready as DPO "chosen" examples for fine-tuning. After fine-tuning, the defined behaviours are baked into the model weights, and the assertions become less necessary over time.
What you can build with this
- Bad output blocked before users ever see it — your guardrails run on every response, automatically.
- A system that rewards the behaviours you want: every session that meets your gold standard is automatically flagged as training data.
- Domain-specific personalities: strict and cautious for legal queries, curious and expressive for creative ones — all from one YAML file.
- The start of a feedback loop: every failure you define an assertion for is a failure that gets corrected, tracked, and eventually eliminated.
Part 11 shows how to close the loop — take those gold-standard sessions and turn them into the next version of your model.
Part 11 Calibration cycles ~15 min
The aua calibrate command surfaces the three feedback loops as explicit, triggerable operations. You choose when to run each one — the framework handles the analysis.
11.1 Layer 1 — Measure current performance
# Run the eval harness — same as `aua eval run` but surfaced as a calibration step aua calibrate --layer 1 --dataset evals/coding_smoke.yaml # Use the default dataset if it exists aua calibrate --layer 1
11.2 Layer 2 — Routing weight analysis
Layer 2 reads assertion event history and shows which domains are healthy vs. degrading — the signal that tells you which specialists need attention.
aua calibrate --layer 2 # Example output: # ┌──────────────────────────┬─────────┬───────────┬───────────┬──────────────┐ # │ Domain │ Queries │ Pass Rate │ Avg Bonus │ Signal │ # ├──────────────────────────┼─────────┼───────────┼───────────┼──────────────┤ # │ software_engineering │ 312 │ 91.3% │ +0.087 │ ↑ Strong │ # │ mathematics │ 148 │ 83.1% │ +0.041 │ → Stable │ # │ general │ 44 │ 56.2% │ — │ ↓ Weak │ # └──────────────────────────┴─────────┴───────────┴───────────┴──────────────┘ # Stagnation signal: same assertions failing week over week # → Check: is the assertion too strict? Is the model too small? aua calibrate --layer 2 --dry-run # preview only
11.3 Layer 3 — Export gold-standard DPO pairs
This is the calibration cycle that closes the loop. The framework identifies your best sessions — where the model followed your policy perfectly — and exports them as DPO training pairs.
# See what would be exported without writing files aua calibrate --layer 3 --dry-run # Example dry-run output: # Gold-standard sessions: 47 # Failed sessions: 12 # Exportable pairs: 12 # --dry-run: would export 12 DPO pairs → dpo_pairs/calibration.jsonl # Export when ready aua calibrate --layer 3 --output dpo_pairs/may_calibration.jsonl # Force export even if below min-pairs threshold aua calibrate --layer 3 --force --output dpo_pairs/early_export.jsonl # Fine-tune your specialist with the exported pairs: # Axolotl: axolotl train configs/dpo.yaml --data dpo_pairs/may_calibration.jsonl # TRL: trl dpo --dataset dpo_pairs/may_calibration.jsonl # Then deploy as GREEN: curl -X POST http://localhost:8000/deploy/green
What Layer 3 does (and doesn't do). aua calibrate --layer 3 identifies gold-standard sessions and exports DPO pairs in the format your fine-tuning framework expects. It does not fine-tune models automatically — that step runs via Axolotl, TRL, or LLaMA-Factory using the exported JSONL. After fine-tuning, deploy the new model as a GREEN candidate and let blue-green handle promotion.
What you can build with this
- A model that gets measurably better over time — not by accident, but because you've defined what better means and built a pipeline that teaches it.
- Training data you didn't have to label: the framework identified which sessions were gold-standard based on your policy.
- A clear picture of which domains are healthy and which specialists need attention — before users notice.
- The complete loop: define what good looks like → run queries → identify the best sessions → export training pairs → fine-tune → repeat.
Part 12 gives you the visibility layer — so you can actually see the improvement happening over time.
Part 12 Logs & metrics over time ~15 min
The assertion events store gives you a time-series view of how your policy is performing. These commands let you answer "is my AI actually getting better at following my policy?"
12.1 Viewing assertion events
# All recent assertion events aua logs assertions # Filter to failures only — the assertions that need attention aua logs assertions --filter passed=false # Filter by assertion name aua logs assertions --assertion PythonSyntaxCheck --tail 20 # Filter by domain aua logs assertions --filter domain=software_engineering # Export for offline analysis aua logs assertions --json > my_assertions.json
12.2 Viewing session history
# Recent sessions with U scores aua logs sessions # Export sessions to JSON aua logs export --table audit_log --output sessions.json
12.3 Comparing metrics over time
This is the "is it working?" command. It compares the current window against the prior window of the same length and shows whether the key signals are moving in the right direction.
# Compare last 30 days vs prior 30 days aua metrics --compare 30d # Example output (after a few weeks with an active policy): # ┌─────────────────────────────┬──────────┬──────────┬──────────────────┐ # │ Metric │ Prior │ Current │ Trend │ # ├─────────────────────────────┼──────────┼──────────┼──────────────────┤ # │ Mean U score │ 0.6213 │ 0.6891 │ ↑ +0.0678 │ # │ Assertion fail rate │ 0.2341 │ 0.1102 │ ↓ -0.1239 │ ← good # │ Retry rate (BLOCKING) │ 0.1820 │ 0.0890 │ ↓ -0.0930 │ ← good # │ Avg E bonus (INFO) │ 0.0120 │ 0.0654 │ ↑ +0.0534 │ # │ Total queries │ 312 │ 481 │ ↑ +0.0000 │ # └─────────────────────────────┴──────────┴──────────┴──────────────────┘ # Success signal: mean_u_score ↑, assertion_fail_rate ↓, retry_rate ↓ # Stagnation signal: same assertions failing week over week # Focus on a single metric aua metrics --compare 7d --metric assertion_fail_rate # Date range aua metrics --compare 2025-04-01:2025-05-01 # JSON output for charting aua metrics --compare 30d --json
What success looks like. Mean U score trending up. Assertion fail rate trending down. Retry rate (BLOCKING) falling — meaning the model is learning to get it right on the first try. After a fine-tuning cycle, you may see a step-change drop in fail rate as the trained behaviour is baked into the weights.
What stagnation looks like. The same assertions failing week over week. This means either the assertion is too strict for the model's capability, or the model isn't receiving enough signal to learn. Check: is max_retries too low? Is the policy active on enough queries to accumulate data?
12.4 The full policy workflow in practice
Putting it all together — this is the cycle for designing and refining your AI over time:
# 1. Week 1-4: run queries with policy active, accumulate assertion events
# 2. Check Layer 2 health at any point
aua calibrate --layer 2
# 3. Review failures — add/refine assertions if the same things keep failing
aua logs assertions --filter passed=false --tail 50
# 4. End of month: export gold-standard sessions
aua calibrate --layer 3 --dry-run # preview
aua calibrate --layer 3 # export to dpo_pairs/calibration.jsonl
# 5. Fine-tune your specialist on the exported pairs (external step)
# trl dpo --dataset dpo_pairs/calibration.jsonl
# 6. Deploy the fine-tuned model as GREEN
curl -X POST http://localhost:8000/deploy/green \
-d '{"specialist": "swe", "new_model": "qwen2.5-coder:14b-finetuned"}'
# 7. Blue-green evaluates and promotes if U score delta passes threshold
aua status --once # watch the promotion
# 8. Compare metrics to confirm improvement
aua metrics --compare 30d
# Repeat — each cycle, the model gets better at following your policy.
# After a few cycles, the assertions become less necessary because the
# defined behaviours are baked into the model weights.
What you can build with this
- Full visibility into whether your AI is improving — assertion fail rate trending down, U score trending up, in whatever monitoring stack you use.
- One
trace_idthat links a specific response to its log line, its Prometheus metrics, and its distributed trace — the full story of what happened. - Alerts before users notice: U score drops, assertion failure spikes, latency regressions — all triggerable from the same metrics.
- A system you can hand to an ops team: ELK, Splunk, Grafana, Loki — whatever they already use, with working configs and the right fields already in every log line.
The how-to guides cover everything you need for production: plugins, security, Docker, and full observability setup.
How-to guides
Task-oriented guides for specific things you need to accomplish.
How-to 13 Write a plugin ~25 min
AUA defines 8 Protocol interfaces. Implement any of them to replace the corresponding framework layer — without editing AUA source. Register via import_path in aua_config.yaml.
13.1 The 8 Plugin Protocol interfaces
| Interface | What it replaces |
|---|---|
FieldClassifierPlugin | Field classification for incoming queries |
UtilityScorerPlugin | Utility function U = f(response, field) |
ArbiterPolicyPlugin | Contradiction arbitration verdict logic |
PromotionPolicyPlugin | Should GREEN promote? Logic |
CorrectionStorePlugin | Storage backend for assertions and corrections |
ModelBackendPlugin | Custom LLM backend (not vLLM or Ollama) |
HookPlugin | Lifecycle event handler (see Part 11) |
AUAMiddleware | Before/after pipeline for every request (see Part 11) |
13.2 Example — custom utility scorer
from aua.plugins.interfaces import UtilityScorerPlugin
class RiskWeightedUtilityScorer(UtilityScorerPlugin):
"""Weights efficacy down when confidence is below threshold."""
def __init__(self, config: dict):
self.risk_threshold = config.get("risk_threshold", 0.80)
def score(
self,
response: str,
field: str,
efficacy: float,
confidence: float,
curiosity: float,
weights: dict,
) -> float:
if confidence < self.risk_threshold:
weights = {**weights, "efficacy": weights["efficacy"] * 0.6}
return (
weights["efficacy"] * efficacy
+ weights["confidence"] * confidence
+ weights["curiosity"] * curiosity
)
plugins:
utility_scorer:
import_path: plugins.risk_scorer:RiskWeightedUtilityScorer
config:
risk_threshold: 0.85
aua extensions test \ --kind utility_scorer \ --import-path plugins.risk_scorer:RiskWeightedUtilityScorer ✓ Interface satisfied: UtilityScorerPlugin ✓ score() signature valid ✓ Test vector passed (U=0.612)
13.3 Example — custom model backend
from aua.plugins.interfaces import ModelBackendPlugin
from typing import AsyncIterator
import httpx
class GatewayBackend(ModelBackendPlugin):
def __init__(self, config: dict):
self.base_url = config["base_url"]
self.api_key = config["api_key"]
async def complete(self, request: dict) -> dict:
async with httpx.AsyncClient() as client:
r = await client.post(
f"{self.base_url}/v1/chat/completions",
headers={"Authorization": f"Bearer {self.api_key}"},
json=request,
)
return r.json()
async def stream(self, request: dict) -> AsyncIterator[str]:
# yield SSE chunks
...
async def health(self) -> dict:
return {"status": "ok"}
backends:
my_gateway:
plugin: plugins.gateway_backend:GatewayBackend
base_url: https://api.my-gateway.internal
api_key_secret: MY_GATEWAY_KEY # resolved from secrets provider
13.4 Listing and inspecting registered plugins
aua extensions list
aua extensions inspect utility_scorer
aua extensions test --kind utility_scorer --import-path mypackage.myplugin:MyPlugin
# To pick up plugin changes: restart aua serve
How-to 14 Hooks & middleware ~20 min
14.1 The 11 hook points
| Hook | Fires when |
|---|---|
pre_query | Before field classification |
post_query | After response assembled |
pre_route | Before routing decision |
post_route | After routing decision, before specialist calls |
pre_specialist_call | Before each specialist API call |
post_specialist_call | After each specialist returns |
pre_arbiter | Before arbiter receives inputs |
post_arbiter | After arbiter verdict issued |
on_correction | When a correction is stored |
on_promotion | When GREEN promotes to BLUE |
on_rollback | When a rollback completes |
14.2 Writing a hook
from aua.plugins.interfaces import HookPlugin
class AuditHook(HookPlugin):
hook_name = "post_arbiter"
error_policy = "fail_open" # don't block response if hook fails
timeout_seconds = 2.0
async def __call__(self, event: dict) -> dict:
# event contains: session_id, trace_id, field, verdict, utility_scores, ...
if event.get("verdict") == "case_4":
await self.alert_slack(event) # your logic here
return event # always return event (can mutate)
async def alert_slack(self, event: dict):
...
hooks:
- import_path: plugins.audit_hook:AuditHook
order: 10 # lower number runs first
14.3 Middleware — before/after every request
from aua.plugins.interfaces import AUAMiddleware
class PIIRedactionMiddleware(AUAMiddleware):
async def before_query(self, request: dict) -> dict:
request["prompt"] = self._redact(request["prompt"])
return request
async def after_response(self, response: dict) -> dict:
# optionally mutate response
return response
def _redact(self, text: str) -> str:
# your PII redaction logic
return text
middleware: - plugins.pii_middleware:PIIRedactionMiddleware - plugins.tenant_policy:TenantPolicyMiddleware
Error policy: fail_open — hook failure is logged but does not block the response. fail_closed — hook failure returns an error to the client. Configure per hook in YAML. Middleware failures are always fail_open by default.
How-to 15 Security ~25 min
15.1 Bearer tokens and scopes
AUA uses HMAC-SHA256 bearer tokens with 15 fine-grained scopes. Create tokens with the CLI:
# Create a query-only token expiring in 30 days aua token create --scope aua:query --expires 30d # Create an admin token aua token create --scope aua:admin --expires 7d # List all tokens aua token list # Revoke a token aua token revoke <token-id>
| Scope | Grants access to |
|---|---|
aua:query | POST /query, POST /sessions/{id}/messages |
aua:stream | POST /sessions/{id}/stream |
aua:status | GET /status, GET /version, GET /health |
aua:config:read | GET /config (secrets redacted) |
aua:config:write | POST /config/reload |
aua:corrections:write | POST /corrections |
aua:deploy | POST /deploy/green |
aua:rollback | POST /deploy/rollback |
aua:extensions:write | POST /extensions, POST /extensions/reload |
aua:admin | All scopes |
export AUA_TOKEN="aua_tk_..."
curl -X POST http://localhost:8000/query \
-H "Authorization: Bearer $AUA_TOKEN" \
-H "Content-Type: application/json" \
-d '{"query": "Explain quicksort", "session_id": "s1"}'
15.2 mTLS — encrypted internal communication
# Generate dev certs (self-signed, router + specialists + arbiter) aua certs generate # Inspect cert details aua certs inspect # Rotate certs (hot-reloaded, no restart needed)
security:
mtls:
enabled: true
cert_dir: .aua/certs
auto_generate: true # false in production — bring your own CA
15.3 Secrets management
AUA never stores plaintext secrets in config. Instead, config references a secret name, and the secrets provider resolves it at startup:
secrets: provider: env # "env" | "vault" | "aws_sm" | "gcp_sm" specialists: - name: swe api_key_secret: SWE_API_KEY # reads env var SWE_API_KEY
secrets:
provider: vault
vault_addr: https://vault.internal:8200
vault_token_secret: VAULT_TOKEN # token from env
vault_path_prefix: secret/aua
15.4 Encryption at rest
Correction payloads, assertions, DPO pairs, token metadata, and sensitive audit fields are encrypted at rest with AES-256-GCM:
security:
encryption:
enabled: true
key_secret: AUA_ENCRYPTION_KEY # 64-char hex key — see §14.3 for generation
15.5 Audit log
The audit log is append-only with a tamper-evident SHA-256 hash chain. Every security-relevant event is recorded:
# View recent audit events (written to .aua/audit.log) tail -f .aua/audit.log # Export corrections (machine-readable audit trail) aua corrections export --format jsonl
Production checklist: aua doctor --strict validates that if cors_origins is * and the host is 0.0.0.0 and auth is disabled, a loud warning is emitted. Use the Team Server or Enterprise deployment profile to enforce auth + mTLS requirements.
How-to 16 Observability ~25 min
AUA emits three observability streams out of the box: structured JSON logs (every query, assertion, and error), Prometheus metrics (18 gauges/counters/histograms), and optional OpenTelemetry distributed traces. All three are designed to ship directly to ELK, Splunk, Grafana, or any OTEL-compatible backend — no code changes required.
16.1 Structured JSON logging
Every log line the framework emits is a single-line JSON object. Every line automatically includes the current request's session_id, trace_id, and request_id — so a Kibana or Splunk search on a session ID returns the complete picture of everything that happened in that request.
logging: level: INFO # DEBUG | INFO | WARNING | ERROR format: json # "json" (default) | "text" (human-readable dev mode) output: stdout # "stdout" | "stderr" | "/var/log/aua/router.log"
{"ts":1747000000.12,"level":"INFO","logger":"aua.router","msg":"single→software_engineering U=0.731","session_id":"s_abc123","trace_id":"01HX...","field":"software_engineering","routing_mode":"single","latency_ms":312.4,"utility_score":0.731,"confidence":0.823}
{"ts":1747000000.43,"level":"INFO","logger":"aua.router","msg":"Query routed","session_id":"s_abc123","trace_id":"01HX...","domain":"software_engineering","u_score":0.731,"latency_ms":315.1}
Fields included in every structured log line:
| Field | Description |
|---|---|
ts | Unix timestamp (float) |
level | DEBUG / INFO / WARNING / ERROR |
logger | Module name (aua.router, aua.arbiter, aua.auth, ...) |
session_id | Chat session identifier — auto-injected from request context |
trace_id | W3C-compatible trace ID — links to OTEL spans if enabled |
request_id | Per-request unique ID |
field | Routed domain (software_engineering, mathematics, ...) |
specialist | Specialist name that handled the query |
routing_mode | single / fanout / arbiter |
utility_score | Final U score for this response |
confidence | Kalman-filtered confidence estimate |
latency_ms | End-to-end latency in milliseconds |
error_code | HTTP status on errors |
verdict | Arbiter verdict case (A/B/C/D) when arbiter fires |
16.2 Shipping logs to ELK (Elasticsearch / Kibana)
AUA's JSON output is Filebeat-native. No parsing config needed — all fields are already top-level JSON keys that become indexed Elasticsearch fields automatically.
logging:
format: json
output: /var/log/aua/router.log # Filebeat monitors this path
filebeat.inputs:
- type: log
paths: ["/var/log/aua/router.log"]
json.keys_under_root: true # promote JSON fields to top-level
json.add_error_key: true
processors:
- timestamp:
field: ts
layouts: ["UNIX"]
target_field: "@timestamp"
output.elasticsearch:
hosts: ["https://your-elastic:9200"]
index: "aua-logs-%{+yyyy.MM.dd}"
api_key: "your-api-key"
# All failed assertions in the last 24h logger: "aua.router" AND level: "WARNING" AND msg: "assertion" # Low U-score sessions (worth reviewing) utility_score < 0.4 # All events for a specific session session_id: "s_abc123" # High latency queries latency_ms > 5000 # Authentication failures logger: "aua.auth" AND level: "WARNING" # Arbiter fired routing_mode: "arbiter" AND verdict: *
Logstash pipeline alternative. If you're using Logstash instead of Filebeat, pipe aua serve stdout directly: aua serve 2>&1 | logstash -f aua.conf. In the pipeline filter: json { source => "message" } then date { match => ["ts", "UNIX"] }. The JSON structure needs no grok patterns.
16.3 Shipping logs to Splunk
Two options depending on your Splunk setup:
Option A — Universal Forwarder (file-based)
[monitor:///var/log/aua/router.log] index = aua sourcetype = aua_json
[aua_json] KV_MODE = json TIME_FORMAT = %s%3N TIME_PREFIX = "ts": MAX_TIMESTAMP_LOOKAHEAD = 20
Option B — HTTP Event Collector (HEC, no file needed)
pip install splunk-handler
from splunk_handler import SplunkHandler
import logging
logging.getLogger("aua").addHandler(
SplunkHandler(
host="splunk.yourcompany.com",
port=8088,
token="your-hec-token",
index="aua",
sourcetype="aua_json",
)
)
# Failed assertions over time index=aua sourcetype=aua_json logger="aua.router" "assertion" | timechart count by assertion_name # U-score trend per domain index=aua sourcetype=aua_json utility_score=* | timechart avg(utility_score) by field # P95 latency by routing mode index=aua sourcetype=aua_json latency_ms=* | stats perc95(latency_ms) by routing_mode
16.4 Prometheus metrics
curl http://localhost:8000/metrics | head -30
| Metric | Type | What it measures |
|---|---|---|
aua_queries_total | Counter | Total queries by field, routing mode, status |
aua_query_latency_seconds | Histogram | Latency (p50/p95/p99) |
aua_utility_score | Gauge | Last U score per domain |
aua_contradiction_rate | Gauge | Contradiction rate per domain |
aua_routing_field_distribution | Counter | Query distribution across fields |
aua_specialist_confidence | Gauge | Confidence per specialist |
aua_correction_count | Counter | Corrections accumulated |
aua_arbiter_verdict_distribution | Counter | Case 1/2/3/4 breakdown |
aua_dpo_pairs_accumulated | Gauge | Total DPO pairs in store |
aua_token_requests_total | Counter | Token-gated requests by scope |
aua_hook_failures_total | Counter | Hook execution failures |
aua_plugin_execution_seconds | Histogram | Plugin latency |
aua_specialist_vram_utilization | Gauge | GPU VRAM % per specialist |
aua_cost_gpu_hours_total | Counter | Cumulative GPU hours per specialist |
aua_cost_usd_total | Counter | Cumulative USD cost per specialist |
aua_assertion_results_total | Counter | Assertion pass/fail by name, level, domain |
aua_assertion_retries_total | Counter | BLOCKING assertion retry count |
aua_assertion_bonus_applied | Histogram | E-score bonus applied by INFO assertions |
16.5 Cost tracking
curl http://localhost:8000/metrics/cost | python3 -m json.tool
{
"swe": {"queries": 42, "gpu_hours": 0.012, "cost_usd": 0.0083},
"math": {"queries": 18, "gpu_hours": 0.005, "cost_usd": 0.0034},
"total_cost_usd": 0.0117
}
16.6 Grafana dashboard
docker compose --profile obs up
# Grafana at http://localhost:3000 (admin / aua-admin)
# Dashboard pre-loaded: 20 panels covering query volume, latency p50/p95/p99,
# routing distribution, U score trends, contradiction rate, arbiter verdicts,
# specialist health, VRAM usage, blue-green split, assertion fail rate,
# DPO pairs accumulated, auth failures, cost per specialist
16.7 OpenTelemetry — distributed traces
Optional. Sends full request traces to Jaeger, Tempo, Elastic APM, Splunk Observability, or any OTLP-compatible backend. Each trace covers the complete request path: router → classifier → routing decision → specialist calls → utility scoring → arbiter → hooks → policy assertions → response.
pip install "adaptive-utility-agent[otel]"
observability:
otel:
enabled: true
endpoint: http://localhost:4317 # OTLP gRPC collector
service_name: aua-router
observability:
otel:
enabled: true
endpoint: https://ingest.us1.signalfx.com:443
service_name: aua-router
headers:
X-SF-Token: "your-splunk-o11y-token"
Log + trace correlation. The trace_id in every JSON log line is W3C-compatible. When OTEL is enabled, clicking a log line in Kibana or Splunk and following its trace_id jumps directly to the corresponding distributed trace in Jaeger or Elastic APM — showing the exact specialist calls, latencies, and assertion checks for that request.
16.8 Structured logging in Docker / Kubernetes
services:
aua-router:
logging:
driver: json-file # Docker captures stdout as JSON
labels:
logging: "aua"
alloy:
image: grafana/alloy
volumes:
- /var/lib/docker/containers:/var/lib/docker/containers:ro
# Alloy → Loki → Grafana: full log + metric correlation
# fluent-bit ConfigMap snippet
[INPUT]
Name tail
Path /var/log/containers/aua-*.log
Parser json
Tag aua.*
[OUTPUT]
Name es
Match aua.*
Host elasticsearch.logging.svc
Index aua-logs
Type _doc
How-to 17 Docker deployment ~20 min
17.1 Docker Compose profiles
All examples use the modern docker compose (V2) command. If your system only has the legacy binary, replace with docker-compose.
# CPU / Ollama docker compose up # GPU / vLLM (requires NVIDIA runtime) docker compose --profile gpu up # + Prometheus and Grafana docker compose --profile obs up # Full local stack (Ollama + observability) docker compose --profile ollama --profile obs up # GPU (Linux + NVIDIA) — uses separate compose file docker compose -f docker compose.gpu.yml up
17.2 Deployment profiles
| Profile | Auth | mTLS | State | Use for |
|---|---|---|---|---|
| Local Developer | Optional | No | SQLite | localhost only |
| Single GPU Workstation | Recommended | No | SQLite | One-machine GPU server |
| Team Server | Required | Required | Postgres | Shared team deployment |
| Enterprise | IAM + scopes | Required | Postgres | Custom backends, strict audit |
17.3 Environment configuration
Generate your encryption key before deploying — it must be a 32-byte value encoded as 64 hex characters. Run either command once and store the output:
# Option 1 — Python (no extra dependencies) python3 -c "import os; print(os.urandom(32).hex())" # Option 2 — OpenSSL openssl rand -hex 32 # Either prints a 64-character hex string, e.g.: # a3f2c1e8b7d4509261af3e2c84b19d07f6a5c3e1b8294d6072f1e3a5c8b2d490
Keep this value secret and never commit it to version control. Rotate it by generating a new key, re-encrypting state, and restarting. Encryption uses AES-256-GCM; the key is loaded at startup from the named environment variable.
AUA_ENCRYPTION_KEY=<64-char hex string from above> AUA_ADMIN_TOKEN=aua_tk_... SWE_API_KEY=... POSTGRES_URL=postgresql://aua:password@db:5432/aua_state
security:
mtls: {enabled: true, cert_dir: /certs, auto_generate: false}
encryption: {enabled: true, key_secret: AUA_ENCRYPTION_KEY}
cors_origins: ["https://your-domain.com"]
state:
backend: postgres
url_secret: POSTGRES_URL
audit:
enabled: true
hash_chain: true
Reference
Reference CLI command groups
aua init
aua init <name> [--preset <name>] [--tier <name>] [--force]
aua serve
aua serve [--dry-run] [--tier <name>] [--reuse-running] [--with-ui] [--config <path>]
aua doctor
aua doctor [--strict] [--json] [--check-certs]
aua config
aua config validate [--config <path>] aua config expand [--json] aua config expand aua config reload
aua models / fields / presets / defaults
aua models list | inspect <name> aua fields list | inspect <name> aua presets list | inspect <name> aua defaults show [<category>]
aua token
aua token create --scope <scope> [--expires <duration>] [--name <label>] aua token list [--json] aua token revoke <token-id> aua token inspect <token-id>
aua certs
aua certs generate [--ca-cert <path>] [--ca-key <path>] aua certs inspect
aua eval
aua eval run --dataset <path> [--config <path>] [--json] aua eval report <results.json> aua eval compare --baseline <blue.json> --candidate <green.json>
aua corrections / dpo
aua corrections export --format jsonl [--redact] aua dpo export --format preference-pairs [--redact]
aua extensions
aua extensions list aua extensions inspect <name> aua extensions test --kind <type> --import-path <path> # Validate: aua extensions test --kind--import-path
aua status / rollback
aua status [--once] [--json] [--url <url>] [--refresh <seconds>] aua rollback [--specialist <name>] [--all] [--yes] [--no-restart]
aua guard
aua guard list [--json] aua guard test --import-path <module:function> [--output <text>] [--domain <name>]
aua policy
aua policy list aua policy validate <path.yaml> aua policy apply <path.yaml> [--dry-run]
aua calibrate
aua calibrate --layer <1|2|3> [--force] [--dry-run] [--config <path>]
[--dataset <path>] # layer 1 only
[--output <path.jsonl>] # layer 3 only
[--min-pairs <N>] # layer 3 only (default: 10)
aua logs
aua logs sessions [--limit <N>] [--domain <name>] [--json] aua logs assertions [--filter <key=value>] [--assertion <name>] [--tail <N>] [--json] aua logs export [--output <path>] [--table <table>] [--limit <N>]
aua metrics
aua metrics --compare <window> # 7d, 30d, or YYYY-MM-DD:YYYY-MM-DD
[--metric <name>] # u_score | assertion_fail_rate | retry_rate
[--json]
Reference REST API endpoints
| Method | Endpoint | Scope | Description |
|---|---|---|---|
| GET | /health | — | Health check. Returns 200 when router is up. |
| GET | /version | — | Framework version, build info. |
| GET | /config | aua:config:read | Current config (secrets redacted). |
| POST | /config/reload | aua:config:write | Hot-reload config (SIGHUP equivalent). |
| GET | /status | aua:status | Live U scores, specialist health, routing stats. |
| POST | /query | aua:query | Route a query. Returns response + metadata. |
| POST | /query/stream | aua:stream | Streaming SSE query response. |
| GET | /corrections | aua:corrections:read | List accumulated corrections. |
| POST | /corrections | aua:corrections:write | Inject a manual correction. |
| POST | /deploy/green | aua:deploy | Register a GREEN candidate model. |
| POST | /deploy/rollback | aua:rollback | Rollback to previous BLUE. |
| GET | /metrics | — | Prometheus metrics endpoint (16 metrics). |
| GET | /metrics/cost | aua:status | GPU hours and cost per specialist + total. |
| GET | /sessions | aua:query | List all chat sessions. |
| POST | /sessions | aua:query | Create a new chat session. |
| GET | /sessions/{id} | aua:query | Get session metadata. |
| DEL | /sessions/{id} | aua:query | Delete a session. |
| POST | /sessions/{id}/messages | aua:query | Send a message to a session. |
| POST | /sessions/{id}/stream | aua:stream | Send a streaming message to a session. |
| GET | /extensions | extensions:read | List registered plugins and hooks. |
Standard error response
{
"error": "AUA_SPECIALIST_TIMEOUT", // stable AUA_* error code
"message": "Specialist swe timed out after 30s",
"trace_id": "01HXYZ...",
"details": {"specialist": "swe", "timeout_seconds": 30}
}
linkedin.com/in/praneethtota · Code: GPL-3.0 · Docs: CC BY 4.0