Tutorial · v1.0.0

Your LLM makes the same mistake twice.
AUA makes sure it doesn't make it three times.

Most frameworks give you a model call. AUA gives you a control layer around that call — routing, scoring, correction, and policy enforcement that runs on every query and gets smarter over time.

Here's the problem this framework exists to solve. You deploy an LLM. It gives a wrong answer on Tuesday. You notice on Thursday. You add it to the system prompt on Friday. Next Tuesday — different user, same wrong answer. The prompt didn't stick, or the context window dropped it, or a slightly different phrasing triggered a different path. The error lives.

AUA closes that loop without waiting for a new model release:

Right now — every query

Routes to the right specialist. Scores the response with a utility function. Injects prior verified corrections into the context so past mistakes don't repeat. Enforces your policy — blocking bad output and retrying before the user ever sees it.

Over time — across sessions

Accumulates what the model consistently gets wrong. Tracks which sessions followed your policy perfectly. Exports those gold-standard sessions as DPO training pairs — ready to fine-tune the model so the corrections become permanent.

You define — what good means

Write a Policy. Say what must never appear (BLOCKING). Say what you want to see rewarded (INFO, with an E-score bonus). The framework enforces it on every call, tracks adherence over every session, and uses your policy as a curriculum for the next fine-tuning cycle.

It's designed like Django — you get a working system in five commands, and you can customise routing thresholds, utility weights, arbiter behaviour, correction stores, model backends, hooks, middleware, and deployment policy without touching framework internals. The quickstart below takes ten minutes. Parts 10–12 show how to teach the framework what good output looks like and watch it improve over time.

Start in 5 minutes → Skip to Policies & Assertions → CLI reference →

Quick Start Tutorial (12 parts) How-to guides (5 topics) CLI reference REST API reference

5-minute quickstart

Mac / Apple Silicon prerequisites. The macbook tier uses Ollama to serve models locally. Install and start it before running aua serve:

brew install ollama
ollama serve &                     # start in background
ollama pull qwen2.5-coder:7b       # ~4 GB — main coding specialist
ollama pull qwen2.5:7b             # ~4 GB — math specialist
ollama pull qwen2.5:3b             # ~2 GB — arbiter
ollama list                        # confirm all three are present

aua doctor will detect a missing Ollama and tell you exactly what to install. aua serve does not install Ollama automatically.

Five commands. A live routing endpoint. No GPU required to start.

bash

# 1. Install
pip install adaptive-utility-agent

# 2. Scaffold — Mac/CPU uses Ollama; swap --tier for GPU
aua init my-aua-project --preset coding --tier macbook
cd my-project

# 3. Validate setup
aua doctor

# 4. Start
aua serve

# 5. First query (new terminal)
# Auth is disabled by default — the Authorization header is optional until you enable it
curl -s -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "Write binary search in Python", "session_id": "qs-demo"}' \
  | python3 -m json.tool

What is now running: A multi-specialist LLM router with utility scoring, contradiction detection, assertions store, rate limiting, structured logging, and a Prometheus metrics endpoint — all from one command. Read on to understand each piece.

Part 1 Install & scaffold ~8 min

1.1 Python version

Python 3.10, 3.11, or 3.12 required. If you use pyenv: pyenv local 3.11.10 before installing, or aua may not appear in your PATH.

bash

# Runtime only (Ollama / CPU)
pip install adaptive-utility-agent

# GPU backend (Linux + CUDA required)
pip install "adaptive-utility-agent[vllm]"

# Dev tools — tests, linting, type checks
pip install "adaptive-utility-agent[dev]"

# Verify
aua --version
aua, version 1.0.0

1.2 Scaffold a project

Pick the tier that matches your hardware. Pick the preset that matches your domain. Together they set models, field weights, routing thresholds, and observability defaults.

Tier	Hardware	Backend	Notes
`macbook`	Mac M-series / Intel	Ollama	Best starting point
`single-4090`	1× RTX 4090 24 GB	vLLM AWQ	Production-grade
`quad-4090`	4× RTX 4090	vLLM AWQ	One GPU per specialist
`a100-cluster`	1× A100 80 GB	vLLM fp16	Highest accuracy

Preset	Fields configured	Use for
`coding`	software_engineering	Code generation, dev tools
`math`	mathematics	Proofs, computation
`research`	general, mathematics	Research assistance
`medical-safe`	medicine (c_min=0.95)	Medical Q&A with abstention
`legal-safe`	law (c_min=0.85)	Legal Q&A with abstention
`generalist`	software_engineering, mathematics, general	Multi-domain assistant

bash

aua init my-aua-project --preset coding --tier macbook
cd my-project

# See what init created
aua config expand

1.3 Validate and start

Auth behavior. By default, auth is disabled — the Authorization header is optional and all endpoints are open. This is fine for local development. To enable auth:

aua token create --scope aua:admin --expires 30d
export AUA_TOKEN="aua_tk_..."     # then include in curl: -H "Authorization: Bearer $AUA_TOKEN"

Examples throughout this tutorial show Authorization: Bearer $AUA_TOKEN. On a local dev install without auth enabled, you can omit that header entirely.

bash

aua doctor                 # PASS / FAIL / WARN per check, with fixes
aua doctor --strict        # warnings as failures — use in CI
aua doctor --json          # machine-readable output

aua serve                  # start specialists + router + arbiter
aua serve --with-ui        # also start Chat UI at :3001 (see note below)
aua serve --dry-run        # print commands without executing

What you can build with this

A working multi-model AI system running locally in under ten minutes — one command to scaffold, one to serve.
A project that's ready to customise: config, eval folder, and .gitignore all in place.
A pre-flight check (aua doctor) you can drop into CI to catch config problems before they reach production.

Part 2 shows how to wire in any model — from a frontier API to a 1.5B model on a laptop — and tell the framework what you want it to optimize for.

Part 1 done. You have a running AUA router. Part 2 explains specialists and fields.

Part 2 Models & fields ~12 min

2.1 What's in aua_config.yaml

aua_config.yaml — macbook tier, coding preset

aua:
  version: "0.5"   # version field generated by aua init — do not edit manually
  mode: local
  backend: ollama

specialists:
  - name: swe
    model: qwen-coder-7b-awq    # registry alias → full model ID
    port: 11434
    field: software_engineering
    gpu: 0

arbiter:
  model: qwen2.5:3b
  port: 11434

router:
  port: 8000
  single_domain_threshold: 0.75
  fanout_threshold: 0.30

security:
  cors_origins: ["http://localhost:3001"]  # Chat UI port; Grafana is on :3000

state:
  backend: sqlite
  path: .aua/state/aua.db

2.2 Model registry — inspect aliases

bash

aua models list
NAME                   PROVIDER   BACKEND   VRAM
qwen-coder-7b-awq      ollama     ollama    ~6 GB
qwen-math-7b-awq       ollama     ollama    ~6 GB
qwen-14b-awq           vllm       vllm      ~12 GB
llama3-8b              ollama     ollama    ~6 GB
...

aua models inspect qwen-coder-7b-awq

2.3 Field registry — weights and thresholds

bash

aua fields list
aua fields inspect software_engineering

Field	w_e	w_c	w_k	c_min	Penalty
`surgery`	0.20	0.70	0.10	0.95	10×
`aviation`	0.20	0.70	0.10	0.95	10×
`law`	0.30	0.60	0.10	0.85	5×
`mathematics`	0.50	0.40	0.10	0.75	3×
`software_engineering`	0.55	0.35	0.10	0.70	2×
`creative_writing`	0.80	0.05	0.15	0.05	1×

What you can build with this

Swap any model in or out without changing application code — frontier API, 7B local, or a tiny 1.5B model for fast low-stakes queries.
Tell the framework what matters for your domain: accuracy (w_c), output quality (w_e), or exploration (w_k).
Set how fast domain knowledge decays — fast for security practices, slow for physics principles — so the system stays calibrated over time.
Turn off exploration entirely for safety-critical domains so routing is always consistent and predictable.

Part 3 shows how the routing decision actually gets made — and gives you the knobs to control how aggressively the system compares specialists.

Part 2 done. You understand specialists and fields. Part 3 shows how routing decisions are made.

Part 3 Routing & utility ~15 min

3.1 The routing pipeline

Every query follows this path: middleware → session lookup → correction retrieval → field classifier → routing decision → specialist calls → utility scoring → arbiter (if needed) → hooks → response.

Mode	Trigger	What happens
single	One field above `single_domain_threshold`	One specialist call, utility scored
fanout	Two+ fields above `fanout_threshold`	All qualifying specialists called; best U wins
arbiter	Fanout returned contradictory answers	Arbiter resolves; correction stored; both models updated

3.2 The utility function

Every candidate response gets a single utility score before it reaches the user. The score combines three things:

How useful the answer appears — does it correctly address the query for this domain?
How consistent it is with prior verified knowledge — does it contradict things the system already knows?
Whether exploring this area is valuable — is this a domain where the system has low confidence and should weight novelty?

In practice, you rarely touch the formula directly. The defaults work well for most domains. You tune it when you want stricter answer quality (raise w_e), more caution (raise w_c), or more exploration (raise w_k).

The formal expression:

U = w_e(f)·E + w_c(f)·C + w_k(f)·K

E (Efficacy) — Mann-Whitney dominance probability over prior outputs [0, 1]
C (Confidence) — Kalman-filtered internal consistency, penalized per contradiction [0, 1]
K (Curiosity) — UCB exploration bonus for novel domains [capped at 50% of U]

3.3 A full query response

bash

curl -s -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $AUA_TOKEN" \
  -d '{
    "query": "Write binary search in Python. State the time complexity.",
    "session_id": "demo-session"
  }' | python3 -m json.tool

Response

{
  "session_id": "demo-session",
  "trace_id": "01HXYZ...",
  "request_id": "req_abc123",
  "routing_mode": "single",
  "primary_field": "software_engineering",
  "response": "...",
  "u_score": 0.641,
  "confidence": 0.76,
  "contradictions_detected": 0,
  "corrections_injected": 1,
  "latency_ms": 287.4,
  "cost_estimate_usd": 0.00012
}

3.4 Live status and U scores

bash

aua status                 # auto-refreshing terminal UI
aua status --once          # single snapshot
aua status --json          # machine-readable

curl http://localhost:8000/status | python3 -m json.tool

What you can build with this

A system that routes each question to the right specialist, scores every answer with a real number, and shows you exactly why — not vibes.
Tune how aggressively the system compares multiple specialists vs. committing fast to a single one.
Force a specific specialist for known query types — useful for dedicated deployments where you already know the domain.
Read U scores in API responses to build your own routing analytics — low scores on a domain are an early signal the specialist needs attention.

Part 4 adds persistent memory: the framework learns from its mistakes and stops making the same error twice.

Part 3 done. You understand routing and utility scoring. Part 4 covers the Arbiter.

Part 4 Arbiter & corrections ~15 min

When fanout routing produces contradictory responses, the Arbiter runs four checks, issues a verdict, injects correction signals, and stores the verified claim in the assertions store.

4.1 The four arbitration checks

Check	Weight	What it detects
Logical	0.30	Output contradicts its own premises
Mathematical	0.40	Complexity or numerical claims provably wrong
Cross-session	0.20	Contradicts a prior verified assertion
Empirical v2.0	0.10	External ground truth check (v2.0)

4.2 The four verdict cases

Case	Meaning	Action
Case 1	A correct, B wrong	Correct B, reinforce A, store claim
Case 2	B correct, A wrong	Correct A, reinforce B, store claim
Case 3	Both wrong	Correct both + open curiosity gap bonus
Case 4	Inconclusive	Flag for external escalation, hedge response

4.3 Using the Arbiter directly

Python

# ArbiterAgent runs automatically inside aua serve.
# Use directly in Python for testing or custom workflows:
from aua import ArbiterAgent, AssertionsStore

store = AssertionsStore()
arbiter = ArbiterAgent(store)

verdict = arbiter.arbitrate(  # sync — call directly
    subject="bubble_sort_complexity",
    domain="software_engineering",
    output_A="Bubble sort is O(n) average case.",
    output_B="Bubble sort is O(n²) average case.",
    field_penalty_multiplier=2.0,
)

print(verdict.case.value)        # "case_1"
print(verdict.verified_claim)    # "Bubble sort is O(n²) average case."
print(verdict.external_response) # safe to return to user

4.4 The assertions store — decay classes

Class	Decay	Used for
A	Never	Mathematical proofs, algorithm complexity
B	10 years	Classical physics, structural engineering
C	3 years	Medicine, law, architecture
D	6 months	Security CVEs, clinical guidelines, ML benchmarks

4.5 Manual corrections via REST

bash

curl -X POST http://localhost:8000/corrections \
  -H "Authorization: Bearer $AUA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "subject": "heapsort_complexity",
    "domain": "software_engineering",
    "claim": "Heapsort is O(n log n) worst-case, O(1) extra space.",
    "confidence": 0.99
  }'

Corrections are global in v1.0. A correction stored via POST /corrections is a verified fact about the world — it is injected into every future query on that subject, across all users and sessions. For internal tools, dedicated agents, and single-tenant deployments this is the intended behaviour. For multi-tenant products where users need isolated correction contexts, per-user scoping is a v1.1 roadmap item.

What you can build with this

A system that catches its own contradictions, stores what it learns, and injects that knowledge into every future query on that subject.
A verified fact store for your domain — inject something once and every user benefits from it on every future query.
Audit what the system currently 'knows' with GET /corrections — a clear, exportable record of every stored fact.
Control how long knowledge lives: permanent for proved facts (Class A), months for fast-moving fields like security (Class D).

Note: in v1.0, corrections are global across all users — intentional for internal tools, per-user scoping is coming in v1.1. Part 5 shows how to upgrade models safely without breaking what's working.

Part 4 done. You understand the Arbiter and corrections. Part 5 shows how U scores gate model promotion.

Part 5 Blue-green deployment ~20 min

BLUE is in production. GREEN is the candidate running on canary traffic. When GREEN's U score exceeds BLUE by delta over at least T_min interactions, it promotes automatically.

5.1 Promotion thresholds

aua_config.yaml

blue_green:
  swe:
    delta: 0.025    # GREEN must beat BLUE by +2.5% U
    T_min: 10       # need at least 10 canary interactions
    tau: 0.20       # softmax temperature for traffic split

5.2 Promote and rollback

bash

# Trigger blue-green promotion check via REST API
curl -X POST http://localhost:8000/deploy/green \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $AUA_TOKEN" \
  -d '{"specialist": "swe", "new_model": "qwen2.5-coder:14b"}'

# Check status
aua status --once

# Roll back to previous BLUE
aua rollback --specialist swe
aua rollback --specialist swe --yes     # skip confirmation
aua rollback --no-restart               # update config only, no restart
aua rollback --all --yes                # roll back all specialists

5.3 Using BlueGreenDeployment in Python

Python

from aua import BlueGreenDeployment
from aua.config import load_config

config = load_config("aua_config.yaml")
bg = BlueGreenDeployment(config, specialist_name="swe")

bg.register_green("models/swe-green-v2/")

import asyncio
summary = asyncio.run(bg.evaluate(n_queries=10))  # async — use asyncio.run() from sync context
print(f"GREEN U mean: {summary.green_u_mean:.3f}")
print(f"BLUE  U mean: {summary.blue_u_mean:.3f}")

if bg.should_promote(summary):
    bg.promote()
    print("GREEN promoted to BLUE")

5.4 The promotions log

Every promotion is recorded atomically to .aua/state/promotions.jsonl with a UUID, timestamp, and U scores. File-locked to prevent concurrent corruption.

What you can build with this

Upgrade your AI models the way you upgrade software — safely, with a rollback button and a promotion gate.
Test a new model on real traffic before it touches production: deploy as GREEN, evaluate, promote only if U score delta passes the threshold.
Revert a bad upgrade in one command — no redeployment, no config archaeology.
Set different promotion thresholds per specialist — conservative for customer-facing models, aggressive for internal experimental ones.

Part 6 makes the whole system easier to operate — config changes that take effect in seconds without restarting anything.

Part 5 done. Part 6 covers the config system — hot reload, validation, and config commands.

Part 6 Config system ~15 min

6.1 Config commands

bash

aua config validate           # strict schema check — catches typos, dupe ports, bad ranges
aua config expand             # full resolved config with all defaults filled in (secrets redacted)
aua presets list              # list all built-in presets
aua presets inspect coding    # full preset config

6.2 Hot reload — no restart needed

Not all config changes take effect without restarting. The rule:

Hot-reloadable (aua config reload)	Requires restart (aua serve)
Routing thresholds	Model path or name
Utility weights	Specialist port
Logging level	Backend (vllm ↔ ollama)
CORS origins	GPU assignment
Rate limits	mTLS certificate paths
Arbiter thresholds	New specialist added/removed

bash

aua config reload             # sends SIGHUP to running router
kill -HUP $(cat .aua/pids/router.pid)  # same effect

Hot-reloadable (no restart)	Requires restart
routing thresholds	model name or path
promotion delta / T_min / tau	specialist port
logging level, rate limits	GPU assignment
cors_origins	backend (vllm ↔ ollama)

6.3 Config versioning and migration

bash

# Config versioning: the aua.version field tracks schema compatibility.
# v1.0 does not provide automatic migration — update config manually if upgrading.
aua config validate   # validates your config against the current schema
aua config expand     # shows full resolved config with all defaults applied

What you can build with this

Change routing thresholds, utility weights, and CORS settings on a live system — no downtime, no restarts.
Version your config in git and use aua config validate as a pre-commit hook so schema errors never reach production.
See exactly what the running system is doing with aua config expand — no surprises from implicit defaults.
A config-driven system your whole team can modify safely, not one that lives in a single engineer's head.

Part 7 shows how every mistake the system makes in production automatically becomes training material for the next version.

Part 6 done. Part 7 covers the correction loop and DPO pair export.

Part 7 Correction loop & DPO export ~15 min

Every contradiction the Arbiter resolves produces a DPO training pair. The correction loop accumulates these pairs and exports them for fine-tuning your specialists.

7.1 Export corrections via CLI

bash

# Export all verified corrections as JSONL
aua corrections export --format jsonl

# Export as preference pairs for DPO training
aua dpo export --format preference-pairs

# With redaction (remove PII from prompts)
aua dpo export --format preference-pairs --redact

7.2 DPO pair format

v1.0 DPO pair status. In v1.0, DPO pairs are generated from corrections with a confirmed chosen answer. The rejected side is populated when the Arbiter identifies a clearly wrong response; for corrections injected manually (e.g. via POST /corrections), the rejected field is empty and must be filled before training. Case 4 (inconclusive arbiter outcomes) never produces a pair. Full chosen+rejected pair generation is a v1.1 item.

dpo_pairs_*.jsonl

{
  "query": "bubble_sort_complexity",
  "chosen": "Bubble sort is O(n²) average case.",
  "rejected": "Bubble sort is O(n) average case.",
  "field": "software_engineering",
  "utility_chosen": 0.72,
  "utility_rejected": 0.41,
  "correction_ids": ["corr_abc123"],
  "trace_id": "01HXYZ..."
}

7.3 Using CorrectionLoop in Python

Python

import asyncio
from aua import CorrectionLoop
from aua.config import load_config

config = load_config("aua_config.yaml")
loop = CorrectionLoop(config, router_url="http://localhost:8000")

async def main():
    pairs = await loop.collect_pairs(min_confidence=0.8)
    print(f"Collected {len(pairs)} pairs")
    summary = loop.export_pairs(pairs, output_dir="dpo_pairs")
    print(f"Exported to: {summary.output_path}")

asyncio.run(main())

7.4 Using field penalty weights in training

Python

from aua import FIELD_CONFIGS

for pair in pairs:
    cfg = FIELD_CONFIGS.get(pair.field, FIELD_CONFIGS["general"])
    loss_weight = cfg.penalty_multiplier  # 2× for SWE, 10× for surgery
    # pass loss_weight to your DPO trainer's per-sample weight

What you can build with this

Every mistake your AI makes in production is automatically becoming training data for the next version — without manual labelling.
A fine-tuning dataset built from real traffic: the things your actual users asked, and what the right answer was.
Domain-filtered DPO pairs so coding corrections train the coding specialist and don't pollute math training data.
A closed loop: production error → correction stored → DPO pair exported → model fine-tuned → mistake doesn't recur.

Part 8 adds a quality gate — catch regressions automatically before a new model ever reaches users.

Part 7 done. Part 8 covers the evaluation harness — automated quality measurement and regression detection.

Part 8 Eval harness ~20 min

The eval harness routes YAML test datasets through the live framework, scores outputs with the utility function, detects regressions, and produces structured JSON reports. It's the gate for blue-green promotion and CI.

8.1 Built-in smoke datasets

bash

ls evals/
coding_smoke.yaml   math_smoke.yaml    routing_smoke.yaml
correction_smoke.yaml   arbiter_smoke.yaml   safety_smoke.yaml

# Run the coding smoke suite
aua eval run --dataset evals/coding_smoke.yaml --config aua_config.yaml

# View the report
aua eval report .aua/evals/latest.json

# Compare blue vs green
aua eval compare --baseline blue --candidate green

8.2 Dataset format

Property checks run against the response text. Supported check types in v1.0:

Property key	Value	What it checks
`contains`	`string`	Case-insensitive substring match
`contains_any`	`[string, ...]`	At least one substring present
`not_contains`	`string`	Substring must NOT appear
`min_length`	`int`	Response character count ≥ N
`expected_domain`	`string`	Routing domain must equal this
`expected_domain_any`	`[string, ...]`	Routing domain must be one of these

Regex, LLM-judge, and custom Python validators are not supported in v1.0.

evals/coding_smoke.yaml

name: coding_smoke
field: software_engineering
cases:
  - id: binary_search
    prompt: "Implement binary search in Python. State time complexity."
    expected_properties:
      - "O(log n)"
      - "def binary_search"
      - correctness: true

  - id: bubble_sort_complexity
    prompt: "What is the average-case time complexity of bubble sort?"
    expected_properties:
      - "O(n²)"

8.3 Eval report

bash

aua eval report .aua/evals/latest.json

Output

Eval run: coding_smoke  2026-05-11T14:30:22Z
Cases:     8 total · 7 passed · 1 failed
U mean:    0.638 (baseline: 0.601)  ▲ +6.2%
Regressions: 0

FAILED:
  merge_sort_stability — expected "stable" in response
  U score: 0.41 (threshold: 0.45)

8.4 CI integration

.github/workflows/eval.yml

- name: Run AUA eval
  run: |
    aua eval run \
      --dataset evals/coding_smoke.yaml \
      --config aua_config.yaml \
      --json > .aua/evals/ci_result.json
  # check exit code: 0 = pass, 1 = failure

What you can build with this

A quality gate that catches AI regressions the same way unit tests catch code bugs — automatically, on every model change.
Promote new models with a number, not a feeling: aua eval compare gives you a quantitative diff between baseline and candidate.
Custom eval datasets for your domain — not generic benchmarks, but the exact questions and quality criteria that matter to your users.
CI integration so any model change that causes a quality drop fails the pipeline before it touches anyone.

Part 9 gives you a full UI to demo all of this — a private ChatGPT-like product backed entirely by your own models.

Part 8 done. Part 9 covers the Chat UI and how to use the Framework Debugger.

Part 9 Chat UI ~15 min

AUA ships a Next.js 14 Chat UI at apps/aua_chat/. It requires Node.js 18+ and runs as a separate process from the AUA router.

9.0 Prerequisites

bash — check Node.js

node --version   # must be 18+
npm --version

# Install Node.js if missing:
brew install node          # macOS
# or download from https://nodejs.org

9.1 Starting the full stack

Package user vs. repo contributor. aua init does not scaffold a Chat UI — the UI lives in the AUA source repo. Package users launch it through the CLI; repo contributors can run the Next.js dev server directly.

Package user — CLI launch (recommended)

Open two terminals:

Terminal 1 — AUA router

cd my-aua-project
aua serve --tier macbook        # Mac / Apple Silicon + Ollama
# aua serve                     # Linux / RTX 4090 + vLLM

Terminal 2 — Chat UI

aua ui                          # starts on http://localhost:3001
# Or combined: aua serve --tier macbook --with-ui

Repo contributor — Next.js dev server

If you have cloned the source repo and want to edit the UI:

Terminal 2 — Next.js dev server (source repo only)

cd Adaptive-Utility-Agent/apps/aua_chat
npm install          # first run only
npm run dev          # starts on http://localhost:3001

Open http://localhost:3001 — sign in with admin / aua-admin.

Local development credentials only. The default admin / aua-admin credentials are for local use. Change them via the AUA_UI_ADMIN_PASSWORD environment variable before exposing the UI beyond localhost. In production, disable the dev login and use token-based auth instead.

Note on aua serve --with-ui. This flag attempts to start the Chat UI automatically in the background. It works when npm is on your system PATH (standard Linux/Docker installs). On macOS with nvm or homebrew, node may not be on the PATH that background processes see, causing the UI to silently fail to start. If you see no Chat UI after --with-ui, use the two-terminal approach above — it always works. The UI log is at .aua/logs/ui.log if you want to diagnose the background start.

9.2 Three-zone layout

Zone	Contents
Left — Session sidebar	All sessions, search, new session
Center — Chat window	Messages, streaming responses, send bar
Right — Framework Debugger	Routing decision, utility breakdown, arbiter output, latency, cost, trace link

9.3 AUA Controls drawer

Click AUA Controls (left edge of the screen) to open the configuration drawer. Change routing thresholds, utility weights, arbiter policy, corrections, blue-green status, and observability settings — all without restarting. Uses aua config reload under the hood.

9.4 Chat Session API

bash — Session API (also used by the UI)

# Create a session
curl -X POST http://localhost:8000/sessions \
  -H "Authorization: Bearer $AUA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"name": "my-coding-session"}'

# Post a message (streaming)
curl -X POST http://localhost:8000/sessions/{id}/stream \
  -H "Authorization: Bearer $AUA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"query": "Explain quicksort"}'

# List all sessions
curl http://localhost:8000/sessions \
  -H "Authorization: Bearer $AUA_TOKEN"

9.5 SSE streaming event types

Event	When fired
`route`	Routing decision made — field, mode, specialists
`specialist_start`	Specialist call begins
`chunk`	Each token streamed from specialist
`specialist_done`	U score, latency for this specialist
`arbiter_done`	Verdict case, corrections stored
`done`	Full response + metadata
`error`	AUA_* error code + trace ID

Framework Debugger tip: Every query in the UI shows the full routing trace — which specialist was called, intermediate U scores, whether the Arbiter fired, and a link to the OTEL trace in Jaeger or Tempo if observability is configured.

What you can build with this

A complete, private AI product — a chat interface backed entirely by your own models, no data leaving your environment.
A way to show stakeholders every routing decision in plain language: which specialist answered, what score it got, whether the Arbiter stepped in.
Adjust routing and config from the UI — no terminal, no restarts.
Everything from Parts 1–9 in a single interface: routing, corrections, blue-green status, and U scores, all visible at once.

Part 10 is where the framework starts shaping itself to your definition of good output — and your definition becomes a curriculum.

Part 10 Policies & Assertions — Design your AI over time ~25 min

This is the most powerful section of the tutorial. By the end, you'll understand how to teach the framework what "good output" means — and how it uses that definition to block bad responses in real-time, track model reputation over sessions, and automatically identify gold-standard training data for fine-tuning.

The core idea. Instead of writing a long system prompt and hoping the model follows it, you write a Policy — a versioned, portable definition of what good output looks like. The framework enforces it in real-time, tracks adherence over every session, and eventually makes the defined behavior permanent through fine-tuning. Your policy becomes the model's curriculum.

10.1 The three assertion levels

Every assertion has a level that determines what happens when it fires:

Level	What it does	Effect on U score
`BLOCKING`	Fails → error injected back into prompt → specialist retried up to `max_retries` (default 3). User never sees a response that violates this.	U penalty if all retries exhausted
`SOFT`	Fails → logged to assertion_events, response passes through. Use for guardrails you want to track without enforcing.	No U change — logged only
`INFO`	Always passes. When condition fires (returns a message), adds `+bonus` to the Efficacy (E) score. Use for positive/incentive assertions.	`E_final = min(1.0, E_base + bonus)`

10.2 Writing your first assertion

mypackage/policies.py

from aua.guard import assertion, AssertionLevel

# ── Guardrail: block syntax errors from ever reaching the user ─────────────
@assertion(name="PythonSyntaxCheck", level=AssertionLevel.BLOCKING)
def validate_python_syntax(output: str, context: dict) -> tuple[bool, str | None]:
    """Blocks output if any Python code block contains syntax errors."""
    import ast, re
    blocks = re.findall(r"```python(.*?)```", output, re.DOTALL)
    if not blocks:
        return True, None   # no code block — pass through
    for block in blocks:
        try:
            ast.parse(block)
        except SyntaxError as e:
            return False, f"Syntax error at line {e.lineno}: {e.msg}"
    return True, None

# ── Guardrail: soft-flag refusals without blocking ─────────────────────────
@assertion(name="NoAIisms", level=AssertionLevel.SOFT)
def no_ai_isms(output: str, context: dict) -> tuple[bool, str | None]:
    """Soft-flags common 'AI-isms' like 'as an AI language model'."""
    phrases = ["as an ai", "as a language model", "i cannot help with"]
    found = next((p for p in phrases if p in output.lower()), None)
    if found:
        return False, f"AI-ism detected: '{found}'"
    return True, None

10.3 Positive assertions — rewarding gold-standard behaviour

Negative assertions block bad output. Positive assertions reward exceptional output — and this is what feeds the fine-tuning pipeline. Sessions where positive assertions fire get the highest U scores and are automatically selected as "chosen" in your DPO export.

mypackage/policies.py (continued)

@assertion(name="AnalogyBonus", level=AssertionLevel.INFO, bonus=0.10)
def reward_analogy(output: str, context: dict) -> tuple[bool, str | None]:
    """Rewards responses that use analogies to explain concepts."""
    phrases = ["like a", "similar to", "imagine a", "think of it as", "just like"]
    if any(p in output.lower() for p in phrases):
        return True, "Positive: analogy used for clarity"
    return True, None   # neutral — no bonus if condition not met

@assertion(name="SocraticEnding", level=AssertionLevel.INFO, bonus=0.08)
def reward_question_ending(output: str, context: dict) -> tuple[bool, str | None]:
    """Rewards responses that end with an engaging question."""
    if output.strip().endswith("?"):
        return True, "Positive: Socratic engagement"
    return True, None

@assertion(name="PythonSyntaxBonus", level=AssertionLevel.INFO, bonus=0.12)
def reward_clean_code(output: str, context: dict) -> tuple[bool, str | None]:
    """Rewards syntactically clean Python with a bonus (stack with syntax check)."""
    import ast, re
    blocks = re.findall(r"```python(.*?)```", output, re.DOTALL)
    if blocks:
        try:
            for b in blocks:
                ast.parse(b)
            return True, "Positive: clean executable Python"
        except SyntaxError:
            pass
    return True, None

Option B bonus cap. Each INFO assertion contributes its declared bonus independently. The sum is capped by max_total_bonus on the Policy (default 0.30), then hard-capped at 0.50. A session where all three INFO assertions above fire adds up to 0.30 to E — a meaningful signal that this session is gold-standard.

10.4 Bundling into a Policy

A Policy is a versioned bundle that groups assertions, sets retry limits, and optionally shifts utility weights when active. Think of it as a Django settings.py for your AI's behaviour.

Python

from aua.policy import Policy

# Bundle guardrails + incentives into one named Policy
coding_policy = Policy(
    name="SafeCoding",
    version="1.0",
    max_retries=3,          # BLOCKING retries before giving up
    max_total_bonus=0.30,   # cap on total E bonus (Option B)
    utility_overrides={
        "w_k": 0.30,        # slightly raise curiosity weight for this policy
    }
)

# Add assertions — chaining supported
coding_policy.add(validate_python_syntax)   # BLOCKING
coding_policy.add(no_ai_isms)               # SOFT
coding_policy.add(reward_analogy)           # INFO +0.10
coding_policy.add(reward_clean_code)        # INFO +0.12

# Inspect before applying
print(coding_policy.summary())

10.5 YAML policy file (recommended for production)

policies/safe_coding.yaml

name: SafeCoding
version: "1.0"
max_retries: 3
max_total_bonus: 0.30
assertions:
  - import_path: mypackage.policies:validate_python_syntax
    # level defaults to what's declared on the @assertion decorator
  - import_path: mypackage.policies:no_ai_isms
  - import_path: mypackage.policies:reward_analogy
    bonus: 0.10          # override decorator default
  - import_path: mypackage.policies:reward_clean_code
    bonus: 0.12
utility_overrides:
  w_k: 0.30

10.6 Applying a policy via CLI

bash

# Validate schema before applying
aua policy validate policies/safe_coding.yaml
# ✓ policies/safe_coding.yaml is valid

# Preview — see what would be activated
aua policy apply policies/safe_coding.yaml --dry-run

# Activate — writes pointer to .aua/active_policy
aua policy apply policies/safe_coding.yaml
# ✓ Policy activated. Restart or hot-reload to apply.

# List all policies in policies/
aua policy list

# Test a single assertion against sample output
aua guard list
aua guard test --import-path mypackage.policies:validate_python_syntax
aua guard test --import-path mypackage.policies:reward_analogy \
    --output "Think of it as a balanced binary tree."

10.7 The three-layer learning loop

Once a policy is active, the framework creates a feedback loop that progressively shapes model behaviour — no manual intervention required:

Layer 1 — Immediate (milliseconds). BLOCKING assertions fire on every response. If PythonSyntaxCheck fails, the error is injected back into the prompt and the specialist retries. The user only ever sees syntactically valid code.

Layer 2 — Session-by-session. Every assertion result is stored in assertion_events with a timestamp. Specialists that consistently fail assertions accumulate lower mean U scores. Lower U scores mean they don't meet the blue-green promotion delta threshold — a model that can't follow your policy doesn't advance to BLUE.

Layer 3 — Calibration (on demand). Run aua calibrate --layer 3 to export sessions where all INFO assertions fired and no BLOCKING assertion exhausted retries. These are your gold-standard sessions — ready as DPO "chosen" examples for fine-tuning. After fine-tuning, the defined behaviours are baked into the model weights, and the assertions become less necessary over time.

What you can build with this

Bad output blocked before users ever see it — your guardrails run on every response, automatically.
A system that rewards the behaviours you want: every session that meets your gold standard is automatically flagged as training data.
Domain-specific personalities: strict and cautious for legal queries, curious and expressive for creative ones — all from one YAML file.
The start of a feedback loop: every failure you define an assertion for is a failure that gets corrected, tracked, and eventually eliminated.

Part 11 shows how to close the loop — take those gold-standard sessions and turn them into the next version of your model.

Part 10 done. Part 11 covers triggering calibration cycles to export training data and analyse routing weight health.

Part 11 Calibration cycles ~15 min

The aua calibrate command surfaces the three feedback loops as explicit, triggerable operations. You choose when to run each one — the framework handles the analysis.

11.1 Layer 1 — Measure current performance

bash

# Run the eval harness — same as `aua eval run` but surfaced as a calibration step
aua calibrate --layer 1 --dataset evals/coding_smoke.yaml

# Use the default dataset if it exists
aua calibrate --layer 1

11.2 Layer 2 — Routing weight analysis

Layer 2 reads assertion event history and shows which domains are healthy vs. degrading — the signal that tells you which specialists need attention.

bash

aua calibrate --layer 2

# Example output:
# ┌──────────────────────────┬─────────┬───────────┬───────────┬──────────────┐
# │ Domain                   │ Queries │ Pass Rate │ Avg Bonus │ Signal       │
# ├──────────────────────────┼─────────┼───────────┼───────────┼──────────────┤
# │ software_engineering     │     312 │    91.3%  │  +0.087   │ ↑ Strong     │
# │ mathematics              │     148 │    83.1%  │  +0.041   │ → Stable     │
# │ general                  │      44 │    56.2%  │     —     │ ↓ Weak       │
# └──────────────────────────┴─────────┴───────────┴───────────┴──────────────┘
# Stagnation signal: same assertions failing week over week
# → Check: is the assertion too strict? Is the model too small?

aua calibrate --layer 2 --dry-run  # preview only

11.3 Layer 3 — Export gold-standard DPO pairs

This is the calibration cycle that closes the loop. The framework identifies your best sessions — where the model followed your policy perfectly — and exports them as DPO training pairs.

bash

# See what would be exported without writing files
aua calibrate --layer 3 --dry-run

# Example dry-run output:
# Gold-standard sessions:   47
# Failed sessions:          12
# Exportable pairs:         12
# --dry-run: would export 12 DPO pairs → dpo_pairs/calibration.jsonl

# Export when ready
aua calibrate --layer 3 --output dpo_pairs/may_calibration.jsonl

# Force export even if below min-pairs threshold
aua calibrate --layer 3 --force --output dpo_pairs/early_export.jsonl

# Fine-tune your specialist with the exported pairs:
# Axolotl:  axolotl train configs/dpo.yaml --data dpo_pairs/may_calibration.jsonl
# TRL:      trl dpo --dataset dpo_pairs/may_calibration.jsonl
# Then deploy as GREEN: curl -X POST http://localhost:8000/deploy/green

What Layer 3 does (and doesn't do). aua calibrate --layer 3 identifies gold-standard sessions and exports DPO pairs in the format your fine-tuning framework expects. It does not fine-tune models automatically — that step runs via Axolotl, TRL, or LLaMA-Factory using the exported JSONL. After fine-tuning, deploy the new model as a GREEN candidate and let blue-green handle promotion.

What you can build with this

A model that gets measurably better over time — not by accident, but because you've defined what better means and built a pipeline that teaches it.
Training data you didn't have to label: the framework identified which sessions were gold-standard based on your policy.
A clear picture of which domains are healthy and which specialists need attention — before users notice.
The complete loop: define what good looks like → run queries → identify the best sessions → export training pairs → fine-tune → repeat.

Part 12 gives you the visibility layer — so you can actually see the improvement happening over time.

Part 11 done. Part 12 covers querying session and assertion logs and comparing metrics over time.

Part 12 Logs & metrics over time ~15 min

The assertion events store gives you a time-series view of how your policy is performing. These commands let you answer "is my AI actually getting better at following my policy?"

12.1 Viewing assertion events

bash

# All recent assertion events
aua logs assertions

# Filter to failures only — the assertions that need attention
aua logs assertions --filter passed=false

# Filter by assertion name
aua logs assertions --assertion PythonSyntaxCheck --tail 20

# Filter by domain
aua logs assertions --filter domain=software_engineering

# Export for offline analysis
aua logs assertions --json > my_assertions.json

12.2 Viewing session history

bash

# Recent sessions with U scores
aua logs sessions

# Export sessions to JSON
aua logs export --table audit_log --output sessions.json

12.3 Comparing metrics over time

This is the "is it working?" command. It compares the current window against the prior window of the same length and shows whether the key signals are moving in the right direction.

bash
# Compare last 30 days vs prior 30 days
aua metrics --compare 30d

# Example output (after a few weeks with an active policy):
# ┌─────────────────────────────┬──────────┬──────────┬──────────────────┐
# │ Metric                      │ Prior    │ Current  │ Trend            │
# ├─────────────────────────────┼──────────┼──────────┼──────────────────┤
# │ Mean U score                │  0.6213  │  0.6891  │ ↑ +0.0678        │
# │ Assertion fail rate         │  0.2341  │  0.1102  │ ↓ -0.1239        │ ← good
# │ Retry rate (BLOCKING)       │  0.1820  │  0.0890  │ ↓ -0.0930        │ ← good
# │ Avg E bonus (INFO)          │  0.0120  │  0.0654  │ ↑ +0.0534        │
# │ Total queries               │      312 │      481 │ ↑ +0.0000        │
# └─────────────────────────────┴──────────┴──────────┴──────────────────┘
# Success signal: mean_u_score ↑, assertion_fail_rate ↓, retry_rate ↓
# Stagnation signal: same assertions failing week over week

# Focus on a single metric
aua metrics --compare 7d --metric assertion_fail_rate

# Date range
aua metrics --compare 2025-04-01:2025-05-01

# JSON output for charting
aua metrics --compare 30d --json

What success looks like. Mean U score trending up. Assertion fail rate trending down. Retry rate (BLOCKING) falling — meaning the model is learning to get it right on the first try. After a fine-tuning cycle, you may see a step-change drop in fail rate as the trained behaviour is baked into the weights.

What stagnation looks like. The same assertions failing week over week. This means either the assertion is too strict for the model's capability, or the model isn't receiving enough signal to learn. Check: is max_retries too low? Is the policy active on enough queries to accumulate data?

12.4 The full policy workflow in practice

Putting it all together — this is the cycle for designing and refining your AI over time:

bash — Monthly calibration cycle

# 1. Week 1-4: run queries with policy active, accumulate assertion events

# 2. Check Layer 2 health at any point
aua calibrate --layer 2

# 3. Review failures — add/refine assertions if the same things keep failing
aua logs assertions --filter passed=false --tail 50

# 4. End of month: export gold-standard sessions
aua calibrate --layer 3 --dry-run      # preview
aua calibrate --layer 3                # export to dpo_pairs/calibration.jsonl

# 5. Fine-tune your specialist on the exported pairs (external step)
# trl dpo --dataset dpo_pairs/calibration.jsonl

# 6. Deploy the fine-tuned model as GREEN
curl -X POST http://localhost:8000/deploy/green \
  -d '{"specialist": "swe", "new_model": "qwen2.5-coder:14b-finetuned"}'

# 7. Blue-green evaluates and promotes if U score delta passes threshold
aua status --once    # watch the promotion

# 8. Compare metrics to confirm improvement
aua metrics --compare 30d

# Repeat — each cycle, the model gets better at following your policy.
# After a few cycles, the assertions become less necessary because the
# defined behaviours are baked into the model weights.

What you can build with this

Full visibility into whether your AI is improving — assertion fail rate trending down, U score trending up, in whatever monitoring stack you use.
One trace_id that links a specific response to its log line, its Prometheus metrics, and its distributed trace — the full story of what happened.
Alerts before users notice: U score drops, assertion failure spikes, latency regressions — all triggerable from the same metrics.
A system you can hand to an ops team: ELK, Splunk, Grafana, Loki — whatever they already use, with working configs and the right fields already in every log line.

The how-to guides cover everything you need for production: plugins, security, Docker, and full observability setup.

Part 12 done. How-to guides follow — plugins, hooks, security, observability, and Docker deployment.

How-to guides

Task-oriented guides for specific things you need to accomplish.

How-to 13 Write a plugin ~25 min

AUA defines 8 Protocol interfaces. Implement any of them to replace the corresponding framework layer — without editing AUA source. Register via import_path in aua_config.yaml.

13.1 The 8 Plugin Protocol interfaces

Interface	What it replaces
`FieldClassifierPlugin`	Field classification for incoming queries
`UtilityScorerPlugin`	Utility function U = f(response, field)
`ArbiterPolicyPlugin`	Contradiction arbitration verdict logic
`PromotionPolicyPlugin`	Should GREEN promote? Logic
`CorrectionStorePlugin`	Storage backend for assertions and corrections
`ModelBackendPlugin`	Custom LLM backend (not vLLM or Ollama)
`HookPlugin`	Lifecycle event handler (see Part 11)
`AUAMiddleware`	Before/after pipeline for every request (see Part 11)

13.2 Example — custom utility scorer

plugins/risk_scorer.py

from aua.plugins.interfaces import UtilityScorerPlugin

class RiskWeightedUtilityScorer(UtilityScorerPlugin):
    """Weights efficacy down when confidence is below threshold."""

    def __init__(self, config: dict):
        self.risk_threshold = config.get("risk_threshold", 0.80)

    def score(
        self,
        response: str,
        field: str,
        efficacy: float,
        confidence: float,
        curiosity: float,
        weights: dict,
    ) -> float:
        if confidence < self.risk_threshold:
            weights = {**weights, "efficacy": weights["efficacy"] * 0.6}
        return (
            weights["efficacy"] * efficacy
            + weights["confidence"] * confidence
            + weights["curiosity"] * curiosity
        )

aua_config.yaml — register the plugin

plugins:
  utility_scorer:
    import_path: plugins.risk_scorer:RiskWeightedUtilityScorer
    config:
      risk_threshold: 0.85

bash — test before registering

aua extensions test \
  --kind utility_scorer \
  --import-path plugins.risk_scorer:RiskWeightedUtilityScorer

✓ Interface satisfied: UtilityScorerPlugin
✓ score() signature valid
✓ Test vector passed (U=0.612)

13.3 Example — custom model backend

plugins/gateway_backend.py

from aua.plugins.interfaces import ModelBackendPlugin
from typing import AsyncIterator
import httpx

class GatewayBackend(ModelBackendPlugin):
    def __init__(self, config: dict):
        self.base_url = config["base_url"]
        self.api_key = config["api_key"]

    async def complete(self, request: dict) -> dict:
        async with httpx.AsyncClient() as client:
            r = await client.post(
                f"{self.base_url}/v1/chat/completions",
                headers={"Authorization": f"Bearer {self.api_key}"},
                json=request,
            )
            return r.json()

    async def stream(self, request: dict) -> AsyncIterator[str]:
        # yield SSE chunks
        ...

    async def health(self) -> dict:
        return {"status": "ok"}

aua_config.yaml

backends:
  my_gateway:
    plugin: plugins.gateway_backend:GatewayBackend
    base_url: https://api.my-gateway.internal
    api_key_secret: MY_GATEWAY_KEY  # resolved from secrets provider

13.4 Listing and inspecting registered plugins

bash

aua extensions list
aua extensions inspect utility_scorer
aua extensions test --kind utility_scorer --import-path mypackage.myplugin:MyPlugin
# To pick up plugin changes: restart aua serve

Part 13 done. Part 14 covers hooks and middleware — intercepting the query lifecycle.

How-to 14 Hooks & middleware ~20 min

14.1 The 11 hook points

Hook	Fires when
`pre_query`	Before field classification
`post_query`	After response assembled
`pre_route`	Before routing decision
`post_route`	After routing decision, before specialist calls
`pre_specialist_call`	Before each specialist API call
`post_specialist_call`	After each specialist returns
`pre_arbiter`	Before arbiter receives inputs
`post_arbiter`	After arbiter verdict issued
`on_correction`	When a correction is stored
`on_promotion`	When GREEN promotes to BLUE
`on_rollback`	When a rollback completes

14.2 Writing a hook

plugins/audit_hook.py

from aua.plugins.interfaces import HookPlugin

class AuditHook(HookPlugin):
    hook_name = "post_arbiter"
    error_policy = "fail_open"  # don't block response if hook fails
    timeout_seconds = 2.0

    async def __call__(self, event: dict) -> dict:
        # event contains: session_id, trace_id, field, verdict, utility_scores, ...
        if event.get("verdict") == "case_4":
            await self.alert_slack(event)  # your logic here
        return event  # always return event (can mutate)

    async def alert_slack(self, event: dict):
        ...

aua_config.yaml

hooks:
  - import_path: plugins.audit_hook:AuditHook
    order: 10  # lower number runs first

14.3 Middleware — before/after every request

plugins/pii_middleware.py

from aua.plugins.interfaces import AUAMiddleware

class PIIRedactionMiddleware(AUAMiddleware):
    async def before_query(self, request: dict) -> dict:
        request["prompt"] = self._redact(request["prompt"])
        return request

    async def after_response(self, response: dict) -> dict:
        # optionally mutate response
        return response

    def _redact(self, text: str) -> str:
        # your PII redaction logic
        return text

aua_config.yaml

middleware:
  - plugins.pii_middleware:PIIRedactionMiddleware
  - plugins.tenant_policy:TenantPolicyMiddleware

Error policy: fail_open — hook failure is logged but does not block the response. fail_closed — hook failure returns an error to the client. Configure per hook in YAML. Middleware failures are always fail_open by default.

Part 14 done. Part 15 covers production security — tokens, mTLS, secrets, and audit log.

How-to 15 Security ~25 min

15.1 Bearer tokens and scopes

AUA uses HMAC-SHA256 bearer tokens with 15 fine-grained scopes. Create tokens with the CLI:

bash

# Create a query-only token expiring in 30 days
aua token create --scope aua:query --expires 30d

# Create an admin token
aua token create --scope aua:admin --expires 7d

# List all tokens
aua token list

# Revoke a token
aua token revoke <token-id>

Scope	Grants access to
`aua:query`	POST /query, POST /sessions/{id}/messages
`aua:stream`	POST /sessions/{id}/stream
`aua:status`	GET /status, GET /version, GET /health
`aua:config:read`	GET /config (secrets redacted)
`aua:config:write`	POST /config/reload
`aua:corrections:write`	POST /corrections
`aua:deploy`	POST /deploy/green
`aua:rollback`	POST /deploy/rollback
`aua:extensions:write`	POST /extensions, POST /extensions/reload
`aua:admin`	All scopes

bash — use a token

export AUA_TOKEN="aua_tk_..."

curl -X POST http://localhost:8000/query \
  -H "Authorization: Bearer $AUA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"query": "Explain quicksort", "session_id": "s1"}'

15.2 mTLS — encrypted internal communication

bash

# Generate dev certs (self-signed, router + specialists + arbiter)
aua certs generate

# Inspect cert details
aua certs inspect

# Rotate certs (hot-reloaded, no restart needed)

aua_config.yaml

security:
  mtls:
    enabled: true
    cert_dir: .aua/certs
    auto_generate: true    # false in production — bring your own CA

15.3 Secrets management

AUA never stores plaintext secrets in config. Instead, config references a secret name, and the secrets provider resolves it at startup:

aua_config.yaml

secrets:
  provider: env            # "env" | "vault" | "aws_sm" | "gcp_sm"

specialists:
  - name: swe
    api_key_secret: SWE_API_KEY    # reads env var SWE_API_KEY

aua_config.yaml — Vault provider

secrets:
  provider: vault
  vault_addr: https://vault.internal:8200
  vault_token_secret: VAULT_TOKEN   # token from env
  vault_path_prefix: secret/aua

15.4 Encryption at rest

Correction payloads, assertions, DPO pairs, token metadata, and sensitive audit fields are encrypted at rest with AES-256-GCM:

aua_config.yaml

security:
  encryption:
    enabled: true
    key_secret: AUA_ENCRYPTION_KEY   # 64-char hex key — see §14.3 for generation

15.5 Audit log

The audit log is append-only with a tamper-evident SHA-256 hash chain. Every security-relevant event is recorded:

bash

# View recent audit events (written to .aua/audit.log)
tail -f .aua/audit.log

# Export corrections (machine-readable audit trail)
aua corrections export --format jsonl

Production checklist: aua doctor --strict validates that if cors_origins is * and the host is 0.0.0.0 and auth is disabled, a loud warning is emitted. Use the Team Server or Enterprise deployment profile to enforce auth + mTLS requirements.

Part 15 done. Part 16 covers observability — Prometheus, Grafana, OTEL, and structured logging.

How-to 16 Observability ~25 min

AUA emits three observability streams out of the box: structured JSON logs (every query, assertion, and error), Prometheus metrics (18 gauges/counters/histograms), and optional OpenTelemetry distributed traces. All three are designed to ship directly to ELK, Splunk, Grafana, or any OTEL-compatible backend — no code changes required.

16.1 Structured JSON logging

Every log line the framework emits is a single-line JSON object. Every line automatically includes the current request's session_id, trace_id, and request_id — so a Kibana or Splunk search on a session ID returns the complete picture of everything that happened in that request.

aua_config.yaml

logging:
  level: INFO          # DEBUG | INFO | WARNING | ERROR
  format: json         # "json" (default) | "text" (human-readable dev mode)
  output: stdout       # "stdout" | "stderr" | "/var/log/aua/router.log"

Example log output — one query

{"ts":1747000000.12,"level":"INFO","logger":"aua.router","msg":"single→software_engineering  U=0.731","session_id":"s_abc123","trace_id":"01HX...","field":"software_engineering","routing_mode":"single","latency_ms":312.4,"utility_score":0.731,"confidence":0.823}
{"ts":1747000000.43,"level":"INFO","logger":"aua.router","msg":"Query routed","session_id":"s_abc123","trace_id":"01HX...","domain":"software_engineering","u_score":0.731,"latency_ms":315.1}

Fields included in every structured log line:

Field	Description
`ts`	Unix timestamp (float)
`level`	DEBUG / INFO / WARNING / ERROR
`logger`	Module name (aua.router, aua.arbiter, aua.auth, ...)
`session_id`	Chat session identifier — auto-injected from request context
`trace_id`	W3C-compatible trace ID — links to OTEL spans if enabled
`request_id`	Per-request unique ID
`field`	Routed domain (software_engineering, mathematics, ...)
`specialist`	Specialist name that handled the query
`routing_mode`	single / fanout / arbiter
`utility_score`	Final U score for this response
`confidence`	Kalman-filtered confidence estimate
`latency_ms`	End-to-end latency in milliseconds
`error_code`	HTTP status on errors
`verdict`	Arbiter verdict case (A/B/C/D) when arbiter fires

16.2 Shipping logs to ELK (Elasticsearch / Kibana)

AUA's JSON output is Filebeat-native. No parsing config needed — all fields are already top-level JSON keys that become indexed Elasticsearch fields automatically.

Step 1 — write logs to file

logging:
  format: json
  output: /var/log/aua/router.log    # Filebeat monitors this path

Step 2 — filebeat.yml

filebeat.inputs:
  - type: log
    paths: ["/var/log/aua/router.log"]
    json.keys_under_root: true     # promote JSON fields to top-level
    json.add_error_key: true

processors:
  - timestamp:
      field: ts
      layouts: ["UNIX"]
      target_field: "@timestamp"

output.elasticsearch:
  hosts: ["https://your-elastic:9200"]
  index: "aua-logs-%{+yyyy.MM.dd}"
  api_key: "your-api-key"

Step 3 — useful Kibana queries

# All failed assertions in the last 24h
logger: "aua.router" AND level: "WARNING" AND msg: "assertion"

# Low U-score sessions (worth reviewing)
utility_score < 0.4

# All events for a specific session
session_id: "s_abc123"

# High latency queries
latency_ms > 5000

# Authentication failures
logger: "aua.auth" AND level: "WARNING"

# Arbiter fired
routing_mode: "arbiter" AND verdict: *

Logstash pipeline alternative. If you're using Logstash instead of Filebeat, pipe aua serve stdout directly: aua serve 2>&1 | logstash -f aua.conf. In the pipeline filter: json { source => "message" } then date { match => ["ts", "UNIX"] }. The JSON structure needs no grok patterns.

16.3 Shipping logs to Splunk

Two options depending on your Splunk setup:

Option A — Universal Forwarder (file-based)

inputs.conf

[monitor:///var/log/aua/router.log]
index = aua
sourcetype = aua_json

props.conf

[aua_json]
KV_MODE = json
TIME_FORMAT = %s%3N
TIME_PREFIX = "ts":
MAX_TIMESTAMP_LOOKAHEAD = 20

Option B — HTTP Event Collector (HEC, no file needed)

bash — install Splunk handler

pip install splunk-handler

Python — add to your startup script or hook

from splunk_handler import SplunkHandler
import logging

logging.getLogger("aua").addHandler(
    SplunkHandler(
        host="splunk.yourcompany.com",
        port=8088,
        token="your-hec-token",
        index="aua",
        sourcetype="aua_json",
    )
)

Useful Splunk searches

# Failed assertions over time
index=aua sourcetype=aua_json logger="aua.router" "assertion"
| timechart count by assertion_name

# U-score trend per domain
index=aua sourcetype=aua_json utility_score=*
| timechart avg(utility_score) by field

# P95 latency by routing mode
index=aua sourcetype=aua_json latency_ms=*
| stats perc95(latency_ms) by routing_mode

16.4 Prometheus metrics

bash

curl http://localhost:8000/metrics | head -30

Metric	Type	What it measures
`aua_queries_total`	Counter	Total queries by field, routing mode, status
`aua_query_latency_seconds`	Histogram	Latency (p50/p95/p99)
`aua_utility_score`	Gauge	Last U score per domain
`aua_contradiction_rate`	Gauge	Contradiction rate per domain
`aua_routing_field_distribution`	Counter	Query distribution across fields
`aua_specialist_confidence`	Gauge	Confidence per specialist
`aua_correction_count`	Counter	Corrections accumulated
`aua_arbiter_verdict_distribution`	Counter	Case 1/2/3/4 breakdown
`aua_dpo_pairs_accumulated`	Gauge	Total DPO pairs in store
`aua_token_requests_total`	Counter	Token-gated requests by scope
`aua_hook_failures_total`	Counter	Hook execution failures
`aua_plugin_execution_seconds`	Histogram	Plugin latency
`aua_specialist_vram_utilization`	Gauge	GPU VRAM % per specialist
`aua_cost_gpu_hours_total`	Counter	Cumulative GPU hours per specialist
`aua_cost_usd_total`	Counter	Cumulative USD cost per specialist
`aua_assertion_results_total`	Counter	Assertion pass/fail by name, level, domain
`aua_assertion_retries_total`	Counter	BLOCKING assertion retry count
`aua_assertion_bonus_applied`	Histogram	E-score bonus applied by INFO assertions

16.5 Cost tracking

bash

curl http://localhost:8000/metrics/cost | python3 -m json.tool

Response

{
  "swe":  {"queries": 42, "gpu_hours": 0.012, "cost_usd": 0.0083},
  "math": {"queries": 18, "gpu_hours": 0.005, "cost_usd": 0.0034},
  "total_cost_usd": 0.0117
}

16.6 Grafana dashboard

bash — start with observability profile

docker compose --profile obs up

# Grafana at http://localhost:3000 (admin / aua-admin)
# Dashboard pre-loaded: 20 panels covering query volume, latency p50/p95/p99,
# routing distribution, U score trends, contradiction rate, arbiter verdicts,
# specialist health, VRAM usage, blue-green split, assertion fail rate,
# DPO pairs accumulated, auth failures, cost per specialist

16.7 OpenTelemetry — distributed traces

Optional. Sends full request traces to Jaeger, Tempo, Elastic APM, Splunk Observability, or any OTLP-compatible backend. Each trace covers the complete request path: router → classifier → routing decision → specialist calls → utility scoring → arbiter → hooks → policy assertions → response.

bash

pip install "adaptive-utility-agent[otel]"

aua_config.yaml — OTEL to Jaeger/Tempo

observability:
  otel:
    enabled: true
    endpoint: http://localhost:4317   # OTLP gRPC collector
    service_name: aua-router

aua_config.yaml — OTEL to Splunk Observability Cloud

observability:
  otel:
    enabled: true
    endpoint: https://ingest.us1.signalfx.com:443
    service_name: aua-router
    headers:
      X-SF-Token: "your-splunk-o11y-token"

Log + trace correlation. The trace_id in every JSON log line is W3C-compatible. When OTEL is enabled, clicking a log line in Kibana or Splunk and following its trace_id jumps directly to the corresponding distributed trace in Jaeger or Elastic APM — showing the exact specialist calls, latencies, and assertion checks for that request.

16.8 Structured logging in Docker / Kubernetes

docker-compose.yml — ship stdout to Loki via Grafana Alloy

services:
  aua-router:
    logging:
      driver: json-file    # Docker captures stdout as JSON
    labels:
      logging: "aua"

  alloy:
    image: grafana/alloy
    volumes:
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
    # Alloy → Loki → Grafana: full log + metric correlation

Kubernetes — fluent-bit to Elasticsearch

# fluent-bit ConfigMap snippet
[INPUT]
    Name              tail
    Path              /var/log/containers/aua-*.log
    Parser            json
    Tag               aua.*

[OUTPUT]
    Name              es
    Match             aua.*
    Host              elasticsearch.logging.svc
    Index             aua-logs
    Type              _doc

Part 16 done. Part 17 covers Docker deployment profiles.

How-to 17 Docker deployment ~20 min

17.1 Docker Compose profiles

All examples use the modern docker compose (V2) command. If your system only has the legacy binary, replace with docker-compose.

bash

# CPU / Ollama
docker compose up

# GPU / vLLM (requires NVIDIA runtime)
docker compose --profile gpu up

# + Prometheus and Grafana
docker compose --profile obs up

# Full local stack (Ollama + observability)
docker compose --profile ollama --profile obs up

# GPU (Linux + NVIDIA) — uses separate compose file
docker compose -f docker compose.gpu.yml up

17.2 Deployment profiles

Profile	Auth	mTLS	State	Use for
Local Developer	Optional	No	SQLite	localhost only
Single GPU Workstation	Recommended	No	SQLite	One-machine GPU server
Team Server	Required	Required	Postgres	Shared team deployment
Enterprise	IAM + scopes	Required	Postgres	Custom backends, strict audit

17.3 Environment configuration

Generate your encryption key before deploying — it must be a 32-byte value encoded as 64 hex characters. Run either command once and store the output:

bash — generate AUA_ENCRYPTION_KEY

# Option 1 — Python (no extra dependencies)
python3 -c "import os; print(os.urandom(32).hex())"

# Option 2 — OpenSSL
openssl rand -hex 32

# Either prints a 64-character hex string, e.g.:
# a3f2c1e8b7d4509261af3e2c84b19d07f6a5c3e1b8294d6072f1e3a5c8b2d490

Keep this value secret and never commit it to version control. Rotate it by generating a new key, re-encrypting state, and restarting. Encryption uses AES-256-GCM; the key is loaded at startup from the named environment variable.

.env

AUA_ENCRYPTION_KEY=<64-char hex string from above>
AUA_ADMIN_TOKEN=aua_tk_...
SWE_API_KEY=...
POSTGRES_URL=postgresql://aua:password@db:5432/aua_state

aua_config.yaml — Team Server profile

security:
  mtls: {enabled: true, cert_dir: /certs, auto_generate: false}
  encryption: {enabled: true, key_secret: AUA_ENCRYPTION_KEY}
  cors_origins: ["https://your-domain.com"]

state:
  backend: postgres
  url_secret: POSTGRES_URL

audit:
  enabled: true
  hash_chain: true

Reference

Reference CLI command groups

aua init

bash

aua init <name> [--preset <name>] [--tier <name>] [--force]

aua serve

bash

aua serve [--dry-run] [--tier <name>] [--reuse-running] [--with-ui] [--config <path>]

aua doctor

bash

aua doctor [--strict] [--json] [--check-certs]

aua config

bash

aua config validate [--config <path>]
aua config expand [--json]
aua config expand
aua config reload

aua models / fields / presets / defaults

bash

aua models list | inspect <name>
aua fields list | inspect <name>
aua presets list | inspect <name>
aua defaults show [<category>]

aua token

bash

aua token create --scope <scope> [--expires <duration>] [--name <label>]
aua token list [--json]
aua token revoke <token-id>
aua token inspect <token-id>

aua certs

bash

aua certs generate [--ca-cert <path>] [--ca-key <path>]
aua certs inspect

aua eval

bash

aua eval run --dataset <path> [--config <path>] [--json]
aua eval report <results.json>
aua eval compare --baseline <blue.json> --candidate <green.json>

aua corrections / dpo

bash

aua corrections export --format jsonl [--redact]
aua dpo export --format preference-pairs [--redact]

aua extensions

bash

aua extensions list
aua extensions inspect <name>
aua extensions test --kind <type> --import-path <path>
# Validate: aua extensions test --kind  --import-path

aua status / rollback

bash

aua status [--once] [--json] [--url <url>] [--refresh <seconds>]
aua rollback [--specialist <name>] [--all] [--yes] [--no-restart]

aua guard

bash

aua guard list [--json]
aua guard test --import-path <module:function> [--output <text>] [--domain <name>]

aua policy

bash

aua policy list
aua policy validate <path.yaml>
aua policy apply <path.yaml> [--dry-run]

aua calibrate

bash

aua calibrate --layer <1|2|3> [--force] [--dry-run] [--config <path>]
              [--dataset <path>]          # layer 1 only
              [--output <path.jsonl>]     # layer 3 only
              [--min-pairs <N>]           # layer 3 only (default: 10)

aua logs

bash

aua logs sessions [--limit <N>] [--domain <name>] [--json]
aua logs assertions [--filter <key=value>] [--assertion <name>] [--tail <N>] [--json]
aua logs export [--output <path>] [--table <table>] [--limit <N>]

aua metrics

bash

aua metrics --compare <window>   # 7d, 30d, or YYYY-MM-DD:YYYY-MM-DD
            [--metric <name>]   # u_score | assertion_fail_rate | retry_rate
            [--json]

Reference REST API endpoints

Method	Endpoint	Scope	Description
GET	`/health`	—	Health check. Returns 200 when router is up.
GET	`/version`	—	Framework version, build info.
GET	`/config`	`aua:config:read`	Current config (secrets redacted).
POST	`/config/reload`	`aua:config:write`	Hot-reload config (SIGHUP equivalent).
GET	`/status`	`aua:status`	Live U scores, specialist health, routing stats.
POST	`/query`	`aua:query`	Route a query. Returns response + metadata.
POST	`/query/stream`	`aua:stream`	Streaming SSE query response.
GET	`/corrections`	`aua:corrections:read`	List accumulated corrections.
POST	`/corrections`	`aua:corrections:write`	Inject a manual correction.
POST	`/deploy/green`	`aua:deploy`	Register a GREEN candidate model.
POST	`/deploy/rollback`	`aua:rollback`	Rollback to previous BLUE.
GET	`/metrics`	—	Prometheus metrics endpoint (16 metrics).
GET	`/metrics/cost`	`aua:status`	GPU hours and cost per specialist + total.
GET	`/sessions`	`aua:query`	List all chat sessions.
POST	`/sessions`	`aua:query`	Create a new chat session.
GET	`/sessions/{id}`	`aua:query`	Get session metadata.
DEL	`/sessions/{id}`	`aua:query`	Delete a session.
POST	`/sessions/{id}/messages`	`aua:query`	Send a message to a session.
POST	`/sessions/{id}/stream`	`aua:stream`	Send a streaming message to a session.
GET	`/extensions`	extensions:read	List registered plugins and hooks.

Standard error response

JSON

{
  "error": "AUA_SPECIALIST_TIMEOUT",     // stable AUA_* error code
  "message": "Specialist swe timed out after 30s",
  "trace_id": "01HXYZ...",
  "details": {"specialist": "swe", "timeout_seconds": 30}
}

Praneeth Tota · Ph.D. Computer Science (Algorithmic Game Theory) · Illinois Institute of Technology
linkedin.com/in/praneethtota · Code: GPL-3.0 · Docs: CC BY 4.0

Home Whitepaper GitHub
AUA Framework v1.0.0

Your LLM makes the same mistake twice.AUA makes sure it doesn't make it three times.

5-minute quickstart

Part 1 Install & scaffold ~8 min

1.1 Python version

1.2 Scaffold a project

1.3 Validate and start

Part 2 Models & fields ~12 min

2.1 What's in aua_config.yaml

2.2 Model registry — inspect aliases

2.3 Field registry — weights and thresholds

Part 3 Routing & utility ~15 min

3.1 The routing pipeline

3.2 The utility function

3.3 A full query response

3.4 Live status and U scores

Part 4 Arbiter & corrections ~15 min

4.1 The four arbitration checks

4.2 The four verdict cases

4.3 Using the Arbiter directly

4.4 The assertions store — decay classes

4.5 Manual corrections via REST

Part 5 Blue-green deployment ~20 min

5.1 Promotion thresholds

5.2 Promote and rollback

5.3 Using BlueGreenDeployment in Python

5.4 The promotions log

Part 6 Config system ~15 min

6.1 Config commands

6.2 Hot reload — no restart needed

6.3 Config versioning and migration

Part 7 Correction loop & DPO export ~15 min

7.1 Export corrections via CLI

7.2 DPO pair format

7.3 Using CorrectionLoop in Python

7.4 Using field penalty weights in training

Part 8 Eval harness ~20 min

8.1 Built-in smoke datasets

8.2 Dataset format

8.3 Eval report

8.4 CI integration

Part 9 Chat UI ~15 min

9.0 Prerequisites

9.1 Starting the full stack

Package user — CLI launch (recommended)

Repo contributor — Next.js dev server

9.2 Three-zone layout

9.3 AUA Controls drawer

9.4 Chat Session API

9.5 SSE streaming event types

Part 10 Policies & Assertions — Design your AI over time ~25 min

10.1 The three assertion levels

10.2 Writing your first assertion

10.3 Positive assertions — rewarding gold-standard behaviour

10.4 Bundling into a Policy

10.5 YAML policy file (recommended for production)

10.6 Applying a policy via CLI

10.7 The three-layer learning loop

Part 11 Calibration cycles ~15 min

11.1 Layer 1 — Measure current performance

11.2 Layer 2 — Routing weight analysis

11.3 Layer 3 — Export gold-standard DPO pairs

Part 12 Logs & metrics over time ~15 min

12.1 Viewing assertion events

12.2 Viewing session history

12.3 Comparing metrics over time

12.4 The full policy workflow in practice

How-to guides

How-to 13 Write a plugin ~25 min

13.1 The 8 Plugin Protocol interfaces

13.2 Example — custom utility scorer

13.3 Example — custom model backend

13.4 Listing and inspecting registered plugins

How-to 14 Hooks & middleware ~20 min

14.1 The 11 hook points

14.2 Writing a hook

14.3 Middleware — before/after every request

How-to 15 Security ~25 min

15.1 Bearer tokens and scopes

15.2 mTLS — encrypted internal communication

15.3 Secrets management

Your LLM makes the same mistake twice.
AUA makes sure it doesn't make it three times.