AUA is a batteries-included Python framework for routing queries to domain-expert language models, scoring outputs with a utility function, resolving contradictions through a formal arbiter, and feeding verified corrections back into training — without editing framework internals. Django for adaptive multi-model LLM systems.
v1.0 ships everything needed to run a multi-specialist LLM system in production — from local Ollama on a MacBook to a quad-GPU cluster with mTLS, encrypted state, and Grafana dashboards.
aua init, aua serve, aua doctor, aua config, aua models, aua fields, aua presets, aua token, aua eval, aua extensions — 40+ subcommands with JSON output, strict mode, and correct exit codes.
FieldClassifierPlugin, UtilityScorerPlugin, ArbiterPolicyPlugin, PromotionPolicyPlugin, CorrectionStorePlugin, ModelBackendPlugin, HookPlugin, AUAMiddleware — all with type hints, docstrings, and example implementations.
GET /metrics, 16 metrics), pre-built Grafana dashboard (20 panels), OpenTelemetry instrumentation, Datadog OTEL preset, structured JSON logging with session/trace IDs, 8 Prometheus alert rules.
aua eval run, aua eval report, aua eval compare --baseline blue --candidate green. 6 built-in smoke datasets (coding, math, routing, correction, arbiter, safety). JSON output for CI. DPO pair export: aua dpo export --format preference-pairs.
aua serve --with-ui.
docker-compose up starts the full stack. Four profiles: cpu (Ollama), gpu (vLLM), observability (Prometheus + Grafana), secure. Four hardware tier templates: macbook, single-4090, quad-4090, a100-cluster — each with annotated YAML.
qwen-coder-7b-awq, llama3-8b, …), field registry (software_engineering, mathematics, medicine, …), and presets (coding, math, research, medical-safe, legal-safe, …). One-line init: aua init --preset coding --tier single-4090.
The utility function U = w_e(f)·E + w_c(f)·C + w_k(f)·K governs everything: how strongly each error is penalized at training time, when corrections are injected into the system prompt, whether a new deployment passes acceptance, and whether the system should act or abstain. The same formula across every domain — only the field weights change.
The additive weighted structure is not a design choice — it is the unique functional form required by five behavioral axioms, proved from first principles via Debreu's representation theorem (Theorem B.1, Appendix B). Field weights are derived from professional licensing standards, not arbitrary parameters. See §4–5 of the whitepaper.
Contradiction detected → corrective assertion stored → injected into system prompt. Immediate, no weight change. Persists across sessions via the assertions store.
Utility-scored pairs → field-penalty-weighted DPO fine-tuning → LoRA adapter → benchmark gate → deploy. Weight-level, permanent, generalizing. Surgical error trained 10× harder than creative writing.
Accumulated adapters distilled into new base fine-tune. Full regression suite. New starting point for the next correction cycle. Wrapper-level learning baked into base weights.
Full theory in the whitepaper. Implementation walkthrough in the tutorial. Framework architecture spec in the Architecture doc.
Each domain page is a standalone read for a specific practitioner audience — no whitepaper prerequisite. Every page now includes a Build it with AUA v1.0 section with a working aua_config.yaml and CLI commands for that domain.
The framework's strongest domain: correctness signals are automatic and unambiguous. Tests pass or fail. The entire RTX 4090 experiment ran here — 69.6% error reduction, +43.3pp VCG gain, 502 DPO pairs. Configure with aua init --preset coding.
AUA is the inference router. 6× cheaper per token for specialist domain queries on consumer hardware. Use hardware tier templates and GET /metrics/cost to track GPU hours and cost per query in real time.
Safety weights shift automatically by context (w_s=0.90 in school zones). Blue-green with T≥246 for safety-critical fields. The append-only audit log and hash chain provide the runtime evidence that safety certification requires.
Waymo · Cruise · Aurora · AV stack engineersIndustrial robots, cobots, AMRs, drones, surgical systems. AUA's utility log + assertions store + blue-green deployment log constitute the runtime evidence that living assurance documents require — not just design-time proofs.
Robotics, safety-case engineering, autonomyThe audit trail is no longer optional — 51 state pricing bills since Jan 2025. AUA adds per-decision utility logs, correction loop, and a structured audit chain on top of any pricing model. Configure with the pricing field preset.
Pricing platforms, marketplace optimisation, revenue managementThe same formula — stability weight shifting from 0.40 to 0.80 under demand surge — produces correct dispatch without separate rule sets. C_min=0.95 halts automated dispatch when forecasts are unreliable. Full DER and VPP coverage.
Grid software, DERMS / VPP, smart-home EMSThe only domain where curiosity (w_k=0.15) is the primary value driver. Platform signal — saves, downloads, shares — becomes a structured quality measure that feeds the correction loop. Content Efficacy × Discoverability as geometric mean.
Generative media, creative tooling, content platformsUpdated for v1.0: documents the actual production control plane that shipped — not aspirational design. Covers the request lifecycle end-to-end, the plugin and hook system, all four deployment profiles, state store backends (SQLite → Postgres), observability wiring, failure modes, and scaling discipline.
Audience: Staff / Senior SWE · AI infrastructure teams · LLM systems builders
Structured like Django docs: Quick Start in 5 commands, progressive Tutorial covering routing through blue-green deployment, How-To guides for plugins, security, and observability, and a CLI + API Reference. Every section runs real commands against a live framework.
The whitepaper is the complete theoretical and empirical reference — now split into focused documents for easier navigation. All original anchors are preserved.
Standard Python install. No GPU required to start — use the macbook tier with Ollama, or the single-4090 tier if you have a GPU.
Runtime only (Ollama / CPU). GPU serving and dev extras are optional.
# Runtime pip install adaptive-utility-agent # With GPU backend (Linux + CUDA) pip install "adaptive-utility-agent[vllm]" # With dev tools (tests, linting) pip install "adaptive-utility-agent[dev]"
Creates aua_config.yaml for your hardware. Presets configure field weights, models, and thresholds.
# Mac / CPU — uses Ollama aua init my-project --preset coding \ --tier macbook # RTX 4090 — vLLM AWQ aua init my-project --preset coding \ --tier single-4090 cd my-project
Every check shows PASS / FAIL / WARN with fix instructions. Exit 0 = all good.
aua doctor ✓ Config valid ✓ Models reachable ✓ Ports available ✓ Certs generated ✓ State store ready # Strict mode (warnings as failures) aua doctor --strict # Machine-readable aua doctor --json
Starts specialists, router, and arbiter. Add --with-ui for the Chat UI at localhost:3000.
aua serve ✓ specialists started (2) ✓ router started :8000 ✓ arbiter started :8001 # With Chat UI (Next.js) aua serve --with-ui # Dry run — print cmds only aua serve --dry-run
First query: curl -X POST http://localhost:8000/query -H "Content-Type: application/json" -d '{"prompt": "Write a binary search in Python", "session_id": "demo"}' — routes to the coding specialist, scores the response with the utility function, and returns utility score, field, and trace ID.
Explore the simulation first: The original correction loop simulation still works without any setup. git clone https://github.com/praneethtota/Adaptive-Utility-Agent && cd Adaptive-Utility-Agent/agent && pip install numpy scipy matplotlib && python3 simulate.py — 3-cycle simulation showing 69.6% error reduction and the +0.1160 utility improvement.