Adaptive Utility Agents · v1.0.0
Praneeth Tota · May 2026 v1.0.0 Shipped

The production framework for
self-correcting multi-specialist AI

AUA is a batteries-included Python framework for routing queries to domain-expert language models, scoring outputs with a utility function, resolving contradictions through a formal arbiter, and feeding verified corrections back into training — without editing framework internals. Django for adaptive multi-model LLM systems.

Measured · Simulation
69.6%
Reduction in repeated errors vs. uncalibrated baseline (cycles 2–5)
Measured · RTX 4090
+43.3pp
VCG arbitration gain over no-routing — Qwen2.5-7B AWQ (p=0.0003, d=1.02)
Analytical
2–6×
Lower cost per token vs. monolithic frontier LLMs on consumer GPU hardware
v1.0 Framework
132
Tests passing across 10 CLI command groups and 20 REST API endpoints
v1.0 Framework
8
Plugin Protocol interfaces — extend any layer without touching framework code
Praneeth Tota · Ph.D. Computer Science · Algorithmic Game Theory · Illinois Institute of Technology · linkedin.com/in/praneethtota · praneethtota.github.io
Evidence scope: Simulation: 69.6% error reduction, U↔quality correlation (r=0.461, p<10⁻⁴⁰) RTX 4090: VCG +43.3pp routing gain, 502 DPO pairs, full blue-green pipeline (p=0.0003, d=1.02) Analytical: 2–6× cost advantage on consumer hardware Shipped v1.0: 132 tests · 20 endpoints · 8 plugin interfaces · 16 Prometheus metrics

A complete framework, not a prototype

v1.0 ships everything needed to run a multi-specialist LLM system in production — from local Ollama on a MacBook to a quad-GPU cluster with mTLS, encrypted state, and Grafana dashboards.

CLI — 10 command groups
aua init, aua serve, aua doctor, aua config, aua models, aua fields, aua presets, aua token, aua eval, aua extensions — 40+ subcommands with JSON output, strict mode, and correct exit codes.
40+ subcommands · JSON output on every command
🔌
Plugin system — 8 Protocol interfaces
Replace any layer without editing AUA source. FieldClassifierPlugin, UtilityScorerPlugin, ArbiterPolicyPlugin, PromotionPolicyPlugin, CorrectionStorePlugin, ModelBackendPlugin, HookPlugin, AUAMiddleware — all with type hints, docstrings, and example implementations.
8 Protocol interfaces · 11 hook points · ordered middleware pipeline
🔒
Production security
Bearer token auth (HMAC-SHA256) across 14 fine-grained scopes. mTLS between router, specialists, and arbiter. Secrets management via env vars, Vault, AWS SM, or GCP SM. AES-256-GCM encryption at rest. Append-only hash-chain audit log.
14 auth scopes · AES-256-GCM at rest · mTLS internal comms
📊
Observability stack
Prometheus metrics endpoint (GET /metrics, 16 metrics), pre-built Grafana dashboard (20 panels), OpenTelemetry instrumentation, Datadog OTEL preset, structured JSON logging with session/trace IDs, 8 Prometheus alert rules.
16 Prometheus metrics · 20 Grafana panels · W3C trace propagation
🧪
Evaluation harness
aua eval run, aua eval report, aua eval compare --baseline blue --candidate green. 6 built-in smoke datasets (coding, math, routing, correction, arbiter, safety). JSON output for CI. DPO pair export: aua dpo export --format preference-pairs.
6 smoke datasets · regression detection · DPO export
💬
Chat UI + Session API
Next.js 14 reference app with three-zone layout: session sidebar, chat window, Framework Debugger. AUA Controls drawer exposes all routing parameters live. 5 Chat Session API endpoints, persistent sessions in SQLite. Start with aua serve --with-ui.
5 session endpoints · Framework Debugger · AUA Controls drawer
🐳
Docker + hardware tiers
docker-compose up starts the full stack. Four profiles: cpu (Ollama), gpu (vLLM), observability (Prometheus + Grafana), secure. Four hardware tier templates: macbook, single-4090, quad-4090, a100-cluster — each with annotated YAML.
4 Docker profiles · 4 hardware tiers · hot reload via SIGHUP
📋
Registries + presets
Built-in model registry with aliases (qwen-coder-7b-awq, llama3-8b, …), field registry (software_engineering, mathematics, medicine, …), and presets (coding, math, research, medical-safe, legal-safe, …). One-line init: aua init --preset coding --tier single-4090.
Built-in aliases · aua models list · aua presets inspect

Utility as a control law, not a monitor

The utility function U = w_e(f)·E + w_c(f)·C + w_k(f)·K governs everything: how strongly each error is penalized at training time, when corrections are injected into the system prompt, whether a new deployment passes acceptance, and whether the system should act or abstain. The same formula across every domain — only the field weights change.

U = w_e(f) · E + w_c(f) · C + w_k(f) · K E — Efficacy: performance relative to a human baseline [0, 1] C — Confidence: internal consistency, penalized by contradictions [0, 1] K — Curiosity: exploration bonus for novel high-upside domains [capped at 50%] f — field: surgery / software / creative / AV / energy / ... Decision rule: act if C ≥ C_min(f) AND E ≥ E_min(f) abstain otherwise → escalate to human or safe state Field-specific penalty multipliers route into DPO training: Surgery contradiction → 10× training weight Software engineering → 2× training weight Creative writing → 1× training weight

The additive weighted structure is not a design choice — it is the unique functional form required by five behavioral axioms, proved from first principles via Debreu's representation theorem (Theorem B.1, Appendix B). Field weights are derived from professional licensing standards, not arbitrary parameters. See §4–5 of the whitepaper.

Layer 1 · Milliseconds

Session correction

Contradiction detected → corrective assertion stored → injected into system prompt. Immediate, no weight change. Persists across sessions via the assertions store.

Layer 2 · Hours

Calibration cycle

Utility-scored pairs → field-penalty-weighted DPO fine-tuning → LoRA adapter → benchmark gate → deploy. Weight-level, permanent, generalizing. Surgical error trained 10× harder than creative writing.

Layer 3 · Monthly

Release integration

Accumulated adapters distilled into new base fine-tune. Full regression suite. New starting point for the next correction cycle. Wrapper-level learning baked into base weights.

Full theory in the whitepaper. Implementation walkthrough in the tutorial. Framework architecture spec in the Architecture doc.

Built for the problems that actually matter

Each domain page is a standalone read for a specific practitioner audience — no whitepaper prerequisite. Every page now includes a Build it with AUA v1.0 section with a working aua_config.yaml and CLI commands for that domain.

MVP domain · Fully runnable

Software Engineering

The framework's strongest domain: correctness signals are automatic and unambiguous. Tests pass or fail. The entire RTX 4090 experiment ran here — 69.6% error reduction, +43.3pp VCG gain, 502 DPO pairs. Configure with aua init --preset coding.

Coding agents, dev-tools, backend / infra
Fleet economics

AI Data Centers

AUA is the inference router. 6× cheaper per token for specialist domain queries on consumer hardware. Use hardware tier templates and GET /metrics/cost to track GPU hours and cost per query in real time.

Inference infra, GPU cloud, ML platforms
Safety · certification

Self-Driving Vehicles

Safety weights shift automatically by context (w_s=0.90 in school zones). Blue-green with T≥246 for safety-critical fields. The append-only audit log and hash chain provide the runtime evidence that safety certification requires.

Waymo · Cruise · Aurora · AV stack engineers
ISO 10218:2025 · IEC 61508

Autonomous Systems

Industrial robots, cobots, AMRs, drones, surgical systems. AUA's utility log + assertions store + blue-green deployment log constitute the runtime evidence that living assurance documents require — not just design-time proofs.

Robotics, safety-case engineering, autonomy
Full auditability

Dynamic Pricing

The audit trail is no longer optional — 51 state pricing bills since Jan 2025. AUA adds per-decision utility logs, correction loop, and a structured audit chain on top of any pricing model. Configure with the pricing field preset.

Pricing platforms, marketplace optimisation, revenue management
Grid · DER · Smart home

Energy Systems

The same formula — stability weight shifting from 0.40 to 0.80 under demand surge — produces correct dispatch without separate rule sets. C_min=0.95 halts automated dispatch when forecasts are unreliable. Full DER and VPP coverage.

Grid software, DERMS / VPP, smart-home EMS
Curiosity-led · Platform signal

Creative Systems

The only domain where curiosity (w_k=0.15) is the primary value driver. Platform signal — saves, downloads, shares — becomes a structured quality measure that feeds the correction loop. Content Efficacy × Discoverability as geometric mean.

Generative media, creative tooling, content platforms

Staff-level systems view of the complete stack

Productionizing the Adaptive Utility Agent

Updated for v1.0: documents the actual production control plane that shipped — not aspirational design. Covers the request lifecycle end-to-end, the plugin and hook system, all four deployment profiles, state store backends (SQLite → Postgres), observability wiring, failure modes, and scaling discipline.

Request lifecycle Plugin system Deployment profiles State store backends Security wiring Observability Failure modes

Audience: Staff / Senior SWE · AI infrastructure teams · LLM systems builders

Quick start to advanced hooks — no prerequisites.

AUA Framework Tutorial

Structured like Django docs: Quick Start in 5 commands, progressive Tutorial covering routing through blue-green deployment, How-To guides for plugins, security, and observability, and a CLI + API Reference. Every section runs real commands against a live framework.

aua init + serve Config & presets Routing & utility Arbiter + correction Blue-green deploy Plugin interfaces Hooks & middleware Security Observability Eval harness Chat UI CLI reference

Theory, proofs, and empirical results

The whitepaper is the complete theoretical and empirical reference — now split into focused documents for easier navigation. All original anchors are preserved.

Theory

Architecture

Empirical

Scope & Roadmap

Running in under 5 minutes

Standard Python install. No GPU required to start — use the macbook tier with Ollama, or the single-4090 tier if you have a GPU.

Step 1 · Install

pip install

Runtime only (Ollama / CPU). GPU serving and dev extras are optional.

# Runtime
pip install adaptive-utility-agent

# With GPU backend (Linux + CUDA)
pip install "adaptive-utility-agent[vllm]"

# With dev tools (tests, linting)
pip install "adaptive-utility-agent[dev]"
Step 2 · Scaffold

aua init

Creates aua_config.yaml for your hardware. Presets configure field weights, models, and thresholds.

# Mac / CPU — uses Ollama
aua init my-project --preset coding \
  --tier macbook

# RTX 4090 — vLLM AWQ
aua init my-project --preset coding \
  --tier single-4090

cd my-project
Step 3 · Validate

aua doctor

Every check shows PASS / FAIL / WARN with fix instructions. Exit 0 = all good.

aua doctor
✓ Config valid
✓ Models reachable
✓ Ports available
✓ Certs generated
✓ State store ready

# Strict mode (warnings as failures)
aua doctor --strict

# Machine-readable
aua doctor --json
Step 4 · Run

aua serve

Starts specialists, router, and arbiter. Add --with-ui for the Chat UI at localhost:3000.

aua serve
✓ specialists started (2)
✓ router started :8000
✓ arbiter started :8001

# With Chat UI (Next.js)
aua serve --with-ui

# Dry run — print cmds only
aua serve --dry-run

First query: curl -X POST http://localhost:8000/query -H "Content-Type: application/json" -d '{"prompt": "Write a binary search in Python", "session_id": "demo"}' — routes to the coding specialist, scores the response with the utility function, and returns utility score, field, and trace ID.

Explore the simulation first: The original correction loop simulation still works without any setup. git clone https://github.com/praneethtota/Adaptive-Utility-Agent && cd Adaptive-Utility-Agent/agent && pip install numpy scipy matplotlib && python3 simulate.py — 3-cycle simulation showing 69.6% error reduction and the +0.1160 utility improvement.