AUA v1.0 is the production control plane. This document covers how it works end-to-end: the request lifecycle, plugin and hook system, all four deployment profiles, state store backends (SQLite → Postgres), security wiring, observability pipeline, failure modes, and scaling discipline.
On this page
The framework described in this document is fully implemented in AUA v1.0.0:
pip install adaptive-utility-agent && aua init --preset coding --tier macbook && aua serve
Modern LLM systems are fundamentally stateless at runtime. A wrong answer produced today is often produced again tomorrow, and the system typically has no durable mechanism for learning from that failure until the next model release. That is the operational gap this architecture is designed to close.
The Adaptive Utility Agent can be productionized as a control plane around LLM inference. Instead of treating the model as the entire system, the model becomes one component inside a larger closed-loop architecture that evaluates outputs, corrects failures, stores useful lessons, and biases future behavior accordingly.
The design goal is not to replace the model. The design goal is to surround the model with enough structure that repeated failures become detectable, correctable, and less likely to recur.
A standard LLM application pipeline is usually little more than prompt construction followed by inference and response return. That simplicity is attractive, but it also means there is no persistent operational memory and no meaningful notion of control.
Prompt → LLM → Output
The production version of the Adaptive Utility Agent inserts multiple stages around that core inference call:
Prompt → LLM → Evaluate → Correct → Learn → Output
Scores generated outputs against correctness, consistency, policy, and domain-specific constraints.
Performs targeted retries, rewriting, constraint injection, or heavier secondary reasoning passes.
Stores failure cases, successful corrections, and reusable patterns that can influence future requests.
Balances correctness gains against latency, cost, and confidence to decide whether the loop should continue.
┌────────────────────┐
│ Client Request │
└─────────┬──────────┘
│
┌────────▼────────┐
│ Orchestrator │
│ (API Layer) │
└────────┬────────┘
│
┌────────▼────────┐
│ LLM Inference │
│ (Base Model) │
└────────┬────────┘
│
┌────────────▼────────────┐
│ Evaluation Engine │
│ - Consistency checks │
│ - Constraint validation │
│ - Failure detection │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Correction Engine │
│ - Regeneration │
│ - Constraint injection │
│ - Output rewriting │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Memory / Learning Store│
│ - Failure logs │
│ - Correction patterns │
│ - Context embeddings │
└────────────┬────────────┘
│
┌────────▼────────┐
│ Final Output │
└─────────────────┘
The key architectural move is separating inference from control. The model generates candidates; the surrounding system decides what should be trusted, corrected, persisted, or returned.
The orchestrator owns the full request lifecycle. It is responsible for request admission, trace propagation, latency budget enforcement, retry boundaries, and routing decisions between lightweight and heavyweight correction paths.
The base model remains stateless and unchanged. This is an explicit design constraint. The point of the architecture is to improve behavior without requiring fine-tuning every time the system discovers a recurring class of mistakes.
Practical implication: the same control plane can sit in front of different models and be upgraded independently of the underlying model provider.
The evaluation engine is where the system turns “this answer feels wrong” into machine-actionable signals. The exact checks are domain dependent, but the contract is stable: inspect the candidate output and emit structured failure metadata.
{
"is_valid": false,
"failure_type": "logical_inconsistency",
"confidence": 0.82,
"details": {
"rule_id": "consistency.cross_step.v1",
"severity": 0.74
}
}
Examples of evaluation strategies:
The correction engine turns evaluation output into intervention. In production, this should be tiered rather than monolithic. Cheap fixes should run first; expensive fixes should be reserved for requests whose expected utility gain is high enough to justify the added cost.
Augment the next pass with specific rules or forbidden patterns derived from the evaluator.
Retry only the failing span or step rather than regenerating the full answer from scratch.
Use deterministic transforms when the failure is mechanical and well understood.
Invoke a stronger model or heavier reasoning path only when lower-cost options fail.
This is the system’s adaptive substrate. The memory layer stores failure traces, successful corrections, and high-value context embeddings that can be retrieved to steer future generations away from known failure modes.
{
"failure_type": "numerical_error",
"input_context": "...",
"incorrect_output": "...",
"corrected_output": "...",
"embedding": [...],
"tenant_id": "acme-prod",
"created_at": "2026-04-13T12:00:00Z"
}
The production system should be understood as a closed loop, not a stateless request-response chain:
That gives the system the three properties most LLM deployments lack today:
Important boundary: this is not “learning” in the same sense as gradient updates during model training. It is operational adaptation through retrieval, constraints, routing, and memory-informed control.
The utility function is the policy surface that decides whether additional correction is worth the cost. Without this layer, systems tend to either stop too early and return bad answers, or over-correct themselves into latency and cost collapse.
U = w1 * correctness + w2 * consistency - w3 * latency - w4 * cost + w5 * confidence
This is a production-oriented adaptation — weights are configurable per tenant and task class, and latency/cost appear explicitly as production objectives. The canonical derivation, formal axioms, and field-specific weight values are in §4 of the full whitepaper.
In production, weights should be configurable by tenant, task class, and domain. A legal assistant, a coding agent, and a creative writing system should not all optimize the same objective.
| Signal | Why it matters | Typical source |
|---|---|---|
| Correctness | Primary reliability objective | Evaluators, tests, validators |
| Consistency | Prevents internal contradictions | Cross-step or cross-turn checks |
| Latency | Bounds user-facing delay | Request timers |
| Cost | Prevents runaway correction loops | Token and model spend accounting |
| Confidence | Supports abstention and routing | Scorers, historical outcomes |
The easiest way to build a fragile version of this system is to throw every artifact into a single blob store and hope retrieval works. A better production design separates fast-path, structured, and semantic state.
Structured logs, evaluation records, correction metadata, per-tenant policies, and audit tables.
Semantic retrieval over prior failures, corrections, and domain context for memory-augmented prompting.
Hot cache for recent session context, active traces, routing hints, and short-lived correction state.
Longer-term artifact retention for raw traces, evaluator evidence, and offline analysis datasets.
Useful persisted objects include:
A system like this can easily become opaque unless observability is a first-class requirement. Every correction step should be traceable, attributable, and measurable.
Recommended trace fields:
| Dimension | Benefit | Cost / risk |
|---|---|---|
| Reliability | Higher correctness and fewer repeated mistakes | More moving parts and more operational complexity |
| Latency | Multi-pass correction can rescue bad responses | Each pass adds user-visible delay |
| Cost | Heavier reasoning available when needed | Runaway inference spend if not bounded |
| Memory | System improves behavior across sessions | Stale, low-quality, or poisoned memory can degrade outputs |
If evaluators incorrectly flag good outputs as failures, the correction loop will waste cost and may degrade answer quality. This argues for confidence thresholds, evaluator versioning, and offline calibration against labeled datasets.
Some outputs become worse after intervention, especially when the system over-indexes on one constraint and destroys fluency or usefulness. Mitigation: tiered correction, candidate comparison, and hard retry caps.
If bad corrections enter the memory store and are later retrieved as guidance, the system can institutionalize the wrong behavior. Mitigation: write-path validation, trust scores, expiry policies, and periodic pruning.
If every correction step produces new evaluator findings, the system can spiral into latency or cost collapse. Mitigation: utility-based stop conditions, maximum pass counts, and explicit fallback modes.
Staff-level reality: the hard part is not inventing the loop. The hard part is keeping the loop from becoming operationally pathological under real traffic, noisy evaluators, and cost constraints.
A pragmatic implementation can be built with standard backend infrastructure rather than exotic custom systems.
Client
│
├── FastAPI / Python control plane
│ ├── Request orchestration
│ ├── Evaluator dispatch
│ ├── Correction policy engine
│ └── Metrics / tracing hooks
│
├── LLM provider APIs
│ ├── Primary model
│ └── Escalation model
│
├── Postgres
├── Redis
├── Vector DB
└── Async queue (optional)
├── Celery
└── Kafka for higher-scale decoupling
This design is intentionally modular. Each evaluator can be implemented as an independent service or library. Each correction strategy can be swapped out or A/B tested. Each memory retrieval policy can evolve without changing the base model.
The architecture is straightforward at small scale and substantially more interesting at real production volume. The main scaling questions are not just throughput, but which requests deserve expensive correction and how to isolate tenants, memory, and policy state safely.
Keep memory, policies, and evaluation traces scoped by tenant to avoid cross-customer contamination.
Run cheap evaluators and fixes first; reserve heavy passes for high-value or high-risk requests.
Group evaluator calls and asynchronous analysis tasks where latency budgets allow it.
For interactive products, decide whether corrections happen before first token, during streaming, or after a draft is emitted.
At larger scale, a mature system would likely introduce:
The monolithic control plane described above is the right starting point — deployable today with standard infrastructure. But it has a structural ceiling that becomes a real problem at scale: every calibration update touches the entire model's weight space. Fixing a recurring error in software engineering advice can subtly degrade medical reasoning through weight interference. Rollback means rolling back everything.
The Micro-Expert Architecture is the production evolution of this design — the same utility-governed control plane, decomposed into a graph of independently deployable domain submodels. Architecturally it is microservices applied to model inference: each specialist has its own weights, its own calibration cycle, and its own utility tracker. Updating surgery weights cannot affect software engineering weights. There are no shared parameters to interfere.
Router (Raft HA cluster — 150–300ms failover)
|
|-- field classifier + query decomposer
|-- probabilistic fan-out to 1-3 specialists per query
+-- response merger + conflict detection
|
|-- Surgery submodel -- own weights, own calibration
|-- Software Eng submodel -- own weights, own calibration
|-- Law submodel -- own weights, own calibration
+-- ...N domain submodels
|
Arbiter Agent
|
|-- 4-check contradiction resolution
| (logical, mathematical, cross-session, empirical)
|-- VCG arbitration -- dominant strategy truthfulness
+-- verified corrections --> DPO signal per submodel
Updating one submodel only modifies its own weights. Rolling back a bad surgery update is swapping one model file — not re-running the full training pipeline. Blast radius of any update is bounded to one domain.
A surgery model update in progress does not block a software engineering update. Each submodel runs its own blue-green cycle independently. Updates are decoupled at the infrastructure level.
Only the relevant 1–3 submodels activate per query. A graph of 20 domain submodels costs roughly the same per query as a single-domain model — inference cost scales with query complexity, not graph size.
If a submodel is mid-update, the router falls back to a parent or sibling domain model. The router maintains a fallback graph — no single point of failure at the specialist level.
Multi-domain queries — "write Python code to analyze patient drug dosage data" — require both software engineering and medicine submodels simultaneously. The router handles this through query decomposition and parallel fan-out:
1. Field classifier returns a distribution:
{"software_engineering": 0.65, "medicine": 0.35}
2. Router decomposes the query:
CS subquery: "Write Python data analysis code"
Med subquery: "What are clinical constraints on dosage data?"
3. Fan out: send subqueries to respective submodels in parallel
4. Merge: combine responses, apply weighted confidence from
field distribution (medicine response gets 35% weight)
5. Arbitration: if CS and Medicine models contradict each other
--> Arbiter Agent --> 4-check resolution --> DPO signal
The routing failure risk. Misrouted queries are not merely suboptimal — the routing experiment quantifies this: mismatched routing actively worsens correctness by 17.5% and dramatically degrades confidence calibration (Brier 0.292 vs 0.160). The system becomes confidently wrong. Mitigations: probabilistic fan-out, entropy-based fallback, VCG arbitration, and the confidence gate which catches most out-of-domain answers before they are returned. Full regime analysis in §10.4.1 of the whitepaper.
This is the most operationally consequential property of the architecture: the decision of when to deploy an update, how fast to roll it out, and when to roll back is not made by a human operator on a schedule. It is made by the utility function itself — continuously, field-calibrated, with no manual intervention required under normal conditions.
The same U score that governs correction injection and DPO training weight also governs the entire deployment lifecycle. There is no separate deployment system — deployment is a direct consequence of sustained utility deviation.
Each submodel monitors its own utility score over a sliding window. A deployment cycle triggers automatically when deviation from baseline is both significant and sustained:
Trigger when ALL of:
|U_current - U_baseline| > delta(field) [significant deviation]
deviation sustained for >= T interactions [not noise]
held-out benchmark available [can evaluate candidate]
Both delta and T are derived directly from the field penalty multiplier that already governs DPO training — the same sensitivity that makes a surgical contradiction weighted 10x harder at training time also makes the surgical submodel 50x harder to trigger for deployment than a creative writing submodel:
delta(field) = 0.05 / penalty_multiplier(field) Surgery: delta = 0.005 penalty = 10x — very sensitive Software Eng: delta = 0.025 penalty = 2x — moderate Creative Writing: delta = 0.050 penalty = 1x — relaxed T(field) = (z_alpha * sigma / delta)^2 [power analysis, alpha=0.05, sigma~0.04] Surgery: T >= 246 interactions — high confidence required Software Eng: T >= 10 interactions — responsive Creative Writing: T >= 2 interactions — fast
Why field-calibration matters here. A wrong surgical update that serves 246 interactions before rollback causes real harm. A wrong creative writing update that serves 2 interactions before rollback is nearly harmless. The penalty multiplier that controls training severity also controls deployment sensitivity — the two are in the correct relationship by construction, not by tuning a separate parameter.
Once triggered, deployment proceeds through five phases — each governed by the utility score, with no manual gates required:
Phase 0 — Detection (ongoing):
U_monitor watches sliding window
If |U_current - U_baseline| > delta for T consecutive interactions:
trigger Phase 1
Phase 1 — Training (offline):
DPO calibration on accumulated (preferred, rejected) pairs
Weighted by field_penalty_multiplier + replay buffer
Produces candidate GREEN model
BLUE model continues serving 100% of traffic — no disruption
Phase 2 — Canary (5% green / 95% blue):
Router sends 5% of traffic to GREEN, 95% to BLUE
Minimum N_min = T(field) x 2 interactions before any shift
Both models log U scores per interaction
Phase 3 — Gradual shift (utility-weighted routing):
Every evaluation window, traffic split recomputed:
traffic_green = exp(U_green/tau) / (exp(U_green/tau) + exp(U_blue/tau))
tau(field): small tau = fast promotion, large tau = conservative
Surgery: tau = 0.05 promotes only when clearly better
Software Eng: tau = 0.20 moderate
Creative: tau = 0.50 fast, promotes on small differences
Enforced: clip(traffic_green, 0.05, 0.95)
As U_green > U_blue, traffic shifts automatically.
No manual intervention required.
Phase 4 — Promotion (traffic_green >= 1 - delta(field)):
traffic_green --> 1.0
BLUE enters cooldown — not retired, instant rollback still available
Cooldown = T(field) interactions
Phase 5 — Retirement:
If GREEN holds through cooldown without regression:
BLUE retired, weights freed
U_baseline = U_green (new baseline for next cycle)
delta recalibrated from observed variance
Rollback is also fully automated — it does not require an on-call engineer to notice a problem:
Rollback triggers (any phase):
U_green < U_blue - epsilon for M consecutive interactions
OR contradiction_rate_green > contradiction_rate_blue x 1.5
OR benchmark regression > field_tolerance
Rollback action:
traffic_blue --> 1.0 instantly
GREEN flagged
Failure DPO pairs added to replay buffer with negative weight
Next candidate trained against this failure pattern explicitly
Negative-weight failure DPO pairs mean each bad deployment makes the next candidate better — the system learns from failed deployments as well as from bad answers.
Traffic split across a typical successful promotion cycle:
Traffic %
100 BLUE ----------------+
+--+
+--+
+-----------+
5 GREEN-----------------+ |
+--+ |
+--+ |
0 +---------+-----> time
| Canary | Gradual shift | Promotion
| (5/95) | (utility-driven) | (100% green)
The team's role is not to make deployment decisions — it is to set field parameters (delta, T, tau) once and monitor autonomous operation. The utility function makes every deployment and rollback decision. The audit trail logs every trigger event, traffic split, and rollback with the U scores that drove each decision.
Under normal operation, no engineer needs to approve or execute a deployment. Engineers are involved when rollback triggers fire — the exception path, not the normal path.
High-stakes domains deploy conservatively by construction. A surgical submodel requires 246 canary interactions before any traffic shift. A creative writing submodel can promote in 4.
Every update has a logged trigger, a logged promotion trajectory, and a logged rollback path. More auditable than any monolithic release cycle — the U scores that drove each decision are retrievable.
Rolled-back candidates contribute negative-weight DPO pairs to the replay buffer. The next training run is explicitly conditioned against the failure pattern that caused the rollback.
The architectural payoff. In the monolithic control plane (§§1–12), the utility function governs inference behavior. In the Micro-Expert Architecture with utility-driven deployment, the utility function governs the entire system lifecycle — from the first correction injection through to the decision to retire a model version. One governing signal, operating at every timescale, from milliseconds to months.
Full specification — including delta derivations, T power analysis, tau field calibration, and the mathematical relationship to VCG arbitration — is in §10.7 of the full whitepaper.
The architecture described in this document was validated end-to-end on a single NVIDIA RTX 4090 (24 GB VRAM) via RunPod cloud. Three Qwen2.5 AWQ specialists — SWE 7B, Math 7B, and a 3B Arbiter — ran concurrently within a single GPU at 90.4% VRAM utilization (22,206 / 24,564 MiB), demonstrating that the three-specialist graph described in Section 13 is achievable on consumer hardware today.
Regime 2 fingerprint — calibration failure, not accuracy collapse. The mismatched-routing arm did not fail through low accuracy alone — its Brier score (0.279) was nearly identical to the no-routing baseline (0.280), but mean confidence was 0.750 versus ~0.60 for correctly-routed arms. The wrong model answered confidently and could not be caught by the C_min abstention gate. This is the Regime 2 failure mode confirmed on real hardware.
The full blue-green pipeline from Section 14 ran end-to-end: LoRA training (AWQ dequantize → fp16, ~12 min on RTX 4090), canary deployment (5% traffic), gradual shift (softmax routing), and promotion gate evaluation. The GREEN adapter was correctly rejected — U_delta +0.0011 fell below the 0.025 threshold. Adversarial evaluation showed the adapter overfit to training contradictions rather than generalizing resistance, producing U_delta −0.097 on the adversarial subset.
This demonstrates the promotion gate working as a safety mechanism: a degraded model adapter does not reach production. The rollback path described in Section 14.3 would have triggered automatically on this result.
A five-query cross-domain battery was run with SWE and Math specialists called in parallel on each query. The contradiction detector flagged a logical error in the SWE specialist's gradient descent implementation (severity 0.9): the assert statement compared two slices of the same predictions array at iteration 100 instead of comparing loss before and after training. The code failed its own stated test case — a genuine internal contradiction caught by the logical check. Evidence chain committed to the repository: agent/results/arbitration_evidence.json.
Full results are in Appendix A.1 of the full whitepaper.