Systems Architecture · v1.0

Productionizing the Adaptive Utility Agent

AUA v1.0 is the production control plane. This document covers how it works end-to-end: the request lifecycle, plugin and hook system, all four deployment profiles, state store backends (SQLite → Postgres), security wiring, observability pipeline, failure modes, and scaling discipline.

Staff / Senior SWE AI infrastructure teams Recruiters & hiring managers LLM systems builders

1. Executive summary

v1.0 Shipped — May 2026

The framework described in this document is fully implemented in AUA v1.0.0:

✓ 132 tests passing ✓ 10 CLI command groups (40+ subcommands) ✓ 20 REST API endpoints ✓ 8 Plugin Protocol interfaces ✓ Bearer token auth (14 scopes) ✓ mTLS + AES-256-GCM at rest ✓ Prometheus (16 metrics) + Grafana (20 panels) ✓ OpenTelemetry instrumentation ✓ SQLite (dev) and Postgres (prod) state store ✓ Eval harness + 6 smoke datasets ✓ Chat UI (Next.js 14) + Session API ✓ Docker + 4 hardware tier templates

pip install adaptive-utility-agent && aua init --preset coding --tier macbook && aua serve

Modern LLM systems are fundamentally stateless at runtime. A wrong answer produced today is often produced again tomorrow, and the system typically has no durable mechanism for learning from that failure until the next model release. That is the operational gap this architecture is designed to close.

The Adaptive Utility Agent can be productionized as a control plane around LLM inference. Instead of treating the model as the entire system, the model becomes one component inside a larger closed-loop architecture that evaluates outputs, corrects failures, stores useful lessons, and biases future behavior accordingly.

Primary system role
Control Plane
Sits around model inference rather than inside the model
Adaptation path
Runtime
Improves behavior without waiting for retraining
Core loop
Evaluate → Correct
Adds feedback to otherwise stateless inference
Main cost
Latency + complexity
Better reliability is not free

The design goal is not to replace the model. The design goal is to surround the model with enough structure that repeated failures become detectable, correctable, and less likely to recur.

2. System overview

A standard LLM application pipeline is usually little more than prompt construction followed by inference and response return. That simplicity is attractive, but it also means there is no persistent operational memory and no meaningful notion of control.

Prompt → LLM → Output

The production version of the Adaptive Utility Agent inserts multiple stages around that core inference call:

Prompt → LLM → Evaluate → Correct → Learn → Output

Evaluation layer

Scores generated outputs against correctness, consistency, policy, and domain-specific constraints.

Correction layer

Performs targeted retries, rewriting, constraint injection, or heavier secondary reasoning passes.

Memory layer

Stores failure cases, successful corrections, and reusable patterns that can influence future requests.

Utility layer

Balances correctness gains against latency, cost, and confidence to decide whether the loop should continue.

3. High-level architecture

                    ┌────────────────────┐
                    │   Client Request   │
                    └─────────┬──────────┘
                              │
                     ┌────────▼────────┐
                     │  Orchestrator   │
                     │  (API Layer)    │
                     └────────┬────────┘
                              │
                     ┌────────▼────────┐
                     │ LLM Inference   │
                     │  (Base Model)   │
                     └────────┬────────┘
                              │
                 ┌────────────▼────────────┐
                 │   Evaluation Engine     │
                 │ - Consistency checks    │
                 │ - Constraint validation │
                 │ - Failure detection     │
                 └────────────┬────────────┘
                              │
                 ┌────────────▼────────────┐
                 │   Correction Engine     │
                 │ - Regeneration          │
                 │ - Constraint injection  │
                 │ - Output rewriting      │
                 └────────────┬────────────┘
                              │
                 ┌────────────▼────────────┐
                 │  Memory / Learning Store│
                 │ - Failure logs          │
                 │ - Correction patterns   │
                 │ - Context embeddings    │
                 └────────────┬────────────┘
                              │
                     ┌────────▼────────┐
                     │  Final Output   │
                     └─────────────────┘

The key architectural move is separating inference from control. The model generates candidates; the surrounding system decides what should be trusted, corrected, persisted, or returned.

4. Core components

4.1 Orchestrator (control plane API)

The orchestrator owns the full request lifecycle. It is responsible for request admission, trace propagation, latency budget enforcement, retry boundaries, and routing decisions between lightweight and heavyweight correction paths.

4.2 LLM inference layer

The base model remains stateless and unchanged. This is an explicit design constraint. The point of the architecture is to improve behavior without requiring fine-tuning every time the system discovers a recurring class of mistakes.

Practical implication: the same control plane can sit in front of different models and be upgraded independently of the underlying model provider.

4.3 Evaluation engine

The evaluation engine is where the system turns “this answer feels wrong” into machine-actionable signals. The exact checks are domain dependent, but the contract is stable: inspect the candidate output and emit structured failure metadata.

{
  "is_valid": false,
  "failure_type": "logical_inconsistency",
  "confidence": 0.82,
  "details": {
    "rule_id": "consistency.cross_step.v1",
    "severity": 0.74
  }
}

Examples of evaluation strategies:

4.4 Correction engine

The correction engine turns evaluation output into intervention. In production, this should be tiered rather than monolithic. Cheap fixes should run first; expensive fixes should be reserved for requests whose expected utility gain is high enough to justify the added cost.

Constraint injection

Augment the next pass with specific rules or forbidden patterns derived from the evaluator.

Targeted regeneration

Retry only the failing span or step rather than regenerating the full answer from scratch.

Rule-based rewriting

Use deterministic transforms when the failure is mechanical and well understood.

Escalation path

Invoke a stronger model or heavier reasoning path only when lower-cost options fail.

4.5 Memory / learning layer

This is the system’s adaptive substrate. The memory layer stores failure traces, successful corrections, and high-value context embeddings that can be retrieved to steer future generations away from known failure modes.

{
  "failure_type": "numerical_error",
  "input_context": "...",
  "incorrect_output": "...",
  "corrected_output": "...",
  "embedding": [...],
  "tenant_id": "acme-prod",
  "created_at": "2026-04-13T12:00:00Z"
}

5. Adaptive control loop

The production system should be understood as a closed loop, not a stateless request-response chain:

  1. Generate a candidate response
  2. Evaluate it against domain and system constraints
  3. If it fails, select the least-cost corrective action
  4. Persist the failure and successful correction signal
  5. Use stored context to bias future generations

That gives the system the three properties most LLM deployments lack today:

Important boundary: this is not “learning” in the same sense as gradient updates during model training. It is operational adaptation through retrieval, constraints, routing, and memory-informed control.

6. Utility function design

The utility function is the policy surface that decides whether additional correction is worth the cost. Without this layer, systems tend to either stop too early and return bad answers, or over-correct themselves into latency and cost collapse.

U = w1 * correctness + w2 * consistency - w3 * latency - w4 * cost + w5 * confidence

This is a production-oriented adaptation — weights are configurable per tenant and task class, and latency/cost appear explicitly as production objectives. The canonical derivation, formal axioms, and field-specific weight values are in §4 of the full whitepaper.

In production, weights should be configurable by tenant, task class, and domain. A legal assistant, a coding agent, and a creative writing system should not all optimize the same objective.

Signal Why it matters Typical source
Correctness Primary reliability objective Evaluators, tests, validators
Consistency Prevents internal contradictions Cross-step or cross-turn checks
Latency Bounds user-facing delay Request timers
Cost Prevents runaway correction loops Token and model spend accounting
Confidence Supports abstention and routing Scorers, historical outcomes

7. Data & state management

The easiest way to build a fragile version of this system is to throw every artifact into a single blob store and hope retrieval works. A better production design separates fast-path, structured, and semantic state.

Postgres

Structured logs, evaluation records, correction metadata, per-tenant policies, and audit tables.

Vector store

Semantic retrieval over prior failures, corrections, and domain context for memory-augmented prompting.

Redis

Hot cache for recent session context, active traces, routing hints, and short-lived correction state.

Object storage

Longer-term artifact retention for raw traces, evaluator evidence, and offline analysis datasets.

Useful persisted objects include:

8. Observability & metrics

A system like this can easily become opaque unless observability is a first-class requirement. Every correction step should be traceable, attributable, and measurable.

Failure detection rate
Detected / total
How often evaluators catch bad outputs
Correction success rate
Improved / corrected
Whether intervention actually helped
Repeated error rate
Down over time
The metric most aligned with the framework thesis
Latency overhead
p50 / p95
How expensive the control loop is in practice

Recommended trace fields:

9. Tradeoffs

Dimension Benefit Cost / risk
Reliability Higher correctness and fewer repeated mistakes More moving parts and more operational complexity
Latency Multi-pass correction can rescue bad responses Each pass adds user-visible delay
Cost Heavier reasoning available when needed Runaway inference spend if not bounded
Memory System improves behavior across sessions Stale, low-quality, or poisoned memory can degrade outputs

10. Failure modes & mitigations

10.1 False positives in evaluation

If evaluators incorrectly flag good outputs as failures, the correction loop will waste cost and may degrade answer quality. This argues for confidence thresholds, evaluator versioning, and offline calibration against labeled datasets.

10.2 Over-correction

Some outputs become worse after intervention, especially when the system over-indexes on one constraint and destroys fluency or usefulness. Mitigation: tiered correction, candidate comparison, and hard retry caps.

10.3 Memory poisoning

If bad corrections enter the memory store and are later retrieved as guidance, the system can institutionalize the wrong behavior. Mitigation: write-path validation, trust scores, expiry policies, and periodic pruning.

10.4 Cascading loops

If every correction step produces new evaluator findings, the system can spiral into latency or cost collapse. Mitigation: utility-based stop conditions, maximum pass counts, and explicit fallback modes.

Staff-level reality: the hard part is not inventing the loop. The hard part is keeping the loop from becoming operationally pathological under real traffic, noisy evaluators, and cost constraints.

11. Implementation approach

A pragmatic implementation can be built with standard backend infrastructure rather than exotic custom systems.

Client
  │
  ├── FastAPI / Python control plane
  │     ├── Request orchestration
  │     ├── Evaluator dispatch
  │     ├── Correction policy engine
  │     └── Metrics / tracing hooks
  │
  ├── LLM provider APIs
  │     ├── Primary model
  │     └── Escalation model
  │
  ├── Postgres
  ├── Redis
  ├── Vector DB
  └── Async queue (optional)
        ├── Celery
        └── Kafka for higher-scale decoupling

This design is intentionally modular. Each evaluator can be implemented as an independent service or library. Each correction strategy can be swapped out or A/B tested. Each memory retrieval policy can evolve without changing the base model.

12. Scaling considerations

The architecture is straightforward at small scale and substantially more interesting at real production volume. The main scaling questions are not just throughput, but which requests deserve expensive correction and how to isolate tenants, memory, and policy state safely.

Multi-tenant isolation

Keep memory, policies, and evaluation traces scoped by tenant to avoid cross-customer contamination.

Tiered correction

Run cheap evaluators and fixes first; reserve heavy passes for high-value or high-risk requests.

Batching opportunities

Group evaluator calls and asynchronous analysis tasks where latency budgets allow it.

Streaming support

For interactive products, decide whether corrections happen before first token, during streaming, or after a draft is emitted.

At larger scale, a mature system would likely introduce:

13. Micro-Expert Architecture

The monolithic control plane described above is the right starting point — deployable today with standard infrastructure. But it has a structural ceiling that becomes a real problem at scale: every calibration update touches the entire model's weight space. Fixing a recurring error in software engineering advice can subtly degrade medical reasoning through weight interference. Rollback means rolling back everything.

The Micro-Expert Architecture is the production evolution of this design — the same utility-governed control plane, decomposed into a graph of independently deployable domain submodels. Architecturally it is microservices applied to model inference: each specialist has its own weights, its own calibration cycle, and its own utility tracker. Updating surgery weights cannot affect software engineering weights. There are no shared parameters to interfere.

Router (Raft HA cluster — 150–300ms failover)
    |
    |-- field classifier + query decomposer
    |-- probabilistic fan-out to 1-3 specialists per query
    +-- response merger + conflict detection
         |
         |-- Surgery submodel      -- own weights, own calibration
         |-- Software Eng submodel -- own weights, own calibration
         |-- Law submodel          -- own weights, own calibration
         +-- ...N domain submodels
              |
         Arbiter Agent
              |
         |-- 4-check contradiction resolution
         |   (logical, mathematical, cross-session, empirical)
         |-- VCG arbitration -- dominant strategy truthfulness
         +-- verified corrections --> DPO signal per submodel

13.1 Why this resolves the monolithic ceiling

No catastrophic forgetting

Updating one submodel only modifies its own weights. Rolling back a bad surgery update is swapping one model file — not re-running the full training pipeline. Blast radius of any update is bounded to one domain.

Independent deployability

A surgery model update in progress does not block a software engineering update. Each submodel runs its own blue-green cycle independently. Updates are decoupled at the infrastructure level.

Cost proportional to query

Only the relevant 1–3 submodels activate per query. A graph of 20 domain submodels costs roughly the same per query as a single-domain model — inference cost scales with query complexity, not graph size.

Graceful degradation

If a submodel is mid-update, the router falls back to a parent or sibling domain model. The router maintains a fallback graph — no single point of failure at the specialist level.

13.2 Cross-domain query handling

Multi-domain queries — "write Python code to analyze patient drug dosage data" — require both software engineering and medicine submodels simultaneously. The router handles this through query decomposition and parallel fan-out:

1. Field classifier returns a distribution:
      {"software_engineering": 0.65, "medicine": 0.35}

2. Router decomposes the query:
      CS subquery:  "Write Python data analysis code"
      Med subquery: "What are clinical constraints on dosage data?"

3. Fan out: send subqueries to respective submodels in parallel

4. Merge: combine responses, apply weighted confidence from
          field distribution (medicine response gets 35% weight)

5. Arbitration: if CS and Medicine models contradict each other
          --> Arbiter Agent --> 4-check resolution --> DPO signal

The routing failure risk. Misrouted queries are not merely suboptimal — the routing experiment quantifies this: mismatched routing actively worsens correctness by 17.5% and dramatically degrades confidence calibration (Brier 0.292 vs 0.160). The system becomes confidently wrong. Mitigations: probabilistic fan-out, entropy-based fallback, VCG arbitration, and the confidence gate which catches most out-of-domain answers before they are returned. Full regime analysis in §10.4.1 of the whitepaper.

14. Utility-driven automated deployment

This is the most operationally consequential property of the architecture: the decision of when to deploy an update, how fast to roll it out, and when to roll back is not made by a human operator on a schedule. It is made by the utility function itself — continuously, field-calibrated, with no manual intervention required under normal conditions.

The same U score that governs correction injection and DPO training weight also governs the entire deployment lifecycle. There is no separate deployment system — deployment is a direct consequence of sustained utility deviation.

14.1 The trigger: utility deviation as the deployment signal

Each submodel monitors its own utility score over a sliding window. A deployment cycle triggers automatically when deviation from baseline is both significant and sustained:

Trigger when ALL of:
    |U_current - U_baseline| > delta(field)   [significant deviation]
    deviation sustained for >= T interactions  [not noise]
    held-out benchmark available               [can evaluate candidate]

Both delta and T are derived directly from the field penalty multiplier that already governs DPO training — the same sensitivity that makes a surgical contradiction weighted 10x harder at training time also makes the surgical submodel 50x harder to trigger for deployment than a creative writing submodel:

delta(field) = 0.05 / penalty_multiplier(field)

Surgery:          delta = 0.005   penalty = 10x  — very sensitive
Software Eng:     delta = 0.025   penalty =  2x  — moderate
Creative Writing: delta = 0.050   penalty =  1x  — relaxed

T(field) = (z_alpha * sigma / delta)^2    [power analysis, alpha=0.05, sigma~0.04]

Surgery:          T >= 246 interactions   — high confidence required
Software Eng:     T >=  10 interactions   — responsive
Creative Writing: T >=   2 interactions   — fast

Why field-calibration matters here. A wrong surgical update that serves 246 interactions before rollback causes real harm. A wrong creative writing update that serves 2 interactions before rollback is nearly harmless. The penalty multiplier that controls training severity also controls deployment sensitivity — the two are in the correct relationship by construction, not by tuning a separate parameter.

14.2 The five-phase lifecycle

Once triggered, deployment proceeds through five phases — each governed by the utility score, with no manual gates required:

Phase 0 — Detection (ongoing):
    U_monitor watches sliding window
    If |U_current - U_baseline| > delta for T consecutive interactions:
        trigger Phase 1

Phase 1 — Training (offline):
    DPO calibration on accumulated (preferred, rejected) pairs
    Weighted by field_penalty_multiplier + replay buffer
    Produces candidate GREEN model
    BLUE model continues serving 100% of traffic — no disruption

Phase 2 — Canary (5% green / 95% blue):
    Router sends 5% of traffic to GREEN, 95% to BLUE
    Minimum N_min = T(field) x 2 interactions before any shift
    Both models log U scores per interaction

Phase 3 — Gradual shift (utility-weighted routing):
    Every evaluation window, traffic split recomputed:

        traffic_green = exp(U_green/tau) / (exp(U_green/tau) + exp(U_blue/tau))

        tau(field): small tau = fast promotion, large tau = conservative
            Surgery:      tau = 0.05   promotes only when clearly better
            Software Eng: tau = 0.20   moderate
            Creative:     tau = 0.50   fast, promotes on small differences

        Enforced: clip(traffic_green, 0.05, 0.95)

    As U_green > U_blue, traffic shifts automatically.
    No manual intervention required.

Phase 4 — Promotion (traffic_green >= 1 - delta(field)):
    traffic_green --> 1.0
    BLUE enters cooldown — not retired, instant rollback still available
    Cooldown = T(field) interactions

Phase 5 — Retirement:
    If GREEN holds through cooldown without regression:
        BLUE retired, weights freed
        U_baseline = U_green   (new baseline for next cycle)
        delta recalibrated from observed variance

14.3 Automated rollback

Rollback is also fully automated — it does not require an on-call engineer to notice a problem:

Rollback triggers (any phase):
    U_green < U_blue - epsilon  for M consecutive interactions
    OR  contradiction_rate_green > contradiction_rate_blue x 1.5
    OR  benchmark regression > field_tolerance

Rollback action:
    traffic_blue --> 1.0 instantly
    GREEN flagged
    Failure DPO pairs added to replay buffer with negative weight
    Next candidate trained against this failure pattern explicitly

Negative-weight failure DPO pairs mean each bad deployment makes the next candidate better — the system learns from failed deployments as well as from bad answers.

14.4 What this looks like operationally

Traffic split across a typical successful promotion cycle:

Traffic %
  100 BLUE ----------------+
                           +--+
                              +--+
                                 +-----------+
    5 GREEN-----------------+               |
                            +--+            |
                               +--+         |
    0                             +---------+-----> time
           | Canary  | Gradual shift | Promotion
           | (5/95)  | (utility-driven) | (100% green)

The team's role is not to make deployment decisions — it is to set field parameters (delta, T, tau) once and monitor autonomous operation. The utility function makes every deployment and rollback decision. The audit trail logs every trigger event, traffic split, and rollback with the U scores that drove each decision.

No deployment toil

Under normal operation, no engineer needs to approve or execute a deployment. Engineers are involved when rollback triggers fire — the exception path, not the normal path.

Field-calibrated risk

High-stakes domains deploy conservatively by construction. A surgical submodel requires 246 canary interactions before any traffic shift. A creative writing submodel can promote in 4.

Full auditability

Every update has a logged trigger, a logged promotion trajectory, and a logged rollback path. More auditable than any monolithic release cycle — the U scores that drove each decision are retrievable.

Failure improves next attempt

Rolled-back candidates contribute negative-weight DPO pairs to the replay buffer. The next training run is explicitly conditioned against the failure pattern that caused the rollback.

The architectural payoff. In the monolithic control plane (§§1–12), the utility function governs inference behavior. In the Micro-Expert Architecture with utility-driven deployment, the utility function governs the entire system lifecycle — from the first correction injection through to the decision to retire a model version. One governing signal, operating at every timescale, from milliseconds to months.

Full specification — including delta derivations, T power analysis, tau field calibration, and the mathematical relationship to VCG arbitration — is in §10.7 of the full whitepaper.

15. Physical Validation — RunPod RTX 4090 (May 2026)

The architecture described in this document was validated end-to-end on a single NVIDIA RTX 4090 (24 GB VRAM) via RunPod cloud. Three Qwen2.5 AWQ specialists — SWE 7B, Math 7B, and a 3B Arbiter — ran concurrently within a single GPU at 90.4% VRAM utilization (22,206 / 24,564 MiB), demonstrating that the three-specialist graph described in Section 13 is achievable on consumer hardware today.

VCG routing gain
+43.3pp
vs no-routing baseline (p=0.0003, d=1.02, n=30 per arm)
VRAM utilization
90.4%
3 concurrent AWQ specialists on single RTX 4090
DPO pairs auto-accumulated
502
No human annotation. 251 SWE + 251 Math, mean weight 2.50
Total experiment cost
$22
1.63 GPU hours at $0.69/hr (RTX 4090)

Regime 2 fingerprint — calibration failure, not accuracy collapse. The mismatched-routing arm did not fail through low accuracy alone — its Brier score (0.279) was nearly identical to the no-routing baseline (0.280), but mean confidence was 0.750 versus ~0.60 for correctly-routed arms. The wrong model answered confidently and could not be caught by the C_min abstention gate. This is the Regime 2 failure mode confirmed on real hardware.

15.1 Blue-Green Pipeline — What Was Measured

The full blue-green pipeline from Section 14 ran end-to-end: LoRA training (AWQ dequantize → fp16, ~12 min on RTX 4090), canary deployment (5% traffic), gradual shift (softmax routing), and promotion gate evaluation. The GREEN adapter was correctly rejected — U_delta +0.0011 fell below the 0.025 threshold. Adversarial evaluation showed the adapter overfit to training contradictions rather than generalizing resistance, producing U_delta −0.097 on the adversarial subset.

This demonstrates the promotion gate working as a safety mechanism: a degraded model adapter does not reach production. The rollback path described in Section 14.3 would have triggered automatically on this result.

15.2 Cross-Domain Arbitration — What Was Measured

A five-query cross-domain battery was run with SWE and Math specialists called in parallel on each query. The contradiction detector flagged a logical error in the SWE specialist's gradient descent implementation (severity 0.9): the assert statement compared two slices of the same predictions array at iteration 100 instead of comparing loss before and after training. The code failed its own stated test case — a genuine internal contradiction caught by the logical check. Evidence chain committed to the repository: agent/results/arbitration_evidence.json.

Full results are in Appendix A.1 of the full whitepaper.