How the Adaptive Utility Agent framework turns heterogeneous GPU fleets into tiered, self-routing inference infrastructure — reducing cost per useful query and unlocking revenue from stranded lower-tier hardware.
On this page
AI datacenters and GPU cloud operators today face a structural tension that manifests in fleet economics. Frontier models — the ones that deliver the broadest capability — require the highest tier of hardware to run at commercial throughput. Mid-tier and older GPU inventory (A40s, A100 40GB, consumer-adjacent parts) cannot run those models at useful latency, so they default to smaller commodity workloads that underutilize their memory and compute.
This creates three compounding problems:
The framework addresses this by reframing the dispatch question. Instead of asking can a smaller model substitute for a frontier model on all tasks? — which the answer to is clearly no — it asks: which specific query classes can be served at high quality at each hardware tier? Efficacy becomes a per-domain, per-hardware-class signal rather than a global model-level judgment, and routing follows that signal.
The framework's central architectural contribution for this domain is the Micro-Expert model: a graph of independently deployable domain submodels, each a specialist in its field, coordinated by a shared router and utility layer but isolated at the weight level. This is the model-graph analogue of what microservice architectures did for backend infrastructure.
The analogy is exact and intentional. A monolithic backend couples all functionality in shared code and shared state — a change to the billing service can break the notification service. The Micro-Expert model couples all domain knowledge in shared weights — a calibration run on the surgery domain can degrade CS knowledge through weight interference. The solution in both cases is the same: decompose into independently deployable units with well-defined interfaces.
┌─────────────────────┐
│ Router / Hub │
│ (field classifier │
│ + query parser │
│ + context merger) │
└──────────┬──────────┘
│
┌─────────────────────┼──────────────────────┐
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│Medicine │ │ CS │ │ Law │
│ model │ │ model │ │ model │
└─────────┘ └────┬────┘ └─────────┘
┌─────────┼─────────┐
┌─▼──┐ ┌──▼──┐ ┌──▼──┐
│ ML │ │Algo │ │Prog │
└─────┘ └─────┘ └─────┘
Each node is a separately deployed model with its own weights, calibration cycle, and utility tracker. The interface between nodes is a structured protocol that the wrapper layer already produces — no new interface design required:
Request: { query, context, field, confidence_floor, session_id }
Response: { answer, confidence, assertions[], uncertainty_flags, U_score }
Why this matters for operators specifically: In a monolithic deployment, updating one domain risks regressing every other domain. Rolling back means reverting the entire model. In the Micro-Expert model, updating the coding specialist only modifies coding weights; rolling back requires swapping one model file. The blast radius of any update is bounded to one submodel. Each submodel has its own blue-green deployment cycle, and they are fully decoupled from each other. See §10 of the full whitepaper for the full architecture specification.
Catastrophic forgetting — actually solved. The standard continual learning approach (replay buffers + careful DPO weighting) slows forgetting but cannot eliminate it when weights are shared. The Micro-Expert model physically eliminates shared weights for domain-specific knowledge. Cross-domain knowledge lives exclusively in the router and a thin shared embedding layer. Domain-specific updates are contained. This is the reason the architecture is structured as a graph rather than as a larger monolith with mixture-of-experts routing — MoE routing is a compute optimization; the Micro-Expert model is an update isolation mechanism.
The granularity of the model graph is not fixed — it is relative to the hardware it runs on. This is a deliberate design property, not a constraint. The core principle that drives it:
Decomposition depth therefore scales directly with GPU memory. Each hardware tier naturally determines the appropriate graph shape:
| Hardware | VRAM | TDP | Approx cost (2025) | Submodel fit | Graph shape |
|---|---|---|---|---|---|
| H100 SXM5 | 80 GB | 700 W | ~$30,000–35,000 | ~70B params per GPU | Shallow graph, few large nodes |
| A100 80GB | 80 GB | 400 W | ~$10,000–15,000 | ~70B params per GPU | Shallow-to-medium depth |
| A100 40GB / A40 | 40 GB | 300 W | ~$4,000–7,000 | ~35B params per GPU | Medium depth |
| RTX 4090 | 24 GB | 450 W | ~$1,600–2,000 | ~20B params per GPU | Deeper graph, fine-grained specialists |
| RTX 4080 / L40S-class | 16 GB | 320 W | ~$700–900 | ~14B params per GPU | Deep graph |
| Edge / older (8–16 GB) | 8–16 GB | 75–150 W | <$500 | ~7B params per GPU | Very deep, fine-grained specialist nodes |
Hardware specs and pricing derived from published Lambda Labs, CoreWeave, RunPod, and Vast.ai data (2025). See §10.9.1 in the full whitepaper for the derivation.
Graph branching heuristics. The graph branches recursively until one of two stopping conditions is met. The hardware bound: stop when a single submodel fits comfortably on one GPU. The statistical bound: stop branching when the within-domain contradiction rate drops below a threshold — typically <2% — meaning the submodel is internally consistent enough that further specialization adds noise rather than signal. In practice: branch until the hardware bound is reached for active domains, and use the statistical bound to decide which domains warrant a branch at all. This prevents over-decomposition of low-volume or already-accurate domains.
Practical implication for mixed fleets. Providers such as CoreWeave and other GPU clouds often hold inventory across hardware generations: H100s, A100s, A40s, L40S-class parts, and consumer-adjacent hardware. Without routing, lower-tier inventory risks being stranded. A specialist-graph architecture gives those GPUs a high-value role. H100s serve the general frontier fallback and genuinely cross-domain queries. A100s and A40s serve mid-tier specialists for high-volume, well-defined domains. Consumer-class hardware serves deep specialists on narrow, high-confidence query types. The fleet becomes heterogeneous by design rather than by accident.
The following is derived entirely from published hardware specifications and cloud provider pricing — no original measurement. The key comparison is cost per 1,000 output tokens across hardware tiers:
The derivation is straightforward from published specs:
H100 SXM5 (enterprise): VRAM: 80 GB Memory bandwidth: 3.35 TB/s Cloud cost: ~$2.50–3.50/hr (Lambda Labs, CoreWeave, 2025) Llama 3.1 70B: ~140 GB fp16 → requires 2× H100, or 1× H100 at 4-bit quant Inference throughput: ~2,000 tokens/sec (batch=1, 200-token output) RTX 4090 (consumer): VRAM: 24 GB Memory bandwidth: 1.01 TB/s Cloud cost: ~$0.35–0.50/hr (RunPod, Vast.ai, 2025) Mistral/Llama 7B: ~14 GB fp16 → fits in 1× 4090 with headroom Inference throughput: ~800 tokens/sec (batch=1) Cost per 1,000 output tokens: 70B on 2× H100: (2 × $3.00/hr) ÷ (2,000 tok/sec × 3,600) × 1,000 = $0.00083 7B on 1× 4090: ($0.40/hr) ÷ (800 tok/sec × 3,600) × 1,000 = $0.00014 Single-specialist query: 7B/4090 is ~6× cheaper per token. 3 specialists activated (fan-out): 3 × $0.00014 = $0.00042 — still 2× cheaper. Cost inversion point: ~6 activated specialists simultaneously.
This is an analytical result from public data. Latency measurement on physical hardware is primary future work (Phase 8, §12 in the full whitepaper).
For infrastructure operators, the practically useful frame is not abstract model size but revenue per watt and fleet utilization. The question is: can a lower-tier GPU serve a specialist workload well enough that the operator earns more per useful query and per watt consumed?
Illustrative only. Values from the analytical cost model in §10.9.4 of the full whitepaper. The "stranded inventory" row represents H100-class hardware serving low-value undifferentiated workloads at low utilization — the failure mode the framework addresses.
The second lever is LoRA multi-tenancy (covered in detail below). A base specialist can remain resident while domain- or customer-specific adapters are rotated in, allowing several customers' specialist workloads to share one logical serving tier. Under export restrictions, supply constraints, or simply uneven access to H100-class hardware, this flexibility is strategic rather than cosmetic.
The claim that a specialist-graph architecture improves output quality — not just reduces cost — rests on an empirical question: how much does the routing and arbitration layer contribute to correctness, independent of model size? A controlled four-arm experiment was run using the production agent codebase to answer this directly. See §10.9.4 of the full whitepaper for the full experimental setup and citations.
All four arms used identical task plans (200 tasks, 25 problem types, 11 algorithm families). The quality model for each arm was parameterized from published domain benchmarks.
| Arm | Description | Correctness | Δ vs baseline | Brier score | p-value |
|---|---|---|---|---|---|
| A | No routing — single generic system prompt, no specialization | 59.0% | — | 0.1596 | — |
| B | Matched routing (oracle) — query always sent to correct domain specialist | 71.5% | +12.5% | 0.1062 | 0.0086 |
| C | Mismatched routing (Regime 2) — query sent to wrong domain specialist | 41.5% | −17.5% | 0.2919 | 0.0004 |
| D | VCG arbitration — probabilistic fan-out, VCG selects among specialists | 69.5% | +10.5% | 0.1097 | 0.0285 |
Quality model sources: DeepSeek Coder 7B vs GPT-3.5 on HumanEval; WizardMath 7B vs Llama 2 70B on MATH; Med-PaLM 2 vs GPT-4 on MedQA. Mismatch penalty from Raval et al. (2026). Routing accuracy from Xu et al. (2024). Full citations in §10.9.4.
The practical implication: the value of the routing layer depends entirely on routing correctly, which in the framework means routing on predicted confidence before the specialist attempts a query, not just on predicted domain classification. The confidence threshold is what prevents Regime 2.
In a monolithic deployment, a model update is a fleet event: the entire serving infrastructure is involved. In the Micro-Expert model, each submodel has its own independent blue-green deployment cycle, and those cycles are fully decoupled. A coding specialist being updated does not affect the medical specialist serving traffic simultaneously.
A submodel monitors its own utility score over a sliding window. A deployment cycle triggers when deviation from baseline is both significant and sustained:
Trigger when ALL of:
|U_current - U_baseline| > δ(field) [significant deviation]
deviation sustained for ≥ T interactions [not transient noise]
held-out benchmark available [can evaluate candidate]
δ(field) = base_δ / penalty_multiplier(field) base_δ = 0.05
Software Engineering: δ = 0.025 T ≥ 10 interactions (responsive)
General/data domains: δ = 0.040 T ≥ 4 interactions (moderate)
Creative/low-stakes: δ = 0.050 T ≥ 2 interactions (fast)
Rollback is automatic and instant. Triggers on utility regression, contradiction rate increase (>1.5× blue's rate), or benchmark regression beyond field tolerance. Traffic reverts to BLUE immediately; GREEN failure cases are added to the replay buffer as negative-weight DPO pairs for the next candidate.
Every submodel update in this protocol has a logged trigger (which utility deviation, over which window), a logged promotion trajectory (traffic split over time), and a clear rollback path. For GPU cloud operators offering SLAs and for customers who need model behavior change audit trails, this is significantly more auditable than a monolithic model release, which has a single release event with aggregate changes that cannot be attributed to individual domain corrections. See §10.7–10.8 in the full whitepaper for the full specification.
Each submodel in the graph is a base specialist that can carry one or more LoRA adapters. This enables a serving architecture that reduces cold-start overhead and improves utilization across customers who share domain but differ in application context:
Illustrative. Actual utilization depends on inter-customer request timing and adapter swap latency. The key point is that shared base specialist + customer-specific adapters reduces the per-customer GPU allocation required to maintain quality SLAs, improving overall fleet utilization.
Under hardware supply constraints or export restrictions on frontier-class GPUs, this multi-tenancy model is particularly important. An operator who cannot expand H100 inventory can still grow customer count on specialist tiers by improving utilization through adapter sharing. The floor for this is that shared customers' request patterns must not perfectly overlap (if all customers send requests simultaneously, no utilization gain is possible); in practice, enterprise workload patterns are staggered enough for meaningful sharing.
The framework's utility function governs routing and escalation decisions at every level of the graph. For datacenter operators, the key terms map directly to fleet economics:
U = w_e(f)·E + w_c(f)·C + w_k(f)·K E — Efficacy: did the specialist produce a verified, useful output? C — Confidence: internal consistency, penalized by detected contradictions K — Curiosity: exploration bonus for high-upside unexplored domains (minor weight in serving) f — field, determining weights and minimum competence thresholds
For datacenter workloads: validated code, accurate technical answers, well-formed structured outputs. Failed outputs requiring escalation count against efficacy and trigger escalation logs. Strong efficacy on a specialist tier means that tier earns its cost differential.
The most economically important term for routing. Low confidence on a query class is a pre-inference signal to route to a higher tier — avoiding wasted GPU cycles on likely-to-fail attempts. This is where fleet economics are improved: routing on predicted confidence before the compute is spent.
Operators can configure field-specific penalty multipliers that reflect the cost of an error in that domain. A wrong answer in a high-stakes professional domain (legal, medical, financial) gets penalized more harshly at the DPO training step than a wrong answer in a low-stakes creative domain. This is how the system calibrates specialist quality to domain risk.
The base utility function does not include an explicit cost term — that is an operator-layer addition. The framework architecture supports adding a hardware-tier cost weight to the routing decision, so that the router explicitly values a correct A40-class answer over an equivalent H100-class answer by the cost differential.
Confidence thresholds take on a specific economic meaning in fleet serving. Setting them too low causes expensive under-routing — cheap specialists attempt queries they will fail, consuming GPU time and producing escalations anyway. Setting them too high causes over-routing — cheap specialists never serve queries they could have handled well, and margin on lower-tier hardware collapses.
| Hardware tier | Recommended confidence floor | Escalation target | Economic rationale |
|---|---|---|---|
| Consumer / A40-class | κ ≥ 0.80 for specialist domain queries | A100 / mid-tier pool | Preserve cheap tier margin; escalate at first uncertainty |
| A100 / mid-tier | κ ≥ 0.70 for broad domain queries | H100-class frontier pool | Mid-tier earns its margin on queries specialists couldn't serve |
| H100-class / frontier | κ ≥ 0.55 (broad model fallback) | Human review / refusal | Frontier tier absorbs hard queries; abstention is acceptable |
Illustrative starting values. Calibrate thresholds by measuring escalation rates per query class under real traffic, then adjust to hit target escalation budgets.
A minimum deployable configuration does not require building all specialists at once. A staged approach focuses on the highest-volume, most-validatable query class first and builds the routing and assertions infrastructure that all future specialists will share.
AUA v1.0 is an LLM inference router. Use hardware tier templates and GET /metrics/cost to track GPU hours and cost per specialist in real time.
pip install adaptive-utility-agent
aua init my-ai-data-centers-agent --preset generalist --tier macbook cd my-ai-data-centers-agent aua doctor
# aua_config.yaml
specialists:
- name: general
model: qwen-coder-7b-awq
port: 11434
field: general
safety:
abstention_enabled: false
require_arbiter_for_high_risk: true
min_confidence_for_direct_answer: 0.60
security:
encryption: {enabled: true, key_secret: AUA_ENCRYPTION_KEY}
audit:
enabled: true
hash_chain: true
Generate your encryption key: python3 -c "import os; print(os.urandom(32).hex())" or openssl rand -hex 32 — 64-char hex string. See Tutorial §12.4 for key management.
aua serve
curl -X POST http://localhost:8000/query \
-H "Authorization: Bearer $AUA_TOKEN" \
-H "Content-Type: application/json" \
-d '{"prompt": "...", "session_id": "demo"}'
| AUA v1.0 provides | You bring |
|---|---|
| Multi-specialist routing + utility scoring | Domain-specific specialist models |
| Arbiter + contradiction detection | Domain-specific quality criteria |
| Correction loop + DPO pair export | Fine-tuning infrastructure (TRL, Axolotl, …) |
| Blue-green deployment + rollback | Evaluation datasets for your domain |
| Append-only audit log with hash chain | Datacenter orchestration (Slurm, k8s, …) |
| Prometheus + Grafana + OTEL | Your monitoring infrastructure |
Full instructions: AUA Tutorial · Framework v1.0 · GitHub ↗