Whitepaper · Part 1 of 7 · Abstract, Introduction, Applications, Related Work

Adaptive Utility Agents: A Framework for Self-Optimizing AI Systems

Praneeth Tota · Illinois Institute of Technology · v1.0.0

Table of Contents Theory (§§4–9) →

Abstract

The central failure mode of deployed language models is error repetition: a system that produces a wrong answer today will produce the same wrong answer tomorrow, and every day until a new version ships. This paper proposes a framework for building AI agents that actively work against this failure — learning from detected errors and adjusting behavior between model releases, without waiting for the next retraining cycle. The utility function — composed of Efficacy (performance relative to a human baseline), Confidence (internal consistency penalized by detected contradictions), and Curiosity (exploration bonus for high-upside unexplored domains) — is field-weighted and bounded by minimum competence thresholds derived directly from existing societal licensing standards, making the bounds principled rather than arbitrary.

Critically, the utility function is not a passive monitoring metric. It is the active loss weighting mechanism for a three-layer continual learning architecture that corrects contradictions and improves model behavior between releases — without waiting for a full retraining cycle. Contradiction corrections are weighted by field-specific penalty multipliers in Direct Preference Optimization (DPO) training, so a surgical error is penalized an order of magnitude more harshly than a creative writing mistake at the weight-update level.

The paper introduces four additional architectural contributions. First, a Personality System with field-bounded trait weights that evolve with accumulated utility history, subject to three-layer drift safeguards. Second, an entity trust and reputation system based on verifiable domain credentials and interaction history, governing both how external inputs are weighted and when external escalation is permitted. Third, a three-layer continual learning architecture operating at per-session (behavioral), calibration-cycle (weight-level), and release (distillation) timescales. Fourth, a distributed model graph architecture that decomposes the monolithic model into independently deployable domain submodels communicating over structured APIs — physically resolving the catastrophic forgetting problem by eliminating shared weights across domains, and enabling hardware-adaptive deployment depth matched to available GPU memory.

Cross-domain contradiction resolution is handled by a dedicated Arbiter Agent that runs structured evidence checks — logical, mathematical, cross-session, and empirical — and routes verified corrections to the relevant submodels as DPO training signal. Submodels whose domains were arbitrated receive the internal evidence chain for DPO integration; nothing is disclosed externally. When all internal checks fail, a controlled external escalation queries verified domain experts through obfuscated, partialized queries consistent with the system's minimum-disclosure principle.

Submodel updates follow a utility-deviation-triggered blue-green deployment protocol with statistically grounded detection thresholds and softmax traffic routing, ensuring system stability throughout. The protocol is validated through simulation across 3 calibration cycles on 8 LeetCode-style problems, demonstrating a +0.1160 average utility improvement and contradiction reduction from 1 to 0.

Submodel deployment depth is hardware-adaptive: shallow graphs of large submodels on high-VRAM GPUs, or deep graphs of small specialist submodels on consumer hardware. A controlled four-arm routing experiment validates a key component of this claim empirically on real hardware: on a single NVIDIA RTX 4090 (24 GB) running three concurrent Qwen2.5 7B AWQ specialists, VCG arbitration routing delivers a +43.3 percentage-point correctness gain over an unrouted generic baseline (p = 0.0003, Cohen's d = 1.02, n=30 per arm), and mismatched routing is actively harmful — appearing not as an accuracy collapse but as a calibration failure: overconfident wrong answers (mean confidence 0.750 vs ~0.60 for correctly-routed arms) with Brier score tied with no-routing (0.279 vs 0.280). All three 7B AWQ specialists fit concurrently on the single GPU at 90.4% VRAM utilization. Combined with published benchmarks showing domain-fine-tuned 7B specialists match general 70B models on domain tasks, and an analytical cost model showing 2–6× lower cost per query on consumer hardware, this constitutes a complete empirical and analytical argument that frontier-quality domain inference does not require frontier hardware.

The arbitration mechanism's incentive structure is formally addressed through a game-theoretic treatment integrated as §10.6. Treating domain submodels as players in a cooperative game with the Arbiter as the external social planner, three theorems are proved. Theorem S1 establishes that truthful utility reporting is a weakly dominant strategy for every submodel under the Vickrey-Clarke-Groves (VCG) mechanism — eliminating the need for hand-specified confidence check weights by replacing them with the submodels' reported value functions. Theorem S2 shows that dominant-strategy truthfulness implies social optimality: the Arbiter selects the claim maximising the sum of submodel utilities, with Price of Anarchy exactly 1. Theorem S3 establishes individual rationality: no submodel prefers abstention to participation. Clarke pivot transfers applied as DPO penalty weight adjustments constitute a continuous, self-correcting calibration signal that supersedes both the hand-specified check weights and the periodic expert-sampling audit described in the engineering approximation. The VCG mechanism is the target architecture for the Phase 6 Arbiter; the current hand-specified mechanism is retained as the deployable approximation.

1. Introduction

Deployed AI systems today are static artifacts. They are optimized at training time and released — whereupon their behavior is frozen. They cannot learn from errors between releases. When a model produces a hallucination, it will produce the same hallucination tomorrow, and in six months when the next version ships. This is not a failure of scale or capability. It is a structural absence: there is no feedback loop between detected errors and model behavior in the space between model versions.

This paper addresses that absence directly. The goal is online learning and error non-repetition: an agent that detects its own errors, corrects its behavior in response, and does not repeat those errors — continuously, between model releases, without waiting for a new training cycle.

The governing mechanism is a utility function that serves as a control law over the agent's behavior at every timescale:

U = w_e(f)·E + w_c(f)·C + w_k(f)·K

E — Efficacy:    performance relative to a human baseline
C — Confidence:  internal consistency, penalized by detected contradictions
K — Curiosity:   exploration bonus for high-upside uncertain domains
f — field, determining weights and minimum competence thresholds

The utility function is not a monitoring metric. It determines how strongly each detected error is penalized at training time (a surgical contradiction is weighted 10× harder than a creative writing mistake), when behavioral corrections are injected into the system prompt, and whether a new model deployment passes acceptance. It governs learning at every timescale the architecture operates on.

The monolithic constraint and the personality system. The correct long-term solution to error repetition is an architecture where individual domain models can be retrained and redeployed in isolation — the Micro-Expert Architecture described in §11. But until that infrastructure is operational, the system runs on a monolithic base model. In the monolithic setting, weight-level correction is constrained by retraining latency (weeks to months per cycle) and global weight interference (correcting one domain can degrade another). The Personality System (§6) is the framework's response to this constraint: a behavioral wrapper that biases the agent toward safer operating regimes between calibration cycles, reducing the harm of repeated hallucinations without modifying the underlying weights. It is explicitly interim infrastructure — designed to be superseded once the Micro-Expert Architecture makes fast, isolated domain retraining possible.

Claim hierarchy. The primary contribution of this paper is a framework for reducing repeated error recurrence between model releases through utility-governed continual correction. The personality system, trust layer, micro-expert architecture, VCG arbitration, and hardware-adaptive deployment are important parts of the broader framework, but they should be read as supporting mechanisms or future-state extensions around that core claim rather than as equally validated components today.

The paper makes seven contributions:

1. Utility function as governing control law. The utility function $U = w_e E + w_c C + w_k K$ is formally derived — not merely asserted. Appendix B proves the additive structure is uniquely necessary from five behavioral axioms, the efficacy sigmoid is given a Mann-Whitney-style dominance interpretation, the EMA update is Kalman-optimal, confidence converges geometrically in expectation, and the personality evolution rules are Lyapunov stable. The weight vector is grounded in field-specific error costs derived from professional liability standards.

2. Three-layer continual learning architecture. Per-session behavioral injection (immediate, no weight change), calibration-cycle DPO fine-tuning (hours, field-penalty-weighted), and release-level distillation (monthly). Each layer corrects what the previous cannot reach. The utility score determines correction weights throughout.

3. Personality system as interim behavioral wrapper. A log-linear tilt of the base LLM's generation distribution, parameterized by field-bounded trait scores that evolve with utility history. In the monolithic setting, this is the primary mechanism for reducing repeated hallucinations between calibration cycles. The wrapper is reset on new model release and not instantiated in the Micro-Expert Architecture, where fast domain retraining makes it unnecessary.

4. Entity trust and reputation system. Each interacting entity is scored on domain expertise (bootstrapped from verifiable credentials on day one) and behavioral trust (accumulated through interaction). Scores gate how external inputs are weighted and whether an entity qualifies for controlled external escalation queries.

5. Arbiter Agent. A dedicated contradiction resolution agent running structured evidence checks (logical, mathematical, cross-session, empirical) across conflicting outputs. Verified corrections are routed as DPO signal internally. When the Arbiter cannot resolve a conflict, controlled external escalation queries verified domain experts through obfuscated, partialized queries — revealing only what is needed to elicit an answer.

6. Micro-Expert Architecture and the consumer hardware argument. The monolithic model is decomposed into independently deployable domain submodels — analogous to microservices architecture applied to model inference. Catastrophic forgetting is resolved architecturally. Submodels are versioned and updated independently via utility-deviation-triggered blue-green deployment. Graph depth is hardware-adaptive: shallow graphs on high-VRAM hardware or deep graphs of small specialist submodels on consumer hardware at lower cost per query. A controlled four-arm routing experiment on a single RTX 4090 (§10.9, Appendix A.1) measures a +43.3 pp correctness gain for VCG arbitration over no-routing (p = 0.0003, d = 1.02) and confirms the Regime 2 failure mode as a calibration failure rather than an accuracy collapse. Three concurrent 7B AWQ specialists were demonstrated on a single 24 GB consumer GPU (90.4% VRAM). Combined with published specialist benchmark results and an analytical cost model, this supports the claim that domain-specific professional inference on consumer hardware is competitive with frontier-scale enterprise deployments on those domains.

7. Game-theoretic arbitration via the VCG mechanism. The Arbiter Agent's incentive structure is formally addressed by treating domain submodels as players in a cooperative game. Under the Vickrey-Clarke-Groves mechanism, truthful reporting of value functions is the dominant strategy for every submodel (Theorem S1), the Arbiter selects the social optimum with Price of Anarchy exactly 1 (Theorem S2), and individual rationality is satisfied (Theorem S3). Clarke pivot transfers applied as DPO penalty weight adjustments replace both the hand-specified confidence check weights and the periodic expert-sampling calibration pipeline, constituting a continuous self-correcting signal. The VCG mechanism is the theoretical ideal toward which Phase 6 Arbiter implementation converges; §10.6 develops all definitions, proofs, and practical implementation details.

Readers primarily interested in practical applications may jump directly to §2 for a domain-by-domain overview of where the framework applies, before returning to the theoretical development in §§4–8.

Current evidence scope. The framework is now validated through both simulation experiments (Appendix A.2) and a physical hardware deployment (Appendix A.1). The hardware experiment — three Qwen2.5 7B AWQ specialists on a single NVIDIA RTX 4090 — produced the following primary measured results: VCG arbitration routing +43.3 percentage points over no-routing (p = 0.0003, Cohen's d = 1.02, n=30 per arm); 502 paired DPO entries accumulated automatically from the contradiction detection pipeline; the blue-green deployment pipeline ran end-to-end (canary + gradual shift + softmax routing); and cross-domain arbitration detected a logical contradiction in the SWE specialist's gradient descent implementation (severity 0.9 — code fails its own assert). The simulation data in Appendix A.2 remains the primary support for the wrapper/control-layer thesis (69.6% error reduction, Brier calibration improvement, U↔correctness correlation). The micro-expert architecture is now empirically grounded. The VCG mechanism and consumer-hardware cost arguments are supported by the measured routing experiment combined with analytical argument.

The framework is validated through two controlled simulation experiments reported in Appendix A. The primary experiment runs a 500-task, five-cycle two-arm comparison: a full agent pipeline against an uncalibrated baseline receiving identical tasks with no correction injection. Across 25 problem types spanning 11 algorithm families, the agent reduces repeated errors by 69.6% (14 vs. 46 repeated errors in cycles 2–5). Confidence calibration improves by 14.3% on the Brier score (0.2226 vs. 0.2597), with the gap widening to 29.5% by cycle 5 as contradiction penalties progressively sharpen the confidence signal. Utility U correlates significantly with ground-truth correctness in both arms (Pearson $r = 0.461$, $p < 10^{-40}$), validating U as a meaningful quality signal rather than a passive monitor. A separate ten-cycle stability run confirms that contradiction rate falls from 22% to 6%, personality traits converge within field bounds consistent with Theorem B.7, and Brier score reaches 0.049 by cycle 7. Full task-level records, DPO pair logs, and all simulation figures are available in extended_results.json in the repository.

The rest of the paper is organized as follows. Section 2 surveys related work. Sections 3–7 develop the theoretical framework: utility function, field bounds, personality system, trust system, and continual learning pipeline. Section 9 describes the monolithic wrapper architecture. Section 10 describes the Distributed Model Graph / Micro-Expert Architecture, including the Arbiter Agent (§10.5), game-theoretic VCG arbitration (§10.6), blue-green deployment (§10.7), system properties (§10.8), hardware-adaptive decomposition (§10.9), and router high availability (§10.10). Section 11 describes the code generation MVP and simulation results. Sections 12–13 present the roadmap and open questions. Appendix A contains the full simulation data. Appendix B contains all mathematical proofs.

2. Applications and Motivation

Before developing the theoretical framework, this section illustrates where and how utility-governed agents apply in practice. The seven domains below are each a natural fit for the architecture: decisions under competing objectives, real-time uncertainty, and the need to improve from experience without waiting for a full system retrain. Two of the added cases are intentionally industry-facing: AI data center operators, where the question is whether routing and specialist decomposition improve revenue per watt and fleet utilisation, and self-driving / AV platforms, where the question is whether modularity, abstention, and auditability improve safety certification and regulatory acceptance. Full worked examples with numerical derivations are in Appendix C.

2.1 Autonomous Vehicles

Domain deep-dive: Self-Driving Vehicles · Autonomous Systems

A self-driving vehicle balances safety, journey efficiency, and passenger comfort simultaneously — objectives that conflict on every manoeuvre decision. The utility function weights shift automatically by context: safety dominates in school zones ($w_s=0.90$), efficiency rises in emergency transport ($w_e=0.40$). When sensor fusion uncertainty drives confidence below the field minimum ($C_{\min}=0.85$), the vehicle abstains from the manoeuvre rather than proceeding at reduced reliability. Over time, the assertions store accumulates route-specific priors — high merge conflict at certain junctions, recurring pedestrian patterns — and future decisions use these automatically without manual reconfiguration. For battery-constrained deployment, the Micro-Expert Architecture is particularly compelling: three domain-specialist models (perception, motion planning, traffic rules) running on Jetson-class embedded hardware consume approximately 110W total system power versus 700W for a single datacenter GPU — and a monolithic frontier model cannot fit in a vehicle's compute envelope at all.

There is also a regulatory and certification angle. A modular architecture narrows the revalidation scope when one component changes: updating traffic rules for a new city need not force a full re-certification of perception and planning. The same architecture also produces an auditable decision trail — utility values, confidence gates, arbitration results, and escalation events — which maps naturally to the interpretability pressure now building around autonomous systems. In practice this means the framework is relevant not only to real-time driving quality, but also to shadow-mode validation, incident review, and the broader challenge of explaining AV behaviour to regulators and safety reviewers.

2.2 Drone Delivery

Domain deep-dive: Autonomous Systems

A delivery drone must weigh delivery speed against energy efficiency and airspace safety — where safety risk is a real-time function of wind conditions, battery state, and traffic. Standard routing favours on-time delivery; an approaching storm shifts the safety weight from $w_s=0.50$ to $w_s=0.80$, automatically selecting a longer but safer low-altitude route without a pre-written storm rule. When environmental uncertainty exceeds a threshold, the drone aborts and returns to base rather than committing to a route it cannot safely evaluate. At 10–25W for a Jetson Orin NX running a 7B specialist, the entire compute system fits within the drone's power budget — an H100 at 700W does not.

2.3 Smart Home Energy Management

Domain deep-dive: Energy Systems

A home energy agent balances occupant comfort, electricity cost, and carbon footprint across appliances, HVAC, and EV charging. The optimal decision changes constantly: during a peak pricing event, the cost weight rises from $w_k=0.40$ to $w_k=0.65$, shifting appliance scheduling to off-peak automatically. When the occupant signals an explicit preference, the system defers — not by overriding the utility function, but by activating a comfort-override weight profile where $w_c=0.75$. Cross-session learning accumulates usage patterns (carbon-low periods, occupant schedules) and applies them to future decisions without retraining.

2.4 Energy Grid Load Balancing

Domain deep-dive: Energy Systems

A grid management agent must balance stability, generation cost, and renewable utilisation in seconds, with errors that cascade across interconnected infrastructure. Under normal load, demand response with battery storage is preferred over gas peaker plants — cleaner and cheaper, with an acceptable stability trade-off. Under a sudden demand surge, the stability weight rises to $w_\sigma=0.80$ and the decision flips: the gas peaker becomes optimal because grid stability now dominates. The $C_{\min}=0.95$ gate under surge conditions ensures the agent escalates to a human operator when demand forecasts are unreliable, rather than committing a large generation decision under uncertainty. The same formula produces the correct outcome in both contexts — no separate rule sets required.

2.5 Dynamic Pricing

Domain deep-dive: Dynamic Pricing

A platform pricing agent balances revenue, customer satisfaction, and market share — objectives that pull in different directions across contexts. Standard conditions favour moderate pricing with loyalty incentives over aggressive surge pricing. Under genuine supply constraints ($w_r=0.65$), surge pricing becomes optimal. Under a competitive threat, market share weight rises to $w_m=0.40$ and pricing shifts to defend position. A flat "never surge" policy is suboptimal in crunch conditions; a flat "always surge" policy destroys customer relationships in normal conditions. The utility framework selects correctly in each context through a single formula with context-specific field parameters, without pre-programmed rules for each scenario.

2.6 AI Data Centers and GPU Cloud Operators

Domain deep-dive: AI Data Centers

For AI data center operators the most interesting question is not whether the framework produces abstractly “better intelligence,” but whether it changes the economics of serving inference. A routed graph of smaller specialist models shifts the optimisation target from raw frontier capability to revenue per watt, fleet utilisation, and cost per useful domain query. In a mixed fleet containing H100s, A100s, A40s, and consumer-class GPUs, specialist workloads can be mapped to cheaper hardware tiers while reserving scarce top-end capacity for broad or latency-sensitive tasks. That creates a product story for heterogeneous inventory rather than forcing every customer onto the same expensive model tier.

This is especially relevant for operators such as Crusoe, CoreWeave, Lambda, and smaller GPU cloud providers. Crusoe’s constraint is often available watts on stranded or variable energy; if the same domain-specific output quality can be achieved at materially lower wattage per useful query, the framework aligns directly with its operating thesis. CoreWeave-style mixed fleets benefit for a different reason: routing allows older or lower-tier GPUs to remain commercially valuable by serving specialist models that fit comfortably in their memory envelope. This turns routing quality into a margin lever, not just a model-quality detail.

2.7 Self-Driving Platform and Regulatory Operations

Domain deep-dive: Self-Driving Vehicles

For AV companies the strongest argument is not hardware cost, but independent updateability, auditable behaviour, and principled abstention. Current end-to-end neural systems make component-level certification and post-incident explanation difficult because perception, planning, and traffic-rule reasoning are tightly entangled. The Micro-Expert Architecture changes that by allowing one domain specialist to be updated and revalidated without perturbing the others, while the utility log provides a reproducible explanation of why a given manoeuvre was accepted, rejected, or escalated.

This connects naturally to current operating practice. Shadow mode testing is effectively a blue-green deployment protocol at vehicle scale: a new submodel can run silently beside the production model, its utility compared against the live path before promotion. The confidence gate also matches how regulators increasingly expect safety-critical autonomy systems to behave: when competence falls below a threshold, the system should not improvise — it should abstain, slow, or escalate. That makes the framework relevant not just to driving performance, but also to safety case construction, incident review, and regulatory acceptance.

3.1 Utility Functions and Resource Allocation Games

The theoretical foundations of our utility function draw directly from the game-theoretic literature on resource allocation mechanisms. Johari and Tsitsiklis (2004) established the canonical proportional allocation game, proving the existence and uniqueness of Nash equilibrium and bounding the Price of Anarchy at 4/3 for concave utility functions — the same class of utility functions used in our confidence and efficacy formulations. Their subsequent work on scalar-parameterized mechanisms showed that the proportional allocation mechanism achieves the best Price of Anarchy among mechanisms using a single market-clearing price, providing theoretical justification for our single-score utility formulation rather than a vector-valued alternative.

Kelly (1997) introduced the proportional fairness framework for network resource allocation, establishing that bid-proportional allocation maximizes a natural social welfare function. Our curiosity term — allocating exploration effort proportional to potential gain weighted by confidence gap — is structurally analogous to Kelly's proportional bidding: agents allocate where marginal return is highest.

Koutsoupias and Papadimitriou (1999) introduced the Price of Anarchy as a measure of efficiency loss from selfish behavior, which directly motivates our confidence penalty mechanism: detected contradictions are evidence that the model has been operating in a locally selfish manner (optimizing fluency over correctness), and the penalty restores the social optimum by penalizing this deviation. Roughgarden and Tardos (2002) established that the same 4/3 bound applies to selfish routing games with affine-linear congestion functions, linking the network efficiency literature to general mechanism design.

Key departure: Existing utility function literature treats agents as fixed rational actors. Our framework treats the agent itself as the subject of the utility function — the agent learns to improve its own utility through calibration, not merely to maximize given a fixed utility. This closes the loop between mechanism design theory and machine learning.

3.2 Distributed Model Architectures: From Microservices to Mixture of Experts

Microservices architecture provides the software engineering precedent for our physically decomposed model graph. The microservices pattern — decomposing a monolithic application into independently deployable services communicating over well-defined interfaces — has been extensively validated at scale by companies including Netflix, Amazon, and Uber. Key properties established in the microservices literature directly motivate our design: independent deployability eliminates coordinated release cycles; bounded blast radius limits the impact of any single component failure; and service mesh patterns (Istio, Linkerd) demonstrate that weighted traffic routing between service versions is operationally mature technology.

Humble and Farley (2010) in Continuous Delivery formalized the practices of blue-green and canary deployment that underpin our submodel update mechanism, establishing the statistical and operational foundations for progressive traffic shifting between service versions.

Mixture of Experts (MoE) is the closest existing architectural precedent within the ML literature. Shazeer et al. (2017) introduced the Sparsely-Gated Mixture-of-Experts layer, achieving efficiency through sparse activation where only k of N expert subnetworks process each input, with a trainable gating network routing tokens to the most appropriate experts. Fedus et al. (2021) in the Switch Transformer simplified MoE routing to k=1 (a single expert per token), achieving up to 7x pre-training speedup with the same computational budget while scaling to trillion-parameter models. Production models including Mixtral 8×7B and (reportedly) GPT-4 deploy MoE architecture.

Key departure from MoE: Our architecture differs from MoE in a critical dimension: MoE experts share weights, are trained jointly, are deployed monolithically, and cannot be updated independently. Our graph consists of physically separate models with independent weight files, independent training pipelines, and independent blue-green deployment cycles. Updating one domain submodel does not affect any other. This is not a difference of degree — it is a difference of kind. MoE solves the compute efficiency problem within a monolithic training paradigm; we solve the independent deployability and catastrophic forgetting problems by abandoning the monolithic paradigm entirely.

Domain-specialized routing without shared weights has been explored in the Branch-Train-Merge literature (Li et al., 2022) and MoErging approaches, where independently fine-tuned models are composed at inference time. Our architecture extends this direction with utility-function-governed update triggers and hardware-adaptive depth.

3.3 Blue-Green Deployment with Statistically Grounded Thresholds

Blue-green deployment is a mature practice in software engineering, described by Humble and Farley (2010) and subsequently adopted at scale by major technology organizations. The pattern involves maintaining parallel production environments and shifting traffic between them, with automated rollback policies based on performance metrics including error rates, latency, and business KPIs. Canary releases extend this by routing only a fraction of traffic to the new version initially, expanding progressively as confidence grows.

Recent empirical work confirms that canary deployments significantly reduce failure rates compared to direct blue-green switches for systems with complex state, with statistical tests on failure rates validating the significance of gradual traffic migration.

Key departure: Existing blue-green and canary deployment literature triggers traffic shifts based on infrastructure metrics (error rates, latency, uptime). Our system triggers updates based on utility deviation — a domain-specific measure of knowledge quality — and uses power analysis on observed utility score variance to derive theoretically grounded minimum observation windows T(field) before any deployment decision. The trigger threshold δ(field) is derived from field-specific penalty multipliers bootstrapped from societal licensing standards, not from arbitrary SLA targets. This grounds deployment decisions in epistemic quality rather than operational health.

The softmax traffic routing formula — where traffic split is a continuous function of comparative utility scores with a field-calibrated temperature parameter — has no direct precedent in the deployment literature. It makes the routing self-regulating: no manual traffic adjustment is required, and the promotion rate naturally slows as U_green approaches U_blue.

3.4 Arbiter Agents and Structured Contradiction Detection

The problem of resolving conflicts between multiple agents or models has antecedents in multi-agent systems and distributed consensus literature. Paxos (Lamport, 1998) and Raft (Ongaro & Ousterhout, 2014) provide consensus mechanisms for distributed systems, but these assume a binary correct/incorrect answer and require a majority of participants to agree. Our Arbiter Agent addresses a more complex problem: two models may both be wrong in different ways, and the arbiter must determine ground truth independently rather than through consensus.

The structured contradiction detection pipeline — logical, mathematical, cross-session, empirical checks in order of cost — draws from formal verification literature. SAT solvers and theorem provers (Lean, Coq) provide the mathematical check layer; their correctness guarantees are exact rather than probabilistic, which is why mathematical contradictions carry the highest confidence weight (0.40) in the arbiter confidence formula.

Self-consistency checking in LLMs (Wang et al., 2022) explores using multiple model samples to identify contradictions within a single model's outputs. Our approach extends this to contradictions between physically separate domain models, adding the cross-session assertions store as a persistent memory of verified facts that both checks and updates across interactions.

The arbiter confidence weighting formula — combining logical (0.30), mathematical (0.40), cross-session (0.20), and empirical (0.10) check results — is grounded in the relative reliability of each check type. Mathematical verification is definitive; empirical checks via external sources introduce the possibility of source error and are therefore weighted lowest despite being most grounded in external reality.

3.5 External Escalation Gated by Trust: Zero Trust and Minimum Disclosure Principles

The external escalation protocol in §10.5 is grounded in two bodies of security literature.

Zero Trust architecture (Kindervag, 2010; NIST SP 800-207, 2020) establishes the principle that no entity — internal or external, human or machine — is trusted by default. The principle of least privilege, foundational to Zero Trust, states that every program and privileged user should operate using the least amount of privilege necessary to complete the job, first articulated by Saltzer (1974). Our external escalation gating applies this principle to information disclosure: no internal state is shared with any external entity by default. Trust must be earned through demonstrated domain expertise and interaction history before any escalation query is issued.

NIST's Zero Trust Maturity Model defines zero trust as minimizing uncertainty in enforcing accurate, least-privilege per-request access decisions in the face of a network viewed as compromised. Our system treats every external entity as potentially compromised or adversarial by default — the dual trust threshold (domain expertise above median AND entity trust above field-specific floor) operationalizes this assumption.

Minimum disclosure / data minimization principles from privacy engineering (Cavoukian, 2009; GDPR Article 5) provide the foundation for query obfuscation. The legal principle of data minimization — collect and share only what is strictly necessary — is extended here to system state disclosure: reveal only what is necessary to elicit the needed external judgment, and nothing more. The query transformation example (internal conflict → clean professional question) implements data minimization at the system interface boundary.

Key contribution: Existing Zero Trust literature addresses access control to systems and data. We extend the framework to govern information disclosure from an AI system to human experts, including not only what is shared but how it is framed, with explicit obfuscation of system context to prevent an external expert from deducing internal architecture, failure modes, or competitive capabilities.

3.6 Hardware-Adaptive Decomposition: Interconnect Bandwidth and Inference Cost

The hardware-adaptive decomposition argument in §10.9 is grounded in established literature on GPU interconnect performance and its impact on distributed inference.

NVLink 4.0 (H100 GPUs) provides 900 GB/s of bidirectional bandwidth per GPU, compared to PCIe 5.0 x16's 128 GB/s — a 7× difference that directly determines the practical efficiency of any computation that must cross a GPU boundary. NVIDIA's production measurements show that transferring 20 GB of tensor parallelism synchronization data for Llama 3.1 70B takes 150 ms over PCIe point-to-point versus 22 ms over NVSwitch at full NVLink bandwidth — a 6.8× latency difference that scales with every inference request.

This bandwidth hierarchy has direct implications for optimal decomposition depth. Tensor parallelism performs better within a single node connected via NVLink, while pipeline parallelism is better suited for setups spanning multiple nodes; inference latency is far more sensitive to communication overhead than training throughput. Our hardware-adaptive branching heuristic — stop branching when submodels fit within a single GPU's VRAM without sharding — is a direct corollary of these measured performance characteristics.

The cost inversion argument can be stated formally. Let B_intra = intra-GPU memory bandwidth (3.35 TB/s on H100), B_inter = inter-GPU bandwidth (900 GB/s NVLink or 128 GB/s PCIe), and D = data transferred per inference step. For a graph of depth d with k active nodes per query:

Latency_shallow(d=1) ∝ D / B_intra
Latency_deep(d=k)    ∝ k × (D_sub / B_intra) + (k-1) × (D_msg / B_inter)

Where D_sub = D/k (smaller submodel, smaller activations)
      D_msg = inter-node message size (much smaller than D)

When D_msg << D_sub and B_inter >> D_msg/latency_target:
    Latency_deep ≈ Latency_shallow

At equal latency, Cost_deep < Cost_shallow because:
    Consumer GPU cost/hr << Enterprise GPU cost/hr
    D_sub fits in consumer VRAM → no high-VRAM GPU required

NVLink delivers 5× the energy efficiency of PCIe Gen 5 at 1.3 picojoules per bit, which means that for workloads where inter-GPU communication is unavoidable, NVLink-connected consumer GPU clusters can be more cost-effective per useful computation than PCIe-connected enterprise GPU clusters of equivalent total VRAM.

The practical implication — that a deep graph of small models on consumer hardware may achieve equivalent quality at lower cost than a shallow graph on expensive hardware — inverts the standard assumption that larger models require larger hardware budgets. This is a testable empirical claim that the Phase 6 roadmap item (§11) is designed to validate.

3.7 VCG Mechanism Design and Cooperative Games

The VCG arbitration mechanism developed in §10.6 draws from the foundational mechanism design literature. Vickrey (1961) introduced the second-price sealed-bid auction, establishing the principle that transferring surplus to agents proportional to their contribution to others' welfare makes truthful bidding dominant. Clarke (1971) and Groves (1973) independently generalised this to arbitrary public-good allocation problems, showing that the Clarke pivot transfer — measuring how much an agent's participation improves the welfare of all others — makes truthful reporting dominant in the entire Groves family of mechanisms.

The Hurwicz-Walker impossibility result establishes that no mechanism can simultaneously achieve allocative efficiency, incentive compatibility, and budget balance. The VCG mechanism makes the canonical trade-off: sacrifice budget balance to obtain the other two. In the DPO weight adjustment context (§10.6.2), budget imbalance means total penalty adjustments across submodels in an arbitration round need not sum to zero — which is acceptable because there is no conservation law on training weights.

Nash (1950) and Harsanyi and Selten (1972) provide the cooperative bargaining foundations for the disagreement point construction in §10.6.1: the disagreement payoff is the utility each submodel achieves if no claim is verified and the system abstains, which is near zero by design.

The connection between mechanism design and the Johari-Tsitsiklis (2004) Price of Anarchy literature (§3.1) is direct. Without a mechanism designer, selfish submodels in an arbitration game may achieve POA as bad as 4/3 — the same bound that applies to the proportional allocation game. The VCG mechanism achieves POA = 1 because the Arbiter's external position allows it to observe all reported utilities and implement the social optimum directly. The gap from 4/3 to 1 is the precise value the Arbiter's external position provides.

Key departure from standard mechanism design: Classical VCG assumes rational agents reporting private values. Domain submodels are not rational agents; they are language models generating token sequences. §10.6.7 addresses this departure directly: value function elicitation is implemented via structured prompting, and the dominant-strategy guarantee applies to the elicitation protocol under the assumption that submodels produce calibrated confidence estimates. The limitations of this assumption — and the conditions under which it breaks down — are an open question noted in §13.