The framework's MVP domain — and the strongest early domain for any builder. Correctness signals are abundant, automatic, and unambiguous. Tests pass or fail. Type checkers flag errors. Complexity analysis is deterministic. The entire simulation runs here today.
On this page
The framework's corrective loop depends on detecting errors. Error detection depends on having a ground truth signal that is unambiguous, automatable, and available without human raters. Software engineering is the only domain that provides all three simultaneously:
A solution that fails its own test cases is demonstrably wrong. No human rater required. The test suite is the ground truth, and it runs in seconds. Correctness is binary and automatable.
mypy, pyright, and TypeScript provide static analysis that catches type inconsistencies, undefined behavior, and API misuse automatically. Each run is a free confidence signal.
AST-based nested loop counting detects claimed-vs-actual complexity contradictions without any external tool. A solution claiming O(n) with two nested loops is mathematically contradicted. The detector runs in milliseconds.
LeetCode solutions, Upwork rates, GitHub commit history — human baseline performance on code tasks is observable and benchmarkable. The efficacy comparison has a concrete reference point.
From §11.1 of the full whitepaper: "Code generation is the ideal MVP domain: correctness is binary and automatable, contradictions are formally detectable, human baseline cost is measurable, existing tooling handles scoring, and no human raters are needed — ground truth is free."
The software engineering field config reflects the domain's characteristics — correctness matters most, confidence is meaningful but not safety-critical, and curiosity (exploring novel approaches) has value:
# From agent/config.py
"software_engineering": FieldConfig(
w_efficacy=0.55, # correctness vs human baseline dominates
w_confidence=0.35, # internal consistency matters — wrong-but-confident is costly
w_curiosity=0.10, # exploration value for novel problem types
c_min=0.70, # must be 70%+ confident to act (not abstain)
e_min=0.65, # must exceed 65% of human baseline performance
penalty_multiplier=2.0 # SWE mistakes are costly but recoverable
)
Compare the penalty multiplier across domains: surgery=10×, law=5×, software engineering=2×, creative writing=1×. A wrong answer in surgery is penalized 10× as hard as a wrong answer in creative writing at the DPO training level. Software engineering sits in the middle — errors are real (shipped bugs, security vulnerabilities, incorrect algorithms) but not irreversible in the way surgical errors are. The 2× multiplier means the model trains moderately harder against software contradictions than against low-stakes domain errors.
The simulation and live harness both use a multi-layer signal stack to produce the confidence and efficacy values fed into the utility function:
Receive coding problem
│
├── [Syntax check] ast.parse() → SyntaxError → confidence penalty 0.8
│
├── [Test execution] subprocess sandbox → test pass/fail → test_pass_rate
│ "Code fails its own stated test case" → severity 0.9
│
├── [Complexity check] AST nested loop count vs. claimed complexity
│ O(n) claimed + 2 nested loops → severity 0.7
│ (This is what catches the two_sum contradiction)
│
├── [Cross-session] Check assertions store for similar problems
│ "Prior solution contradicts current approach" → severity 0.5
│
└── → ContradictionResult(
contradictions=[...],
confidence_penalty = min(Σseverities × 0.1 × penalty_multiplier, 0.5)
)
Confidence update:
penalized_signal = test_pass_rate × (1 - effective_penalty)
C_new = 0.8 × C_prior + 0.2 × penalized_signal
The most powerful signal is the complexity contradiction check — it is formally deterministic (AST analysis is not heuristic), free to run on every output, and catches the most common class of correctness claim errors in code generation. See §6 of the Tutorial for the full implementation walkthrough.
The simulation seeds a deliberate complexity contradiction in the first cycle to demonstrate the full detection-correction path. This is a real example from simulate.py that you can run today:
# The seeded contradiction — cycle 1, two_sum problem
# A nested-loop solution that claims O(n) complexity
def two_sum(nums, target):
# Time complexity: O(n) <-- CLAIMED
result = []
for i in range(len(nums)):
for j in range(len(nums)): # ← NESTED LOOP: actually O(n²)
if nums[i] + nums[j] == target and i != j:
result.append((i, j))
return result
Time complexity: O(n) <-- STATED IN RESPONSE
The contradiction detector:
ast.parse()for loops"o(n)" in claimed.lower() and nested_loop_count >= 2 → TrueContradiction(type="mathematical", description="Claimed O(n) but code has 2 nested loop levels", severity=0.7)0.7 × 0.1 × 2.0 (SWE multiplier) = 0.14 — 14% confidence reductionThe corrected solution in cycle 2 uses a hash map (genuine O(n)) and passes. The contradiction disappears. By cycle 3, domain confidence has risen enough to route to medium-difficulty problems.
Terminal output from python3 simulate.py:
── Cycle 1 ───────────────────────────────────────────────────────────── Domain confidence: 0.500 → routing to 'easy' problems two_sum U=0.4617 E_ema=0.5012 C=0.5000 ⚠ CONTRADICTION ... Cycle 1 summary: avg U=0.5128, contradictions=1 ── Cycle 2 ───────────────────────────────────────────────────────────── Domain confidence: 0.612 → routing to 'easy' problems two_sum U=0.5669 E_ema=0.5543 C=0.7234 ... Cycle 2 summary: avg U=0.5921, contradictions=0 ── Cycle 3 ───────────────────────────────────────────────────────────── Domain confidence: 0.819 → routing to 'medium' problems ... Cycle 3 summary: avg U=0.6288 (+0.1160 from cycle 1), contradictions=0
The extended simulation (500-task two-arm study, 5 cycles, 25 problem types, 11 algorithm families) is the primary empirical validation of the framework. Full data in agent/extended_output/extended_results.json. Full analysis in Appendix A of the full whitepaper.
| Cycle | Avg U | Avg E_ema | Avg C | Contradictions | Agent Brier | Baseline Brier |
|---|---|---|---|---|---|---|
| 1 | 0.5128 | 0.5124 | 0.6005 | 1 | 0.3279 | 0.3502 |
| 2 | 0.5921 | 0.5542 | 0.8128 | 0 | 0.2177 | 0.2520 |
| 3 | 0.6288 | 0.5740 | 0.8940 | 0 | 0.2464 | 0.2860 |
Extended simulation: 5-cycle two-arm comparison. Full data: extended_results.json. See Appendix A for full statistical analysis.
The 10-cycle stability run confirms: contradiction rate falls from 22% to 6% (73% reduction), and Brier score reaches 0.049 by cycle 7 — well-calibrated confidence in the software engineering domain.
A separate four-arm routing quality study measured the contribution of correct domain routing to code quality, independent of model size. The experiment used 200 tasks, 25 problem types, 11 algorithm families. All four arms used the same task plans; quality parameters were derived from published domain benchmarks. Full data: routing_results.json.
| Arm | Description | Correctness | Δ vs baseline | Brier | p-value |
|---|---|---|---|---|---|
| A | No routing — single generic prompt | 59.0% | — | 0.1596 | — |
| B | Matched routing (oracle) — correct specialist always | 71.5% | +12.5% | 0.1062 | 0.009 |
| C | Mismatched routing — wrong specialist | 41.5% | −17.5% | 0.2919 | <0.001 |
| D | VCG arbitration — probabilistic fan-out | 69.5% | +10.5% | 0.1097 | 0.029 |
Run with: cd agent && python3 routing_experiment.py. Full data in routing_output/. See §10.9.4 for the full analysis.
Three findings matter for coding agent builders:
As the system accumulates correct solutions and domain confidence rises, it automatically routes to harder problems. From agent/utility_scorer.py:
DIFFICULTY_THRESHOLDS = {
"hard": 0.85, # route to Hard when domain C > 0.85
"medium": 0.70, # route to Medium when domain C > 0.70
# below 0.70 → Easy problems
}
def _recommended_difficulty(self, confidence: float) -> str:
if confidence >= DIFFICULTY_THRESHOLDS["hard"]: return "hard"
if confidence >= DIFFICULTY_THRESHOLDS["medium"]: return "medium"
return "easy"
This is visible in the simulation output: cycle 1 routes to easy problems (confidence 0.500), cycle 3 routes to medium problems (confidence 0.819). The progression is automatic — the system challenges itself as it improves, resetting the novelty counter and re-engaging the curiosity dynamics that drive exploration of new problem types.
For coding agent builders, this produces a useful operational property: the agent self-calibrates the difficulty of problems it attempts based on its demonstrated competence. An agent that has proven it can reliably solve easy-to-medium problems on a codebase will escalate to harder refactoring tasks. One that has not yet demonstrated reliability stays on simpler tasks until it has built the confidence track record.
From §11.2 of the full whitepaper — the complete SWE feedback loop as it runs in production:
harness.py.In Phase 2, the specialist decomposition splits the software engineering domain into sub-specialists: algorithm correctness specialist, API usage specialist, system design specialist, security/vulnerability specialist. Each has its own corrections store and calibration cycle. The Arbiter resolves conflicts when the algorithm specialist and the security specialist disagree on an implementation approach — routing the question through the 4-check evidence pipeline rather than silently picking one output.
The framework's contradiction detector integrates naturally with existing dev toolchains — pytest, mypy, ruff, bandit, complexity analyzers. Any tool that produces a pass/fail or score output on a code snippet can be wired into the contradiction detection or efficacy scoring pipeline. The framework does not require replacing existing toolchains; it wraps them as signal sources.
For infrastructure code — Kubernetes configs, Terraform plans, CI/CD pipelines — the contradiction detection approach extends: a config that claims to implement zero-downtime deployment but has no health check endpoint is a logical contradiction; a Terraform plan that claims to create a VPC with the specified CIDR but uses an overlapping range is a mathematical contradiction. The AST-based analysis generalizes to any structured artifact with a declared intent and a verifiable implementation.
Security engineering warrants a higher penalty multiplier (closer to law=5× than standard SWE=2×) and a tighter C_min. A code generation agent that produces SQL injection vulnerabilities while claiming sanitized inputs is producing a contradiction with security consequences that are more severe than an inefficient algorithm. The field config can be tuned per subdomain without changing the correction loop architecture.
# Clone the repo git clone https://github.com/praneethtota/Adaptive-Utility-Agent.git cd Adaptive-Utility-Agent/agent pip install numpy scipy matplotlib # Run the 3-cycle simulation — no API key, no GPU python3 simulate.py # Expected: contradiction detected cycle 1, resolved cycle 2, # +0.1160 utility improvement across 3 cycles # Run the extended 500-task two-arm study python3 simulate_extended.py # Expected: 69.6% repeated error reduction over 5 cycles # Run the four-arm routing quality experiment python3 routing_experiment.py # Expected: Arm D (VCG) +10.5% vs Arm A, Arm C (mismatched) -17.5% # Outputs: routing_output/routing_results.json + 4 PNG figures # Live harness — requires ANTHROPIC_API_KEY export ANTHROPIC_API_KEY=sk-ant-... python3 harness.py # Runs LeetCode problems through full pipeline with live Claude responses
Success criteria. From §11.3 of the full whitepaper: U improves across a dataset of 1,000+ problems across multiple calibration cycles; confidence scores calibrate with actual correctness rate (Brier score < 0.15); contradiction rate decreases measurably across cycles; LoRA adapter deployment does not regress benchmark by more than 2%. The first two are already demonstrated in simulation. The third is validated on the 10-cycle stability run. The fourth requires GPU hardware to validate — Phase 6 of the roadmap.
AUA v1.0 provides the full routing, correction, and evaluation stack for code-generation agents. Configure with aua init --preset coding. The routing experiment that validated the framework ran exactly this domain on a single RTX 4090.
pip install adaptive-utility-agent
aua init my-software-engineering-agent --preset coding --tier macbook cd my-software-engineering-agent aua doctor
# aua_config.yaml
specialists:
- name: software_engineering
model: qwen-coder-7b-awq
port: 11434
field: software_engineering
safety:
abstention_enabled: false
require_arbiter_for_high_risk: true
min_confidence_for_direct_answer: 0.70
security:
encryption: {enabled: true, key_secret: AUA_ENCRYPTION_KEY}
audit:
enabled: true
hash_chain: true
Generate your encryption key: python3 -c "import os; print(os.urandom(32).hex())" or openssl rand -hex 32 — 64-char hex string. See Tutorial §12.4 for key management.
aua serve
curl -X POST http://localhost:8000/query \
-H "Authorization: Bearer $AUA_TOKEN" \
-H "Content-Type: application/json" \
-d '{"prompt": "...", "session_id": "demo"}'
| AUA v1.0 provides | You bring |
|---|---|
| Multi-specialist routing + utility scoring | Domain-specific specialist models |
| Arbiter + contradiction detection | Domain-specific quality criteria |
| Correction loop + DPO pair export | Fine-tuning infrastructure (TRL, Axolotl, …) |
| Blue-green deployment + rollback | Evaluation datasets for your domain |
| Append-only audit log with hash chain | Your test runner and CI pipeline |
| Prometheus + Grafana + OTEL | Your monitoring infrastructure |
Full instructions: AUA Tutorial · Framework v1.0 · GitHub ↗