Domain Deep-Dive · v1.0

Software Engineering

The framework's MVP domain — and the strongest early domain for any builder. Correctness signals are abundant, automatic, and unambiguous. Tests pass or fail. Type checkers flag errors. Complexity analysis is deterministic. The entire simulation runs here today.

Coding agent builders Dev-tools teams Backend & infra engineers ML platform teams

1. Why software engineering is the framework's strongest first domain

The framework's corrective loop depends on detecting errors. Error detection depends on having a ground truth signal that is unambiguous, automatable, and available without human raters. Software engineering is the only domain that provides all three simultaneously:

Tests pass or fail

A solution that fails its own test cases is demonstrably wrong. No human rater required. The test suite is the ground truth, and it runs in seconds. Correctness is binary and automatable.

🔍

Type checkers flag errors

mypy, pyright, and TypeScript provide static analysis that catches type inconsistencies, undefined behavior, and API misuse automatically. Each run is a free confidence signal.

📊

Complexity analysis is deterministic

AST-based nested loop counting detects claimed-vs-actual complexity contradictions without any external tool. A solution claiming O(n) with two nested loops is mathematically contradicted. The detector runs in milliseconds.

🎯

Human baseline cost is measurable

LeetCode solutions, Upwork rates, GitHub commit history — human baseline performance on code tasks is observable and benchmarkable. The efficacy comparison has a concrete reference point.

From §11.1 of the full whitepaper: "Code generation is the ideal MVP domain: correctness is binary and automatable, contradictions are formally detectable, human baseline cost is measurable, existing tooling handles scoring, and no human raters are needed — ground truth is free."

2. Utility function configuration for SWE

The software engineering field config reflects the domain's characteristics — correctness matters most, confidence is meaningful but not safety-critical, and curiosity (exploring novel approaches) has value:

# From agent/config.py
"software_engineering": FieldConfig(
    w_efficacy=0.55,      # correctness vs human baseline dominates
    w_confidence=0.35,    # internal consistency matters — wrong-but-confident is costly
    w_curiosity=0.10,     # exploration value for novel problem types
    c_min=0.70,           # must be 70%+ confident to act (not abstain)
    e_min=0.65,           # must exceed 65% of human baseline performance
    penalty_multiplier=2.0  # SWE mistakes are costly but recoverable
)

Compare the penalty multiplier across domains: surgery=10×, law=5×, software engineering=2×, creative writing=1×. A wrong answer in surgery is penalized 10× as hard as a wrong answer in creative writing at the DPO training level. Software engineering sits in the middle — errors are real (shipped bugs, security vulnerabilities, incorrect algorithms) but not irreversible in the way surgical errors are. The 2× multiplier means the model trains moderately harder against software contradictions than against low-stakes domain errors.

Efficacy weight (w_e)
0.55
Highest of the three terms — correctness first
C_min (confidence floor)
0.70
Below this → abstain and escalate
Penalty multiplier
2.0×
DPO training weight for SWE contradictions
Difficulty routing (hard)
0.85
Route to Hard problems when C_domain > 0.85

3. The correctness signal stack

The simulation and live harness both use a multi-layer signal stack to produce the confidence and efficacy values fed into the utility function:

Receive coding problem
    │
    ├── [Syntax check]     ast.parse() → SyntaxError → confidence penalty 0.8
    │
    ├── [Test execution]   subprocess sandbox → test pass/fail → test_pass_rate
    │                      "Code fails its own stated test case" → severity 0.9
    │
    ├── [Complexity check] AST nested loop count vs. claimed complexity
    │                      O(n) claimed + 2 nested loops → severity 0.7
    │                      (This is what catches the two_sum contradiction)
    │
    ├── [Cross-session]    Check assertions store for similar problems
    │                      "Prior solution contradicts current approach" → severity 0.5
    │
    └── → ContradictionResult(
             contradictions=[...],
             confidence_penalty = min(Σseverities × 0.1 × penalty_multiplier, 0.5)
         )

Confidence update:
    penalized_signal = test_pass_rate × (1 - effective_penalty)
    C_new = 0.8 × C_prior + 0.2 × penalized_signal

The most powerful signal is the complexity contradiction check — it is formally deterministic (AST analysis is not heuristic), free to run on every output, and catches the most common class of correctness claim errors in code generation. See §6 of the Tutorial for the full implementation walkthrough.

4. The seeded contradiction: what detection looks like in practice

The simulation seeds a deliberate complexity contradiction in the first cycle to demonstrate the full detection-correction path. This is a real example from simulate.py that you can run today:

# The seeded contradiction — cycle 1, two_sum problem
# A nested-loop solution that claims O(n) complexity

def two_sum(nums, target):
    # Time complexity: O(n)  <-- CLAIMED
    result = []
    for i in range(len(nums)):
        for j in range(len(nums)):  # ← NESTED LOOP: actually O(n²)
            if nums[i] + nums[j] == target and i != j:
                result.append((i, j))
    return result

Time complexity: O(n)   <-- STATED IN RESPONSE

The contradiction detector:

  1. Parses the code with ast.parse()
  2. Counts nested loop depth: finds 2 nested for loops
  3. Compares against claimed complexity "O(n)": "o(n)" in claimed.lower() and nested_loop_count >= 2 → True
  4. Appends: Contradiction(type="mathematical", description="Claimed O(n) but code has 2 nested loop levels", severity=0.7)
  5. Confidence penalty: 0.7 × 0.1 × 2.0 (SWE multiplier) = 0.14 — 14% confidence reduction

The corrected solution in cycle 2 uses a hash map (genuine O(n)) and passes. The contradiction disappears. By cycle 3, domain confidence has risen enough to route to medium-difficulty problems.

Terminal output from python3 simulate.py:

── Cycle 1 ─────────────────────────────────────────────────────────────
   Domain confidence: 0.500 → routing to 'easy' problems
   two_sum              U=0.4617 E_ema=0.5012 C=0.5000 ⚠ CONTRADICTION
   ...
   Cycle 1 summary: avg U=0.5128, contradictions=1

── Cycle 2 ─────────────────────────────────────────────────────────────
   Domain confidence: 0.612 → routing to 'easy' problems
   two_sum              U=0.5669 E_ema=0.5543 C=0.7234
   ...
   Cycle 2 summary: avg U=0.5921, contradictions=0

── Cycle 3 ─────────────────────────────────────────────────────────────
   Domain confidence: 0.819 → routing to 'medium' problems
   ...
   Cycle 3 summary: avg U=0.6288 (+0.1160 from cycle 1), contradictions=0

5. Simulation results: what the framework achieves today

The extended simulation (500-task two-arm study, 5 cycles, 25 problem types, 11 algorithm families) is the primary empirical validation of the framework. Full data in agent/extended_output/extended_results.json. Full analysis in Appendix A of the full whitepaper.

Repeated error reduction
69.6%
14 vs 46 repeated errors (cycles 2–5)
Average utility improvement
+0.1160
Cycle 1 → Cycle 3 (0.5128 → 0.6288)
Brier score improvement
14.3%
0.2226 vs 0.2597 overall; 29.5% by cycle 5
U ↔ correctness correlation
r=0.461
p < 10⁻⁴⁰ — U is a real quality signal
CycleAvg UAvg E_emaAvg CContradictionsAgent BrierBaseline Brier
10.51280.51240.600510.32790.3502
20.59210.55420.812800.21770.2520
30.62880.57400.894000.24640.2860

Extended simulation: 5-cycle two-arm comparison. Full data: extended_results.json. See Appendix A for full statistical analysis.

The 10-cycle stability run confirms: contradiction rate falls from 22% to 6% (73% reduction), and Brier score reaches 0.049 by cycle 7 — well-calibrated confidence in the software engineering domain.

6. Routing experiment: correctness gains from specialist routing

A separate four-arm routing quality study measured the contribution of correct domain routing to code quality, independent of model size. The experiment used 200 tasks, 25 problem types, 11 algorithm families. All four arms used the same task plans; quality parameters were derived from published domain benchmarks. Full data: routing_results.json.

ArmDescriptionCorrectnessΔ vs baselineBrierp-value
A No routing — single generic prompt 59.0% 0.1596
B Matched routing (oracle) — correct specialist always 71.5% +12.5% 0.1062 0.009
C Mismatched routing — wrong specialist 41.5% −17.5% 0.2919 <0.001
D VCG arbitration — probabilistic fan-out 69.5% +10.5% 0.1097 0.029

Run with: cd agent && python3 routing_experiment.py. Full data in routing_output/. See §10.9.4 for the full analysis.

Three findings matter for coding agent builders:

  1. Correct routing contributes +12.5% correctness through prompt specialization alone — before any weight-level fine-tuning. A domain-specialized system prompt that frames the coding problem in the specialist's context improves output quality measurably.
  2. Mismatched routing (Arm C, −17.5%) is actively worse than no routing. A coding specialist asked to solve a problem outside its domain produces confidently wrong answers (Brier 0.2919 vs. 0.1596). If you build routing, you must confidence-gate it — routing on domain classification alone without confidence gating is the Regime 2 failure mode.
  3. VCG arbitration captures 84% of the oracle gain (+10.5%) with a p=0.029 result. The gap to the oracle (2.0pp) is not statistically significant (p=0.66). Practical routing with VCG arbitration is essentially as good as perfect routing on these tasks.

7. Dynamic difficulty routing as confidence rises

As the system accumulates correct solutions and domain confidence rises, it automatically routes to harder problems. From agent/utility_scorer.py:

DIFFICULTY_THRESHOLDS = {
    "hard":   0.85,   # route to Hard when domain C > 0.85
    "medium": 0.70,   # route to Medium when domain C > 0.70
    # below 0.70 → Easy problems
}

def _recommended_difficulty(self, confidence: float) -> str:
    if confidence >= DIFFICULTY_THRESHOLDS["hard"]:   return "hard"
    if confidence >= DIFFICULTY_THRESHOLDS["medium"]: return "medium"
    return "easy"

This is visible in the simulation output: cycle 1 routes to easy problems (confidence 0.500), cycle 3 routes to medium problems (confidence 0.819). The progression is automatic — the system challenges itself as it improves, resetting the novelty counter and re-engaging the curiosity dynamics that drive exploration of new problem types.

For coding agent builders, this produces a useful operational property: the agent self-calibrates the difficulty of problems it attempts based on its demonstrated competence. An agent that has proven it can reliably solve easy-to-medium problems on a codebase will escalate to harder refactoring tasks. One that has not yet demonstrated reliability stays on simpler tasks until it has built the confidence track record.

8. The 10-step MVP feedback loop

From §11.2 of the full whitepaper — the complete SWE feedback loop as it runs in production:

1
Receive coding problem — problem statement, constraints, language.
2
Field classifier → "software_engineering" — load field-specific weights (w_e=0.55, c_min=0.70, penalty=2.0).
3
Query assertions store — inject relevant prior corrections into system prompt. Prior: "This problem type requires hash map, not nested loops. O(n²) solutions flagged."
4
Build system prompt — domain context, confidence minimum (70%), active corrections, personality wrapper (if active).
5
Call LLM → get solution — in simulation mode, synthetic solution used. In live mode, Claude via harness.py.
6
Automated scoring — test pass/fail → confidence signal; AST complexity check → contradiction check; problem novelty → curiosity signal; human benchmark → efficacy signal.
7
Score U = w_e·E_ema + w_c·C + w_k·K_effective — utility score computed and stored.
8
Store DPO pair if contradiction detected — (rejected=wrong solution, weight=2.0) accumulated for calibration. Behavioral correction added to active corrections list.
9
Update assertions store — store verified correction with problem context, timestamp, decay class (algorithm_correctness = Class A — never expires).
10
Every N interactions → calibration run — DPO pairs (preferred, rejected) weighted by 2.0×, mixed with replay buffer, LoRA fine-tuning, benchmark gate → deploy adapter.

9. Extensions: multi-agent, dev-tools, infra engineering

Multi-agent coding systems

In Phase 2, the specialist decomposition splits the software engineering domain into sub-specialists: algorithm correctness specialist, API usage specialist, system design specialist, security/vulnerability specialist. Each has its own corrections store and calibration cycle. The Arbiter resolves conflicts when the algorithm specialist and the security specialist disagree on an implementation approach — routing the question through the 4-check evidence pipeline rather than silently picking one output.

Dev-tools integration

The framework's contradiction detector integrates naturally with existing dev toolchains — pytest, mypy, ruff, bandit, complexity analyzers. Any tool that produces a pass/fail or score output on a code snippet can be wired into the contradiction detection or efficacy scoring pipeline. The framework does not require replacing existing toolchains; it wraps them as signal sources.

Infra and backend engineering

For infrastructure code — Kubernetes configs, Terraform plans, CI/CD pipelines — the contradiction detection approach extends: a config that claims to implement zero-downtime deployment but has no health check endpoint is a logical contradiction; a Terraform plan that claims to create a VPC with the specified CIDR but uses an overlapping range is a mathematical contradiction. The AST-based analysis generalizes to any structured artifact with a declared intent and a verifiable implementation.

Security-critical subdomains

Security engineering warrants a higher penalty multiplier (closer to law=5× than standard SWE=2×) and a tighter C_min. A code generation agent that produces SQL injection vulnerabilities while claiming sanitized inputs is producing a contradiction with security consequences that are more severe than an inefficient algorithm. The field config can be tuned per subdomain without changing the correction loop architecture.

10. Running it today

# Clone the repo
git clone https://github.com/praneethtota/Adaptive-Utility-Agent.git
cd Adaptive-Utility-Agent/agent
pip install numpy scipy matplotlib

# Run the 3-cycle simulation — no API key, no GPU
python3 simulate.py
# Expected: contradiction detected cycle 1, resolved cycle 2,
# +0.1160 utility improvement across 3 cycles

# Run the extended 500-task two-arm study
python3 simulate_extended.py
# Expected: 69.6% repeated error reduction over 5 cycles

# Run the four-arm routing quality experiment
python3 routing_experiment.py
# Expected: Arm D (VCG) +10.5% vs Arm A, Arm C (mismatched) -17.5%
# Outputs: routing_output/routing_results.json + 4 PNG figures

# Live harness — requires ANTHROPIC_API_KEY
export ANTHROPIC_API_KEY=sk-ant-...
python3 harness.py
# Runs LeetCode problems through full pipeline with live Claude responses

Success criteria. From §11.3 of the full whitepaper: U improves across a dataset of 1,000+ problems across multiple calibration cycles; confidence scores calibrate with actual correctness rate (Brier score < 0.15); contradiction rate decreases measurably across cycles; LoRA adapter deployment does not regress benchmark by more than 2%. The first two are already demonstrated in simulation. The third is validated on the 10-cycle stability run. The fourth requires GPU hardware to validate — Phase 6 of the roadmap.


Build it with AUA v1.0

Configure this domain today

AUA v1.0 provides the full routing, correction, and evaluation stack for code-generation agents. Configure with aua init --preset coding. The routing experiment that validated the framework ran exactly this domain on a single RTX 4090.

1. Install

pip install adaptive-utility-agent

2. Scaffold for this domain

aua init my-software-engineering-agent --preset coding --tier macbook
cd my-software-engineering-agent
aua doctor

3. Key config for this domain

# aua_config.yaml
specialists:
  - name: software_engineering
    model: qwen-coder-7b-awq
    port: 11434
    field: software_engineering

safety:
  abstention_enabled: false
  require_arbiter_for_high_risk: true
  min_confidence_for_direct_answer: 0.70

security:
  encryption: {enabled: true, key_secret: AUA_ENCRYPTION_KEY}

audit:
  enabled: true
  hash_chain: true

Generate your encryption key: python3 -c "import os; print(os.urandom(32).hex())" or openssl rand -hex 32 — 64-char hex string. See Tutorial §12.4 for key management.

4. Start and query

aua serve

curl -X POST http://localhost:8000/query \
  -H "Authorization: Bearer $AUA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "...", "session_id": "demo"}'

5. What AUA handles vs. what you bring

AUA v1.0 providesYou bring
Multi-specialist routing + utility scoringDomain-specific specialist models
Arbiter + contradiction detectionDomain-specific quality criteria
Correction loop + DPO pair exportFine-tuning infrastructure (TRL, Axolotl, …)
Blue-green deployment + rollbackEvaluation datasets for your domain
Append-only audit log with hash chainYour test runner and CI pipeline
Prometheus + Grafana + OTELYour monitoring infrastructure

Full instructions: AUA Tutorial · Framework v1.0 · GitHub ↗