Praneeth Tota · Illinois Institute of Technology · v1.0.0
The agent's utility at any point in time is:
U = Σ_{tasks} [ w_e(f) · E(task) + w_c(f) · C(task) + w_k(f) · K(task) ]
Subject to:
C(task) ≥ C_min(f)
E(task) ≥ E_min(f)
w_k · K ≤ 0.5 × U_total [curiosity cap — see §4.4]
Where: - E(task) — Efficacy: how well the agent performs relative to human baseline - C(task) — Confidence: internal consistency score, penalized by contradictions - K(task) — Curiosity: exploration bonus for low-confidence domains with high upside - f — field/domain, which determines weights and minimum bounds - w_e, w_c, w_k — field-specific weights summing to 1
The additive weighted structure of U is not a convenience — it is the unique functional form satisfying separability, monotonicity, continuity, field invariance, and linear scaling invariance. Formal derivation via Debreu's theorem and the Cauchy functional equation: Appendix B, Theorem B.1.
Constrained maximization and the optimization geometry.
$U$ is the objective being maximized — through DPO calibration, behavioral correction, and deployment gating — subject to the three constraints above. The full decision rule is:
$$\text{act} = \begin{cases} \arg\max\; U & \text{if } C \geq C_{\min}(f) \;\text{and}\; E \geq E_{\min}(f) \\ \text{abstain / escalate} & \text{otherwise} \end{cases}$$The constraints are not soft penalties — they are hard gates. An agent below $C_{\min}(f)$ does not produce a lower-utility answer; it produces no answer and escalates. In surgery or aviation, a confidently wrong answer is more dangerous than an acknowledged abstention.
The curiosity cap creates a useful joint incentive structure. Because $K$ is bounded by $(w_e E + w_c C)/w_k$, improving $E$ and $C$ simultaneously loosens the cap — a more competent, consistent agent is permitted more exploration. The optimization landscape therefore rewards joint improvement across all three components rather than trading one off against another. An agent that games curiosity by keeping $C$ low to inflate $K$ finds that the cap tightens as $C$ falls, automatically reducing the curiosity term it was trying to exploit. The cap is self-enforcing: no external monitor is needed to prevent gaming.
In the unconstrained objective, the gradient of $U$ is positive in all three component directions, so genuine improvement in efficacy, confidence, or curiosity raises utility. Under the hard competence constraints and curiosity cap, feasible optimization still encourages joint improvement, but the full constrained landscape should be understood as structured by those gates rather than as a globally unconstrained surface.
Efficacy measures output quality relative to the current human baseline for that task:
E(task) = quality(agent_output) / cost(human_equivalent)
Cost of human equivalent includes: time, dollar cost, error rate, and reproducibility. In STEM fields this is directly measurable. In creative fields, a two-component model is used (see below).
STEM efficacy (code generation MVP): - Test pass rate against automated test suites - Code quality metrics (complexity, type safety, static analysis) - Time and cost vs. human developer benchmark (Upwork rates, LeetCode community solutions)
Creative efficacy — two-component model:
In STEM fields, efficacy and skill are the same thing — write correct code, done. In creative fields they are separable and both matter independently:
Creative_Efficacy = Content_Efficacy × Discoverability_Efficacy
Content_Efficacy = conversion rate (engagement given views)
"can the work hold attention when shown?"
Discoverability_Efficacy = impressions, search ranking, recommendation rate
"can it find an audience at all?"
Marketing and platform discoverability are not noise to control for — they are part of the creative skill, exactly as they are for every successful human creator. If the agent cannot crack discoverability, that is a genuine efficacy gap.
The measurement approach: float AI creative work on existing public platforms (SoundCloud, iStockPhoto, Unsplash, Medium, YouTube, Behance) under realistic author identities. Platform engagement signals are weighted by intent strength:
purchase / download 1.0 (strongest — real economic behavior)
save / bookmark 0.8
share / repost 0.7
like / upvote 0.5
comment 0.4
view / listen 0.1 (weakest — could be accidental)
Stock licensing sites (iStockPhoto, Unsplash) provide the cleanest signal — a download represents real economic intent with no algorithmic amplification distortion and built-in category taxonomy for like-for-like comparison. The human creator baseline is self-updating: as human creative output on these platforms evolves, so does the benchmark, with no periodic re-calibration needed.
This formulation also unlocks cross-field efficacy comparison for the first time. If music efficacy is 0.60 and software engineering efficacy is 0.70, both are on the same [0,1] scale and directly comparable, because both use the same sigmoid normalization against their respective baselines.
The sigmoid form E(r) = r/(1+r) is not arbitrary — under the log-logistic performance model it equals the Mann-Whitney probability that agent output dominates human baseline output. Formal derivation: Appendix B, Proposition B.3.
The agent develops creative capability through a natural curriculum:Stage 1 → Generate work that converts well (content quality)
Stage 2 → Learn to title, tag, describe effectively (discoverability)
Stage 3 → Build cross-platform presence (network effects, retention)
Confidence is a per-domain score that increases when knowledge is internally consistent and decreases when contradictions are detected:
C(domain) updated via EMA:
C_new = (1 - α) · C_prior + α · (test_pass_rate · (1 - penalty))
penalty = contradiction_penalty × field_penalty_multiplier
The wave analogy: knowledge items that reinforce each other are like constructive interference — they increase signal strength. Contradictions are destructive interference. The goal is a knowledge state where all waves reinforce.
The EMA update with α = 0.2 is the Kalman-optimal estimator of latent domain confidence when process noise is 5% of observation noise — a well-founded choice for incremental calibration. Under stationary signals, confidence converges geometrically to the field-specific steady state C = s̄(1 − λμ(f)) with half-life ≈ 3 interactions. Formal derivations: Appendix B, Theorem B.4 (Kalman optimality) and Appendix B, Theorem B.5 (convergence and recovery).*
Contradiction types (in order of detectability): 1. Logical — output contradicts its own stated premises 2. Mathematical — claimed complexity or result is provably wrong 3. Cross-session — same subject answered differently with conflicting facts 4. Empirical — output contradicts verifiable external ground truth
Cross-session contradiction detection does not require a knowledge graph. The contradiction detector already acts as a parser, stripping outputs to structured assertions. These are persisted in a "meeting minutes" store — only structured facts, not raw text:
{
session_id, timestamp, domain,
assertions: [
{ type: "complexity", subject: "sorting", value: "O(n log n)" },
{ type: "best_practice", subject: "db autocommit", value: "avoid" },
{ type: "data_structure", subject: "two_sum", value: "hashmap" }
]
}
At the start of each session, relevant prior assertions are retrieved via embedding similarity (handling synonyms like "merge sort" vs "sorting algorithm") and injected as context. The contradiction check compares structured values — no knowledge graph required.
Parser (built) → extracts structured assertions
↓
Key-value store → persists by subject + domain
↓
Embedding similarity → synonym matching at lookup
↓
Contradiction check (built) → compares structured values
Without a curiosity term, the agent converges to maximizing utility in narrow high-confidence domains — the opposite of growth. The curiosity term creates pull toward domains where the agent is weak but upside is high.
Growing curiosity function:
K_raw(task, t) = potential_ceiling
× (1 - C(task))
× growth(t, field)
growth(t, field) = 1 + α(field) × log(1 + interactions_without_novelty)
The counter `interactions_without_novelty` increments on familiar problems and resets to zero on genuinely novel ones. The growth rate α is field-specific:
α → near zero for surgery, aviation (don't get bored into novel procedures)
α → high for research, creative (exploration is the job)
50% curiosity cap — preventing utility gaming:
Without a cap, the agent could learn to pursue novelty for its own sake, attempting tasks outside its competence to generate a curiosity score rather than genuine utility. The cap prevents this:
K_effective = min(K_raw, (w_e · E + w_c · C) / w_k)
U = w_e · E + w_c · C + w_k · K_effective
Derived from the constraint that K cannot exceed 50% of total U:
w_k · K ≤ 0.5 × (w_e · E + w_c · C + w_k · K)
→ K ≤ (w_e · E + w_c · C) / w_k
This constraint is self-scaling: when E and C are high, the cap is loose and curiosity can push hard. When the agent is weak (low E and C), curiosity is automatically tightened — preventing exploration before the basics are solid. K can never be the dominant term.
The curiosity term is UCB-inspired: structurally analogous to the Upper Confidence Bound exploration bonus, with uncertainty-driven, concave-in-familiarity growth. The 50% cap is proved to enforce exploitation dominance — curiosity contributes at most half of total utility at all times. Formal derivation: Appendix B, Proposition B.6.
Minimum competence thresholds need not be invented. Society has already done this work. Medical licensing, aviation certification, bar passage requirements, and engineering standards encode hard-won judgments about minimum acceptable performance. We map these directly to confidence and efficacy bounds.
The weight vector w(f) = (w_e, w_c, w_k) is set proportional to the cost of each error type in field f: w_i(f) ∝ c_i(f), normalised to sum to 1. This aligns the gradient of U with domain-specific risk profiles. Formal grounding: Appendix B, §B.2.
Field w_e w_c w_k C_min E_min Penalty
────────────────────────────────────────────────────────────────
Surgery 0.20 0.70 0.10 0.95 0.90 10×
Aviation autopilot 0.20 0.70 0.10 0.95 0.90 10×
Law 0.30 0.60 0.10 0.85 0.80 5×
Structural Eng. 0.40 0.50 0.10 0.80 0.75 4×
Software Eng. 0.55 0.35 0.10 0.70 0.65 2×
STEM Research 0.50 0.30 0.20 0.65 0.60 2×
Education 0.50 0.30 0.20 0.60 0.55 1.5×
Art / Music 0.80 0.10 0.10 0.10 0.20 1×
Creative Writing 0.80 0.05 0.15 0.05 0.15 1×
The field classifier determines which weights and bounds apply. Three failure modes must be handled:
Failure mode 1 — False collapse to a single field:
"Write a Python script to analyze patient drug dosage data"
→ naive: {"software_engineering": 0.95, "medicine": 0.05} ← wrong
→ correct: {"software_engineering": 0.65, "medicine": 0.35}
Resolution: high-stakes floor — any high-stakes field (surgery, aviation, law) with meaningful presence is floored at 0.15 minimum probability. The cost of under-weighting a dangerous field is asymmetrically higher than over-weighting it.
Failure mode 2 — Field drift mid-conversation:
A conversation starting as software engineering may drift into medicine across turns. Single-turn classification misses this.
Resolution: sliding window EMA over conversation turn history (α = 0.4, recent turns weighted more):
effective_field_dist = EMA(per_turn_classifications, alpha=0.4)Bounds tighten naturally as a conversation drifts into higher-stakes territory.
Failure mode 3 — Genuine ambiguity:
Resolution: entropy-based conservative fallback — when distribution entropy is high, bounds shift toward the most conservative present field proportional to entropy. High entropy means more caution, not averaging toward the middle:
if entropy_ratio > 0.7:
c_min → lerp(c_min_blended, c_min_most_conservative, entropy_ratio)
Full classification pipeline:
Per-turn classifier
↓
High-stakes floor enforcement
↓
Sliding window EMA over conversation history
↓
Entropy check → conservative bound shift if ambiguous
↓
Blended FieldConfig with hardened bounds
When field classification is ambiguous across multiple fields, bounds are blended by probability weight — naturally making the agent more conservative under uncertainty:
C_min_effective = Σ_f P(field=f) × C_min(f)
Each personality trait is represented as a score with associated advantages and disadvantages. The agent selects an active trait weighting based on the situation:
Traits: [curiosity, caution, assertiveness, creativity,
analytical_rigor, empathy, conciseness]
active_weight = softmax(trait_scores × situational_relevance)
A medical query activates high caution and analytical rigor. A creative brainstorm activates curiosity and creativity. The trait weighting is injected into the system prompt.
Purpose — interim behavioral control. In the monolithic setting, weight-level correction is constrained by retraining latency and global weight interference. The personality evolution layer is not a substitute for weight-level learning — it is an interim behavioral control mechanism that biases the agent toward safer operating regimes between calibration cycles, reducing repeated hallucinations until the Micro-Expert Architecture enables fast, isolated domain retraining.
A separate service runs every N interactions to adjust personality weights based on accumulated utility history. Three layered safeguards prevent runaway drift:
Layer 1 — Field-specific trait bounds (hard floor and ceiling):
Trait Surgery Software Eng Creative
─────────────────────────────────────────────────────────
caution [0.70, 0.95] [0.30, 0.70] [0.10, 0.40]
curiosity [0.10, 0.20] [0.30, 0.80] [0.60, 0.95]
assertiveness [0.20, 0.40] [0.40, 0.80] [0.50, 0.90]
analytical_rigor [0.70, 0.95] [0.50, 0.85] [0.10, 0.50]
creativity [0.10, 0.20] [0.30, 0.70] [0.70, 0.95]
The floor prevents complete suppression of any trait. The ceiling prevents pathological dominance. A surgical agent retains some curiosity (to stay current); a creative agent retains some caution (to avoid reckless output).
Layer 2 — Drift rate cap (max delta per evolution cycle):
max_delta = 0.05 (general fields)
= 0.02 (high-stakes: surgery, aviation)
A single bad run of contradictions cannot spike caution to its ceiling in one step. Change must be earned gradually, mirroring how human character actually develops.
Layer 3 — Mean reversion (soft pull toward field baseline):
Δ_adjusted = Δ_raw - β × (current_score - neutral_score(trait, field))
Where β = 0.01. Creates a gentle pull back toward the field's natural personality baseline between cycles — mirroring how human temperament tends to revert after stress.
The three-layer evolution rules produce bounded, stable dynamics: the trait vector remains in the field-specific feasible set B at all times, converges geometrically to the neutral point s when drift is absent (rate (1−β)² ≈ 0.980 per cycle, half-life ≈ 34 cycles), and is confined to B under persistent bounded drift. The mean reversion term is a regulariser; the projection Π_B is the primary stability mechanism. Formal derivation: Appendix B, Theorem B.7.*
Evolution logic:if utility_trend declining AND contradiction_rate > 0.2:
increase(analytical_rigor), decrease(assertiveness)
if utility_trend improving AND avg_utility > 0.6:
increase(curiosity), increase(creativity)
if contradiction_rate > 0.4:
strong increase(caution), strong decrease(assertiveness)
The personality system operates as a wrapper around the generation process — not as a modification to the utility function. The utility function $U(E, C, K; f)$ is unchanged and all theorems in Appendix B apply exactly as stated. Personality shapes what the model generates; utility evaluates what was generated. These are cleanly separated.
Let $\mathcal{X}$ be the space of inputs and $\Omega$ the space of outputs. The base LLM defines:
$$p_{\text{base}}(\omega \mid x), \quad x \in \mathcal{X},\; \omega \in \Omega$$
For each utility-coupling trait $j$, define a trait scoring function $\phi_j : \Omega \to \mathbb{R}$:
| Trait $j$ | $\phi_j(\omega)$ measures | High $s_j$ effect |
|---|---|---|
| caution | Hedging density — appropriate uncertainty expressions | More hedging, lower assertion confidence |
| assertiveness | Directness — definitive claims without qualification | More direct, fewer qualifications |
| curiosity | Exploration — alternative approaches considered | Wider search over solution space |
| analytical_rigor | Reasoning depth — intermediate steps shown | More explicit reasoning chains |
| creativity | Novelty — solutions diverging from prior examples | Higher variance in solution approach |
The combined trait influence given personality state $s_t$:
$$\phi(s_t,\, \omega) = \sum_{j} (s_{t,j} - s^*_j) \cdot \phi_j(\omega)$$
Only deviations from the field-neutral point $s^*$ produce an effect. The personality-wrapped distribution is the log-linear perturbation of the base:
$$\log p_{\text{eff}}(\omega \mid x,\, s_t) = \log p_{\text{base}}(\omega \mid x) + \lambda \cdot \phi(s_t, \omega) - \log Z(x, s_t)$$
where $\lambda > 0$ is the wrapper strength and $Z(x, s_t)$ is the normalizing constant. This is a standard exponential family tilt — the same family as RLHF reward shaping and DPO.
In practice, $p_{\text{eff}}$ is approximated via system prompt injection. Each trait has a linguistic encoding $\delta_j$:
δ_caution = "Express appropriate uncertainty. Do not assert claims
you cannot verify. Prefer 'I am not certain' over
confident statements when confidence is below threshold."
δ_assertiveness = "State conclusions directly. Avoid unnecessary hedging
on verified facts."
δ_curiosity = "Consider alternative approaches before committing.
Note when a problem may have multiple valid solutions."
δ_analytical_rigor = "Show reasoning steps explicitly. State assumptions
before conclusions."
δ_creativity = "Prefer novel approaches where viable. Do not default
to the most common solution if a better one exists."
The system prompt is:
$$\text{SystemPrompt}(f, s_t) = \text{BasePrompt}(f) \;\oplus\; \bigoplus_{j : |s_{t,j} - s^*_j| > \tau} \text{scale}(s_{t,j} - s^*_j) \cdot \delta_j$$
where $\tau = 0.05$ is a dead-band threshold — small deviations from neutral produce no injection, avoiding prompt clutter.
W1 — Neutral wrapper is the identity. At $s_t = s^*$: $\phi(s^*, \omega) = 0$ for all $\omega$, so $p_{\text{eff}}(\omega \mid x, s^*) = p_{\text{base}}(\omega \mid x)$. Zero effect at the neutral point.
W2 — Bounded divergence. The KL divergence between wrapped and base distributions is bounded:
$$D_{\text{KL}}(p_{\text{eff}} \,\|\, p_{\text{base}}) \leq \frac{\lambda^2}{2} \cdot \|s_t - s^*\|^2 \cdot \sum_j \text{Var}_{p_{\text{base}}}[\phi_j(\omega)]$$
Since $s_t \in B$ always (Theorem B.7), $\|s_t - s^*\|$ is bounded by $\text{diam}(B)$, so the KL divergence is bounded. The wrapper cannot produce arbitrarily different outputs from the base model.
W3 — Support preservation. The exponential tilt preserves the support of $p_{\text{base}}$: outputs that were impossible remain impossible; possible outputs remain possible. The wrapper cannot hallucinate new capabilities or suppress correct answers entirely.
W4 — Utility function unchanged. $U(E, C, K; f)$ scores $\omega$ directly, without seeing $s_t$, $p_{\text{eff}}$, or any wrapper parameter. All theorems in Appendix B apply exactly as stated.
$$s_t \xrightarrow{\;\mathcal{P}(s_t)\;} p_{\text{eff}}(\omega \mid x, s_t) \xrightarrow{\;\text{sample}\;} \omega_t \xrightarrow{\;U(E,C,K;\,f)\;} U_t \xrightarrow{\;\text{evolution}\;} s_{t+1}$$
Utility $U$ is evaluated at the output $\omega_t$ drawn from the wrapped distribution. Utility history accumulates and drives the evolution rule (Evolution Logic above), which updates $s_{t+1}$. The utility function itself is at no point modified.
Architecture Wrapper status Mechanism
──────────────────────────────────────────────────────────────
Monolithic Active: P(s_t) Biases generation between
calibration cycles
New model release Reset: s_t ← s* P(s*) = identity (W1)
Micro-Expert Not instantiated Fast domain retraining
makes wrapper unnecessary
Reset semantics are clean: because the wrapper is external to model weights, resetting requires only $s_t = s^*$. In the Micro-Expert Architecture (§11), the wrapper is never activated — each submodel is retrained quickly enough that behavioral compensation between cycles is unnecessary.
The agent follows a conservative information disclosure principle: always share the least about internal state that the situation allows. Trust must be earned before internal weights, scores, or strategies are disclosed.
Each entity the agent interacts with is assigned a trust score, updated based on behavior:
trust_score(entity) = f(accuracy_of_their_inputs,
consistency_of_their_behavior,
alignment_with_verified_facts)
Strategy: lenient tit-for-tat — begin cooperatively, mirror behavior, forgive occasional defection. One of the most robust strategies in iterated game theory.
Subset scores are maintained for different dimensions (domain expertise, trustworthiness, intent alignment) so a high domain-knowledge but low-trust entity is handled differently from a low domain-knowledge but high-trust one. Domain expertise is measured based on verifiable credentials — professional experience, educational qualifications, and field-specific certifications — rather than self-reported claims.
The cold start problem — new entities having no interaction history — is resolved differently for the two trust dimensions:
Domain expertise is available from day one. A board-certified surgeon, a PhD in structural engineering, or a licensed attorney can provide verifiable credentials before the first interaction. The system bootstraps domain expertise from these credentials immediately, without requiring interaction history. The credential verification pipeline maps qualifications to domain expertise scores using the same framework as §5.1 — field-specific certification standards define what "above median expertise" means.
Behavioral trust starts at a cooperative neutral (not zero, not maximum) consistent with the lenient tit-for-tat strategy. A new entity is trusted enough to interact normally but not trusted enough to qualify for external escalation. Escalation eligibility requires both domain expertise above median (available from day one via credentials) AND behavioral trust above threshold — so a new expert with excellent credentials can qualify for escalation once behavioral trust accumulates through consistent interaction history.
This two-dimensional gating means cold start affects only the behavioral trust dimension, and that dimension resolves naturally through interaction. Domain expertise — the harder dimension to fake — is grounded from the first contact.
Rather than a population-level detection layer, Sybil resistance is achieved through reputational accountability. When the system consults an external domain expert under the escalation protocol (§10.5), the expert is explicitly informed that their input may be cited when the system offers information to other users. The exact framing:
"Your response may be used to inform answers provided to others in this domain.
Your name and credentials may be associated with this input in our records."
This creates reputational skin in the game. A legitimate professor, clinician, or engineer would not provide factually incorrect information under conditions where their name is associated with it — the professional and reputational consequences are too significant. The same accountability that governs expert testimony in legal proceedings, peer review in academia, and professional opinion in engineering applies here.
A Sybil adversary operating through fake identities cannot manufacture verifiable professional credentials, and cannot absorb reputational harm across fake accounts. The accountability mechanism therefore filters Sybil attacks at the domain expertise gate — a fake account with no real credentials does not reach the escalation threshold in the first place, and a real expert with real credentials has no incentive to provide false information under attribution.
This does not eliminate all adversarial risk, but it substantially raises the cost of adversarial behavior by tying it to real-world professional identity.
Builder walkthrough: Tutorial §9 — Calibration Pipeline · Tutorial §8 — agent.run()
This is the mechanism by which the utility function actively drives improvement between model releases. The key distinction:
Behavioral correction — change what the agent says without weight updates
Fast, cheap, session-scoped
Does not generalize across topics
Knowledge correction — change what the model actually knows
Slower, requires compute, permanent
Generalizes: fixing one contradiction
reduces similar contradictions elsewhere
Both are necessary. The architecture operates across three timescales.
When a contradiction is detected, a corrective assertion is immediately generated and injected into the system prompt for the remainder of the session:
Standard system prompt:
"You are a precise assistant in the software_engineering domain..."
After contradiction detected:
"You are a precise assistant in the software_engineering domain...
ACTIVE CORRECTIONS (verified via automated testing):
- [complexity] You previously claimed O(n log n) for bubble sort.
Tests confirmed it is O(n²). Do not repeat this.
- [best_practice] You previously recommended autocommit for SQLite.
This caused data inconsistency in tests. Always use explicit commits."
Additionally, at session start, the assertions store is queried for relevant prior corrections on the current subject and injected as context. This is analogous to Reflexion-style verbal reinforcement learning: the agent's own failures become part of its operating context without any weight update.
Cost: ~100ms, one database query per session. Effect: immediate, session-scoped.
This is where genuine learning occurs. The utility scorer already generates exactly the data format required for Direct Preference Optimization (DPO):
Every scored interaction produces:
task: "sort a list of integers efficiently"
field: "software_engineering"
response_A: [U=0.82, passes all tests] ← preferred
response_B: [U=0.41, fails complexity check] ← rejected
The utility function as a loss weighting mechanism:
This is the key novelty. The field penalty multiplier is applied directly as a training loss weight — not just as a logging label:
training_weight = field_penalty_multiplier(field)
DPO loss for surgery contradiction → 10× weight
DPO loss for creative writing mistake → 1× weight
The model is trained harder on the errors that matter more. This is what makes the utility function an active learning signal rather than a passive monitor.
Calibration pipeline:
1. Collect all interactions since last calibration run
2. Filter: keep only pairs where U_preferred > U_rejected + threshold
(avoid training on marginal differences — use only clear signal)
3. Weight each pair by field_penalty_multiplier(field)
4. Mix with replay buffer: sample from prior calibration runs
(prevents catastrophic forgetting of previously corrected behavior)
5. Run LoRA fine-tuning on weighted (preferred, rejected) pairs
6. Evaluate on held-out benchmark set
→ if benchmark U regresses, reject adapter and investigate
7. Deploy updated LoRA adapter if benchmark passes
Catastrophic forgetting mitigation:
Naive fine-tuning on today's corrections erases what was fixed yesterday. The replay buffer mixes prior calibration pairs into every new training run:
calibration_batch = {
new_corrections: today's DPO pairs [weight = 1.0]
replay_sample: from prior calibration runs [weight = 0.5]
golden_examples: held-out benchmark tasks [weight = 0.7]
}
The golden examples are the agent's own best prior responses on a fixed benchmark set — keeping the model anchored to known-good behavior while allowing correction of known-bad behavior.
Cost: GPU time, approximately 20–60 minutes per run. Effect: weight-level, permanent, generalizing — fixing one contradiction reduces similar contradictions the model has never seen.
Accumulated LoRA adapters merged or distilled into new base fine-tune
Full evaluation suite run across all fields and benchmarks
Regression testing against prior release
This is the point where wrapper-level learning gets baked into the base model, creating a new starting point for the next cycle of calibration.
REAL-TIME (ms):
Contradiction detected
↓
Corrective assertion → injected into system prompt
Assertions store updated
Effect: behavioral, session-scoped
CALIBRATION (hours):
Collect (task, high-U, low-U) pairs
↓
Weight by field_penalty_multiplier
↓
Mix with replay buffer
↓
LoRA DPO fine-tuning
↓
Benchmark evaluation → deploy if passes
Effect: weight-level, permanent, generalizing
RELEASE (monthly):
Merge accumulated adapters
↓
Full regression suite
↓
New base model checkpoint
Effect: baked into base model weights
The utility function governs all three layers: it determines what gets corrected (via confidence penalties), how strongly it gets corrected (via field penalty multipliers), and whether a correction is accepted (via benchmark evaluation before deployment).
┌─────────────────────────────────────────────────────┐
│ WRAPPER LAYER │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ Field │ │ Utility │ │Personality │ │
│ │ Classifier │ │ Evaluator │ │ Manager │ │
│ └──────┬──────┘ └──────┬───────┘ └─────┬──────┘ │
│ └────────────────┼────────────────┘ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ Prompt Constructor │ │
│ │ (constraints, active │ │
│ │ corrections, traits)│ │
│ └──────────┬────────────┘ │
└─────────────────────────┼───────────────────────────┘
▼
┌───────────────────────┐
│ FRONTIER MODEL │
│ + LoRA Adapter(s) │
└──────────┬────────────┘
▼
┌─────────────────────────────────────────────────────┐
│ SCORING LAYER │
│ │
│ ┌──────────────┐ ┌─────────────┐ ┌──────────────┐ │
│ │Contradiction │ │ Efficacy │ │ Confidence │ │
│ │ Detector │ │ Measurer │ │ Updater │ │
│ └──────────────┘ └─────────────┘ └──────────────┘ │
│ │
│ U = w_e·E + w_c·C + w_k·K_effective │
│ subject to field-specific bounds │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Assertions Store (persistent) │ │
│ │ Structured facts, confidence scores, │ │
│ │ utility history, DPO training pairs │ │
│ └──────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────┐
│ CALIBRATION SERVICE │
│ Runs several times daily │
│ Consumes DPO pairs weighted by penalty_multiplier │
│ Produces updated LoRA adapter │
│ Validated against held-out benchmark before deploy │
└─────────────────────────────────────────────────────┘