Design Document

How AUA-Veritas works

A technical design reference covering the full system architecture, the AUA framework components it implements, and the empirical validation behind key design decisions.

System overview

AUA-Veritas is a macOS desktop application that wraps the Adaptive Utility Agent (AUA) framework in a consumer-friendly chat interface. It runs multiple large language models simultaneously, applies mechanism design to select the best answer, and maintains a persistent correction memory that improves routing over time.

The system has four distinct layers, each adding correctness on top of the previous:

1

Memory layer

Stored corrections are scored and injected into the system prompt before every query. Relevance scoring uses 8 factors including semantic similarity, recency, and token cost.

2

Competition layer (VCG)

All enabled models answer simultaneously. A domain-decomposed VCG welfare function selects the winner based on domain-specific historical win rates. Models are incentivised to report their true domain competencies.

3

Verification layer (peer review)

When accuracy mode is High or Max, a designated reviewer model independently checks the winner's answer. Correct answers reward the model; incorrect answers trigger a correction storage event.

4

Compliance layer

A compliance monitor checks every response against active instructions. Violations trigger escalating bold reinforcement in the next system prompt. The domain tree grows automatically from model self-reports.


Technology stack

ComponentTechnologyRationale
Desktop shellElectron 30 (arm64)Native macOS integration, system tray, Keychain access via keytar
FrontendReact 18 + TypeScript + ViteType safety, hot reload in dev, minimal bundle in prod
BackendFastAPI + uvicorn (Python 3.11)Async-native, easy to add endpoints, Pydantic validation
DatabaseSQLite 3 (WAL mode)Zero-config, file-portable, WAL for concurrent reader/writer
EncryptionAES-128 Fernet (cryptography)Message content encrypted at rest; keys in Keychain
ML inferencespaCy + custom trigger modelOffline correction detection; no cloud ML dependency
DistributionPyInstaller + electron-builderSingle DMG; no Python or Node required on user's machine

The backend binary is built with PyInstaller in --onedir mode (not --onefile). This eliminates the 8–15 second cold-start extraction penalty of single-file builds — files are read directly from disk at the binary's path.


Data model

All state is stored in a single SQLite file at ~/Library/Application Support/AUA-Veritas/veritas.db. Key tables:

TablePurpose
conversationsChat session metadata (title, timestamps)
messagesEncrypted message content, model attribution, confidence labels
model_runsOne row per model per query: welfare score, vcg_winner flag, domain, latency
correctionsStored corrections and preferences with type, scope, domain, decay class
audit_logScore events for Look Under the Hood graphs
domain_nodesDynamic domain tree: node_id, parent_id, depth, aliases (JSON)
domain_candidatesCandidate domain strings awaiting promotion review
model_context_promptsPer-model recovery prompts, generated_at, corrections_hash for staleness
context_backupsPer-model context summaries for conversation continuity
local_settingsUser preferences, correction thresholds, user_token for bug reports

VCG mechanism design

AUA-Veritas implements the Vickrey–Clarke–Groves (VCG) mechanism from the AUA framework. In a standard VCG auction, a social planner selects the allocation that maximises total welfare. Here, "welfare" is the expected correctness of the answer a model will produce for this query in this domain.

The welfare formula

Domain-decomposed VCG welfare (Phase 11.4) Wi(q) = Σj p(j | q) · ui(j)

where:
  p(j | q) = probability query q belongs to domain j
  ui(j) = effective utility of model i in domain j
  j ∈ {code, mathematics, science, legal, medical, finance, writing, analysis, history, general, ...}

The domain probabilities p(j|q) are derived from model self-reports — each model appends a DOMAINS: tag to its response listing the 1–3 domains it considers most relevant. The router strips the tag, aggregates across all models (equal weight split per model's list), and normalises to produce a probability distribution.

Why VCG is the right mechanism

🎯 Three AUA-proven properties

  • Dominant strategy truthfulness — Models maximise their expected score by reporting their genuine domain competencies, not by inflating claims. A model that misreports its domain gets lower welfare scores over time as its actual win rate diverges from its claimed domain strength.
  • Price of Anarchy = 1 — The VCG allocation is socially optimal. No other allocation rule produces higher expected total welfare across all models.
  • Individual rationality — Every model participating in the competition is at least as well off as not participating (they receive a welfare score ≥ 0).
Empirical validation: The AUA framework's routing correctness gain of +10.5% (p=0.029) was measured over a single-model baseline using the same VCG mechanism implemented here.

Utility function — effective_u

The domain-specific effective utility u_i(j) is a volume-normalized win rate. Raw win rates are unreliable with few observations — a model with 1 win from 1 query has the same win rate as one with 20 wins from 20 queries, but very different reliability.

Effective utility with volume normalization (Phase 11.6) ui(j) = α · win_rate(i, j) + (1 − α) · 0.5

α = min(1.0, nij / Ncliff) where Ncliff = 20

win_rate(i, j) = wins in domain j / total runs in domain j
0.5 = neutral bootstrap prior

At n=0, the utility is 0.5 — neutral, with no advantage to established models. At n=20, the observed win rate fully dominates. Between these values, observations count less per-observation than later ones — a form of Bayesian shrinkage toward the prior.

Hierarchical utility (Phase 12)

For queries classified to deep domain nodes (e.g. molecular_biology → biology → science), the effective utility walks up the tree to find the deepest level with sufficient data:

n_cliff(d) = max(5, 20 − 2d)

VCG uses deepest node with nij ≥ n_cliff(depth)
Falls back toward L0 root until data is sufficient

Deeper nodes need fewer observations because they represent rarer, more specific queries. A model that has answered 8 protein-folding questions has demonstrated enough signal to use that node's effective_u rather than falling back to the biology or science level.


Dynamic domain ontology

The domain taxonomy is a rooted tree that grows naturally from model self-reports. Ten L0 root domains are fixed anchors. Everything below is dynamic.

Three-tier normalization

When a model reports a domain string (e.g. "molecular biology"), the system resolves it through two stages:

Stage 1 — Alias map lookup (O(1)) "mol biology" → alias of "molecular_biology" → match, done Stage 2 — Edit-distance similarity (fallback) "biomolecular science" → similarity 0.78 vs "molecular_biology" ≥ 0.80 → add as alias, return node 0.55–0.79 → add to candidate queue, return closest node < 0.55 → candidate queue + fall back to general

Candidate promotion

Strings in the candidate queue are evaluated by a background job every 5 minutes. A candidate is promoted to a full node when ALL of:

📋 Promotion criteria

  • query_count ≥ 5 — noise gate: one-off strings are not promoted
  • model_count ≥ 2 — not a single-model quirk
  • Divergence test passes — the primary criterion: mean |u_candidate − u_parent| > δ(depth)
  • Not already covered — re-run alias lookup with latest index
Branch-relative divergence threshold δ(d) = 0.10 + 0.05 × d

d=0: δ=0.10 (L0 splits easily — broad domains)
d=3: δ=0.25 (mid-tree needs stronger evidence)
d=6: δ=0.40 (deep nodes need very strong divergence)

The divergence threshold is depth-relative because a 15% performance gap at the root level indicates a meaningfully distinct domain, while the same gap at depth 6 is likely sampling noise in a very specific sub-domain.


Compliance monitor

Models periodically stop following system prompt instructions — most commonly the DOMAINS: tag under context window pressure. The compliance monitor detects this and escalates automatically.

Rule categories

Rule IDCheckType
domain_tagDOMAINS: tag present and non-empty in last 3 linesStructural (free)
no_scoring_leakModel doesn't mention welfare/vcg/scoring contextStructural (free)
reviewer_no_leakReviewer doesn't reveal it is reviewingStructural (free)

Streak-based escalation

Each rule maintains a consecutive-failure counter per model. When a counter reaches the threshold (default 2), the next system prompt highlights the violated instruction in bold:

**YOU HAVE STOPPED INCLUDING DOMAIN TAGS. RESUME IMMEDIATELY:** **At the end of your response, on a new line, write exactly:** **DOMAINS: <comma-separated domains>** **REASON THIS WAS TRIGGERED: Missing from last 2 consecutive responses.**

The streak resets to 0 immediately when the rule passes. The system self-heals without user intervention.


Keyword indexing and real-time search

Every message is keyword-extracted and stored in a local SQLite table. Searches across months of conversation history complete in microseconds — because search never touches the DB at query time. Everything runs from three co-maintained in-memory structures.

Three in-memory structures

In-memory index (loaded at startup, updated by async worker) _KW_INDEX: keyword → {conv_id, ...} inverted index, conversation-level
_KW_SORTED: sorted list of all keywords prefix search via bisect

Both structures live in Python memory. Keypress search is pure Python set operations — no I/O, no SQLite, no disk. Memory footprint at scale:

UsageUnique keywordsIndex memory
Small (1k messages)~15k~4 MB
Medium (5k messages)~60k~17 MB
Large (20k messages)~200k~57 MB

Message-level scroll-to (clicking a result) uses a targeted DB query on (conversation_id, keyword) — both indexed, called once on click not on keypress. The memory vs single-DB-lookup tradeoff is deliberate: keeping message-level sets in memory would add 3–4× overhead (measured: 364MB at 20k messages) for a feature used far less frequently than search-as-you-type.

Inverted index — how it works

When the keyword "postgres" is extracted from a message, it is added to _KW_INDEX["postgres"] as a set containing that conversation's ID. If "postgres" appears in 47 conversations, _KW_INDEX["postgres"] is a set of 47 conv_ids.

A search for "postgres index" becomes:

# AND semantics — all words must appear sets = [_KW_INDEX["postgres"], _KW_INDEX["index"]] result = set.intersection(*sets) # → conv_ids where BOTH "postgres" AND "index" appear # Time: O(|smallest set|) — typically microseconds

Prefix search via bisect

_KW_SORTED is a sorted list of all unique keywords maintained in parallel with the inverted index. Prefix search (typing "post" matching "postgres", "posting", "post-hoc") uses Python's bisect module:

prefix = "post" lo = bisect.bisect_left(_KW_SORTED, "post") hi = bisect.bisect_left(_KW_SORTED, "posu") # next char after 't' # Slice [lo:hi] contains all keywords that start with "post" # Time: O(log n) to find range + O(k) to collect matching conv_ids

This makes real-time sidebar search work correctly — each keystroke triggers a prefix search that returns results before the user finishes typing, with no perceptible latency.

Message-level index for scroll-to

_KW_MSG_INDEX maps each keyword to a nested structure: keyword → conv_id → {msg_ids}. When you click a search result, the frontend scrolls to the exact message that matched — not just the conversation. The best match is the earliest message containing all query words:

def _find_best_match_message(conv_id, words): # For each word, get the set of message_ids in this conversation sets = [_KW_MSG_INDEX[w].get(conv_id, set()) for w in words] # Intersect — all words must appear in the same message matching = set.intersection(*sets) # Return the earliest (lowest timestamp) message return min(matching, key=lambda mid: _KW_MSG_TS[mid])

Extraction pipeline (async, zero response latency)

Extraction runs in a background asyncio worker — never on the response path:

1. Request handler → _kw_enqueue(msg_id, conv_id, role, text) ← O(1) 2. Response returned → user sees answer immediately 3. Background worker ← drains queue every 50ms or 20 messages 4. Thread pool executor← spaCy tokenisation + POS tagging 5. Single DB txn ← executemany INSERT into message_keywords 6. Index update ← add to all four in-memory structures

spaCy POS-tags each message (capped at 2000 chars) and keeps nouns, proper nouns, and non-stopword tokens. Code tokens (camelCase, snake_case, dotted paths) are identified by pattern and given a 2× weight boost so architectural terms surface reliably in long prose responses.

Schema

CREATE TABLE message_keywords ( keyword TEXT NOT NULL, message_id TEXT NOT NULL, conversation_id TEXT NOT NULL, role TEXT NOT NULL, -- 'user' | 'assistant' | 'attachment' created_at REAL NOT NULL ); CREATE INDEX idx_kw_keyword ON message_keywords(keyword); CREATE INDEX idx_kw_conv ON message_keywords(conversation_id);

The SQLite table is the persistent store — the in-memory index is rebuilt from it at startup. The DB index on keyword makes the startup load fast even with millions of keyword rows. After startup, the DB is only written to (never read) during normal search operations.

Search complexity summary:
Exact keyword match: O(1) hash lookup → O(|set|) collect conv_ids
Prefix match (partial typing): O(log n) bisect → O(k·|set|) union
Multi-word AND: O(|smallest set|) set intersection
Scroll-to-message (on click): 1 indexed DB query, not on hot path
DB calls at keypress: zero.

Context recovery

AUA-Veritas is designed so that resuming a conversation after months requires no user action. Three independent mechanisms layer together to restore full context automatically.

Layer 1 — Correction injection (always active)

This is the primary mechanism. Before every query — including the first message in a resumed conversation — all stored corrections are scored and the relevant ones injected into the system prompt. Because corrections are global and persistent, a preference you expressed six months ago in a different conversation applies to today's query if it's semantically relevant.

Why this works across sessions corrections table has no TTL — rows are never deleted
injection scorer uses semantic similarity, not recency
"Always use TypeScript" from 8 months ago scores high for any JS/TS query
decay_class=A corrections (explicit instructions) never decay

Layer 2 — Model recovery prompts

Each model generates and stores its own recovery prompt — a self-written system message that summarises the active corrections, preferred domains, and its own reliability trajectory. The model writes this in its own style, incorporating the full set of active corrections it would need to behave correctly.

Staleness detection:

Stale if ANY of: prompt does not exist (first run) age > 86400s (24 hours) sha256(active_correction_ids) ≠ stored corrections_hash On stale detection: Background job regenerates the prompt silently On manual "Generate / View prompt" click: immediate regeneration Fresh prompts served from DB — zero latency on resume

Layer 3 — Context window backups (structured handoff notes)

For long-running conversations, each model periodically writes a structured handoff note — not a raw summary, but a purpose-built document for resuming in a new window. When context pressure is detected, this note is injected before the system context so the model picks up immediately.

The backup prompt forces the model to capture five things explicitly:

## GOAL — what the user is trying to accomplish and why ## DECISIONS MADE — each significant decision with the reason (decided X because Y, rejected Z) ## CURRENT STATUS — ✅ completed / 🔄 in progress (exact file/function) / ❓ unresolved ## ACTIVE FILE — exact file path, function name, what needs to happen next ## RESUME INSTRUCTION — one sentence telling the new window how to pick up

The backup is injected with a framing header that tells the model it is resuming a conversation and should use the note silently without mentioning it to the user.

Backup frequencySettingBest for
AutoDefaultMost users — triggers on context pressure only
Every 15 minSettings → Context backupVery long sessions, complex projects
HourlySettings → Context backupBackground research sessions
ManualSettings → Context backupPower users who want explicit control
The user experience: Open a conversation from three months ago. Type a question. Send. The AI responds with full knowledge of your previous decisions, preferences, and context — without you having said anything yet. This is the result of all three layers firing before your message is even processed.

Memory architecture

The memory system stores corrections in the corrections table with 12 fields including type, scope, domain, confidence, decay class, and corrective instruction. Corrections are never deleted — they may be superseded (scope set to "superseded") but the audit trail is preserved.

Correction types

TypeExampleScope
factual_correction"The answer is 53, not 57"Conversation
persistent_instruction"Always use metric units"Global
preference_rule"Prefer Postgres over MySQL"Global or domain
model_preference"User preferred Claude on a disagreement"Global
domain_rule"In legal contexts, cite specific statute sections"Domain

Decay classes

Corrections have a decay class (A, B, or C) that determines how quickly they lose relevance in the scoring function. Class A corrections (explicit user instructions) never decay. Class C corrections (inferred from implicit signals) decay over 30 days without reinforcement.


Correction detection pipeline

Implicit corrections (natural language replies like "No, that's wrong — it's 53") go through a three-stage pipeline before storage:

L1

Rule-based trigger detection

Fast pattern matching: negation words ("no", "actually", "wrong"), explicit prefixes ("correction:", "I know that"), question-to-assertion shift. Filters 80% of non-corrections at zero cost.

L2

Semantic similarity check

A fine-tuned spaCy model computes semantic similarity between the candidate correction and the previous AI response. Score above threshold (default 0.45) → likely correction. Below → likely a new question.

L3

Validation (Plausible / Strict)

A cheap judge model checks if the correction could plausibly be true. In Strict mode, a full cross-check is run. Confirmed corrections are stored; rejected ones are discarded with a note.

Empirical result: The correction system achieves a 69.6% reduction in repeated errors — validated across 200 simulated queries with injected correction events.

Context injection scoring

Before every query, IncludeUtilityScorer scores all stored corrections and selects which to inject. The scoring function is a weighted combination of 8 factors:

Injection score score(c, q) = w₁·relevance + w₂·failure_prevention + w₃·importance
            + w₄·recency + w₅·confidence + w₆·pinned
            − w₇·staleness − w₈·token_cost

Threshold: 0.30 (configurable 0.20–0.80 in settings)

Corrections below the threshold are not injected — this prevents irrelevant memories from consuming context window. Pinned corrections always pass regardless of score.

The relevance score uses the same semantic similarity model as the correction detector, comparing the stored correction's canonical query against the current question.


Full query flow

A single query through AUA-Veritas traverses this path:

1. Trigger detection Implicit correction check on user message 2. Memory injection Score + select corrections for system prompt 3. Domain classification Field classifier → p(j|q) distribution 4. Backend calls All enabled models called via asyncio.gather 5. Domain tag parsing Strip DOMAINS: tag, resolve via tree, build consensus p(j|q) 6. Compliance check check_structural() → update_streaks() per model 7. VCG selection W_i = Σ p(j|q)·effective_u_hierarchical(i,j,tree) 8. Score update vcg_winner=1 written only if models AGREE or user picks 9. Peer review (High/Max) Reviewer checks winner's answer 10. Correction storage Any detected errors → corrections table 11. Context prompt check Compliance monitor → system prompt reinforcement if needed 12. Response display Winner's answer shown; disagreements surfaced explicitly
VCG winner flag: On genuine disagreements, vcg_winner=1 is withheld from all models until the user picks. This prevents the VCG election winner from accumulating effective_u gains when the user actually preferred a different model.

Peer review

In High and Max accuracy modes, a designated reviewer model independently evaluates the winner's answer. The reviewer receives the original query and the winner's response — without knowing which model produced it or what the VCG scores were.

Reviewer isolation

The reviewer is deliberately isolated from the scoring system (reviewer_no_leak compliance rule). This prevents the reviewer from being influenced by knowledge of the competition outcome.

Review verdicts

VerdictAction
correctWinner rewarded. If models agreed, all models get +1.
incorrectCorrection stored. Winner penalised in effective_u. Answer shown with callout.
partially_correctCorrection stored for the incorrect portion. Answer shown with caveat.

Context prompts

Each model generates its own recovery prompt — a system message incorporating all active corrections, the reliability trajectory, and top domains. The prompt is written by the model itself (it knows its own style best) and saved to model_context_prompts.

Staleness detection

Staleness conditions (any one triggers) prompt does not exist
age > 86400s (24 hours)
sha256(correction_ids) ≠ stored corrections_hash

The corrections hash is a fast-path check — comparing a 16-character SHA256 hex digest against a stored value avoids a full timestamp scan of the corrections table on every staleness check. Only stale models are called for regeneration — fresh models return their saved text immediately.

A background asyncio job runs every 15 minutes to regenerate stale prompts silently. The manual Generate / View prompt button is a fallback for the window between staleness and the next job run.


Empirical validation

The AUA framework underlying Veritas has been validated across three dimensions:

MetricResultMethod
Routing correctness gain+10.5% (p=0.029)Controlled experiment vs single-model baseline
Repeated error reduction69.6%200 simulated queries with injected correction events
Utility–correctness correlationr=0.461 (p<10⁻⁴⁰)Pearson correlation across routing simulation dataset

The utility–correctness correlation of r=0.461 (p<10⁻⁴⁰) confirms that the VCG welfare score is a meaningful predictor of answer correctness — not just a post-hoc rationalization. Models with higher effective_u in a domain genuinely produce more correct answers in that domain.

Full methodology: See the AUA Framework whitepaper ↗ for simulation code, experiment design, and full statistical analysis.
Built on
Adaptive Utility Agent Framework
VCG mechanism design · Dominant strategy truthfulness · Hierarchical utility estimation
Read the AUA whitepaper ↗