How AUA-Veritas works
A technical design reference covering the full system architecture, the AUA framework components it implements, and the empirical validation behind key design decisions.
System overview
AUA-Veritas is a macOS desktop application that wraps the Adaptive Utility Agent (AUA) framework in a consumer-friendly chat interface. It runs multiple large language models simultaneously, applies mechanism design to select the best answer, and maintains a persistent correction memory that improves routing over time.
The system has four distinct layers, each adding correctness on top of the previous:
Memory layer
Stored corrections are scored and injected into the system prompt before every query. Relevance scoring uses 8 factors including semantic similarity, recency, and token cost.
Competition layer (VCG)
All enabled models answer simultaneously. A domain-decomposed VCG welfare function selects the winner based on domain-specific historical win rates. Models are incentivised to report their true domain competencies.
Verification layer (peer review)
When accuracy mode is High or Max, a designated reviewer model independently checks the winner's answer. Correct answers reward the model; incorrect answers trigger a correction storage event.
Compliance layer
A compliance monitor checks every response against active instructions. Violations trigger escalating bold reinforcement in the next system prompt. The domain tree grows automatically from model self-reports.
Technology stack
| Component | Technology | Rationale |
|---|---|---|
| Desktop shell | Electron 30 (arm64) | Native macOS integration, system tray, Keychain access via keytar |
| Frontend | React 18 + TypeScript + Vite | Type safety, hot reload in dev, minimal bundle in prod |
| Backend | FastAPI + uvicorn (Python 3.11) | Async-native, easy to add endpoints, Pydantic validation |
| Database | SQLite 3 (WAL mode) | Zero-config, file-portable, WAL for concurrent reader/writer |
| Encryption | AES-128 Fernet (cryptography) | Message content encrypted at rest; keys in Keychain |
| ML inference | spaCy + custom trigger model | Offline correction detection; no cloud ML dependency |
| Distribution | PyInstaller + electron-builder | Single DMG; no Python or Node required on user's machine |
The backend binary is built with PyInstaller in --onedir mode (not --onefile). This eliminates the 8–15 second cold-start extraction penalty of single-file builds — files are read directly from disk at the binary's path.
Data model
All state is stored in a single SQLite file at ~/Library/Application Support/AUA-Veritas/veritas.db. Key tables:
| Table | Purpose |
|---|---|
| conversations | Chat session metadata (title, timestamps) |
| messages | Encrypted message content, model attribution, confidence labels |
| model_runs | One row per model per query: welfare score, vcg_winner flag, domain, latency |
| corrections | Stored corrections and preferences with type, scope, domain, decay class |
| audit_log | Score events for Look Under the Hood graphs |
| domain_nodes | Dynamic domain tree: node_id, parent_id, depth, aliases (JSON) |
| domain_candidates | Candidate domain strings awaiting promotion review |
| model_context_prompts | Per-model recovery prompts, generated_at, corrections_hash for staleness |
| context_backups | Per-model context summaries for conversation continuity |
| local_settings | User preferences, correction thresholds, user_token for bug reports |
VCG mechanism design
AUA-Veritas implements the Vickrey–Clarke–Groves (VCG) mechanism from the AUA framework. In a standard VCG auction, a social planner selects the allocation that maximises total welfare. Here, "welfare" is the expected correctness of the answer a model will produce for this query in this domain.
The welfare formula
where:
p(j | q) = probability query q belongs to domain j
ui(j) = effective utility of model i in domain j
j ∈ {code, mathematics, science, legal, medical, finance, writing, analysis, history, general, ...}
The domain probabilities p(j|q) are derived from model self-reports — each model appends a DOMAINS: tag to its response listing the 1–3 domains it considers most relevant. The router strips the tag, aggregates across all models (equal weight split per model's list), and normalises to produce a probability distribution.
Why VCG is the right mechanism
🎯 Three AUA-proven properties
- Dominant strategy truthfulness — Models maximise their expected score by reporting their genuine domain competencies, not by inflating claims. A model that misreports its domain gets lower welfare scores over time as its actual win rate diverges from its claimed domain strength.
- Price of Anarchy = 1 — The VCG allocation is socially optimal. No other allocation rule produces higher expected total welfare across all models.
- Individual rationality — Every model participating in the competition is at least as well off as not participating (they receive a welfare score ≥ 0).
Utility function — effective_u
The domain-specific effective utility u_i(j) is a volume-normalized win rate. Raw win rates are unreliable with few observations — a model with 1 win from 1 query has the same win rate as one with 20 wins from 20 queries, but very different reliability.
α = min(1.0, nij / Ncliff) where Ncliff = 20
win_rate(i, j) = wins in domain j / total runs in domain j
0.5 = neutral bootstrap prior
At n=0, the utility is 0.5 — neutral, with no advantage to established models. At n=20, the observed win rate fully dominates. Between these values, observations count less per-observation than later ones — a form of Bayesian shrinkage toward the prior.
Hierarchical utility (Phase 12)
For queries classified to deep domain nodes (e.g. molecular_biology → biology → science), the effective utility walks up the tree to find the deepest level with sufficient data:
VCG uses deepest node with nij ≥ n_cliff(depth)
Falls back toward L0 root until data is sufficient
Deeper nodes need fewer observations because they represent rarer, more specific queries. A model that has answered 8 protein-folding questions has demonstrated enough signal to use that node's effective_u rather than falling back to the biology or science level.
Dynamic domain ontology
The domain taxonomy is a rooted tree that grows naturally from model self-reports. Ten L0 root domains are fixed anchors. Everything below is dynamic.
Three-tier normalization
When a model reports a domain string (e.g. "molecular biology"), the system resolves it through two stages:
Candidate promotion
Strings in the candidate queue are evaluated by a background job every 5 minutes. A candidate is promoted to a full node when ALL of:
📋 Promotion criteria
- query_count ≥ 5 — noise gate: one-off strings are not promoted
- model_count ≥ 2 — not a single-model quirk
- Divergence test passes — the primary criterion: mean |u_candidate − u_parent| > δ(depth)
- Not already covered — re-run alias lookup with latest index
d=0: δ=0.10 (L0 splits easily — broad domains)
d=3: δ=0.25 (mid-tree needs stronger evidence)
d=6: δ=0.40 (deep nodes need very strong divergence)
The divergence threshold is depth-relative because a 15% performance gap at the root level indicates a meaningfully distinct domain, while the same gap at depth 6 is likely sampling noise in a very specific sub-domain.
Compliance monitor
Models periodically stop following system prompt instructions — most commonly the DOMAINS: tag under context window pressure. The compliance monitor detects this and escalates automatically.
Rule categories
| Rule ID | Check | Type |
|---|---|---|
| domain_tag | DOMAINS: tag present and non-empty in last 3 lines | Structural (free) |
| no_scoring_leak | Model doesn't mention welfare/vcg/scoring context | Structural (free) |
| reviewer_no_leak | Reviewer doesn't reveal it is reviewing | Structural (free) |
Streak-based escalation
Each rule maintains a consecutive-failure counter per model. When a counter reaches the threshold (default 2), the next system prompt highlights the violated instruction in bold:
The streak resets to 0 immediately when the rule passes. The system self-heals without user intervention.
Keyword indexing and real-time search
Every message is keyword-extracted and stored in a local SQLite table. Searches across months of conversation history complete in microseconds — because search never touches the DB at query time. Everything runs from three co-maintained in-memory structures.
Three in-memory structures
_KW_SORTED: sorted list of all keywords prefix search via bisect
Both structures live in Python memory. Keypress search is pure Python set operations — no I/O, no SQLite, no disk. Memory footprint at scale:
| Usage | Unique keywords | Index memory |
|---|---|---|
| Small (1k messages) | ~15k | ~4 MB |
| Medium (5k messages) | ~60k | ~17 MB |
| Large (20k messages) | ~200k | ~57 MB |
Message-level scroll-to (clicking a result) uses a targeted DB query on (conversation_id, keyword) — both indexed, called once on click not on keypress. The memory vs single-DB-lookup tradeoff is deliberate: keeping message-level sets in memory would add 3–4× overhead (measured: 364MB at 20k messages) for a feature used far less frequently than search-as-you-type.
Inverted index — how it works
When the keyword "postgres" is extracted from a message, it is added to _KW_INDEX["postgres"] as a set containing that conversation's ID. If "postgres" appears in 47 conversations, _KW_INDEX["postgres"] is a set of 47 conv_ids.
A search for "postgres index" becomes:
Prefix search via bisect
_KW_SORTED is a sorted list of all unique keywords maintained in parallel with the inverted index. Prefix search (typing "post" matching "postgres", "posting", "post-hoc") uses Python's bisect module:
This makes real-time sidebar search work correctly — each keystroke triggers a prefix search that returns results before the user finishes typing, with no perceptible latency.
Message-level index for scroll-to
_KW_MSG_INDEX maps each keyword to a nested structure: keyword → conv_id → {msg_ids}. When you click a search result, the frontend scrolls to the exact message that matched — not just the conversation. The best match is the earliest message containing all query words:
Extraction pipeline (async, zero response latency)
Extraction runs in a background asyncio worker — never on the response path:
spaCy POS-tags each message (capped at 2000 chars) and keeps nouns, proper nouns, and non-stopword tokens. Code tokens (camelCase, snake_case, dotted paths) are identified by pattern and given a 2× weight boost so architectural terms surface reliably in long prose responses.
Schema
The SQLite table is the persistent store — the in-memory index is rebuilt from it at startup. The DB index on keyword makes the startup load fast even with millions of keyword rows. After startup, the DB is only written to (never read) during normal search operations.
Exact keyword match: O(1) hash lookup → O(|set|) collect conv_ids
Prefix match (partial typing): O(log n) bisect → O(k·|set|) union
Multi-word AND: O(|smallest set|) set intersection
Scroll-to-message (on click): 1 indexed DB query, not on hot path
DB calls at keypress: zero.
Context recovery
AUA-Veritas is designed so that resuming a conversation after months requires no user action. Three independent mechanisms layer together to restore full context automatically.
Layer 1 — Correction injection (always active)
This is the primary mechanism. Before every query — including the first message in a resumed conversation — all stored corrections are scored and the relevant ones injected into the system prompt. Because corrections are global and persistent, a preference you expressed six months ago in a different conversation applies to today's query if it's semantically relevant.
injection scorer uses semantic similarity, not recency
"Always use TypeScript" from 8 months ago scores high for any JS/TS query
decay_class=A corrections (explicit instructions) never decay
Layer 2 — Model recovery prompts
Each model generates and stores its own recovery prompt — a self-written system message that summarises the active corrections, preferred domains, and its own reliability trajectory. The model writes this in its own style, incorporating the full set of active corrections it would need to behave correctly.
Staleness detection:
Layer 3 — Context window backups (structured handoff notes)
For long-running conversations, each model periodically writes a structured handoff note — not a raw summary, but a purpose-built document for resuming in a new window. When context pressure is detected, this note is injected before the system context so the model picks up immediately.
The backup prompt forces the model to capture five things explicitly:
The backup is injected with a framing header that tells the model it is resuming a conversation and should use the note silently without mentioning it to the user.
| Backup frequency | Setting | Best for |
|---|---|---|
| Auto | Default | Most users — triggers on context pressure only |
| Every 15 min | Settings → Context backup | Very long sessions, complex projects |
| Hourly | Settings → Context backup | Background research sessions |
| Manual | Settings → Context backup | Power users who want explicit control |
Memory architecture
The memory system stores corrections in the corrections table with 12 fields including type, scope, domain, confidence, decay class, and corrective instruction. Corrections are never deleted — they may be superseded (scope set to "superseded") but the audit trail is preserved.
Correction types
| Type | Example | Scope |
|---|---|---|
| factual_correction | "The answer is 53, not 57" | Conversation |
| persistent_instruction | "Always use metric units" | Global |
| preference_rule | "Prefer Postgres over MySQL" | Global or domain |
| model_preference | "User preferred Claude on a disagreement" | Global |
| domain_rule | "In legal contexts, cite specific statute sections" | Domain |
Decay classes
Corrections have a decay class (A, B, or C) that determines how quickly they lose relevance in the scoring function. Class A corrections (explicit user instructions) never decay. Class C corrections (inferred from implicit signals) decay over 30 days without reinforcement.
Correction detection pipeline
Implicit corrections (natural language replies like "No, that's wrong — it's 53") go through a three-stage pipeline before storage:
Rule-based trigger detection
Fast pattern matching: negation words ("no", "actually", "wrong"), explicit prefixes ("correction:", "I know that"), question-to-assertion shift. Filters 80% of non-corrections at zero cost.
Semantic similarity check
A fine-tuned spaCy model computes semantic similarity between the candidate correction and the previous AI response. Score above threshold (default 0.45) → likely correction. Below → likely a new question.
Validation (Plausible / Strict)
A cheap judge model checks if the correction could plausibly be true. In Strict mode, a full cross-check is run. Confirmed corrections are stored; rejected ones are discarded with a note.
Context injection scoring
Before every query, IncludeUtilityScorer scores all stored corrections and selects which to inject. The scoring function is a weighted combination of 8 factors:
+ w₄·recency + w₅·confidence + w₆·pinned
− w₇·staleness − w₈·token_cost
Threshold: 0.30 (configurable 0.20–0.80 in settings)
Corrections below the threshold are not injected — this prevents irrelevant memories from consuming context window. Pinned corrections always pass regardless of score.
The relevance score uses the same semantic similarity model as the correction detector, comparing the stored correction's canonical query against the current question.
Full query flow
A single query through AUA-Veritas traverses this path:
Peer review
In High and Max accuracy modes, a designated reviewer model independently evaluates the winner's answer. The reviewer receives the original query and the winner's response — without knowing which model produced it or what the VCG scores were.
Reviewer isolation
The reviewer is deliberately isolated from the scoring system (reviewer_no_leak compliance rule). This prevents the reviewer from being influenced by knowledge of the competition outcome.
Review verdicts
| Verdict | Action |
|---|---|
| correct | Winner rewarded. If models agreed, all models get +1. |
| incorrect | Correction stored. Winner penalised in effective_u. Answer shown with callout. |
| partially_correct | Correction stored for the incorrect portion. Answer shown with caveat. |
Context prompts
Each model generates its own recovery prompt — a system message incorporating all active corrections, the reliability trajectory, and top domains. The prompt is written by the model itself (it knows its own style best) and saved to model_context_prompts.
Staleness detection
age > 86400s (24 hours)
sha256(correction_ids) ≠ stored corrections_hash
The corrections hash is a fast-path check — comparing a 16-character SHA256 hex digest against a stored value avoids a full timestamp scan of the corrections table on every staleness check. Only stale models are called for regeneration — fresh models return their saved text immediately.
A background asyncio job runs every 15 minutes to regenerate stale prompts silently. The manual Generate / View prompt button is a fallback for the window between staleness and the next job run.
Empirical validation
The AUA framework underlying Veritas has been validated across three dimensions:
| Metric | Result | Method |
|---|---|---|
| Routing correctness gain | +10.5% (p=0.029) | Controlled experiment vs single-model baseline |
| Repeated error reduction | 69.6% | 200 simulated queries with injected correction events |
| Utility–correctness correlation | r=0.461 (p<10⁻⁴⁰) | Pearson correlation across routing simulation dataset |
The utility–correctness correlation of r=0.461 (p<10⁻⁴⁰) confirms that the VCG welfare score is a meaningful predictor of answer correctness — not just a post-hoc rationalization. Models with higher effective_u in a domain genuinely produce more correct answers in that domain.