Tutorial

Learn how to use AUA-Veritas — from installation to interpreting the full decision analytics in Look Under the Hood.

Two ways to use AUA-Veritas

Most people start with one and discover the other. Both are always running.

🧠 As a context manager

  • Every conversation keyword-indexed and searchable
  • Leave for months, come back — AI resumes instantly
  • Corrections and preferences persist forever, across all chats
  • Works even if you only use one model

⚖️ As a multi-model verifier

  • Multiple models compete on every query
  • Disagreements surfaced explicitly — you pick the better answer
  • Independent peer review on High/Max accuracy
  • Game-theoretically optimal model selection via VCG
If you only need one thing: Use Balanced mode with a single model (GPT-4o or Claude Sonnet). The context management, keyword search, and correction memory all work exactly the same. You can add more models later when it matters.


Resuming a conversation after months

This is where the context management pays off. You don't need to re-explain your project every time you come back to it.

What happens automatically

1

Corrections and preferences are injected

Every stored correction is scored against your new message. "Always use TypeScript", "prefer metric units", "our API uses kebab-case" — anything you've taught the AI is injected into the context automatically, even if you taught it months ago in a different conversation.

2

Model recovery prompts fire

Each model generates and stores a recovery prompt — a self-written system message that summarises your active corrections, preferred domains, and reliability record. When you return to a conversation, stale recovery prompts are regenerated silently in the background before your message is sent.

3

Context backups carry over session history

At configurable intervals (auto, 15 min, hourly), each model writes a compressed context summary. These summaries are injected when the conversation window would otherwise lose history — you never hit a wall where the AI "forgets" the first half of a long project.

What you need to do

Almost nothing. Click on a conversation from the sidebar, type your follow-up question, and send. The AI picks up from where you left off. The only time you need to re-explain something is if you deliberately want to change direction.

📋 Tips for long-running projects

  • Use Projects to group related conversations — corrections scoped to a project only apply within it
  • Pin important corrections in the Memory tab so they always inject regardless of relevance score
  • Use Generate / View prompt ↗ to manually trigger a recovery prompt refresh after a major project change
  • Search for old decisions before asking a new question — saves tokens and keeps reasoning consistent

Installation

AUA-Veritas runs natively on macOS with Apple Silicon (M-series). No Python or Node installation needed — everything is bundled.

1

Download the DMG

Go to the GitHub Releases page and download AUA-Veritas-0.1.0-arm64.dmg.

2

Mount and install

Double-click the DMG to mount it. Drag AUA-Veritas to your Applications folder.

3

Bypass Gatekeeper on first launch

The app is unsigned. On first launch: right-click → Open → Open in the dialog. You only need to do this once.

System requirements: macOS 12+ · Apple Silicon (M1/M2/M3/M4) · ~250MB disk space

Adding API keys

Veritas calls frontier model APIs directly — you pay the providers at cost, no markup. You need at least one key, but more means better competition.

📋 Supported providers

  • OpenAI — GPT-4o, GPT-4o-mini. Get at platform.openai.com/api-keys
  • Anthropic — Claude Sonnet 4, Claude Haiku. Get at console.anthropic.com
  • Google — Gemini 1.5 Pro, Gemini 2.0 Flash. Get at aistudio.google.com
  • Groq — Llama 3.3 70B (free tier available). Get at console.groq.com
1

Open Settings

Click the ⚙️ gear icon in the top-right of the sidebar, or use Cmd+,.

2

Enter keys in the API Keys section

Keys are stored in macOS Keychain — encrypted at the OS level, never written to disk in plain text.

3

Enable the models you want

Use the checkboxes in the left sidebar to enable or disable models per-conversation. More models = better VCG elections, but higher cost.

Recommendation: Enable GPT-4o + Claude Sonnet + one Groq model as a baseline. This gives you OpenAI's breadth, Anthropic's reasoning, and a free fast model for the VCG competition.

Your first conversation

Just type in the input at the bottom and press Enter. Veritas handles everything else.

AUA-Veritas
You typed:
"What's the best way to index a Postgres table for a LIKE query?"
Veritas response (GPT-4o selected · 2.7s):
For prefix LIKE queries (e.g. name LIKE 'Jo%'), a standard B-tree index works well. For arbitrary substring searches (e.g. LIKE '%smith%'), use a GIN index with pg_trgm extension...
⚡ Memory injected: 1 correction · Domain: code/databases · Winner welfare: 0.68

🔍 What happened behind the scenes

  • Veritas scored your stored corrections against the question — any relevant memories were injected into the system prompt
  • All enabled models were called simultaneously with a competitive evaluation prompt
  • Each model self-reported its domain using a DOMAINS: tag (stripped before display)
  • The VCG welfare formula selected GPT-4o as the winner in this session's database domain
  • The answer was displayed and the model run was recorded for future routing

Accuracy modes

Choose how much verification overhead you want. Fast for quick questions, Max for anything critical.

Mode Models called Peer review Best for
Fast 1 (VCG winner only) No Quick questions, low-stakes queries
Balanced All enabled On disagreement only Most everyday use — good accuracy/cost ratio
High All enabled Always Technical questions, important decisions
Max All enabled Always + correction check High-stakes queries: medical, legal, financial
Tip: Use Balanced as your default. Switch to High when the answer matters and you want independent verification across models. Save Max for decisions you'd regret getting wrong.

Memory & the correction system

This is what makes Veritas different. Every correction you make is stored permanently and injected into future queries where it's relevant.

How memory injection works

Before every query, Veritas scores all stored corrections against the current question using 8 factors:

📊 Scoring factors

  • Relevance — semantic similarity between the correction and the current query
  • Failure prevention value — how bad is it if this correction is missed?
  • Importance — correction priority (pinned corrections always score higher)
  • Recency — newer corrections are slightly preferred
  • Confidence — how certain was the correction at time of storage?
  • Staleness — corrections older than decay threshold score lower
  • Token cost — very long corrections are penalised to avoid context bloat
  • Pinned status — pinned corrections always pass the threshold

Only corrections scoring above 0.30 are injected. This prevents irrelevant memories from cluttering the context window.

Memory types stored: Factual corrections ("The answer is 53, not 57"), persistent instructions ("Always use metric units"), domain rules ("Prefer Postgres over MySQL"), and model preferences ("I prefer Claude's style on writing tasks").

Making corrections

Three ways to teach Veritas something new.

A

Implicit correction (just reply naturally)

Say "No, it's 53" or "Actually metric units please" — Veritas detects the correction intent using semantic similarity and asks you to confirm. Once confirmed, it's stored permanently.

B

Explicit correction prefix

Start your message with correction: to skip the detection step: correction: always use TypeScript, never plain JavaScript.

C

"I know that…" prefix

For persistent instructions: I know that we use kebab-case for all our CSS class names. Stored as a global rule, injected whenever CSS is relevant.

Correction sensitivity: You can adjust how aggressively implicit corrections are detected in Look Under the Hood → Memory tab → Correction intelligence settings. Lower threshold = more prompts, higher = fewer.

Model disagreements

When models give meaningfully different answers, Veritas surfaces the disagreement explicitly rather than silently picking one.

⚠ Models disagree on this answer
GPT-4o: "Use a singleton pattern here"
Claude Sonnet: "A singleton would be an antipattern — use dependency injection"
Pick the answer you prefer:
GPT-4o Claude Sonnet ✓

📌 What happens when you pick

  • Your preference is recorded as a model preference correction
  • The chosen model gets a VCG win recorded in that domain — its effective_u rises
  • Over time, the model you prefer on disagreements gets routed to more often in that domain
  • No point is awarded to any model until you pick — disagreements don't inflate winners

Look Under the Hood — Overview tab

Click the 📊 chart icon in the top-right of the Quality panel to open the full analytics dashboard. It has 5 tabs.

The Overview tab shows high-level health metrics for your session:

📈 What you'll see

  • Total queries — how many questions asked this session
  • Agreement rate — % of queries where all models agreed (high = consistent, reliable answers)
  • Peer review rate — % of queries that triggered independent verification
  • Active corrections — how many stored memories are actively being injected
  • Session cost — total API spend, broken down by provider

Look Under the Hood — Models tab

Per-model performance breakdown — who's winning, and why.

⚖️ What the scores mean

  • Welfare score — the VCG score this model got this session (higher = more wins)
  • Win rate — fraction of queries where this model was selected as the VCG winner
  • Confidence — average confidence score from peer review (80% = models agree it's correct)
  • Latency — average response time — affects whether fast queries use this model in Fast mode
  • Domain strengths — which domains this model has been winning in across your history
What to look for: If one model consistently has a much higher win rate than others, consider whether it's genuinely better or just getting easier queries assigned to it by the routing. Check the Domains tab to see if the routing is domain-specific.

Look Under the Hood — Decisions tab

The most informative tab — a full trace of every decision made on each query.

Click any query in the list to expand its decision chain:

1 Correction check → 1 correction injected (metric units rule) 2 Memory retrieval → 3 relevant memories found, 2 above threshold 3 Models called → GPT-4o · Claude Sonnet · Gemini (3 models) 4 VCG selection → GPT-4o selected (W = 0.73) Claude Sonnet: W = 0.68 Gemini: W = 0.61 5 Peer review → Correct (all models agree) 6 Confidence label → High (80%)

🔎 Reading the VCG scores

  • Scores are computed as W_i = Σ p(j|q) · u_i(j) — the weighted sum of domain-specific win rates
  • A score near 0.5 means the model hasn't built up much history in this domain yet — neutral prior
  • A score above 0.7 means the model has a strong track record in this query's domain
  • Large gaps between models indicate strong domain specialization in your usage
  • If all scores are near 0.5, you haven't sent enough queries in this domain for differentiation yet

Look Under the Hood — Memory tab

View and manage all stored corrections and preferences. Two sub-views within one tab.

Memory view

All stored corrections with their type, domain, scope (global / project / conversation), and creation date. Filter by type using the pill buttons at the top.

Corrections view

Correction intelligence settings — two controls:

⚡ Correction intelligence settings

  • Implicit sensitivity slider (0.20 – 0.80) — how aggressively to detect corrections in your replies. Default 0.45. Lower = more prompts (catches more corrections, more false positives). Higher = fewer prompts (misses more, fewer interruptions).
  • Validation mode — Plausible (default): a cheap model checks if the correction could be true before storing it. Strict: full cross-check, slower but rejects more noise.

Look Under the Hood — Domains tab

See the live domain taxonomy that Veritas has learned from your query patterns.

Tree view

The 10 L0 root domains are always present (code, mathematics, science, legal, medical, finance, writing, analysis, history, general). As you use the app, sub-domains grow beneath them when models consistently report a specific sub-domain and performance diverges from the parent.

Click any node to expand its children. The query count badge shows how many queries have been routed to that domain across your session history.

Candidates view

Domain strings that models have reported but haven't yet been promoted to full nodes. Each candidate shows:

📊 Candidate fields

  • raw_string — the exact string a model returned (e.g. "constitutional law")
  • nearest_node — the closest existing tree node (e.g. "legal")
  • similarity — edit-distance similarity to the nearest node (0–1)
  • query count — how many times this string has been reported (needs ≥5 to be considered for promotion)
  • model count — how many distinct models have reported it (needs ≥2)
Promotion: A candidate becomes a full node when it has ≥5 queries from ≥2 models AND the performance divergence between it and its parent node exceeds the branch-relative threshold. This happens automatically in a background job every 5 minutes.

Context prompts

Each model writes its own recovery prompt — a system message it would send itself to restore full context if the conversation restarts.

Click Generate / View prompt ↗ at the bottom of the Memory panel (right sidebar) to open the modal. The system calls each stale model and asks it to write a prompt incorporating your current corrections, preferences, and top domains — then saves the result.

🔄 When context prompts are auto-sent

  • A model drops a rule (compliance monitor detects streak ≥ 2)
  • You start a new chat window after a long gap
  • Context window pressure is detected mid-conversation

Prompts are checked for staleness every 15 minutes in the background. A prompt is stale if it's older than 24 hours or a new correction has been added since it was generated. Only stale models are regenerated — fresh ones show their saved text immediately.


Local models (Ollama)

Run Ollama models locally for free, private inference. Local models compete in VCG elections on equal footing with frontier models.

1

Install Ollama

Download from ollama.com and run ollama pull llama3.2 or any model you prefer.

2

Enable in Settings

Toggle Local models in Settings. Veritas will auto-discover models running on the default Ollama port (11434).

3

Local mode is exclusive

You can run frontier models OR local models — not both simultaneously. Use the toggle to switch. Local models are ideal for sensitive queries that shouldn't leave your machine.


Bug reporting

Found something broken? Report it in one click — no account, no email required.

Click the 🐛 Report a bug button at the bottom of the left sidebar. A modal opens with a comment field and opt-in checkboxes for including your last 5 messages (for context) and your email (for follow-up).

Privacy: Reports go to a private GitHub repo accessible only to the developer. An anonymous 8-character machine hash is included for correlation — no name, location, or personal data unless you explicitly opt in.
Read the design doc →