Tutorial
Learn how to use AUA-Veritas — from installation to interpreting the full decision analytics in Look Under the Hood.
Two ways to use AUA-Veritas
Most people start with one and discover the other. Both are always running.
🧠 As a context manager
- Every conversation keyword-indexed and searchable
- Leave for months, come back — AI resumes instantly
- Corrections and preferences persist forever, across all chats
- Works even if you only use one model
⚖️ As a multi-model verifier
- Multiple models compete on every query
- Disagreements surfaced explicitly — you pick the better answer
- Independent peer review on High/Max accuracy
- Game-theoretically optimal model selection via VCG
Keyword search across all chats
Every message you send and receive is keyword-extracted at send time and indexed locally. Search finds matches across months of conversations in milliseconds.
How to search
Use the search bar in the left sidebar. Type any keyword or phrase — the sidebar filters to matching conversations instantly.
🔧 How search works under the hood
- Every message is keyword-extracted by a background worker after it's sent — zero impact on response speed
- Keywords are stored in a local message_keywords SQLite table and simultaneously added to three in-memory structures: an inverted index (keyword → conversations), a message-level index (keyword → conversation → messages), and a sorted keyword list for prefix matching
- Search never reads the DB — it runs entirely in memory via set intersection (AND semantics: all words must appear) and bisect-based prefix lookup. Latency is typically under 1ms regardless of history size
- Clicking a result scrolls to the exact message that matched, not just the conversation
- Old conversations are backfilled into the index on first launch — one-time background operation
Resuming a conversation after months
This is where the context management pays off. You don't need to re-explain your project every time you come back to it.
What happens automatically
Corrections and preferences are injected
Every stored correction is scored against your new message. "Always use TypeScript", "prefer metric units", "our API uses kebab-case" — anything you've taught the AI is injected into the context automatically, even if you taught it months ago in a different conversation.
Model recovery prompts fire
Each model generates and stores a recovery prompt — a self-written system message that summarises your active corrections, preferred domains, and reliability record. When you return to a conversation, stale recovery prompts are regenerated silently in the background before your message is sent.
Context backups carry over session history
At configurable intervals (auto, 15 min, hourly), each model writes a compressed context summary. These summaries are injected when the conversation window would otherwise lose history — you never hit a wall where the AI "forgets" the first half of a long project.
What you need to do
📋 Tips for long-running projects
- Use Projects to group related conversations — corrections scoped to a project only apply within it
- Pin important corrections in the Memory tab so they always inject regardless of relevance score
- Use Generate / View prompt ↗ to manually trigger a recovery prompt refresh after a major project change
- Search for old decisions before asking a new question — saves tokens and keeps reasoning consistent
Installation
AUA-Veritas runs natively on macOS with Apple Silicon (M-series). No Python or Node installation needed — everything is bundled.
Download the DMG
Go to the GitHub Releases page and download AUA-Veritas-0.1.0-arm64.dmg.
Mount and install
Double-click the DMG to mount it. Drag AUA-Veritas to your Applications folder.
Bypass Gatekeeper on first launch
The app is unsigned. On first launch: right-click → Open → Open in the dialog. You only need to do this once.
Adding API keys
Veritas calls frontier model APIs directly — you pay the providers at cost, no markup. You need at least one key, but more means better competition.
📋 Supported providers
- OpenAI — GPT-4o, GPT-4o-mini. Get at platform.openai.com/api-keys
- Anthropic — Claude Sonnet 4, Claude Haiku. Get at console.anthropic.com
- Google — Gemini 1.5 Pro, Gemini 2.0 Flash. Get at aistudio.google.com
- Groq — Llama 3.3 70B (free tier available). Get at console.groq.com
Open Settings
Click the ⚙️ gear icon in the top-right of the sidebar, or use Cmd+,.
Enter keys in the API Keys section
Keys are stored in macOS Keychain — encrypted at the OS level, never written to disk in plain text.
Enable the models you want
Use the checkboxes in the left sidebar to enable or disable models per-conversation. More models = better VCG elections, but higher cost.
Your first conversation
Just type in the input at the bottom and press Enter. Veritas handles everything else.
🔍 What happened behind the scenes
- Veritas scored your stored corrections against the question — any relevant memories were injected into the system prompt
- All enabled models were called simultaneously with a competitive evaluation prompt
- Each model self-reported its domain using a DOMAINS: tag (stripped before display)
- The VCG welfare formula selected GPT-4o as the winner in this session's database domain
- The answer was displayed and the model run was recorded for future routing
Accuracy modes
Choose how much verification overhead you want. Fast for quick questions, Max for anything critical.
| Mode | Models called | Peer review | Best for |
|---|---|---|---|
| Fast | 1 (VCG winner only) | No | Quick questions, low-stakes queries |
| Balanced | All enabled | On disagreement only | Most everyday use — good accuracy/cost ratio |
| High | All enabled | Always | Technical questions, important decisions |
| Max | All enabled | Always + correction check | High-stakes queries: medical, legal, financial |
Memory & the correction system
This is what makes Veritas different. Every correction you make is stored permanently and injected into future queries where it's relevant.
How memory injection works
Before every query, Veritas scores all stored corrections against the current question using 8 factors:
📊 Scoring factors
- Relevance — semantic similarity between the correction and the current query
- Failure prevention value — how bad is it if this correction is missed?
- Importance — correction priority (pinned corrections always score higher)
- Recency — newer corrections are slightly preferred
- Confidence — how certain was the correction at time of storage?
- Staleness — corrections older than decay threshold score lower
- Token cost — very long corrections are penalised to avoid context bloat
- Pinned status — pinned corrections always pass the threshold
Only corrections scoring above 0.30 are injected. This prevents irrelevant memories from cluttering the context window.
Making corrections
Three ways to teach Veritas something new.
Implicit correction (just reply naturally)
Say "No, it's 53" or "Actually metric units please" — Veritas detects the correction intent using semantic similarity and asks you to confirm. Once confirmed, it's stored permanently.
Explicit correction prefix
Start your message with correction: to skip the detection step: correction: always use TypeScript, never plain JavaScript.
"I know that…" prefix
For persistent instructions: I know that we use kebab-case for all our CSS class names. Stored as a global rule, injected whenever CSS is relevant.
Model disagreements
When models give meaningfully different answers, Veritas surfaces the disagreement explicitly rather than silently picking one.
📌 What happens when you pick
- Your preference is recorded as a model preference correction
- The chosen model gets a VCG win recorded in that domain — its effective_u rises
- Over time, the model you prefer on disagreements gets routed to more often in that domain
- No point is awarded to any model until you pick — disagreements don't inflate winners
Look Under the Hood — Overview tab
Click the 📊 chart icon in the top-right of the Quality panel to open the full analytics dashboard. It has 5 tabs.
The Overview tab shows high-level health metrics for your session:
📈 What you'll see
- Total queries — how many questions asked this session
- Agreement rate — % of queries where all models agreed (high = consistent, reliable answers)
- Peer review rate — % of queries that triggered independent verification
- Active corrections — how many stored memories are actively being injected
- Session cost — total API spend, broken down by provider
Look Under the Hood — Models tab
Per-model performance breakdown — who's winning, and why.
⚖️ What the scores mean
- Welfare score — the VCG score this model got this session (higher = more wins)
- Win rate — fraction of queries where this model was selected as the VCG winner
- Confidence — average confidence score from peer review (80% = models agree it's correct)
- Latency — average response time — affects whether fast queries use this model in Fast mode
- Domain strengths — which domains this model has been winning in across your history
Look Under the Hood — Decisions tab
The most informative tab — a full trace of every decision made on each query.
Click any query in the list to expand its decision chain:
🔎 Reading the VCG scores
- Scores are computed as W_i = Σ p(j|q) · u_i(j) — the weighted sum of domain-specific win rates
- A score near 0.5 means the model hasn't built up much history in this domain yet — neutral prior
- A score above 0.7 means the model has a strong track record in this query's domain
- Large gaps between models indicate strong domain specialization in your usage
- If all scores are near 0.5, you haven't sent enough queries in this domain for differentiation yet
Look Under the Hood — Memory tab
View and manage all stored corrections and preferences. Two sub-views within one tab.
Memory view
All stored corrections with their type, domain, scope (global / project / conversation), and creation date. Filter by type using the pill buttons at the top.
Corrections view
Correction intelligence settings — two controls:
⚡ Correction intelligence settings
- Implicit sensitivity slider (0.20 – 0.80) — how aggressively to detect corrections in your replies. Default 0.45. Lower = more prompts (catches more corrections, more false positives). Higher = fewer prompts (misses more, fewer interruptions).
- Validation mode — Plausible (default): a cheap model checks if the correction could be true before storing it. Strict: full cross-check, slower but rejects more noise.
Look Under the Hood — Domains tab
See the live domain taxonomy that Veritas has learned from your query patterns.
Tree view
The 10 L0 root domains are always present (code, mathematics, science, legal, medical, finance, writing, analysis, history, general). As you use the app, sub-domains grow beneath them when models consistently report a specific sub-domain and performance diverges from the parent.
Click any node to expand its children. The query count badge shows how many queries have been routed to that domain across your session history.
Candidates view
Domain strings that models have reported but haven't yet been promoted to full nodes. Each candidate shows:
📊 Candidate fields
- raw_string — the exact string a model returned (e.g. "constitutional law")
- nearest_node — the closest existing tree node (e.g. "legal")
- similarity — edit-distance similarity to the nearest node (0–1)
- query count — how many times this string has been reported (needs ≥5 to be considered for promotion)
- model count — how many distinct models have reported it (needs ≥2)
Context prompts
Each model writes its own recovery prompt — a system message it would send itself to restore full context if the conversation restarts.
Click Generate / View prompt ↗ at the bottom of the Memory panel (right sidebar) to open the modal. The system calls each stale model and asks it to write a prompt incorporating your current corrections, preferences, and top domains — then saves the result.
🔄 When context prompts are auto-sent
- A model drops a rule (compliance monitor detects streak ≥ 2)
- You start a new chat window after a long gap
- Context window pressure is detected mid-conversation
Prompts are checked for staleness every 15 minutes in the background. A prompt is stale if it's older than 24 hours or a new correction has been added since it was generated. Only stale models are regenerated — fresh ones show their saved text immediately.
Local models (Ollama)
Run Ollama models locally for free, private inference. Local models compete in VCG elections on equal footing with frontier models.
Install Ollama
Download from ollama.com and run ollama pull llama3.2 or any model you prefer.
Enable in Settings
Toggle Local models in Settings. Veritas will auto-discover models running on the default Ollama port (11434).
Local mode is exclusive
You can run frontier models OR local models — not both simultaneously. Use the toggle to switch. Local models are ideal for sensitive queries that shouldn't leave your machine.
Bug reporting
Found something broken? Report it in one click — no account, no email required.
Click the 🐛 Report a bug button at the bottom of the left sidebar. A modal opens with a comment field and opt-in checkboxes for including your last 5 messages (for context) and your email (for follow-up).