Domain Deep-Dive · v1.0

Recommendation Engines &
Personalization Platforms

Recommendation systems optimize for engagement signals. The framework adds what they're missing: a correction loop that learns from explicit rejections, a confidence gate that handles cold-start honestly, and a single utility function that makes multi-objective tradeoffs explicit, tunable, and auditable.

Recommendation platform engineers Personalization teams E-commerce ML teams Content platform engineers

1. What existing recommendation systems are missing

Netflix, Spotify, and YouTube have sophisticated recommendation systems — multimodal content embeddings, LLM-augmented architectures, unified search and recommendation models, and domain-specific LoRA adapters. These approaches have produced meaningful production gains: M3CSR (Kuaishou) showed +3.4% clicks; 360Brew (LinkedIn) supports 30+ ranking tasks matching specialised models; UniCoRn (Netflix) shows 10% improvement in recommendations. The framework documented here doesn't compete with these on recommendation quality. That is not what it does.

What it does is add a control layer that sits on top of whatever base recommendation model a platform runs — including any of the architectures above. The control layer provides three properties that none of the current production approaches, including the most sophisticated ones, have built in.

No persistent correction loop for explicit rejections

A user who clicks "not interested" on a category repeatedly still gets recommended that category — because engagement models and preference models don't share a persistent correction store. The signal is logged, but it decays quickly and rarely propagates across sessions. The same pattern resurfaces next week, next month, at the next login. There is no mechanism to say "this preference signal was verified and should not decay."

No confidence gate on sparse signals

Recommendation models produce a ranked list regardless of how confident they actually are in those rankings. A new user, a new item, or a session with thin behavioural signals gets the same confident output as a well-calibrated recommendation for a power user. Cold-start is still one of the hardest unsolved problems in production recommendation precisely because there's no principled mechanism to say "I don't know enough — here's a safe fallback."

Multi-objective tradeoffs resolved heuristically

Most production recommendation systems combine relevance, diversity, novelty, and business objectives (margin, sponsored content, inventory clearance) through hand-tuned heuristics and post-processing steps. The weights are implicit, undocumented, and difficult to audit. When a regulator or a user asks "why did you recommend this?" the honest answer is usually "a weighted combination of signals we can't fully decompose."

Why this matters now. The EU Digital Services Act (DSA) — in force for very large online platforms since August 2023 — requires algorithmic transparency for recommendation systems, including the ability to explain why content was recommended and to offer users a non-profiling-based alternative. A system that cannot decompose a recommendation into its component signals and weights cannot satisfy those requirements. Separate EU AI Act obligations may apply where recommendation systems affect employment, education, or access to services.

2. The utility function for recommendations

The recommendation utility function uses the same three-term formula as every other domain in the framework — proved as the unique functional form satisfying five behavioral axioms via Debreu's representation theorem (Theorem B.1, Appendix B). The weights and their domain-specific meaning shift, but the structure is unchanged:

U = w_e(f) · E  +  w_c(f) · C  +  w_k(f) · K

  E — Efficacy:   how well the recommendation serves the platform's
                  definition of a good outcome in this context.
                  Crucially, what counts as "effective" is domain-defined:
                  in discovery mode, E measures novelty-weighted engagement;
                  in fulfillment mode, E measures relevance to known intent;
                  in a business campaign, E measures margin-weighted conversion.
                  Business objectives are not a separate term — they are part
                  of what the platform defines as efficacious recommendation.

  C — Confidence: internal consistency of the recommendation signal.
                  Penalised by sparse data, contradictory signals, cold-start.
                  Below C_min(f), the system falls back to safe defaults.

  K — Curiosity:  exploration bonus for novel items the user hasn't seen.
                  Governs diversity and serendipity within the E constraint.

f — recommendation context (discovery, fulfillment, new user, campaign...)

This framing is consistent with how the framework handles domain variation elsewhere. Surgery defines E to include protocol adherence and patient safety compliance — those aren't separate terms, they're what "effective" means in that context. The same logic applies here: a recommendation platform that values margin-weighted conversion defines E accordingly, with the weight vector making that definition explicit and auditable rather than implicit in a black-box loss function.

Decision rule:
  recommend      if C ≥ C_min(f)   [confidence above threshold]
  safe fallback  otherwise          [popular/proven items for this user segment]

The key properties that make this different from existing multi-objective approaches:

3. Context-adaptive weight shifting

The optimal recommendation objective changes completely depending on what the user is trying to do and where they are in their lifecycle. The framework handles this with weight profiles — not separate recommendation systems per mode, but a single formula whose weights shift with the detected context. The values below are illustrative profiles; field-specific calibration is required for each deployment.

Mode A
Discovery / exploration
Efficacy (w_e)
0.30
Confidence (w_c)
0.25
Curiosity (w_k)
0.35

E is defined as novelty-weighted engagement in this mode. Curiosity dominates — the user wants to be surprised. Confidence is moderate; the system tolerates lower certainty in exchange for serendipity.

Mode B
Fulfillment / known intent
Efficacy (w_e)
0.70
Confidence (w_c)
0.20
Curiosity (w_k)
0.10

E is defined as relevance to known intent. User has a clear goal — similar to a search query. Efficacy dominates. Curiosity is suppressed; novelty is a distraction in this mode.

Mode C
New user / cold start
Efficacy (w_e)
0.20
Confidence (w_c)
0.40
Curiosity (w_k)
0.20

Below C_min — confidence gate fires. System falls back to segment baseline. High curiosity weight maximises information gathering about preferences. E is low because the system cannot yet define what "effective" means for this user.

Mode D
Business campaign / inventory
Efficacy (w_e)
0.65
Confidence (w_c)
0.25
Curiosity (w_k)
0.10

E is redefined to include margin-weighted conversion — business objectives shift what "effective" means in this mode, not how the formula works. The weight vector makes this redefinition explicit and auditable. Curiosity is suppressed; novelty is irrelevant to campaign goals.

One formula, all modes. The formula doesn't change between modes — only the weights and the definition of E change. Business campaign mode doesn't add a new term; it redefines what "effective" means in E. This is the same mechanism that makes surgery and creative writing use the same formula with radically different weights and E definitions. The weight vector makes the platform's objective explicit and auditable rather than baking it into a black-box loss function.

4. The correction loop applied to preference signals

This is the framework's most distinctive contribution to recommendation. Existing systems log explicit rejections but don't propagate them systematically across sessions and don't use them to update model weights between retraining cycles. The correction loop changes both.

Signal decay classification

Not all preference signals should persist equally. The assertions store applies field-specific decay classes to recommendation signals:

Signal type                  Decay class   Persistence (video recommendation)
──────────────────────────────────────────────────────────────────────────────
Explicit rejection            Class A       Never decays
  ("not interested", thumbs down, hide)
  → stored as verified negative preference
  → injected into all future sessions for this user

Strong positive (purchase,    Class B       Slow (τ = 2yr)
  save to watchlist, rewatch)               [framework default: τ = 10yr]

Engagement (completion        Class C       Moderate (τ = 6mo)
  ≥80%, rating given)                       [framework default: τ = 3yr]

Weak signal (click, start     Class D       Fast (τ = 2wk)
  but abandon <20%)                         [framework default: τ = 6mo]
  → decays quickly; accidental clicks and
    abandoned pilots are noisier here
    than equivalent signals in other domains

Decay constants for video recommendation are set shorter than framework defaults because preference signals in this domain have higher temporal specificity than medical or engineering facts. A user's genre preferences shift meaningfully within 6–12 months as tastes evolve, household composition changes, and the content catalogue refreshes. Class A (explicit rejection) is unchanged — a deliberate "not interested" signal is as durable here as anywhere. Class D decays within 2 weeks rather than the default 6 months because accidental clicks and abandoned pilots on a video platform are meaningfully noisier than equivalent weak signals in other domains.

How a rejection becomes a correction

1. User clicks "not interested" on Action genre recommendations
   for the 4th time this month

2. Contradiction detected:
   "Repeated explicit rejection of [Action] category
    contradicts model's continued recommendation of this category"

3. Correction stored (Class A — never decays):
   "User [id] has verified negative preference for [Action].
    Exclude from personalised recommendations permanently."

4. Correction injected into every future recommendation session:
   → Action items scored to zero in utility function
   → No re-ranking required — excluded at retrieval stage

5. DPO pair accumulated:
   preferred: [recommendations without Action]
   rejected:  [recommendations with Action]
   weight:    high (explicit rejection, repeated)

6. Next calibration cycle:
   → DPO pairs weighted into recommendation model training
   → Model learns not to retrieve Action for this user segment
   → Correction no longer needs injection — baked into weights

The key difference from existing approaches. Current systems apply rejection signals as re-ranking penalties that decay over time. The framework applies them as permanent corrections at the retrieval stage for verified preferences, and as weighted training signal at calibration time. The same category does not resurface next month. The model learns the preference, not just the penalty.

5. Confidence-gated cold start

Cold-start has more sophisticated solutions in the literature than what the confidence gate provides. Multimodal content embeddings — YouTube's Semantic IDs, Kuaishou's M3CSR — address cold-start by learning item representations from visual, textual, and audio content, so a new item with no interaction history still gets a meaningful embedding. These are genuinely stronger solutions to the cold-start quality problem than a confidence-gated fallback to segment baselines.

The confidence gate is not a quality solution to cold-start — it is an honesty solution. The problem it addresses is not "how do we recommend better for new users" but "how do we avoid confidently recommending things we have no basis for recommending." A platform that labels a segment-average recommendation as "Recommended for you" when it has zero user signal is making a small misrepresentation that users notice. The confidence gate makes the label accurate.

New user — session 1:
  Signal count: 0
  C = 0.10   [illustrative — far below C_min]
  
  Action: safe fallback
  → Recommend from high-confidence segment baseline
     (most-saved items in user's declared interest category)
  → Present as "Popular in [category]" not "Recommended for you"
  → Maximise diversity to gather preference signal quickly

New user — session 3 (5 interactions logged):
  Signal count: 5 (2 saves, 3 completions)
  C = 0.58   [illustrative — approaching C_min]
  
  Action: partial personalisation
  → Blend segment baseline (60%) with early personalisation (40%)
  → Still diverse — learning phase continues

New user — session 7 (20 interactions, 3 explicit saves):
  C = 0.72   [illustrative — above C_min]
  
  Action: full personalisation
  → Utility function governs recommendations
  → "Recommended for you" label is now honest

C_min for new users requires calibration against your platform's
own data — the point below which personalised recommendations
show measurably lower conversion than segment baselines.

Complementary to content-based cold start, not a replacement. A platform running Semantic IDs or M3CSR for content-based cold-start representation can still benefit from the confidence gate: even with a content embedding, a new user's preferences are unknown. The gate prevents the system from claiming personalisation before the user's own signals have been observed — regardless of how good the item embeddings are.

6. Auditability and regulatory compliance

The framework produces a structured log for every recommendation decision — not just what was recommended, but why, with what confidence, and under which weight profile. This is the audit trail that regulators increasingly require and that current production recommendation systems cannot produce at the per-decision level.

Every recommendation session produces a log record:
{
  "user_id": "u_8821",
  "timestamp": "2025-11-14T14:47:23Z",
  "mode": "discovery",
  "weights_active": {
    "w_e": 0.30,
    "w_c": 0.25,
    "w_k": 0.35,
    "mode": "discovery"
  },
  "confidence": 0.78,
  "confidence_gate": "passed",
  "top_recommendations": ["item_A", "item_B", "item_C"],
  "active_corrections": [
    "User has verified negative preference for [Action] genre — excluded"
  ],
  "utility_scores": {
    "item_A": 0.731,
    "item_B": 0.698,
    "item_C": 0.672
  }
}

When a regulator or user asks "why was this recommended?", the answer is in a structured, retrievable log — not reconstructed from a black-box model. When a user asks "why do I keep seeing this category?", the corrections store shows exactly what signals have been accumulated and why the category is or isn't being suppressed.

7. Comparison: existing approaches vs. the framework

PropertyCollaborative filteringDeep learning recThis framework
Explicit rejection persists ~ Incorporated but decays; varies by system Re-ranking penalty only; no persistent store Class A — never decays
Confidence gate on sparse signals Always produces ranking Always produces ranking C_min → safe fallback
Multi-objective weights explicit ~ Heuristic blending ~ Single loss function Documented weight vector per mode
Per-decision auditability Structured log per session
Correction propagates to model weights Only at retraining Only at retraining DPO pairs → calibration cycle
Context-adaptive objective Fixed similarity metric ~ Context features only Weight profile per mode
Cold-start handled honestly ~ Popularity fallback; content embeddings in advanced systems ~ Multimodal content embeddings (stronger on quality) C_min gate — honest labelling, not a quality solution

Framework properties from §8 — Correction Loop and §5 — Field Bounds.

8. MVP shape

A practical MVP for a recommendation platform doesn't replace the existing recommendation system. It begins as a correction and confidence layer on top of existing rankings:

  1. Stand up the assertions store for explicit rejections — intercept all "not interested," hide, and thumbs-down signals. Store as Class A corrections. Measure how many times the same rejected category resurfaces in the next 30 days without corrections vs. with. This is the baseline metric.
  2. Set C_min for new users — identify the signal count threshold below which your current recommendations have measurably lower conversion. That's your C_min calibration point. Below it, present segment baselines as "Popular in [category]" rather than "Recommended for you."
  3. Shadow-mode weight profiles — run the discovery/fulfillment/cold-start weight profiles in shadow mode alongside the existing system. Log U scores per recommendation. Measure which profile produces better engagement on a per-session basis before any traffic is shifted.
  4. First calibration cycle — after 4–6 weeks of accumulated correction signals, run the first DPO calibration. Measure whether rejection recurrence drops and whether Brier score improves. This is the Phase 1 validation gate.
  5. Promote to advisory recommendations — surface the framework's recommendations alongside the existing system's for human review or A/B testing. The framework builds a track record before it controls production traffic.

Build it with AUA v1.0

Configure this domain today

AUA v1.0 is the control layer this page describes. The confidence gate, correction loop, persistent assertions store, per-decision audit log, and mode-adaptive weight profiles are all shipped and runnable today — on top of your existing recommendation system via the REST API.

Fully demonstrable. Unlike physical-actuation domains (self-driving, autonomous systems), every component described on this page is runnable today. AUA provides the confidence gate, correction loop, and audit layer. Your existing rec system keeps ranking; AUA adds the control and correctability on top.

1. Install

pip install adaptive-utility-agent

2. Scaffold for this domain

aua init my-recommendation-platform --preset generalist --tier macbook
cd my-recommendation-platform
aua doctor

3. Key config for this domain

Configure field weights to match your four recommendation modes. The generalist preset is the right starting point — recommendation is a multi-mode domain where each context gets its own weight profile rather than a fixed field.

# aua_config.yaml
specialists:
  - name: discovery_rec
    model: qwen2.5-7b-awq          # your recommendation LLM
    port: 11434
    field: generalist
    metadata:
      rec_mode: discovery           # novelty-weighted, high curiosity
      w_e: 0.30
      w_c: 0.25
      w_k: 0.35

  - name: fulfillment_rec
    model: qwen2.5-7b-awq
    port: 11435
    field: generalist
    metadata:
      rec_mode: fulfillment         # known-intent, efficacy-dominant
      w_e: 0.70
      w_c: 0.20
      w_k: 0.10

safety:
  abstention_enabled: true          # enables C_min confidence gate
  min_confidence_for_direct_answer: 0.65   # tune to your platform data
  abstention_message: "Using segment baseline — not enough signal yet"

audit:
  enabled: true
  hash_chain: true                  # tamper-evident log for DSA compliance
  log_path: ./audit/recommendations.log

assertions:
  decay_class_overrides:
    explicit_rejection: A           # never decays
    strong_positive: B              # τ = 2yr
    engagement: C                   # τ = 6mo
    weak_signal: D                  # τ = 2wk (shorter than default)

security:
  auth:
    enabled: true
    tokens: [{name: rec-api, scopes: [query, read_metrics]}]
  encryption: {enabled: true, key_secret: AUA_ENCRYPTION_KEY}

Generate your encryption key: python3 -c "import os; print(os.urandom(32).hex())" or openssl rand -hex 32 — 64-char hex string. See Tutorial §12.4 for key management.

4. Start and query

aua serve

# Standard recommendation query — AUA routes to correct mode and applies C_min gate
curl -X POST http://localhost:8000/query \
  -H "Authorization: Bearer $AUA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "recommend items for user u_8821",
    "session_id": "u_8821",
    "metadata": {"rec_mode": "discovery", "signal_count": 23}
  }'

# Log an explicit rejection — stored as Class A, never decays
curl -X POST http://localhost:8000/assert \
  -H "Authorization: Bearer $AUA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "u_8821",
    "subject": "genre:action",
    "assertion": "verified_negative_preference",
    "decay_class": "A",
    "source": "explicit_rejection"
  }'

# Export accumulated DPO pairs for calibration cycle
curl http://localhost:8000/export/dpo \
  -H "Authorization: Bearer $AUA_TOKEN" \
  -o rec_dpo_pairs.jsonl

5. Validate with aua eval

# Run against your recommendation eval dataset
aua eval run --dataset rec_smoke.yaml --output results/

# Check confidence gate is firing correctly for sparse-signal users
aua eval report --metric abstention_rate --filter signal_count_lt_5

# Shadow-mode comparison: AUA utility scores vs existing system scores
aua eval compare results/baseline.json results/aua_discovery.json

6. What AUA handles vs. what you bring

AUA v1.0 providesYou bring
Confidence gate (C_min) → safe fallback triggerExisting recommendation model (CF, DL, LLM-augmented)
Persistent assertions store — Class A rejections never decayUser signal ingestion pipeline (clicks, saves, skips)
Mode-adaptive weight profiles (discovery / fulfillment / campaign)Item catalogue and embeddings
Correction loop: rejection → DPO pair → calibration signalSegment baselines for cold-start fallback
Per-decision audit log with hash chain (DSA-ready)Fine-tuning infrastructure for calibration cycles
REST API — integrates with any existing rec stackA/B testing framework and traffic splitting
Prometheus + Grafana + OTEL — abstention rate, correction count, U score distributionPlatform-specific conversion and engagement metrics

Full instructions: AUA Tutorial · Framework v1.0 · GitHub ↗