How the Adaptive Utility Agent framework closes the feedback loop between creative output and platform engagement signal — and why the curiosity term, suppressed in every other domain, finally takes center stage here.
On this page
Generative AI tools for creative work — image generators, music models, writing assistants, video synthesis — share a structural problem with every other deployed AI system: there is no feedback loop between what gets generated today and what the model produces tomorrow. A model that consistently produces images in a style that doesn't resonate with an audience will continue producing them until someone retrains the model from scratch.
For creative platforms this failure mode is particularly costly for two reasons:
The framework addresses both: the two-component creative efficacy model turns platform signal into a structured quality measure, and the correction loop feeds that measure back into the model continuously between releases.
The creative domain has a fundamentally different utility profile from every other domain in the framework. Compare it to surgery:
From agent/config.py:
"creative_writing": FieldConfig(
w_efficacy=0.80, # platform resonance is the primary signal
w_confidence=0.05, # internal consistency barely matters — this is art
w_curiosity=0.15, # exploration and novelty are core value drivers
c_min=0.05, # almost no floor — allowed to take creative risks
e_min=0.15, # minimal efficacy floor
penalty_multiplier=1.0 # creative mistakes are recoverable; no harsh penalty
),
"art": FieldConfig(
w_efficacy=0.80,
w_confidence=0.10,
w_curiosity=0.10,
c_min=0.10,
e_min=0.20,
penalty_multiplier=1.0
)
In surgery, the curiosity term is suppressed to zero (alpha=0.00) — an exploring surgeon is a dangerous surgeon. In creative domains, curiosity alpha rises to 0.15, and the cap of 50% of total utility (which prevents curiosity from dominating in safety-critical fields) becomes practically non-binding at the low confidence minimum. The system is explicitly designed to explore, take creative risks, and produce unexpected outputs — and to learn from what resonates.
The curiosity term finally has a home. In every other domain documented here — AV, industrial robots, grid management, pricing — the curiosity term is a minor background signal, kept small by high confidence minimums and penalty multipliers. In creative domains, with c_min=0.05 and penalty=1.0, curiosity is the primary driver of differentiated output. A system that only produces what it is already confident is good is not a creative system — it is a interpolation engine. The framework's curiosity term is what enables the framework to generate genuinely novel work rather than averaging existing styles.
The core engineering challenge in creative domains is measuring quality. Unlike software engineering (tests pass or fail) or surgery (outcome is medically verifiable), creative quality is contextual, audience-dependent, and cannot be reduced to a single ground truth. The framework's response, implemented in agent/creative_efficacy.py, is a two-component model that mirrors how human creators actually measure their own success:
Creative_Efficacy = sqrt(Content_Efficacy × Discoverability_Efficacy)
Content_Efficacy — weighted engagement per view
"can the work hold attention when shown?"
Measures: saves, likes, shares, purchases ÷ views
Discoverability_Efficacy — raw reach relative to human creator baseline
"can it find an audience at all?"
Measures: total impressions, streams, plays vs. baseline
Combined: geometric mean — both components must be strong.
A viral but low-quality work (high reach, low engagement rate) scores low.
A high-quality but undiscovered work (high engagement rate, no reach) also scores low.
The score rewards genuine resonance at scale.
This is the same efficacy scale used in STEM fields — 0.5 means matching human creator average in the category, >0.5 means outperforming the baseline, <0.5 means underperforming. The consistency across domains is intentional: it allows the same utility function formula to govern both a surgical AI and a music generation system, with only the weights and the efficacy measurement method changing.
The two-component model is only as good as the signals feeding it. From the actual implementation:
# Signal weights by intent strength — from creative_efficacy.py
SIGNAL_WEIGHTS = {
"purchase": 1.0, # strongest — real economic behavior
"download": 1.0,
"save": 0.8, # strong intent signal
"bookmark": 0.8,
"share": 0.7, # audience endorsement
"repost": 0.7,
"like": 0.5, # moderate — easy to give, less meaningful
"upvote": 0.5,
"comment": 0.4, # engagement signal, noisy
"view": 0.1, # weak — could be accidental
"listen": 0.1,
"impression": 0.05, # nearly no signal
}
# Platform → recommended signals
PLATFORM_SIGNALS = {
"soundcloud": ["listens", "likes", "reposts", "downloads"],
"spotify": ["streams", "saves", "playlist_adds"],
"pinterest": ["saves", "clicks", "reposts"],
"istockphoto": ["downloads", "purchases"],
"medium": ["reads", "claps", "saves"],
"youtube": ["views", "likes", "shares", "saves"],
"behance": ["views", "appreciations", "saves"],
}
Minimum observations before scoring. The model requires 50 observations (views/listens) before a content efficacy score is considered meaningful (MIN_OBSERVATIONS = 50). Below this threshold, the system returns sufficient_data=False and withholds the score to prevent spurious corrections based on single-digit engagement counts. This is the creative domain equivalent of the confidence gate — the system does not act on insufficient signal.
Example A — High content efficacy, low discoverability:
Views: 40 Likes: 30 Saves: 20 Downloads: 8
Content efficacy: ~0.85 (extremely high engagement rate)
Discoverability: ~0.15 (almost no one saw it)
Combined: sqrt(0.85 × 0.15) ≈ 0.36 ← low overall score
Example B — Low content efficacy, high discoverability:
Views: 50,000 Likes: 200 Saves: 50 Downloads: 10
Content efficacy: ~0.25 (very low engagement per view)
Discoverability: ~0.85 (strong reach)
Combined: sqrt(0.25 × 0.85) ≈ 0.46 ← below baseline
Example C — Balanced:
Views: 5,000 Likes: 400 Saves: 200 Downloads: 80
Content efficacy: ~0.72 (strong engagement rate)
Discoverability: ~0.70 (good reach)
Combined: sqrt(0.72 × 0.70) ≈ 0.71 ← above baseline
The geometric mean penalizes imbalance. A model that generates content optimizing for virality at the expense of actual engagement quality will score no better than one that generates high-quality-but-invisible work. Both components must improve together.
Creative AI systems face a fundamental tension: too much consistency produces derivative, predictable output; too much novelty produces work that doesn't connect with any audience. The framework handles this with operating modes that shift the weight on the curiosity term:
Novelty exploration mode. System is rewarded for producing unfamiliar outputs that still resonate. Used for ideation, style development, and breaking established patterns. Curiosity weight elevated.
Platform resonance dominates. Exploration still present but secondary. Default creative_writing profile from config.py.
Brand voice consistency mode. Confidence weight rises — the system is penalized for outputs that deviate from established style. Exploration is suppressed. Used for brand compliance and style guide adherence.
Maximum resonance optimization. Platform signal almost entirely governs output. Used when the objective is explicit: maximize saves, downloads, or conversions for a specific content type and platform.
Platform signals: Spotify saves and playlist adds (weight 0.8), SoundCloud reposts (0.7), downloads (1.0). Content efficacy measured against human creator baseline in the same genre and release period. Correction loop feeds back which melodic or production patterns resonated and which didn't.
Platform signals: iStockPhoto downloads/purchases (1.0), Pinterest saves (0.8), Behance appreciations (0.5). Discoverability efficacy measures whether the work surfaces in search results and recommendations — tagging, titling, and category alignment are part of the creative skill.
Platform signals: Medium saves (0.8), reads completion rate (0.4), comments (0.4). Content efficacy measures conversion from view to completion — a piece that holds attention to the end scores higher than one that gets opened and abandoned.
Platform signals: YouTube saves (0.8), shares (0.7), like/view ratio (0.5). Discoverability efficacy measures click-through rate from thumbnails and titles — the creative framing of content, not just its substance, is part of the quality measure.
Brand mode (Mode C above) uses confidence weight to enforce style consistency. The assertions store accumulates brand voice guidelines as high-confidence Class C assertions. Deviations from brand voice are treated as contradictions and corrected exactly as factual errors are in STEM domains.
Tools that help individual creators improve their own work over time. Personal assertions store accumulates what resonates with the creator's specific audience. The correction loop is personalized — not averaging across all users but learning from each creator's own signal.
The correction loop in creative domains works differently from high-stakes domains in three important ways:
With c_min=0.05, the system almost never triggers the abstention gate. In creative domains, "I'm not confident this will resonate" is not a reason to withhold output — it is an invitation to explore. The correction loop is the safety mechanism: if the low-confidence output doesn't resonate (low platform signal), the correction prevents that pattern from repeating. If it does resonate unexpectedly, that positive signal updates the efficacy EMA upward and the system learns the pattern works.
Creative signal takes time to accumulate — a newly uploaded track needs time to gather saves and reposts. The MIN_OBSERVATIONS = 50 threshold prevents premature corrections on sparse data. Calibration cycles for creative domains should be longer than for STEM domains: weekly rather than several-times-daily, allowing engagement signal to accumulate before the next DPO calibration run.
Ambient electronic music that underperforms relative to ambient electronic baselines should not contaminate the model's behavior on hip-hop production. The two-component efficacy model stores baselines per platform:category key — a work is only compared to its category peers. The correction loop updates are scoped by category, not applied globally.
Phase 5 of the framework roadmap is dedicated to creative fields — platform signal collection pipeline and two-component efficacy measurement. See §12 of the full whitepaper.
CreativeEfficacyTracker.add_baseline(). Aim for at least 20–30 works to build a meaningful average.AUA v1.0 supports curiosity-weighted fields out of the box. Configure platform signal scoring via a custom UtilityScorerPlugin, and let the correction loop learn from saves, downloads, and shares.
pip install adaptive-utility-agent
aua init my-creative-systems-agent --preset generalist --tier macbook cd my-creative-systems-agent aua doctor
# aua_config.yaml
specialists:
- name: creative_writing
model: qwen-coder-7b-awq
port: 11434
field: creative_writing
safety:
abstention_enabled: false
require_arbiter_for_high_risk: true
min_confidence_for_direct_answer: 0.05
security:
encryption: {enabled: true, key_secret: AUA_ENCRYPTION_KEY}
audit:
enabled: true
hash_chain: true
Generate your encryption key: python3 -c "import os; print(os.urandom(32).hex())" or openssl rand -hex 32 — 64-char hex string. See Tutorial §12.4 for key management.
aua serve
curl -X POST http://localhost:8000/query \
-H "Authorization: Bearer $AUA_TOKEN" \
-H "Content-Type: application/json" \
-d '{"prompt": "...", "session_id": "demo"}'
| AUA v1.0 provides | You bring |
|---|---|
| Multi-specialist routing + utility scoring | Domain-specific specialist models |
| Arbiter + contradiction detection | Domain-specific quality criteria |
| Correction loop + DPO pair export | Fine-tuning infrastructure (TRL, Axolotl, …) |
| Blue-green deployment + rollback | Evaluation datasets for your domain |
| Append-only audit log with hash chain | Platform signal ingestion (saves, downloads, …) |
| Prometheus + Grafana + OTEL | Your monitoring infrastructure |
Full instructions: AUA Tutorial · Framework v1.0 · GitHub ↗