Domain Deep-Dive · v1.0

Creative Systems

How the Adaptive Utility Agent framework closes the feedback loop between creative output and platform engagement signal — and why the curiosity term, suppressed in every other domain, finally takes center stage here.

Generative media builders Creative tooling teams Content platforms Brand & voice systems

1. The missing feedback loop in creative AI

Generative AI tools for creative work — image generators, music models, writing assistants, video synthesis — share a structural problem with every other deployed AI system: there is no feedback loop between what gets generated today and what the model produces tomorrow. A model that consistently produces images in a style that doesn't resonate with an audience will continue producing them until someone retrains the model from scratch.

For creative platforms this failure mode is particularly costly for two reasons:

The framework addresses both: the two-component creative efficacy model turns platform signal into a structured quality measure, and the correction loop feeds that measure back into the model continuously between releases.

2. The utility function in creative domains

The creative domain has a fundamentally different utility profile from every other domain in the framework. Compare it to surgery:

Surgery — Confidence weight (w_c)
0.70
Confidence dominates; wrong-but-confident is dangerous
Creative writing — Confidence weight (w_c)
0.05
Confidence is nearly irrelevant; consistency ≠ quality
Creative writing — Curiosity weight (w_k)
0.15
Exploration is the primary creative value driver
Art — Efficacy weight (w_e)
0.80
Platform resonance is the dominant quality signal

From agent/config.py:

"creative_writing": FieldConfig(
    w_efficacy=0.80,    # platform resonance is the primary signal
    w_confidence=0.05,  # internal consistency barely matters — this is art
    w_curiosity=0.15,   # exploration and novelty are core value drivers
    c_min=0.05,         # almost no floor — allowed to take creative risks
    e_min=0.15,         # minimal efficacy floor
    penalty_multiplier=1.0  # creative mistakes are recoverable; no harsh penalty
),

"art": FieldConfig(
    w_efficacy=0.80,
    w_confidence=0.10,
    w_curiosity=0.10,
    c_min=0.10,
    e_min=0.20,
    penalty_multiplier=1.0
)

In surgery, the curiosity term is suppressed to zero (alpha=0.00) — an exploring surgeon is a dangerous surgeon. In creative domains, curiosity alpha rises to 0.15, and the cap of 50% of total utility (which prevents curiosity from dominating in safety-critical fields) becomes practically non-binding at the low confidence minimum. The system is explicitly designed to explore, take creative risks, and produce unexpected outputs — and to learn from what resonates.

The curiosity term finally has a home. In every other domain documented here — AV, industrial robots, grid management, pricing — the curiosity term is a minor background signal, kept small by high confidence minimums and penalty multipliers. In creative domains, with c_min=0.05 and penalty=1.0, curiosity is the primary driver of differentiated output. A system that only produces what it is already confident is good is not a creative system — it is a interpolation engine. The framework's curiosity term is what enables the framework to generate genuinely novel work rather than averaging existing styles.

3. The two-component creative efficacy model

The core engineering challenge in creative domains is measuring quality. Unlike software engineering (tests pass or fail) or surgery (outcome is medically verifiable), creative quality is contextual, audience-dependent, and cannot be reduced to a single ground truth. The framework's response, implemented in agent/creative_efficacy.py, is a two-component model that mirrors how human creators actually measure their own success:

Creative_Efficacy = sqrt(Content_Efficacy × Discoverability_Efficacy)

Content_Efficacy      — weighted engagement per view
                        "can the work hold attention when shown?"
                        Measures: saves, likes, shares, purchases ÷ views

Discoverability_Efficacy — raw reach relative to human creator baseline
                        "can it find an audience at all?"
                        Measures: total impressions, streams, plays vs. baseline

Combined: geometric mean — both components must be strong.
A viral but low-quality work (high reach, low engagement rate) scores low.
A high-quality but undiscovered work (high engagement rate, no reach) also scores low.
The score rewards genuine resonance at scale.

This is the same efficacy scale used in STEM fields — 0.5 means matching human creator average in the category, >0.5 means outperforming the baseline, <0.5 means underperforming. The consistency across domains is intentional: it allows the same utility function formula to govern both a surgical AI and a music generation system, with only the weights and the efficacy measurement method changing.

4. Platform signal collection and weighting

The two-component model is only as good as the signals feeding it. From the actual implementation:

# Signal weights by intent strength — from creative_efficacy.py
SIGNAL_WEIGHTS = {
    "purchase":    1.0,   # strongest — real economic behavior
    "download":    1.0,
    "save":        0.8,   # strong intent signal
    "bookmark":    0.8,
    "share":       0.7,   # audience endorsement
    "repost":      0.7,
    "like":        0.5,   # moderate — easy to give, less meaningful
    "upvote":      0.5,
    "comment":     0.4,   # engagement signal, noisy
    "view":        0.1,   # weak — could be accidental
    "listen":      0.1,
    "impression":  0.05,  # nearly no signal
}

# Platform → recommended signals
PLATFORM_SIGNALS = {
    "soundcloud":  ["listens", "likes", "reposts", "downloads"],
    "spotify":     ["streams", "saves", "playlist_adds"],
    "pinterest":   ["saves", "clicks", "reposts"],
    "istockphoto": ["downloads", "purchases"],
    "medium":      ["reads", "claps", "saves"],
    "youtube":     ["views", "likes", "shares", "saves"],
    "behance":     ["views", "appreciations", "saves"],
}

Minimum observations before scoring. The model requires 50 observations (views/listens) before a content efficacy score is considered meaningful (MIN_OBSERVATIONS = 50). Below this threshold, the system returns sufficient_data=False and withholds the score to prevent spurious corrections based on single-digit engagement counts. This is the creative domain equivalent of the confidence gate — the system does not act on insufficient signal.

Content efficacy vs. discoverability: why both matter

Example A — High content efficacy, low discoverability:
    Views: 40        Likes: 30    Saves: 20    Downloads: 8
    Content efficacy: ~0.85   (extremely high engagement rate)
    Discoverability: ~0.15    (almost no one saw it)
    Combined: sqrt(0.85 × 0.15) ≈ 0.36   ← low overall score

Example B — Low content efficacy, high discoverability:
    Views: 50,000    Likes: 200   Saves: 50    Downloads: 10
    Content efficacy: ~0.25   (very low engagement per view)
    Discoverability: ~0.85    (strong reach)
    Combined: sqrt(0.25 × 0.85) ≈ 0.46   ← below baseline

Example C — Balanced:
    Views: 5,000     Likes: 400   Saves: 200   Downloads: 80
    Content efficacy: ~0.72   (strong engagement rate)
    Discoverability: ~0.70    (good reach)
    Combined: sqrt(0.72 × 0.70) ≈ 0.71   ← above baseline

The geometric mean penalizes imbalance. A model that generates content optimizing for virality at the expense of actual engagement quality will score no better than one that generates high-quality-but-invisible work. Both components must improve together.

5. Operating modes: exploration vs. refinement

Creative AI systems face a fundamental tension: too much consistency produces derivative, predictable output; too much novelty produces work that doesn't connect with any audience. The framework handles this with operating modes that shift the weight on the curiosity term:

Mode A
Exploration / ideation
Efficacy (w_e)
0.60
Confidence (w_c)
0.05
Curiosity (w_k)
0.35

Novelty exploration mode. System is rewarded for producing unfamiliar outputs that still resonate. Used for ideation, style development, and breaking established patterns. Curiosity weight elevated.

Mode B
Standard creative production
Efficacy (w_e)
0.80
Confidence (w_c)
0.05
Curiosity (w_k)
0.15

Platform resonance dominates. Exploration still present but secondary. Default creative_writing profile from config.py.

Mode C
Brand voice / consistency
Efficacy (w_e)
0.65
Confidence (w_c)
0.30
Curiosity (w_k)
0.05

Brand voice consistency mode. Confidence weight rises — the system is penalized for outputs that deviate from established style. Exploration is suppressed. Used for brand compliance and style guide adherence.

Mode D
Platform optimization
Efficacy (w_e)
0.90
Confidence (w_c)
0.05
Curiosity (w_k)
0.05

Maximum resonance optimization. Platform signal almost entirely governs output. Used when the objective is explicit: maximize saves, downloads, or conversions for a specific content type and platform.

6. Applications: generative media, brand voice, content platforms

🎵

Music generation platforms

Platform signals: Spotify saves and playlist adds (weight 0.8), SoundCloud reposts (0.7), downloads (1.0). Content efficacy measured against human creator baseline in the same genre and release period. Correction loop feeds back which melodic or production patterns resonated and which didn't.

🎨

Image & visual generation

Platform signals: iStockPhoto downloads/purchases (1.0), Pinterest saves (0.8), Behance appreciations (0.5). Discoverability efficacy measures whether the work surfaces in search results and recommendations — tagging, titling, and category alignment are part of the creative skill.

✍️

Long-form writing & editorial

Platform signals: Medium saves (0.8), reads completion rate (0.4), comments (0.4). Content efficacy measures conversion from view to completion — a piece that holds attention to the end scores higher than one that gets opened and abandoned.

🎬

Video & short-form content

Platform signals: YouTube saves (0.8), shares (0.7), like/view ratio (0.5). Discoverability efficacy measures click-through rate from thumbnails and titles — the creative framing of content, not just its substance, is part of the quality measure.

🏷️

Brand voice systems

Brand mode (Mode C above) uses confidence weight to enforce style consistency. The assertions store accumulates brand voice guidelines as high-confidence Class C assertions. Deviations from brand voice are treated as contradictions and corrected exactly as factual errors are in STEM domains.

📱

Creator economy tooling

Tools that help individual creators improve their own work over time. Personal assertions store accumulates what resonates with the creator's specific audience. The correction loop is personalized — not averaging across all users but learning from each creator's own signal.

7. The correction loop for creative feedback

The correction loop in creative domains works differently from high-stakes domains in three important ways:

No hard abstention — creative risk is allowed

With c_min=0.05, the system almost never triggers the abstention gate. In creative domains, "I'm not confident this will resonate" is not a reason to withhold output — it is an invitation to explore. The correction loop is the safety mechanism: if the low-confidence output doesn't resonate (low platform signal), the correction prevents that pattern from repeating. If it does resonate unexpectedly, that positive signal updates the efficacy EMA upward and the system learns the pattern works.

Platform signal lag requires patience

Creative signal takes time to accumulate — a newly uploaded track needs time to gather saves and reposts. The MIN_OBSERVATIONS = 50 threshold prevents premature corrections on sparse data. Calibration cycles for creative domains should be longer than for STEM domains: weekly rather than several-times-daily, allowing engagement signal to accumulate before the next DPO calibration run.

Category-specific baselines prevent cross-contamination

Ambient electronic music that underperforms relative to ambient electronic baselines should not contaminate the model's behavior on hip-hop production. The two-component efficacy model stores baselines per platform:category key — a work is only compared to its category peers. The correction loop updates are scoped by category, not applied globally.

8. MVP shape

Phase 5 of the framework roadmap is dedicated to creative fields — platform signal collection pipeline and two-component efficacy measurement. See §12 of the full whitepaper.

  1. Pick one platform and one category — Spotify ambient music, or Pinterest product photography. The category-specific baseline is what makes the efficacy score meaningful. Start narrow enough to build a robust human creator baseline.
  2. Collect human creator baseline signals — register human creator engagement signals for the category using CreativeEfficacyTracker.add_baseline(). Aim for at least 20–30 works to build a meaningful average.
  3. Deploy in production and collect 50+ observations per work — the minimum observation threshold before any correction is applied. In early deployment, focus on data collection, not correction.
  4. Run first calibration cycle on content efficacy signal — identify works where content efficacy is consistently below baseline for the same category. These are the first correction targets.
  5. Extend to discoverability efficacy — after content quality corrections are stable, add the discoverability component. This requires more platform instrumentation (impressions, search rank) but produces the combined score that distinguishes genuinely resonant work from well-distributed mediocrity.

Build it with AUA v1.0

Configure this domain today

AUA v1.0 supports curiosity-weighted fields out of the box. Configure platform signal scoring via a custom UtilityScorerPlugin, and let the correction loop learn from saves, downloads, and shares.

1. Install

pip install adaptive-utility-agent

2. Scaffold for this domain

aua init my-creative-systems-agent --preset generalist --tier macbook
cd my-creative-systems-agent
aua doctor

3. Key config for this domain

# aua_config.yaml
specialists:
  - name: creative_writing
    model: qwen-coder-7b-awq
    port: 11434
    field: creative_writing

safety:
  abstention_enabled: false
  require_arbiter_for_high_risk: true
  min_confidence_for_direct_answer: 0.05

security:
  encryption: {enabled: true, key_secret: AUA_ENCRYPTION_KEY}

audit:
  enabled: true
  hash_chain: true

Generate your encryption key: python3 -c "import os; print(os.urandom(32).hex())" or openssl rand -hex 32 — 64-char hex string. See Tutorial §12.4 for key management.

4. Start and query

aua serve

curl -X POST http://localhost:8000/query \
  -H "Authorization: Bearer $AUA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "...", "session_id": "demo"}'

5. What AUA handles vs. what you bring

AUA v1.0 providesYou bring
Multi-specialist routing + utility scoringDomain-specific specialist models
Arbiter + contradiction detectionDomain-specific quality criteria
Correction loop + DPO pair exportFine-tuning infrastructure (TRL, Axolotl, …)
Blue-green deployment + rollbackEvaluation datasets for your domain
Append-only audit log with hash chainPlatform signal ingestion (saves, downloads, …)
Prometheus + Grafana + OTELYour monitoring infrastructure

Full instructions: AUA Tutorial · Framework v1.0 · GitHub ↗