Domain Deep-Dive · v1.0

Self-Driving &
AV Systems

How the Adaptive Utility Agent framework gives AV decision stacks dynamic weight shifting across scenarios, principled abstention under uncertainty, independently updatable specialists, and an auditable decision trail that regulators can actually read.

Waymo · Cruise · Aurora Autonomy stack engineers Safety-case authors Regulatory & certification teams Embedded ML & systems architects

On this page

Why end-to-end monoliths fail AV certification
Context-adaptive utility weights: how and when they shift
Confidence gates and the abstention decision rule
Specialist decomposition for the AV decision stack
The Arbiter: structured contradiction resolution between specialists
Escalation chain: intra-stack → human → minimum-risk condition
Edge hardware deployment: Jetson-class specialists and pipeline parallelism
Independent updateability and narrower certification scope
Fleet-level correction propagation from edge encounters
MVP shape: shadow-mode validation before any live control
Where to read next
Build it with AUA v1.0

Why end-to-end monoliths fail AV certification

Current end-to-end neural AV systems bundle perception, motion planning, and traffic-rule reasoning into shared weights trained jointly. This is attractive for raw performance — joint training allows the system to discover cross-component representations — but it creates three structural problems that compound as fleets scale and geographies expand.

Any update is a full-stack revalidation event. Adding traffic rules for a new city, improving pedestrian tracking for an edge case, or correcting a planning bias all require retraining and revalidating the entire system. The validation surface area does not shrink as the system matures — it expands with every new capability and every new geography.
Post-incident explanation is opaque. When a monolithic system makes a wrong manoeuvre decision, attributing the cause — was it a perception failure? a planning error? a misapplied traffic rule? — requires gradient attribution through shared weights that do not cleanly decompose by function. Regulators increasingly require this attribution, and the architecture cannot provide it cleanly.
Error repetition has no mechanism to fix it. A system that makes a wrong decision on a pedestrian crossing at a specific junction will make the same decision at the same junction tomorrow, and in every vehicle in the fleet, until the next full training cycle — which may be months away. There is no feedback loop between a detected error and the system's behavior in the interval between releases.

The framework addresses all three directly: specialist isolation narrows the update surface, the utility log produces attribution-grade audit trails, and the continual correction loop reduces repeated errors between releases. For AV companies, the strongest argument is not hardware cost but independent updateability, auditable behaviour, and principled abstention. See §2.7 of the full whitepaper for the AV framing in context.

Context-adaptive utility weights: how and when they shift

The central mechanism that makes the framework useful for real-world driving scenarios is the field-weighted utility function. Instead of hard-coded rules for each scenario type, a single formula governs all decisions — but its weights shift automatically with context, producing different priorities in different situations without a separate rule set for each one.

U = w_e(f)·E + w_c(f)·C + w_k(f)·K

E — Efficacy:    performance relative to expected safe behavior in this scenario class
C — Confidence:  internal consistency, penalized by detected contradictions
K — Curiosity:   exploration bonus (near-zero in operational AV; applies in simulation/testing)
f — context field, determining weights and minimum competence thresholds

The key property: the same formula produces different decisions across contexts because the weights, not the logic, change. No pre-written rules are needed for "school zone behavior" versus "emergency transport behavior" — only the weight vector is different, and that difference flows through to every downstream decision.

Weight profiles across AV operating scenarios

The following scenarios are worked examples from §2.1 of the full whitepaper, with context-specific weight derivations:

Scenario A

School zone / pedestrian-dense area

Safety (w_s)

0.90

Efficiency (w_e)

0.07

Comfort (w_c)

0.03

Safety dominates at 0.90. Speed, comfort, and journey time become near-irrelevant. Any manoeuvre with marginal safety uncertainty is rejected in favor of a conservative alternative.

Scenario B

Emergency transport (ambulance routing)

Safety (w_s)

0.56

Efficiency (w_e)

0.40

Comfort (w_c)

0.04

Efficiency rises substantially (w_e = 0.40). Safety retains priority but no longer excludes aggressive routing. Manoeuvres that would be rejected in a school zone become acceptable under this profile.

Scenario C

Highway cruise (open ODD, clear conditions)

Safety (w_s)

0.55

Efficiency (w_e)

0.30

Comfort (w_c)

0.15

Balanced profile for well-characterized, lower-risk driving conditions. Comfort has meaningful weight because the perceived risk of smooth lane changes and optimal speed selection is low.

Scenario D

Sensor degradation / adverse weather

Safety (w_s)

0.85

Efficiency (w_e)

0.10

Comfort (w_c)

0.05

When sensor fusion uncertainty rises — fog, heavy rain, lidar degradation — the weight profile automatically shifts toward safety-conservatism and the confidence minimum tightens. Abstention triggers earlier in degraded sensor conditions.

The key engineering advantage. None of the scenario profiles above require a hand-written rule. The weight vector is stored as a field parameter set, and the field classifier activates the appropriate profile when context is detected. Adding a new scenario type — construction zones, tunnel transitions, bad actor pedestrian behavior — requires adding a new weight profile and a classifier signal, not a new decision rule. The formula remains the same.

Comparable field weights table from the whitepaper

For context, the full whitepaper defines field-weight profiles across all safety-critical domains. Aviation autopilot is the most comparable AV-adjacent entry:

Field	w_e	w_c	w_k	C_min	E_min	Penalty multiplier
Aviation autopilot	0.20	0.70	0.10	0.95	0.90	10×
Surgery	0.15	0.75	0.10	0.95	0.90	10×
Software engineering	0.50	0.40	0.10	0.70	0.60	2×
Creative writing	0.40	0.20	0.40	0.30	0.20	1×

AV operating scenarios map closest to aviation autopilot: confidence-weighted, high penalty multiplier, tight minimum competence thresholds. Source: §5.1 of the full whitepaper.

Confidence gates and the abstention decision rule

The confidence gate is the most operationally important single mechanism in the framework for AV deployment. It is a hard gate — not a soft penalty — that enforces a formal abstention when the system's confidence falls below the field-specific minimum. The decision rule is:

act = argmax U     if C ≥ C_min(f)  AND  E ≥ E_min(f)
act = abstain      otherwise   → trigger escalation chain

The key distinction: below C_min, the system does not produce a lower-quality answer. It produces no answer and escalates. In an AV context, a confidently wrong answer is more dangerous than an acknowledged abstention. The gate is what makes that safety property formally verifiable — it is not a soft preference for caution, it is a hard decision boundary with a defined behavior on both sides.

For the AV domain, the recommended confidence minimum is C_min = 0.85, derived from the aviation autopilot profile. This means:

When sensor fusion uncertainty drives confidence below 0.85, the vehicle abstains from the manoeuvre rather than proceeding at reduced reliability.
When a specialist produces a confidence value below the threshold for its decision class, the query does not get a specialist answer — it gets escalated.
The confidence minimum is field-specific: it can be tightened for high-stakes decision classes (collision avoidance: C_min = 0.90) and relaxed for lower-stakes ones (route planning: C_min = 0.65).

Decision class	Confidence floor (C_min)	Below-threshold behavior
Route planning / navigation	0.65	Request alternative route; escalate to remote assist if no alternative
Traffic rule interpretation	0.80	Apply conservative interpretation; log for human review
Arbitration output (cross-subsystem)	0.75	Defer to most conservative specialist output; log contradiction event
Immediate obstacle / collision avoidance	0.90	Trigger safe stop; mandatory human escalation
Sensor degradation / adverse conditions	0.85	Reduce speed; abstain from any manoeuvre above minimum-risk condition

Illustrative thresholds. Must be validated against the operator's specific ODD and safety case before operational deployment. See §5.1–5.2 in the full whitepaper for the threshold derivation methodology.

Why this matters for regulators. The confidence gate is the mechanism that allows a certifiable fallback to be defined and tested. The NHTSA framework and ISO 26262 both expect safety-critical systems to degrade gracefully under uncertainty — to slow, stop, or transfer control rather than continuing to operate at reduced competence. The confidence gate produces that behavior formally and testably, which is what a safety case requires.

Specialist decomposition for the AV decision stack

The Micro-Expert model decomposes the AV reasoning stack into independently deployable domain submodels, each with its own weights, calibration cycle, and utility tracker. The decomposition maps naturally onto the functional architecture that AV teams already reason about:

👁️

Perception specialist

Object detection, classification, and tracking. Produces confidence-annotated scene representations consumed by planning and rules. Highest update cadence — edge cases from field encounters improve this specialist first. Validation via comparison against labeled ground-truth sensor recordings.

🗺️

Motion planning specialist

Trajectory generation and path selection given perception output. Updated independently when planning improvements don't require perception retraining. Its own confidence signal for trajectory feasibility — a plan with low feasibility confidence triggers the abstention gate before the manoeuvre executes.

📋

Traffic rules / policy specialist

Jurisdiction-specific traffic law, right-of-way logic, and regulatory constraints. Most updateable specialist when entering new geographies — the narrowest revalidation scope of any component. Adding rules for a new city updates this specialist only; perception and planning are unaffected.

⚖️

Arbiter Agent

Resolves contradictions between the above specialists. Applies the defined arbitration policy, logs resolution events with both specialist confidence values, and triggers escalation when the policy cannot resolve. The most conservative safe output wins under any unresolved contradiction.

The inter-specialist interface is a structured protocol — the same data structure the wrapper layer already produces, requiring no new design:

Request:  { query, context, field, confidence_floor, session_id }
Response: { answer, confidence, assertions[], uncertainty_flags, U_score }

Each specialist's response includes its confidence value and any uncertainty flags raised during generation. The Arbiter consumes these structured responses, not raw text outputs. This is what enables formal contradiction detection — contradictions are detected between structured claims with associated confidence values, not between free-form text strings.

The Arbiter: structured contradiction resolution between specialists

When two or more specialists produce contradictory outputs — for example, the perception specialist reports a pedestrian in the vehicle's path while the motion planning specialist generates a proceed decision — the Arbiter runs a structured resolution process rather than silently averaging or arbitrarily overriding. See §10.5 of the full whitepaper for the full Arbiter specification.

Contradiction resolution order

1. Logical check          — does one output contradict the other on formal grounds?
                            Cost: O(1) for formal domains. Available for traffic rules,
                            physics constraints, geometry.

2. Mathematical check     — is one output numerically inconsistent?
                            (e.g., claimed trajectory radius vs actual geometry)

3. Cross-session check    — does the assertions store contain a prior correction
                            for this scenario class? Apply it.

4. Empirical check        — compare against simulation-validated scenario library.
                            Costlier but definitive for known scenario classes.

→ If resolution found:    apply most conservative safe output; log correction event;
                          feed verified correction to relevant specialist via DPO pipeline.

→ If no resolution:       trigger escalation chain (see below).

The VCG arbitration mechanism

The Arbiter's incentive structure is formally addressed through a game-theoretic treatment in the whitepaper. Treating domain specialists as players in a cooperative game with the Arbiter as the external social planner, three theorems are proved. Under the Vickrey-Clarke-Groves (VCG) mechanism, truthful reporting of confidence values is the weakly dominant strategy for every specialist (Theorem S1) — eliminating the incentive for a specialist to overstate its confidence to "win" a contradiction. The Arbiter selects the social optimum with Price of Anarchy exactly 1 (Theorem S2), and no specialist prefers to abstain from the arbitration process (Theorem S3). See §10.6 of the full whitepaper for the full proofs.

The practical implication for AV teams: the arbitration outcome is not determined by whichever specialist is loudest or most recently calibrated — it is determined by the formal mechanism that elicits honest confidence reporting and selects the outcome that maximizes the joint utility of all specialists. This is the property that makes the arbitration result defensible in a post-incident review: the Arbiter's decision follows a formally verifiable process, not a heuristic.

What gets logged on every contradiction event. Which specialists were in conflict, their confidence values at the time of conflict, which evidence check resolved the conflict (logical / mathematical / cross-session / empirical), the resolution outcome, the DPO correction queued for each affected specialist, and the full session context. This is the artifact that makes post-incident review tractable — not a reconstructed narrative, but a structured log of the decision process as it happened.

Escalation chain: intra-stack → human → minimum-risk condition

The framework defines a three-stage escalation chain for the AV context. Each stage has a defined trigger, a defined behavior, and a defined packet that flows to the next stage.

⚖️

Stage 1 — Intra-stack Arbiter

Triggered when two or more specialists produce contradictory outputs. The Arbiter runs structured evidence checks and applies the arbitration policy. If resolved: the most conservative safe output is selected, the contradiction is logged, and corrections are queued for DPO calibration. If unresolved after all four checks: advance to Stage 2.

↓ unresolved contradiction or confidence below C_min

👤

Stage 2 — Remote human operator

Vehicle transitions to a safe reduced-speed state. Escalation packet sent to remote operator including: which specialists were active, their confidence values, the contradiction record, current sensor state, and position. The full packet is the key artifact for both real-time assist and post-incident review. If remote operator not available within latency budget: advance to Stage 3.

↓ no operator available or escalation latency unsafe

🛑

Stage 3 — Minimum-risk condition (MRC)

Controlled stop at the safest available location within the ODD. This is not a failure mode to be minimized — it is the designed behavior when the system reaches the boundary of its competence. The confidence gate exists precisely so that this boundary is formally testable and reproducible, not emergent from opaque model internals.

Edge hardware deployment: Jetson-class specialists and pipeline parallelism

For AV deployment, the hardware-adaptive decomposition argument is stronger than for datacenter deployment — not a cost argument, but a physical feasibility argument. A monolithic frontier model is not merely more expensive than a Micro-Expert graph on vehicle hardware; it cannot be deployed within a vehicle's power envelope at all.

H100 SXM5 (datacenter)

700 W

Exceeds entire vehicle compute budget

RTX 4090 (workstation)

450 W

Still impractical for vehicle thermal envelope

3× Jetson Orin NX (vehicle)

~75 W

Total system: perception + planning + rules

Jetson AGX Orin

15–60 W

32 GB unified memory, ~$900

Hardware	VRAM	TDP	Approx cost (2025)	Role in AV graph
Jetson AGX Orin	32 GB unified	15–60 W	~$900	Perception specialist or hub node
Jetson Orin NX	16 GB unified	10–25 W	~$500	Planning or traffic rules specialist
H100 SXM5	80 GB	700 W	~$30,000–35,000	Datacenter only — physically impossible in vehicle
RTX 4090	24 GB	450 W	~$1,600–2,000	Edge server / test vehicle only

Hardware specs and pricing from NVIDIA and published sources (2025). See §10.9.6 of the full whitepaper for the full edge deployment analysis and Jetson-specific worked examples.

Pipeline parallelism: specialists run concurrently, not sequentially

A critical point that is often missed: in the Micro-Expert deployment on vehicle hardware, the domain specialists do not run one after another — they run in a pipeline. Total added latency is the pipeline stage depth, not the sum of inference times.

Standard pipeline (sequential — naive assumption):
    Perception  → Planning  → Traffic Rules  → Decision
    [50ms]         [40ms]       [20ms]           [5ms]
    Total: ~115ms

Micro-Expert pipeline (parallel — actual architecture):
    Perception        [50ms] ──────────────────────────────┐
    Planning     ←──────────── [40ms, from perception] ────┤→ Arbiter → Decision
    Traffic Rules     [20ms running in parallel] ──────────┘
    Total: ~55ms + arbitration overhead ≈ 60–65ms

This is the same architecture that production AV stacks — Tesla FSD, Waymo's neural network pipeline — already use for their perception and planning modules. The Micro-Expert model applies the same parallelism at the model-graph level, where each parallel element is a domain specialist rather than a fixed computational stage. The result is that a three-specialist deployment on Jetson hardware adds less end-to-end latency than the sequential assumption suggests, and remains within real-time control requirements for most AV decision loops (typically <100ms for non-emergency decisions).

Independent updateability and narrower certification scope

The most important certification argument for the specialist decomposition is what it does to the revalidation scope of individual updates. In a monolithic system, every change — even a traffic rule update for a new geography — carries the full validation burden of the entire system. In the specialist decomposition, updating the traffic rules specialist does not automatically force revalidation of perception or planning, provided the interfaces between specialists remain stable.

Revalidation scope: monolithic vs. specialist decomposition — illustrative

Monolith: new traffic rules (city)

100%

Specialist: new traffic rules (city)

~18%

Monolith: perception improvement

100%

Specialist: perception improvement

~35%

Monolith: planning bias correction

100%

Specialist: planning bias correction

~30%

Illustrative. Actual revalidation scope depends on interface stability and the degree to which the update touches shared embedding layers. The traffic rules case is the strongest — it has the narrowest interface and the most contained update surface.

Mapping to ISO 26262 and ASIL decomposition

The framework's specialist architecture is compatible with ISO 26262 ASIL decomposition reasoning: safety goals are allocated to components, and component-level validation can satisfy the overall system safety case if the decomposition is sound. Specifically:

Each specialist can be assigned an ASIL level appropriate to its decision class — the traffic rules specialist may carry a different ASIL rating than the comfort optimization layer.
Interface specifications between specialists define the safety contract that must hold for the decomposition to be valid — formally the same reasoning as hardware/software interface requirements in the standard.
The utility log provides the behavioral record required for safety case evidence — utility values, confidence gates, arbitration results, and escalation events are all logged with session context.

The framework is not a replacement for the full ISO 26262 process. But it is compatible with that reasoning pattern in a way that a monolithic model is not, because it produces the modular structure and the behavioral records that the standard's evidence requirements depend on. See §12 Phase 9 for the planned safety-critical deployment validation work.

Shadow mode as blue-green deployment at vehicle scale

Shadow mode testing — already standard practice for AV validation — maps directly onto the framework's blue-green deployment protocol. A new submodel can run silently beside the production model, its utility compared against the live path before promotion. The blue-green trigger condition for AV specialists:

Trigger when ALL of:
    |U_current - U_baseline| > δ(field)     [significant deviation]
    deviation sustained for ≥ T interactions  [not transient noise]
    held-out scenario library available        [can evaluate candidate]

δ(aviation/AV)  ≈ 0.005–0.010   [very sensitive — small changes matter]
T(aviation/AV)  ≥ 246 interactions  [high confidence window required before promotion]

The high T value for safety-critical fields means that a specialist must prove sustained improvement across a statistically significant sample before any traffic is shifted from the production model. This is more conservative than standard software blue-green deployment and is calibrated to the error cost of the domain. See §10.7 of the full whitepaper for the full deployment lifecycle.

Fleet-level correction propagation from edge encounters

One of the structural problems with deployed AV systems is that an edge case encountered by one vehicle takes months to improve behavior across the fleet — because the correction path goes through a full retraining cycle. The framework provides a faster, narrower-scope correction path through the assertions store and DPO calibration pipeline.

When an edge case is encountered and the escalation packet is reviewed:

The reviewed case is logged as a structured correction in the assertions store — not as raw text, but as a typed, confidence-annotated assertion with the scenario context and the verified correct behavior.
The correction is routed to the relevant specialist only. A rare pedestrian behavior updates the perception specialist's assertions store; a new traffic law in a new jurisdiction updates the rules specialist. The other specialists are unaffected.
At the next calibration cycle, the correction becomes a DPO training pair — (incorrect behavior, reviewed correct behavior) — weighted by the field penalty multiplier. For AV decision classes with a high penalty multiplier, this correction is trained proportionally harder than corrections in lower-stakes domains.
The updated specialist is promoted via blue-green protocol — canary at 5% of fleet, gradual expansion as utility improves, full promotion only after the statistical confidence window is satisfied.
Future encounters with similar scenarios benefit immediately from the assertions store injection — even before the DPO calibration cycle completes — because the correction is injected into the specialist's context at session start.

What this means for fleet scale. An edge case encountered on Monday in one vehicle can propagate a behavioral correction to similar scenarios across the fleet by the following calibration cycle — without a full retraining event, and without any other specialist's behavior being affected. The correction is scoped, traceable, and auditable: the assertions store records when the correction was added, from which escalation event, and which calibration cycle propagated it to production.

MVP shape: shadow-mode validation before any live control

A practical MVP for an AV team adopting this framework does not start with any live vehicle control role. It starts in shadow mode, building a logged track record against the production system before the framework influences any real decisions. This follows the same validation logic as standard AV disengagement analysis, but with structured utility and confidence metrics rather than binary disengagement events.

Deploy in shadow mode — no influence on vehicle behavior The framework runs silently alongside the production stack, consuming the same sensor and decision inputs, generating utility values, confidence scores, and contradiction events, but not influencing any vehicle behavior. The goal is data collection, not performance.

Build the assertions store from shadow logs Identify which scenario classes produce the lowest confidence outputs or the highest contradiction event rates. These are the highest-priority calibration targets and the scenarios where the framework would have been most likely to escalate in a live role.

Validate the Arbiter policy against labeled scenario recordings Run the Arbiter against a library of labeled scenario recordings where the correct outcome is known. Measure the rate at which the Arbiter selects the correct outcome versus the incorrect one, and the rate at which it escalates when the correct outcome is not available to it. Tune the arbitration policy before any live role.

Set conservative confidence thresholds — expect high shadow escalation rates Start with C_min = 0.90 for all decision classes and log how often the shadow system would have escalated under that threshold. This is the calibration baseline. Relax thresholds only after the assertions store has built a track record on the real traffic distribution.

Run first calibration cycle on shadow-collected DPO pairs Use the shadow logs to build (preferred, rejected) pairs for each specialist — the scenarios where the shadow system's confidence was high and correct, versus the scenarios where it was low or wrong. Run the first DPO calibration cycle and measure whether confidence calibration improves on the held-out scenario library.

Promote to limited operational control only after shadow-blue-green validation The shadow system (BLUE) runs alongside the new candidate specialist graph (GREEN) until GREEN's utility profile matches or exceeds BLUE across all decision classes above the safety threshold, over the T ≥ 246 interaction window. Only after this window closes does any live control role begin — and then at the lowest-risk decision class only (route planning, not collision avoidance).

Phase 9 of the roadmap covers this exactly. The framework's planned Phase 9 work is shadow-mode evaluation, auditable logs, and abstention testing in autonomy-style settings — validating that the framework improves both performance and certifiability under regulatory constraints. See §12 of the full whitepaper for the full roadmap.

Where to read next

From this page

§2.1 — Autonomous Vehicles — Full AV worked example with weight derivations, confidence minimum, and assertions store accumulation
§2.7 — Self-Driving Platform and Regulatory Operations — The AV certification and regulatory acceptance framing in full
§5.1–5.2 — Field Bounds and Minimum Competence Thresholds — How C_min and E_min are derived from professional licensing standards; field weight table
§10.5 — Arbiter Agent — Full structured contradiction resolution specification, evidence check hierarchy, and external escalation protocol
§10.6 — VCG Arbitration — Game-theoretic treatment of the Arbiter's incentive structure; Theorems S1–S3
§10.7 — Blue-Green Deployment — Trigger conditions, T and δ thresholds per field, deployment lifecycle, and rollback specification
§10.9.6 — Edge and Battery-Constrained Deployment — Jetson hardware table, pipeline parallelism argument, power budget analysis
§12 Phase 9 — Safety-Critical Deployment Validation — Shadow-mode evaluation, auditable logs, and abstention testing roadmap
Autonomous Systems domain doc — Broader autonomy treatment: robotics, drones, safety-case engineering across all autonomous platforms
Tutorial — Full architecture walkthrough with code: utility function, assertions store, contradiction detector, confidence updater, Arbiter, blue-green deployment
GitHub — Adaptive-Utility-Agent — Agent code, routing experiment source, and reference implementation

Build it with AUA v1.0

Configure this domain today

AUA v1.0 handles the arbitration, correction, and audit layer. Configure safety-critical fields with high c_min thresholds and the append-only audit chain. Physical vehicle integration is via the AUA REST API.

Integration boundary: AUA handles the arbitration, routing, correction, and audit layer. Physical integration (vehicle control, SCADA, robot actuators, …) connects to AUA via the REST API — AUA does not control hardware directly.

1. Install

pip install adaptive-utility-agent

2. Scaffold for this domain

aua init my-self-driving-vehicles-agent --preset medical-safe --tier macbook
cd my-self-driving-vehicles-agent
aua doctor

3. Key config for this domain

# aua_config.yaml
specialists:
  - name: aviation
    model: qwen-coder-7b-awq
    port: 11434
    field: aviation

safety:
  abstention_enabled: true
  require_arbiter_for_high_risk: true
  min_confidence_for_direct_answer: 0.95

security:
  encryption: {enabled: true, key_secret: AUA_ENCRYPTION_KEY}

audit:
  enabled: true
  hash_chain: true

Generate your encryption key: python3 -c "import os; print(os.urandom(32).hex())" or openssl rand -hex 32 — 64-char hex string. See Tutorial §12.4 for key management.

4. Start and query

aua serve

curl -X POST http://localhost:8000/query \
  -H "Authorization: Bearer $AUA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "...", "session_id": "demo"}'

5. What AUA handles vs. what you bring

AUA v1.0 provides	You bring
Multi-specialist routing + utility scoring	Domain-specific specialist models
Arbiter + contradiction detection	Domain-specific quality criteria
Correction loop + DPO pair export	Fine-tuning infrastructure (TRL, Axolotl, …)
Blue-green deployment + rollback	Evaluation datasets for your domain
Append-only audit log with hash chain	Vehicle stack integration (ROS2, CAN bus, …)
Prometheus + Grafana + OTEL	Your monitoring infrastructure

Full instructions: AUA Tutorial · Framework v1.0 · GitHub ↗

Self-Driving &AV Systems

Why end-to-end monoliths fail AV certification

Context-adaptive utility weights: how and when they shift

Weight profiles across AV operating scenarios

Comparable field weights table from the whitepaper

Confidence gates and the abstention decision rule

Specialist decomposition for the AV decision stack

Perception specialist

Motion planning specialist

Traffic rules / policy specialist

Arbiter Agent

The Arbiter: structured contradiction resolution between specialists

Contradiction resolution order

The VCG arbitration mechanism

Escalation chain: intra-stack → human → minimum-risk condition

Stage 1 — Intra-stack Arbiter

Stage 2 — Remote human operator

Stage 3 — Minimum-risk condition (MRC)

Edge hardware deployment: Jetson-class specialists and pipeline parallelism

Pipeline parallelism: specialists run concurrently, not sequentially

Independent updateability and narrower certification scope

Mapping to ISO 26262 and ASIL decomposition

Shadow mode as blue-green deployment at vehicle scale

Fleet-level correction propagation from edge encounters

MVP shape: shadow-mode validation before any live control

Where to read next

From this page

Configure this domain today

1. Install

2. Scaffold for this domain

3. Key config for this domain

4. Start and query

5. What AUA handles vs. what you bring

Self-Driving &
AV Systems