Back
SubstrateEmpirical findings · May 2026

Empirical findings

A domain-agnostic dual-graph representation of human–LLM dialogue, and a minimal orchestrator that demonstrably acts on it.

Two studies on the Substrate representation. Study A tests whether the schema is genuinely domain-agnostic across 200 WildChat conversations spanning 7 conversation types. Study B closes the loop by adding a minimal orchestrator that consumes Substrate's divergence channel to intervene on the next model turn, and measures whether intervention reliably honors user constraints.

Conversations
200
WildChat-1M · 7 domains
Schema-valid extractions
200
100% · 0 quarantined
Intervention pairs
51
blinded A/B
Judge wins intervention
78.9%
n=38 decisive · p ≈ 0.0004

TL;DR

The schema generalises. Tested across 200 real ChatGPT conversations spanning seven wildly different domains (code, writing, planning, learning, advice, creative, factual Q&A), every type of node and every divergence the schema names shows up in nearly every domain. It's not built for one kind of conversation.

And an orchestrator built on it works. When Substrate flags a user constraint the model ignored, a 30-line intervention that surfaces that constraint back into the next system prompt is preferred by an independent blinded judge roughly 4 out of 5 times. The cleanest effect: when the model has added something the user never asked for, the intervention wins every single decisive comparison.

Numbers, matrices, and limits are in the sections below.


01 / The research question

Two questions, one artefact.

Existing structured representations of human–AI dialogue tend to be case-by-case: slot-value schemas tuned to a single task (flight booking, weather lookup), KG annotations restricted to a closed knowledge domain. As a recent framing of the open question puts it: case-by-case task representations do exist in the literature; the challenge would be a general, domain-agnostic representation. Whether such a representation can capture the shape of open-ended GenAI workflows that range over coding, writing, planning, design, and advice is what these studies set out to test.

Substrate proposes a dual-graph schema: a typed graph of the user's evolving problem formulation, a parallel graph of the model's inferred formulation, plus a divergence channel cataloguing the grounding acts and silent disagreements between them. The schema is intentionally process-primitive (goals, constraints, decisions, assumptions) rather than domain-primitive. The same vocabulary should apply whether the dyad is refactoring code or drafting marketing copy.

These two studies test the two halves of the contribution:

  • Study A: Representation. Across radically different domains, does the same schema produce structurally consistent extractions, or does it implicitly favour one conversation type?
  • Study B: Action. If a minimal orchestrator consumes Substrate's flagged divergences and surfaces the cited user constraint into the next system prompt, does the next model turn measurably better honor that constraint?

02 / Study A: domain coverage

Is the schema domain-agnostic?

200 multi-turn conversations sampled from allenai/WildChat-1M (real ChatGPT users in the wild, ≥4 and ≤8 turns), classified by Claude Haiku into seven domains, extracted by Claude Sonnet using Substrate's production extraction prompt and schema. Output: 1,010 turns processed, 3,658 graph nodes, 993 divergence events. All 200 extractions schema-valid; 0 quarantined.

Per-domain spread

DomainnMean turnsNodes/turnConfidence
Code414.83.210.92
Writing324.93.820.93
Planning75.44.500.93
Learning235.33.480.96
Advice205.12.560.89
Creative425.24.380.94
Factual Q&A355.13.490.93
Node-type coverage
55/56
cells of the node-type × domain matrix non-empty (98.2%)
The single empty cell (planning × example) is in the n=7 domain. Sampling noise, not a schema failure.
Divergence-type coverage
11/11
divergence types appear in ≥4 of 7 domains
Every type appears in 6-of-7 to 7-of-7 domains. Only one type-domain combination empty (contradicted_assumption × planning).

Node types × domain (proportion of nodes per domain)

Darker = higher proportion. Columns sum to 1.0 within each domain. The diagonal isn't structured. What matters is that every cell (except planning × example) has colour at all.

CodeWritingPlanningLearningAdviceCreativeFactual Q&AGoal17%20%16%20%25%14%15%Subgoal9%9%30%14%8%5%9%Constraint14%12%11%9%11%15%9%Entity13%15%12%20%10%21%26%Assumption22%19%18%16%23%19%19%Decision20%23%12%18%15%22%16%Example3%2%·2%5%5%2%Open question3%0%1%1%4%0%2%

Divergence types × domain (per-conversation rate)

Each cell is the mean number of events of that type per conversation in that domain. Top rows are grounding acts (clarification, acceptance, repair, reformulation); bottom rows are divergences proper.

CodeWritingPlanningLearningAdviceCreativeFactual Q&AClarification requested0.200.090.430.220.450.240.23Clarification response0.150.060.140.130.250.120.09Acceptance0.510.530.570.390.300.430.29Implicit acceptance0.711.060.861.480.751.101.09Repair0.460.310.570.170.200.310.51Reformulation0.170.220.290.220.100.170.20Unilateral addition1.321.661.711.431.302.121.23Ignored constraint0.220.470.570.220.300.640.23Scope drift0.290.220.570.350.250.360.54Contradicted assumption0.150.13·0.040.150.190.14Premature commitment0.410.160.140.220.200.170.17

What the matrices say

The substantive finding isn't coverage on its own. It's that the schema produces a recognisable signature per domain. The same vocabulary captures different domains breaking in characteristic ways:

  • subgoal is 30.4% of planning nodes vs 5.1% in creative. Planning conversations decompose, creative ones don't.
  • unilateral_addition peaks in creative at 2.12/conversation (roleplay, fanfic, brainstorming, where the model improvises) vs 1.23 in factual_qa.
  • entity is 26.4% in factual_qa (named things, concrete artefacts) vs 9.6% in advice.
  • premature_commitment peaks in code at 0.41/conversation. Models commit to solutions before understanding the codebase.
  • repair is highest in planning and factual_qa at 0.57. Users push back hardest on actionable plans and factual claims.

The schema is general, but it isn't flat. Domain differences come through as prevalence shifts within a stable vocabulary. That's what a domain-agnostic representation that actually works should look like.

Study A: limits ▾

Planning is under-represented (n=7). WildChat's natural distribution skews to creative / code / writing prompts; even after a top-up sampling pass to 200 total, only 7 conversations classified as planning. Per-domain rates for planning are noisier; coverage claims for planning rest on ~38 turns total. The other six domains are well-supported (20–42 conversations each).

Extractor model is Sonnet, not the API-tier model the live app uses. All extractions produced by Claude Sonnet via the Claude Code SDK; post-hoc Zod validation with a repair-pass and one retry per conversation. API-tier validation against claude-opus-4-7 is deferred to a follow-up phase.

No human inter-rater baseline. Out of scope for this study. Domain classification is itself a model judgement (Haiku); a 10-sample audit was 9/10 clearly correct.

WildChat sampling bias. Real ChatGPT users, English-only filter, ≤8 turn cap, SFW-only. Cross-domain claims generalise to this population only. The corpus also contains noticeable roleplay/fanfic, inflating the creative domain.


03 / Study B: orchestration intervention

Is the representation actionable?

From Study A's outputs we selected 51 intervention candidates: HIGH/MEDIUM-severity divergences of three repairable types (ignored_constraint, premature_commitment, unilateral_addition), one per conversation, balanced across types. For each: a baseline arm (Sonnet generates the next assistant turn with no intervention) and an intervention arm (same input, but a 30-line orchestrator note surfaces the cited user constraint into the system prompt). Both arms re-extracted; both arms judged blind A/B by an independent Claude Opus instance against the cited constraint.

LLM-as-judge
Claude Opus, blinded A/B, random order
78.9%
intervention wins
Decisive judgements38
Intervention wins30
Baseline wins8
Ties13
Binomial test z3.57
p-value≈ 0.0004
Highly significant. Judge picks intervention in roughly 4-out-of-5 decisive comparisons.
Substrate-as-judge
Per-turn divergence count, intervention − baseline
-0.078
mean Δ divergences/pair (negative = intervention helps)
Baseline mean0.67
Intervention mean0.59
Paired t-0.66
p-value≈ 0.507
Cohen's d-0.09
n51
Direction correct, magnitude small, under-powered at n=51. The metric counts fresh divergences, not whether the model acknowledged prior constraints.

What the metric split actually means

The two judges don't disagree. They measure different things, and the split is the most interesting finding:

  • The Opus judge asks: which reply more directly honors the user's stated constraint? It rewards explicit acknowledgement and content adjustment.
  • Substrate's divergence counter asks: did a new divergence fire in the next turn? It scores whether the model committed fresh errors, not whether the model addressed an earlier one.

The intervention does what it was designed to do: it gets the model to acknowledge and adjust. Substrate's narrow per-turn divergence count doesn't directly reward that. The two metrics are complementary, not competing. A natural follow-up would extend Substrate's evaluator with a "constraint repair" signal that closes this gap.

By divergence type

The sharpest finding: unilateral_addition is a 13–0–5 intervention sweep with a real divergence-rate drop.

Divergence typenMean ΔInt winsBase winsTies
ignored_constraint180.0001044
unilateral_addition18-0.4441305
premature_commitment15+0.267744

By domain

Writing, learning, creative show the strongest intervention effects. Code is weakest, likely because code conversations have multiple competing concerns and a single-constraint nudge doesn't capture full intent.

DomainnMean ΔInt winsBase winsTies
Code16+0.250754
Factual Q&A11-0.091704
Creative9-0.333522
Writing5-0.400401
Learning5-0.400401
Advice4-0.250211
Planning1+1.000100
Study B: limits ▾

Same-family extractor / replayer / judge. Replay (Sonnet) and re-extraction (Sonnet) are the same family as the primary judge (Opus). Self-preference bias is mitigated by blinded A/B with deterministic random order. Cross-family validation via OpenAI gpt-4o-mini was deferred by user choice; the headline result is not yet validated outside the Claude family.

Under-powered for small effects. 51 paired observations (vs. plan target of 100), sufficient to detect medium effects (d ≥ 0.4), which is what unilateral_addition shows. Smaller per-type effects sit below the detection floor.

Deliberately minimal orchestrator. The intervention surfaces the cited user constraint and asks the model to acknowledge it. No prescriptive instructions about content. A heavier-handed orchestrator might move the needle further; we didn't test that. The point of the study is the minimal nudge.

Sample selection bias. Only HIGH/MEDIUM severity divergences of three repairable types. Behaviour on LOW severity or on scope_drift / contradicted_assumption is untested.


04 / What the two studies say together

A domain-agnostic representation that an orchestrator can use.

StudyQuestionAnswer
Study AIs the representation domain-agnostic?Yes. 55 of 56 node-type × domain cells populated across 7 domains. All 11 divergence types covered ≥6/7 domains. Domain differences are prevalence shifts within a stable vocabulary, not type-system breakdowns.
Study BIs the representation actionable?Yes, in user-perceptible terms. A 30-line orchestrator surfacing flagged constraints is preferred by an independent blinded judge in 78.9% of decisive comparisons (p ≈ 0.0004). Largest effect on unilateral_addition (13–0–5 sweep, Δ = −0.444/pair).

Together these answer the bar this project was set against: a general, domain-agnostic representation (Study A) that is operationally useful (Study B). Not "a tool that displays graphs". A representation that captures real conversational structure across very different tasks and that a minimal orchestrator can act on to measurably improve next-turn constraint-honoring behaviour.

The one-sentence contribution

Across 200 WildChat conversations spanning 7 domains, the Substrate dual-graph schema produces structurally consistent extractions (55 of 56 node-type × domain cells populated; all 11 divergence types covered ≥6/7 domains), and a 30-line orchestrator that surfaces flagged user constraints into the next system prompt is preferred by an independent judge in 78.9% of decisive comparisons (n=38, p ≈ 0.0004), with the largest reduction in fresh unilateral_addition divergences (13–0–5 intervention sweep, Δ = −0.444/pair).


05 / What these studies do NOT prove

Honest limits, explicitly.

  • ·Not validated against API-tier models. All extractions and replays produced by Claude Sonnet through the Claude Code SDK. Behaviour on claude-opus-4-7 or gpt-5.5 may differ; follow-up phase pending.
  • ·Not validated by humans. No inter-rater reliability study. The taxonomy reproduces under LLM coding; whether human annotators agree at comparable rates is open.
  • ·Not cross-family for the judge. Sonnet extractor + Sonnet replayer + Opus judge are all Claude. Self-preference bias mitigated by blinding but not eliminated.
  • ·Not tested on heavy intervention. The orchestrator is deliberately minimal (constraint surfacing only). A more prescriptive variant might move the needle further, or might bias the judge result through over-steering.
  • ·Not all divergence types covered in Study B. Only HIGH/MEDIUM severity, only three repairable types (ignored_constraint, premature_commitment, unilateral_addition). scope_drift and contradicted_assumption were excluded for clarity.
  • ·Not a generalisation beyond WildChat. The sample reflects English-speaking ChatGPT users in the wild circa 2023–24. Generalisation to other platforms, languages, or more deliberately structured workflows is untested.

06 / What's next

Three concrete follow-ups.

01
API-tier validation

Re-extract 10–20 conversations through claude-opus-4-7 via the live /api/extract endpoint; compare node counts, divergence counts, type agreement against the Sonnet extractions. Cheap (~$10), defends the methodology section.

02
Constraint-repair metric

Extend Substrate's evaluator with a 'did the model acknowledge an earlier constraint' signal that closes the gap between LLM-judge and turn-level divergence count. Would let Substrate itself measure intervention effects.

03
Live orchestrator in the app

Promote the 30-line orchestrator into src/lib/, gate behind a chat-mode toggle. Users opt into Substrate intervening when a HIGH-severity divergence is flagged. Real-world telemetry beats benchmark.


Theoretical lineage

  • Clark & Brennan (1991), Grounding in Communication. The conceptual root: dialogue as joint construction of common ground.
  • Shaikh et al. (NAACL 2024, ACL 2025), Grounding Gaps in Language Model Generations; Navigating Rifts in Human–LLM Grounding. Source of the grounding-act taxonomy in the divergence channel.
  • Subramonyam et al. (CHI 2024), Bridging the Gulf of Envisioning. Motivation for externalising the user's evolving formulation.
  • Schneider et al. (SIGDIAL 2024), BridgeKG. Closest prior art on KG-mediated dialogue annotation; Substrate generalises from closed-schema dialogues to open GenAI workflows.
  • Laban, Hayashi, Zhou & Neville (ICLR 2026), Lost in Multi-Turn. Empirical motivation: LLMs lose ~35% on multi-turn vs single-turn because conversational state isn't maintained externally. Substrate is the artefact that lets it be.
  • Zhao et al. (ICLR 2024 Spotlight), WildChat-1M. Source corpus for Study A & B.
  • Panickssery, Bowman & Feng (2024), LLM Evaluators Recognise and Favour Their Own Generations. Motivation for blinded A/B + cross-family judge design in Study B.

Substrate is a research prototype. Back to the app · How the algorithm works · Terms · Source