neural / attribution

The model says one thing.
It computes another.

Chain-of-thought is a narrative, not a transcript. Attribution graphs reveal that models run parallel computation paths while generating sequential explanations. The gap between stated and actual reasoning is measurable — and significant. Select a task. Hover nodes. Watch what connects and what doesn't.

interactive attribution graph

select task · hover to highlight CoT↔computation links · drag threshold to filter weak attributions

divergence

73%

How much the stated chain-of-thought differs from actual computation in the arithmetic task. The narrative is a post-hoc rationalization, not a transcript.

parallel paths

2–3

Models run approximate (heuristic) and precise (circuit) computations simultaneously. CoT narrates a single sequential path that matches neither.

answer lead

~5 layers

In multi-hop tasks, the final answer features activate roughly 5 layers before the CoT 'derives' it. The reasoning backfills a conclusion already reached.

circuit finding

NP-hard

ICLR 2025 proved that fully decomposing model computation into interpretable circuits is computationally intractable. Every insight we can extract is hard-won.

the mechanism

The narrative gap

When you ask a model to show its work, it generates a plausible-sounding explanation. But attribution analysis reveals the actual computation doesn't follow those steps. The model runs parallel paths — approximate heuristics racing against precise circuits — then narrates a sequential story after the answer is already determined.

Why it matters

If chain-of-thought were a faithful trace of computation, we could trust it for safety-critical reasoning. It isn't. The gap between stated and actual computation means we need mechanistic tools — attribution graphs, circuit analysis, feature decomposition — not just verbal explanations from the model about what it's doing.

The sycophancy signal

Attribution reveals parallel streams competing for output: factual recall vs user-agreement bias. The model 'knows' the right answer with high confidence but a separate stream pushes toward agreeing with the user. This tension is invisible in the CoT, which reads as careful deliberation.

Limits of legibility

Circuit finding is NP-hard (ICLR 2025). We can't fully decompose what a model computes into interpretable pieces. But partial attribution — tracing the strongest signals — reveals enough to see when the story and the computation diverge. The visualization above simulates these partial traces.

the implication

We can't trust models to explain their own reasoning. Not because they're lying — because the narrative and the computation are different objects. Chain-of-thought is generated by the same forward pass that produces the answer, but the causal path from input to output runs through circuits the CoT doesn't describe. Mechanistic interpretability — building tools that trace actual attribution, not stated attribution — is the path to verifiable AI reasoning.