neural / attribution

Interpretability is touchable.
Hover a token. Watch the story rearrange.

Chain-of-thought is a narrative, not a transcript. Attribution graphs reveal the parallel streams a model actually runs underneath the stated reasoning. Three bands stack: input tokens, the chain-of-thought story, and the streams the computation actually flows through. Hover a token to trace where its signal lands. Click a pivot token to remove it — watch which streams collapse and which keep narrating anyway. That's the legibility gap.

interactive attribution graph

hover input · CoT · streams · click a pivot token to ablate · drag threshold to filter weak edges

divergence

73%

How much the stated chain-of-thought differs from actual computation in the arithmetic task. The narrative is a post-hoc rationalization, not a transcript.

parallel paths

2–3

Models run approximate (heuristic) and precise (circuit) computations simultaneously. CoT narrates a single sequential path that matches neither.

answer lead

~5 layers

In multi-hop tasks, the final answer features activate roughly 5 layers before the CoT 'derives' it. The reasoning backfills a conclusion already reached.

circuit finding

NP-hard

ICLR 2025 proved that fully decomposing model computation into interpretable circuits is computationally intractable. Every insight we can extract is hard-won.

the mechanism

The narrative gap

When you ask a model to show its work, it generates a plausible-sounding explanation. But attribution analysis reveals the actual computation doesn't follow those steps. The model runs parallel paths — approximate heuristics racing against precise circuits — then narrates a sequential story after the answer is already determined.

Why ablation matters

Hovering shows correlation. Ablation shows causation. When you click a pivot token to remove it (causal patching, Meng et al. 2022), streams that depended on it collapse, streams that routed around it persist. The narrative keeps narrating regardless — which is exactly why we need attribution graphs and not just chain-of-thought transcripts to verify reasoning.

The sycophancy signal

Attribution reveals parallel streams competing for output: factual recall vs user-agreement bias. The model 'knows' the right answer with high confidence but a separate stream pushes toward agreeing with the user. Remove the belief marker ("I think") and the agreement stream starves — divergence drops, but the CoT structure stays the same. Bias lives in the streams, not the script.

Limits of legibility

Circuit finding is NP-hard (ICLR 2025). We can't fully decompose what a model computes into interpretable pieces. But partial attribution — tracing the strongest signals and watching them rewire under ablation — reveals enough to see when the story and the computation diverge. The visualization above simulates these partial traces against published findings.

the implication

We can't trust models to explain their own reasoning. Not because they're lying — because the narrative and the computation are different objects. Chain-of-thought is generated by the same forward pass that produces the answer, but the causal path from input to output runs through circuits the CoT doesn't describe. Mechanistic interpretability — building tools that trace actual attribution, ablate inputs, and watch downstream features rewire — is the path to verifiable AI reasoning. Touch the input. Watch the story rearrange. That's the whole research program.

→ chain-of-thought prompting → superposition geometry → attention mechanics