neural / diffusion
The answer arrives before the reasoning finishes.
That changes everything about how we think about thinking.
Autoregressive models think left to right. Each token waits for every previous one. Diffusion language models think holistically — all tokens refine in parallel, with structural tokens stabilizing first and answer tokens crystallizing from context. At 50% of refinement steps, 97% of tokens are already correct.
interactive — real pixel diffusion
drag the slider — watch a real image dissolve into gaussian noise and reconstruct · the math is exact: x_t = sqrt(ab)*x_0 + sqrt(1-ab)*eps with cosine schedule · toggle "show denoised" to see approximate reconstruction at each noise level · upload your own image to diffuse
interactive — thinking out of order
drag the slider or press think — watch how diffusion resolves tokens by confidence, not position · structure locks first, reasoning next, answers last · compare with sequential AR generation side by side
When you solve 17 x 24, you don't think left-to-right either. You recognize the structure ("multiplication"), decompose the problem ("17 x 20 + 17 x 4"), compute intermediate results, then assemble the answer. The structural scaffolding comes first. The answer comes last.
Diffusion models exhibit exactly this pattern. Structural tokens (equals signs, conjunctions, formatting) resolve in the first few steps. Reasoning tokens (operands, logical connectives) come next. Answer tokens — the actual output the user cares about — stabilize last, built on the scaffold of everything else. The model "knows" its answer before it finishes refining, because the reasoning substrate is already in place.
Mercury 2
1,009
tok/s
First reasoning dLLM. AIME 91.1. $0.25/M input.
Accuracy at 50%
97%
correct
Most tokens stable halfway through refinement.
CDLM speedup
14.5x
latency reduction
Consistency training compresses denoising steps.
LLaDA 2.0
100B
MoE params
First open-source dLLM at scale. 94.51 HumanEval.
Forget next-token prediction. dLLMs corrupt text with noise, then train a model to reverse the corruption. The entire sequence refines simultaneously.
Forward Process — Masking
Start with clean text. At each timestep t, independently mask each token with probability t. At t=1, every token is [MASK]. Discrete masking on token IDs — no Gaussian blur, no continuous vectors.
Bidirectional Prediction
A vanilla Transformer without a causal mask. Given a partially masked sequence, predict every masked position simultaneously. Token 47 sees token 48. This is fundamentally different from left-to-right models.
Iterative Refinement
Start from pure noise. Each step: predict all masked tokens in parallel, re-mask low-confidence ones, keep high-confidence ones. ~14 steps to refine a 512-token sequence. The text doesn't appear left-to-right — it appears everywhere at once.
interactive — denoising process
step through diffusion denoising on custom text · compare AR vs diffusion side by side · race models at real-world speeds
interactive — langevin dynamics
real langevin sampling on a mixture of 5 gaussians · arrows show the exact score function grad_x log p(x) at each grid point · particles follow this field plus noise · toggle annealed mode to watch sigma decrease from 0.15 to 0.001
interactive — annealed vs unannealed
the key insight from Song & Ermon 2019: annealing the noise level lets particles cross energy barriers between modes · without annealing, particles collapse to the nearest peak and miss low-weight modes entirely · same start positions, same noise, different sigma schedule · watch the mode coverage diverge
real DDPM math with cosine noise schedule · forward: x_t = sqrt(ab_t)*x_0 + sqrt(1-ab_t)*eps · reverse: x_(t-1) uses the exact reverse mean and variance formulas from Ho et al. 2020 · watch alpha_bar and beta change at each timestep
same starting noise, same trained model, different reverse process · drag eta from 0 (fully deterministic DDIM) to 1 (recovers DDPM) · enable step-skipping to see why deterministic sampling can skip timesteps but stochastic can't · Song et al. 2020
interactive — consistency distillation
the teacher traces 50-step ODE trajectories from noise to data · the student learns that every point on the same trajectory should map to the same clean output · 1 step = 50x speedup with measurable quality loss · 4 steps nearly matches teacher · toggle consistency probe to see the self-consistency property in action · Song et al. 2023
interactive — masked diffusion (LLaDA)
real DistilBERT inference (68MB ONNX, runs in-browser via WASM) · type any sentence · forward process masks all tokens · reverse process uses real fill-mask predictions · LLaDA remasking: sort by model confidence, keep top, remask bottom (s/t) fraction · green dot = real model prediction · red dot = remasked · compare with AR side-by-side
autoregressive
O(n) — 512 tokens = 512 forward passes.
diffusion
O(T) — ~14 steps regardless of length.
autoregressive
None. Bad token at position 12 propagates forever.
diffusion
Built-in. Low-confidence tokens get re-masked and re-predicted each step.
autoregressive
Proven at scale. GPT-4, Claude, Gemini.
diffusion
Mercury 2 matches Claude Haiku class. LLaDA 2.0 (100B) competitive with Qwen3-30B. Gap closing fast.
autoregressive
Unidirectional. The reversal curse is a direct consequence.
diffusion
Bidirectional. Every token sees every other. Reversal curse substantially weakened.
AR and diffusion are endpoints on a spectrum. The most interesting work is in the middle — architectures that choose when to be sequential and when to be parallel.
Block Diffusion
Kuleshov Lab (ICLR 2025)mechanism
Generates blocks autoregressively; within each block, uses diffusion. Block size L' is the interpolation knob — L'=1 is pure AR, L'=n is pure diffusion.
insight
AR and diffusion are endpoints on a continuum. Every point in between is a valid architecture.
TiDAR
NVIDIA Researchmechanism
Single model, hybrid attention. Causal prefix (AR-cached) + diffusion draft block in one forward pass. Fills idle GPU slots with speculative tokens.
insight
4.7-5.9x faster than pure AR. 8 tokens per forward pass instead of 1. No separate draft model.
CDLM
Together AImechanism
Post-training acceleration via consistency loss + distillation. Compresses denoising steps after the fact. Works on any masked diffusion model.
insight
14.5x latency reduction on coding tasks. You don't have to choose few-step vs many-step at architecture time.
For six years, the transformer story had one plot: predict the next token, scale the parameters. In 2025, the plot forked. Mercury proved diffusion could be commercial. Gemini Diffusion hit 1,479 tok/s. LLaDA 2.0 scaled to 100B and open-sourced it. Block Diffusion proved the paradigms are endpoints on a continuum.
But the deepest insight isn't about speed or scale. It's about cognition. These models don't think in order. They think in confidence — resolving what they're sure of first, refining what they're uncertain about, arriving at answers through a process that looks less like writing and more like understanding.