The answer arrives before the reasoning finishes.
That changes everything about how we think about thinking.

Autoregressive models think left to right. Each token waits for every previous one. Diffusion language models think holistically — all tokens refine in parallel, with structural tokens stabilizing first and answer tokens crystallizing from context. At 50% of refinement steps, 97% of tokens are already correct.

drag the slider — watch a real image dissolve into gaussian noise and reconstruct · the math is exact: x_t = sqrt(ab)*x_0 + sqrt(1-ab)*eps with cosine schedule · toggle "show denoised" to see approximate reconstruction at each noise level · upload your own image to diffuse

drag the slider or press think — watch how diffusion resolves tokens by confidence, not position · structure locks first, reasoning next, answers last · compare with sequential AR generation side by side

When you solve 17 x 24, you don't think left-to-right either. You recognize the structure ("multiplication"), decompose the problem ("17 x 20 + 17 x 4"), compute intermediate results, then assemble the answer. The structural scaffolding comes first. The answer comes last.

Diffusion models exhibit exactly this pattern. Structural tokens (equals signs, conjunctions, formatting) resolve in the first few steps. Reasoning tokens (operands, logical connectives) come next. Answer tokens — the actual output the user cares about — stabilize last, built on the scaffold of everything else. The model "knows" its answer before it finishes refining, because the reasoning substrate is already in place.

Mercury 2

1,009

tok/s

First reasoning dLLM. AIME 91.1. $0.25/M input.

Accuracy at 50%

97%

correct

Most tokens stable halfway through refinement.

CDLM speedup

14.5x

latency reduction

Consistency training compresses denoising steps.

LLaDA 2.0

100B

MoE params

First open-source dLLM at scale. 94.51 HumanEval.

Forget next-token prediction. dLLMs corrupt text with noise, then train a model to reverse the corruption. The entire sequence refines simultaneously.

01

Forward Process — Masking

Start with clean text. At each timestep t, independently mask each token with probability t. At t=1, every token is [MASK]. Discrete masking on token IDs — no Gaussian blur, no continuous vectors.

02

Bidirectional Prediction

A vanilla Transformer without a causal mask. Given a partially masked sequence, predict every masked position simultaneously. Token 47 sees token 48. This is fundamentally different from left-to-right models.

03

Iterative Refinement

Start from pure noise. Each step: predict all masked tokens in parallel, re-mask low-confidence ones, keep high-confidence ones. ~14 steps to refine a 512-token sequence. The text doesn't appear left-to-right — it appears everywhere at once.

step through diffusion denoising on custom text · compare AR vs diffusion side by side · race models at real-world speeds

real langevin sampling on a mixture of 5 gaussians · arrows show the exact score function grad_x log p(x) at each grid point · particles follow this field plus noise · toggle annealed mode to watch sigma decrease from 0.15 to 0.001

the key insight from Song & Ermon 2019: annealing the noise level lets particles cross energy barriers between modes · without annealing, particles collapse to the nearest peak and miss low-weight modes entirely · same start positions, same noise, different sigma schedule · watch the mode coverage diverge

real DDPM math with cosine noise schedule · forward: x_t = sqrt(ab_t)*x_0 + sqrt(1-ab_t)*eps · reverse: x_(t-1) uses the exact reverse mean and variance formulas from Ho et al. 2020 · watch alpha_bar and beta change at each timestep

same starting noise, same trained model, different reverse process · drag eta from 0 (fully deterministic DDIM) to 1 (recovers DDPM) · enable step-skipping to see why deterministic sampling can skip timesteps but stochastic can't · Song et al. 2020

the teacher traces 50-step ODE trajectories from noise to data · the student learns that every point on the same trajectory should map to the same clean output · 1 step = 50x speedup with measurable quality loss · 4 steps nearly matches teacher · toggle consistency probe to see the self-consistency property in action · Song et al. 2023

real DistilBERT inference (68MB ONNX, runs in-browser via WASM) · type any sentence · forward process masks all tokens · reverse process uses real fill-mask predictions · LLaDA remasking: sort by model confidence, keep top, remask bottom (s/t) fraction · green dot = real model prediction · red dot = remasked · compare with AR side-by-side

Latency dLLM edge

autoregressive

O(n) — 512 tokens = 512 forward passes.

diffusion

O(T) — ~14 steps regardless of length.

Error correction dLLM edge

autoregressive

None. Bad token at position 12 propagates forever.

diffusion

Built-in. Low-confidence tokens get re-masked and re-predicted each step.

Quality AR edge

autoregressive

Proven at scale. GPT-4, Claude, Gemini.

diffusion

Mercury 2 matches Claude Haiku class. LLaDA 2.0 (100B) competitive with Qwen3-30B. Gap closing fast.

Context dLLM edge

autoregressive

Unidirectional. The reversal curse is a direct consequence.

diffusion

Bidirectional. Every token sees every other. Reversal curse substantially weakened.

AR and diffusion are endpoints on a spectrum. The most interesting work is in the middle — architectures that choose when to be sequential and when to be parallel.

Block Diffusion

Kuleshov Lab (ICLR 2025)

mechanism

Generates blocks autoregressively; within each block, uses diffusion. Block size L' is the interpolation knob — L'=1 is pure AR, L'=n is pure diffusion.

insight

AR and diffusion are endpoints on a continuum. Every point in between is a valid architecture.

TiDAR

NVIDIA Research

mechanism

Single model, hybrid attention. Causal prefix (AR-cached) + diffusion draft block in one forward pass. Fills idle GPU slots with speculative tokens.

insight

4.7-5.9x faster than pure AR. 8 tokens per forward pass instead of 1. No separate draft model.

CDLM

Together AI

mechanism

Post-training acceleration via consistency loss + distillation. Compresses denoising steps after the fact. Works on any masked diffusion model.

insight

14.5x latency reduction on coding tasks. You don't have to choose few-step vs many-step at architecture time.

For six years, the transformer story had one plot: predict the next token, scale the parameters. In 2025, the plot forked. Mercury proved diffusion could be commercial. Gemini Diffusion hit 1,479 tok/s. LLaDA 2.0 scaled to 100B and open-sourced it. Block Diffusion proved the paradigms are endpoints on a continuum.

But the deepest insight isn't about speed or scale. It's about cognition. These models don't think in order. They think in confidence — resolving what they're sure of first, refining what they're uncertain about, arriving at answers through a process that looks less like writing and more like understanding.