neural / acceleration

128× in four months.

Diffusion LLMs went from ~64 denoising steps to 1 step between Nov 2024 and Mar 2025. The compression curve is not a marketing chart — it's four papers, each replacing the previous bottleneck. Drag along the timeline to feel how fast the floor moved.

the trajectory

denoising step compression (log scale)128x in 4 months

Nov 2024~64 steps

Jan 2025~4 steps

14.5x

Feb 20251 step

64x

Mar 20251 step

128x

Each milestone is roughly a 4x jump over its predecessor. At this rate, the denoising overhead that made dLLMs uncompetitive is being compressed to near-zero. The quality-speed tradeoff that defined the paradigm comparison is dissolving.

why the curve is exponential, not linear

Each milestone attacks denoising from a different angle. CDLM (Together AI, Jan 2025) keeps the architecture and distills a consistency loss into a post-trained student — quality stays, latency drops 14.5× on code. IDLM (Li et al., Feb 2025) replaces iterative denoising with a learned implicit prior that lets a single forward pass land near the data manifold. FS-DFM (Chen et al., Mar 2025) straightens the discrete flow with optimal transport, so one Euler step produces samples that previously needed 64.

These aren't variations of the same trick. Distillation, implicit priors, and flow-straightening are independent levers. Stacking them is plausible — and that's what makes the curve visceral rather than coincidental.

capability, not just speed

The other axis: Dream 7B (HKU NLP) hits 81% on 9×9 Sudoku where an autoregressive 7B baseline scores 21%. Not because diffusion is "smarter" — because bidirectional attention lets the model reason about constraints across the full grid simultaneously. AR models commit cell-by-cell; diffusion can revise. On ARC-AGI and Countdown, the same effect appears in smaller form.

Hover the planning stats panel and the gap between AR and diffusion stops looking like noise. Constraint-satisfaction tasks reward the architecture that can re-mask its own draft.

honest limits

What this page doesn't show. One-step samples from FS-DFM retain ~95% quality at the benchmark level — they are not pixel-equivalent to 64-step samples. Acceleration is a Pareto move, not a free lunch. And the speed numbers are inference-only; training cost for distillation/flow-straightening is real and unamortized in the bar chart. The 128× is what an end user feels, not what a GPU cluster paid.

sources

Sahoo et al., "Simple and Effective Masked Diffusion Language Models" (NeurIPS 2024) — the MDLM baseline.
Together AI, "Consistency Diffusion Language Models" (Jan 2025) — CDLM, 14.5×.
Li et al., "Implicit Discrete Language Models" (Feb 2025) — IDLM, 64×.
Chen et al., "Flow-Straight Discrete Flow Matching" (Mar 2025) — FS-DFM, 128×.
Ye et al., "Dream 7B" (HKU NLP, 2025) — Sudoku 81% vs AR baseline 21%.

← neural diffusion mechanics inference lab inference cost