Every LLM you have used in production generates text the same way: one token at a time, left to right, each token depending on everything that came before it. That is autoregressive decoding, and it has a hard constraint baked in. The sequential nature is not an implementation detail you can optimize away - it is the mathematical structure of the model.
Diffusion LLMs take a different path. They generate an entire sequence in parallel across multiple denoising steps, rather than one token per step. The practical result is lower latency on long outputs. The catch is that the tradeoffs are subtle enough that most coverage of this topic either oversells the speed claims or undersells the real limitations.
Here is what is actually happening.
The Autoregressive Bottleneck
To understand why diffusion exists, you need to feel the pain point of autoregressive generation.
A transformer generates a 1000-token response by running 1000 sequential forward passes. You can batch many users together, and the KV cache means you are not recomputing attention for previous tokens on each step. But you are still bottlenecked by the 1000 sequential decode steps. You cannot generate token 500 until you have generated tokens 1 through 499.
For a typical 7B model serving 1000-token outputs:
Time to first token (TTFT): ~50ms (prompt processing, parallelizable)
Time per output token: ~5-15ms (depends on hardware, model size, batch)
Total latency for 1000 tokens: 5-15 seconds
Speculative decoding helps by drafting multiple tokens with a small model and verifying with the large model. Flash Attention and optimized CUDA kernels help. But you are fighting against a fundamental sequential dependency. You cannot generate token N+1 until the model produces token N.
This matters most for two cases: very long outputs where users wait many seconds, and high-throughput bulk generation where you want to push tokens per second as high as possible.
What Diffusion Models Do Instead
Image diffusion (Stable Diffusion, DALL-E) starts with random noise and runs a denoising network repeatedly to converge on an image. Each denoising step refines the entire image simultaneously.
Text diffusion adapts this idea, but text is discrete - you cannot add Gaussian noise to token IDs the way you add it to pixel values. The main approach used in production is masked diffusion:
- Forward process: randomly mask tokens in the sequence (replace with [MASK])
- Reverse process: a trained model predicts all masked tokens simultaneously
- Iterate: partially unmask, then re-mask some tokens, refine again
The key insight is that step 2 runs in parallel across all positions. A single forward pass produces predictions for every masked position at once, not one token at a time.
Autoregressive generation (1000 tokens):
Step 1: [The] [EMPTY] [EMPTY] ... [EMPTY]
Step 2: [The] [cat] [EMPTY] ... [EMPTY]
Step 3: [The] [cat] [sat] ... [EMPTY]
...
Step 1000: [The] [cat] [sat] ... [fence]
= 1000 sequential forward passes
Masked diffusion generation (1000 tokens, 40 steps):
Step 1: [MASK] [MASK] [MASK] ... [MASK] (all masked)
Step 10: [The] [MASK] [sat] ... [MASK] (some unmasked)
Step 20: [The] [cat] [sat] ... [MASK] (more unmasked)
Step 40: [The] [cat] [sat] ... [fence] (fully denoised)
= 40 forward passes, each over the full sequence
Fewer total forward passes. Each pass is over the full sequence length rather than just the current position. The tradeoff is that each step is more expensive than a single autoregressive decode step - but you do far fewer of them.
Architecture Diagram
Autoregressive Transformer
--------------------------
Input: [T1][T2][T3]...[Tn] --> generate Tn+1
|
causal attention
(each token attends
only to left context)
|
next token distribution
--> sample --> repeat
Masked Diffusion Model
----------------------
Input: [T1][M][M][T4][M]...[Tn] (M = masked)
|
bidirectional attention
(each position attends
to all positions)
|
predictions for all M positions simultaneously
--> partially unmask --> repeat ~20-50 steps
Bidirectional attention is worth flagging: diffusion models can see the entire (partial) sequence at every step. This is why they naturally handle infilling and editing tasks that trip up autoregressive models.
Mercury: The First Production Diffusion LLM
Inception Labs released Mercury in early 2026, which is to date the most significant commercial deployment of a text diffusion model. Their benchmarks claim 5-10x faster generation than comparable autoregressive models at similar quality levels.
The Mercury architecture is built on masked discrete diffusion. The headline numbers are real in the sense that they reflect throughput improvements under specific conditions - primarily long-sequence generation where the ratio of sequence length to diffusion steps is favorable.
What the benchmarks do not always surface:
- Quality on reasoning tasks lags behind frontier autoregressive models by a noticeable margin
- The speed advantage is largest for outputs of 500+ tokens; for short responses the overhead of multiple full-sequence forward passes shrinks the gain
- Token streaming to users is more complex - you cannot stream individual tokens naturally because you are generating in parallel
Where the Speed Math Works Out
The crossover point between autoregressive and diffusion latency depends on output length and the number of diffusion steps required.
# Rough latency model (simplified, GPU-dependent)
def autoregressive_latency(output_tokens, ms_per_token=10):
return output_tokens * ms_per_token # strictly sequential
def diffusion_latency(output_tokens, diffusion_steps=40, ms_per_step_base=20):
# Each step processes the full sequence, so cost scales with sequence length
ms_per_step = ms_per_step_base * (output_tokens / 512) # normalize to 512 tokens
return diffusion_steps * ms_per_step
# Example: 1000-token output
ar = autoregressive_latency(1000) # 10,000ms = 10s
diff = diffusion_latency(1000) # 40 * ~39ms = 1,562ms
# Example: 100-token output
ar = autoregressive_latency(100) # 1,000ms
diff = diffusion_latency(100) # 40 * ~4ms = 156ms -- still faster but narrower gap
The crossover is real, but the exact numbers depend heavily on batch size, hardware, and how aggressively you can parallelize the diffusion steps across GPUs.
Comparison: Where Each Approach Wins
| Dimension | Autoregressive | Diffusion |
|---|---|---|
| Short responses (< 200 tokens) | Faster total latency | Overhead from multi-step often negates gain |
| Long responses (> 500 tokens) | Latency grows linearly | Clear speed advantage |
| Streaming to user | Natural, token-by-token | Awkward - requires faking it |
| Reasoning / math | Strong (chains of thought work well) | Currently weaker |
| Infilling / editing | Clunky - requires prompt engineering | Natural fit |
| Complex instruction following | Frontier models excel | Not yet competitive at top tier |
| Bulk throughput (no latency pressure) | Can batch well | Very strong |
| Quality at frontier capability | GPT-4o, Claude 4 | Not yet equivalent |
The reasoning gap is the most significant limitation. Autoregressive models generate chains of thought token by token - each reasoning step builds on the previous one. Diffusion models generate in parallel across positions, which makes it harder to build up multi-step reasoning chains naturally. Mercury-class models benchmark well on coding and summarization but fall behind on math and complex logic.
The Infilling Advantage Is Underrated
The case where diffusion is genuinely better right now - not just faster but qualitatively better - is infilling and constrained editing.
Suppose you have a document and want to rewrite the third paragraph while keeping surrounding context intact. With an autoregressive model you have to engineer a prompt that includes the surrounding context and hope the model respects the constraints. The model only attends left-to-right, so it cannot “see” what comes after the gap when generating the fill.
A diffusion model runs bidirectional attention over the entire sequence at every step. It sees both the text before and after the gap simultaneously when deciding what to put in it. This produces better constrained generation without the awkward prompt engineering.
# Autoregressive infilling (awkward)
prompt = f"""
{before_context}
[FILL IN THIS SECTION]
{after_context}
Write the fill-in section only:
"""
# Model often ignores the after_context because it generated past it
# Diffusion infilling (natural)
# Pass the entire document with the target section masked
# Model sees both surrounding contexts on every denoising step
This matters for code generation (fill in a function body), document editing, and template-based generation. Any task where you know what comes before and after but not what goes in the middle.
What You Cannot Do with Current Diffusion Models
Chain-of-thought reasoning. Asking a diffusion model to “think step by step” does not work as well because the generation process does not naturally produce a left-to-right reasoning trace. Research is ongoing here (adding autoregressive-style thinking phases to diffusion models) but it is not solved.
Token streaming. Users expect to see text appear progressively. With autoregressive models this is natural - emit each token as it generates. With diffusion you are generating in bursts across denoising steps. You can fake streaming by emitting tokens as they become confident enough across steps, but it looks different and some tokens get revised in later steps, which creates a jarring experience.
Plug-in with existing inference infrastructure. vLLM, TGI, TensorRT-LLM are all optimized for autoregressive generation. Running diffusion models at production scale requires custom serving infrastructure. This is an engineering tax that matters for teams without the resources to build it.
Does This Matter for Real Applications in 2026?
Honestly, for most applications: not yet, but soon.
The use cases where diffusion LLMs are production-ready today are narrow: bulk document generation where latency does not matter but throughput does, and infilling/editing tasks where bidirectional context is valuable.
For the mainstream cases - chat interfaces, coding assistants, agent loops, RAG pipelines - autoregressive models are still the right choice. The reasoning quality gap is real, the streaming problem is real, and the ecosystem support gap is real.
The trajectory is what makes this worth understanding now. Speculative decoding and continuous batching improved autoregressive throughput significantly in 2024-2025. Diffusion models closing the quality gap on reasoning would change the calculus, and that work is active.
The practical comparison for a team making architecture decisions today:
| Use case | Recommendation |
|---|---|
| Chat / conversational AI | Autoregressive (GPT-4o, Claude, Gemini) |
| Bulk document generation, no streaming | Mercury or diffusion worth evaluating |
| Code completion in IDE | Autoregressive - streaming is required |
| Document infilling / editing | Diffusion has a genuine edge |
| Reasoning, math, complex logic | Autoregressive is clearly stronger |
| High-throughput summarization pipeline | Diffusion worth benchmarking |
Bottom Line
Diffusion LLMs are a real architectural alternative, not vaporware. The speed gains on long outputs are genuine. The quality ceiling for reasoning tasks is a real limitation that the field has not solved. The infilling advantage is underrated and practically useful today.
If you are building anything where generation latency on long outputs is the bottleneck, and streaming is not a hard requirement, Mercury and similar models are worth a benchmark. For everything else - conversational interfaces, coding assistants, agent systems that need reliable reasoning - autoregressive models remain the better choice in 2026. Keep an eye on this space. The first diffusion model that closes the reasoning gap while maintaining the throughput advantage will be a significant shift.
Comments