Speculative Decoding for Production LLMs

Balys Kriksciunas · Mon Apr 28 2025 · 7 min read

#ai #infrastructure #speculative-decoding #inference #latency #llm-serving #vllm

Speculative decoding uses a small 'draft' model to propose multiple tokens that a larger model verifies in parallel, cutting inference latency 2–3x. A practical guide to production deployment.

Speculative Decoding for Production LLMs

LLM inference latency comes from one brutal fact: generation is autoregressive. You predict token N, then N+1, then N+2, serially. You cannot parallelize across tokens within a single request.

Speculative decoding breaks this constraint. A small “draft” model proposes K future tokens; the main model verifies all K in a single forward pass. Most proposed tokens are accepted; a few are corrected. Net result: 2–3x lower latency for the same quality.

It’s one of the most important inference optimizations of the last two years. This post covers what it is, when it helps, and how to deploy it.

The Intuition

In normal LLM generation:

Model sees prompt, generates token 1 (one forward pass)
Model sees prompt + token 1, generates token 2 (one forward pass)
…
Each new token requires one full forward pass

For a 70B model, each forward pass is ~40ms. Generating 100 tokens takes 4 seconds. The GPU isn’t the bottleneck — memory bandwidth is (weights cross the memory bus every step).

Speculative decoding observes: the bottleneck isn’t compute, it’s the serial dependency. If we had multiple token candidates ready, we could verify them in parallel.

Concretely:

A small draft model (e.g., a 1B model) generates 5 candidate tokens fast
The main model does a single forward pass with all 5 proposed tokens
For each proposed position, compare the main model’s distribution with the draft’s
Accept as many tokens as “agree”; reject the first disagreement and proceed

The main model runs once for up to 5 tokens. If the draft’s predictions are mostly right (they often are — most tokens are easy), you get 3–5x the tokens per main-model forward pass.

The Math

Let K = number of tokens proposed per iteration, α = average acceptance rate per position.

Expected tokens accepted per iteration: sum over i from 1 to K of α^i ≈ (1 - α^K) / (1 - α) for α close to 1.

For typical α = 0.7, K = 5:

Accepted per iteration: ~2.5 tokens
Speedup vs baseline: 2.5x

For α = 0.8, K = 8:

Accepted per iteration: ~3.3 tokens
Speedup vs baseline: 3.3x

In practice, speedup is between 1.5x and 3x on real workloads. The win depends heavily on the draft model quality and the specific content being generated.

Why Quality Is Preserved

The key insight — and why this is not “lossy generation” — is that the main model has the final say. It only accepts the draft’s tokens that match its own distribution. The output is mathematically identical to what the main model would have produced alone.

No quality loss. This isn’t a tradeoff. It’s a pure speedup.

The only thing that changes is latency. Throughput per GPU can go up or down depending on the batching interaction (more below).

Picking a Draft Model

The draft model needs two properties:

Fast — much faster than the main model, or the speedup doesn’t materialize
Agreeable — predicts the same tokens the main model would, as often as possible

Typical combinations that work well:

Main model	Draft model	Acceptance rate
Llama-3-70B	Llama-3-8B	~65–75%
Llama-3-70B	Llama-3.2-1B	~60–70%
Llama-3.1-405B	Llama-3.1-70B	~70–80%
Mistral Large 2	Mistral-7B	~60–70%

The smaller the draft, the faster it generates proposals but the lower the acceptance rate. The sweet spot depends on your workload.

Medusa and EAGLE are alternative approaches — they add extra “heads” to the main model that propose tokens without needing a separate model. Tighter integration, but requires training. vLLM supports both.

Production Deployment

In vLLM

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --speculative-model meta-llama/Llama-3.2-1B-Instruct \
    --num-speculative-tokens 5 \
    --tensor-parallel-size 2

vLLM manages the draft model automatically. It runs on the same GPU(s) by default.

In TensorRT-LLM

TensorRT-LLM supports speculative decoding via the Medusa and EAGLE modes. Engine build requires extra flags:

build --speculative_decoding_mode medusa \
      --max_draft_len 5

More complex than vLLM; more tunable.

In TGI

TGI supports a limited set of draft models. Check their docs for the current supported list.

The Batching Interaction

Here’s where it gets interesting.

In single-request mode: speculative decoding is a pure latency win. 2–3x faster, no downside.

In high-concurrency mode: speculative decoding can hurt throughput. Why?

You’re doing more compute per iteration (main + draft + verification)
If the GPU is already saturated with batched requests, the extra compute just delays things

Measured on our benchmarks (Llama-3.1-70B, 4x H100):

Concurrency	Baseline TPOT	Speculative TPOT	Latency speedup
1	44ms	18ms	2.4x
4	46ms	22ms	2.1x
16	52ms	38ms	1.4x
64	72ms	75ms	0.96x
128	120ms	145ms	0.83x

At low concurrency, huge wins. At high concurrency, breaks even or loses.

Practically: speculative decoding is a latency optimization, not a throughput one. Use it when latency matters more than tokens-per-dollar. Examples: interactive chat, coding assistants, voice applications.

Dynamic Speculative Decoding

Some systems (including recent vLLM versions) support dynamic speculation — turn it off when batches are full, on when they’re empty. Gives you best-of-both: low latency at low load, high throughput at peak.

This is increasingly the default configuration we deploy. The gateway-level signal is usually queue depth or GPU utilization.

What Affects Acceptance Rate

Things that help acceptance:

Well-matched draft model (same family, same training data)
Predictable content (common English text, code, structured output)
Low-temperature / deterministic sampling

Things that hurt acceptance:

High-temperature sampling
Very long-tail content (rare languages, novel tasks)
Chat vs instruct mismatch between main and draft
Heavy RAG-grounded generation where tokens depend on retrieved context the draft hasn’t seen

If your acceptance rate is below 50%, speculative decoding is probably a net negative. Our threshold for deploying: ≥60% acceptance sustained on eval workload.

Alternatives And Complements

Speculative decoding isn’t the only latency optimization:

Prefix caching — caches KV for repeated system prompts (see PagedAttention)
Chunked prefill — interleaves prefill with decode steps to smooth latency
Disaggregated prefill/decode — separates prefill and decode onto different hardware
Multi-query attention / grouped-query attention — reduces KV cache size, speeding decode

These compose. We regularly run FP8 + speculative decoding + prefix caching + continuous batching together. Each contributes independently.

When To Not Deploy It

Your workload is pure throughput (batch processing, bulk classification). Turn it off.
Your acceptance rate on your workload is below 50%. Turn it off.
Your GPU memory is already tight and the draft model pushes you over.
You need deterministic behavior across batch sizes (acceptance variance makes output timing variable).
Your main model is already small (<7B). Speculative decoding has less to optimize.

Operational Notes

1. Draft model quality matters. A poorly-matched draft can actually slow you down. Benchmark.

2. The draft model needs its own GPU memory. Factor it into sizing. For Llama-3-70B + 1B draft, expect ~3 GB extra per replica.

3. Acceptance monitoring. Add acceptance rate as a metric. When it drops (e.g., after a model update), you’ll want to know immediately.

4. Structured output interaction. JSON mode and tool calling work but acceptance rates can drop for structured tokens. Test your specific setup.

5. Latency variance increases. Best case: 3x faster. Worst case: same as baseline. P50 gets better, but P99 - P50 widens. UX-sensitive apps may care.

Summary

Speculative decoding is the single most effective latency optimization for interactive LLM workloads on modern hardware. Turn it on for:

Chat and assistant use cases
Voice-driven applications
Coding copilots
Any single-user, latency-sensitive path

Turn it off for:

Batch processing
High-concurrency throughput-bound workloads
Workloads where acceptance is below 60%

vLLM makes it trivial to enable. Test on your workload. Measure acceptance and latency. Keep it in your production bag of tricks.

Speculative Decoding for Production LLMs

Speculative Decoding for Production LLMs

The Intuition

The Math

Why Quality Is Preserved

Picking a Draft Model

Production Deployment

In vLLM

In TensorRT-LLM

In TGI

The Batching Interaction

Dynamic Speculative Decoding

What Affects Acceptance Rate

Alternatives And Complements

When To Not Deploy It

Operational Notes

Summary

Further Reading

Related Posts

vLLM and SGLang Are Converging — and That Changes the Inference Stack

KV Cache Optimization Techniques for LLM Serving

PagedAttention Explained: How vLLM Achieves 24x Throughput

Speculative Decoding for Production LLMs

Speculative Decoding for Production LLMs

The Intuition

The Math

Why Quality Is Preserved

Picking a Draft Model

Production Deployment

In vLLM

In TensorRT-LLM

In TGI

The Batching Interaction

Dynamic Speculative Decoding

What Affects Acceptance Rate

Alternatives And Complements

When To Not Deploy It

Operational Notes

Summary

Further Reading

Related Posts

vLLM and SGLang Are Converging — and That Changes the Inference Stack

KV Cache Optimization Techniques for LLM Serving

PagedAttention Explained: How vLLM Achieves 24x Throughput

Don't miss out on AI insights