Infrastructure

Speculative Decoding for Production LLMs

Balys Kriksciunas 7 min read
#ai#infrastructure#speculative-decoding#inference#latency#llm-serving#vllm

Speculative Decoding for Production LLMs

LLM inference latency comes from one brutal fact: generation is autoregressive. You predict token N, then N+1, then N+2, serially. You cannot parallelize across tokens within a single request.

Speculative decoding breaks this constraint. A small “draft” model proposes K future tokens; the main model verifies all K in a single forward pass. Most proposed tokens are accepted; a few are corrected. Net result: 2–3x lower latency for the same quality.

It’s one of the most important inference optimizations of the last two years. This post covers what it is, when it helps, and how to deploy it.


The Intuition

In normal LLM generation:

  1. Model sees prompt, generates token 1 (one forward pass)
  2. Model sees prompt + token 1, generates token 2 (one forward pass)
  3. Each new token requires one full forward pass

For a 70B model, each forward pass is ~40ms. Generating 100 tokens takes 4 seconds. The GPU isn’t the bottleneck — memory bandwidth is (weights cross the memory bus every step).

Speculative decoding observes: the bottleneck isn’t compute, it’s the serial dependency. If we had multiple token candidates ready, we could verify them in parallel.

Concretely:

  1. A small draft model (e.g., a 1B model) generates 5 candidate tokens fast
  2. The main model does a single forward pass with all 5 proposed tokens
  3. For each proposed position, compare the main model’s distribution with the draft’s
  4. Accept as many tokens as “agree”; reject the first disagreement and proceed

The main model runs once for up to 5 tokens. If the draft’s predictions are mostly right (they often are — most tokens are easy), you get 3–5x the tokens per main-model forward pass.


The Math

Let K = number of tokens proposed per iteration, α = average acceptance rate per position.

Expected tokens accepted per iteration: sum over i from 1 to K of α^i ≈ (1 - α^K) / (1 - α) for α close to 1.

For typical α = 0.7, K = 5:

For α = 0.8, K = 8:

In practice, speedup is between 1.5x and 3x on real workloads. The win depends heavily on the draft model quality and the specific content being generated.


Why Quality Is Preserved

The key insight — and why this is not “lossy generation” — is that the main model has the final say. It only accepts the draft’s tokens that match its own distribution. The output is mathematically identical to what the main model would have produced alone.

No quality loss. This isn’t a tradeoff. It’s a pure speedup.

The only thing that changes is latency. Throughput per GPU can go up or down depending on the batching interaction (more below).


Picking a Draft Model

The draft model needs two properties:

  1. Fast — much faster than the main model, or the speedup doesn’t materialize
  2. Agreeable — predicts the same tokens the main model would, as often as possible

Typical combinations that work well:

Main modelDraft modelAcceptance rate
Llama-3-70BLlama-3-8B~65–75%
Llama-3-70BLlama-3.2-1B~60–70%
Llama-3.1-405BLlama-3.1-70B~70–80%
Mistral Large 2Mistral-7B~60–70%

The smaller the draft, the faster it generates proposals but the lower the acceptance rate. The sweet spot depends on your workload.

Medusa and EAGLE are alternative approaches — they add extra “heads” to the main model that propose tokens without needing a separate model. Tighter integration, but requires training. vLLM supports both.


Production Deployment

In vLLM

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --speculative-model meta-llama/Llama-3.2-1B-Instruct \
    --num-speculative-tokens 5 \
    --tensor-parallel-size 2

vLLM manages the draft model automatically. It runs on the same GPU(s) by default.

In TensorRT-LLM

TensorRT-LLM supports speculative decoding via the Medusa and EAGLE modes. Engine build requires extra flags:

build --speculative_decoding_mode medusa \
      --max_draft_len 5

More complex than vLLM; more tunable.

In TGI

TGI supports a limited set of draft models. Check their docs for the current supported list.


The Batching Interaction

Here’s where it gets interesting.

In single-request mode: speculative decoding is a pure latency win. 2–3x faster, no downside.

In high-concurrency mode: speculative decoding can hurt throughput. Why?

Measured on our benchmarks (Llama-3.1-70B, 4x H100):

ConcurrencyBaseline TPOTSpeculative TPOTLatency speedup
144ms18ms2.4x
446ms22ms2.1x
1652ms38ms1.4x
6472ms75ms0.96x
128120ms145ms0.83x

At low concurrency, huge wins. At high concurrency, breaks even or loses.

Practically: speculative decoding is a latency optimization, not a throughput one. Use it when latency matters more than tokens-per-dollar. Examples: interactive chat, coding assistants, voice applications.


Dynamic Speculative Decoding

Some systems (including recent vLLM versions) support dynamic speculation — turn it off when batches are full, on when they’re empty. Gives you best-of-both: low latency at low load, high throughput at peak.

This is increasingly the default configuration we deploy. The gateway-level signal is usually queue depth or GPU utilization.


What Affects Acceptance Rate

Things that help acceptance:

Things that hurt acceptance:

If your acceptance rate is below 50%, speculative decoding is probably a net negative. Our threshold for deploying: ≥60% acceptance sustained on eval workload.


Alternatives And Complements

Speculative decoding isn’t the only latency optimization:

These compose. We regularly run FP8 + speculative decoding + prefix caching + continuous batching together. Each contributes independently.


When To Not Deploy It


Operational Notes

1. Draft model quality matters. A poorly-matched draft can actually slow you down. Benchmark.

2. The draft model needs its own GPU memory. Factor it into sizing. For Llama-3-70B + 1B draft, expect ~3 GB extra per replica.

3. Acceptance monitoring. Add acceptance rate as a metric. When it drops (e.g., after a model update), you’ll want to know immediately.

4. Structured output interaction. JSON mode and tool calling work but acceptance rates can drop for structured tokens. Test your specific setup.

5. Latency variance increases. Best case: 3x faster. Worst case: same as baseline. P50 gets better, but P99 - P50 widens. UX-sensitive apps may care.


Summary

Speculative decoding is the single most effective latency optimization for interactive LLM workloads on modern hardware. Turn it on for:

Turn it off for:

vLLM makes it trivial to enable. Test on your workload. Measure acceptance and latency. Keep it in your production bag of tricks.


Further Reading

Exploring speculative decoding for your workload? Reach out — we’ll benchmark it with your actual traffic in a day.

← Back to Blog