KV Cache Optimization Techniques for LLM Serving
The KV cache dominates memory in LLM serving and often dominates cost. Paged, compressed, offloaded, and shared — a guide to the techniques that let you serve 2–4x more concurrent requests.
LLM inference latency comes from one brutal fact: generation is autoregressive. You predict token N, then N+1, then N+2, serially. You cannot parallelize across tokens within a single request.
Speculative decoding breaks this constraint. A small “draft” model proposes K future tokens; the main model verifies all K in a single forward pass. Most proposed tokens are accepted; a few are corrected. Net result: 2–3x lower latency for the same quality.
It’s one of the most important inference optimizations of the last two years. This post covers what it is, when it helps, and how to deploy it.
In normal LLM generation:
For a 70B model, each forward pass is ~40ms. Generating 100 tokens takes 4 seconds. The GPU isn’t the bottleneck — memory bandwidth is (weights cross the memory bus every step).
Speculative decoding observes: the bottleneck isn’t compute, it’s the serial dependency. If we had multiple token candidates ready, we could verify them in parallel.
Concretely:
The main model runs once for up to 5 tokens. If the draft’s predictions are mostly right (they often are — most tokens are easy), you get 3–5x the tokens per main-model forward pass.
Let K = number of tokens proposed per iteration, α = average acceptance rate per position.
Expected tokens accepted per iteration: sum over i from 1 to K of α^i ≈ (1 - α^K) / (1 - α) for α close to 1.
For typical α = 0.7, K = 5:
For α = 0.8, K = 8:
In practice, speedup is between 1.5x and 3x on real workloads. The win depends heavily on the draft model quality and the specific content being generated.
The key insight — and why this is not “lossy generation” — is that the main model has the final say. It only accepts the draft’s tokens that match its own distribution. The output is mathematically identical to what the main model would have produced alone.
No quality loss. This isn’t a tradeoff. It’s a pure speedup.
The only thing that changes is latency. Throughput per GPU can go up or down depending on the batching interaction (more below).
The draft model needs two properties:
Typical combinations that work well:
| Main model | Draft model | Acceptance rate |
|---|---|---|
| Llama-3-70B | Llama-3-8B | ~65–75% |
| Llama-3-70B | Llama-3.2-1B | ~60–70% |
| Llama-3.1-405B | Llama-3.1-70B | ~70–80% |
| Mistral Large 2 | Mistral-7B | ~60–70% |
The smaller the draft, the faster it generates proposals but the lower the acceptance rate. The sweet spot depends on your workload.
Medusa and EAGLE are alternative approaches — they add extra “heads” to the main model that propose tokens without needing a separate model. Tighter integration, but requires training. vLLM supports both.
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--speculative-model meta-llama/Llama-3.2-1B-Instruct \
--num-speculative-tokens 5 \
--tensor-parallel-size 2
vLLM manages the draft model automatically. It runs on the same GPU(s) by default.
TensorRT-LLM supports speculative decoding via the Medusa and EAGLE modes. Engine build requires extra flags:
build --speculative_decoding_mode medusa \
--max_draft_len 5
More complex than vLLM; more tunable.
TGI supports a limited set of draft models. Check their docs for the current supported list.
Here’s where it gets interesting.
In single-request mode: speculative decoding is a pure latency win. 2–3x faster, no downside.
In high-concurrency mode: speculative decoding can hurt throughput. Why?
Measured on our benchmarks (Llama-3.1-70B, 4x H100):
| Concurrency | Baseline TPOT | Speculative TPOT | Latency speedup |
|---|---|---|---|
| 1 | 44ms | 18ms | 2.4x |
| 4 | 46ms | 22ms | 2.1x |
| 16 | 52ms | 38ms | 1.4x |
| 64 | 72ms | 75ms | 0.96x |
| 128 | 120ms | 145ms | 0.83x |
At low concurrency, huge wins. At high concurrency, breaks even or loses.
Practically: speculative decoding is a latency optimization, not a throughput one. Use it when latency matters more than tokens-per-dollar. Examples: interactive chat, coding assistants, voice applications.
Some systems (including recent vLLM versions) support dynamic speculation — turn it off when batches are full, on when they’re empty. Gives you best-of-both: low latency at low load, high throughput at peak.
This is increasingly the default configuration we deploy. The gateway-level signal is usually queue depth or GPU utilization.
Things that help acceptance:
Things that hurt acceptance:
If your acceptance rate is below 50%, speculative decoding is probably a net negative. Our threshold for deploying: ≥60% acceptance sustained on eval workload.
Speculative decoding isn’t the only latency optimization:
These compose. We regularly run FP8 + speculative decoding + prefix caching + continuous batching together. Each contributes independently.
1. Draft model quality matters. A poorly-matched draft can actually slow you down. Benchmark.
2. The draft model needs its own GPU memory. Factor it into sizing. For Llama-3-70B + 1B draft, expect ~3 GB extra per replica.
3. Acceptance monitoring. Add acceptance rate as a metric. When it drops (e.g., after a model update), you’ll want to know immediately.
4. Structured output interaction. JSON mode and tool calling work but acceptance rates can drop for structured tokens. Test your specific setup.
5. Latency variance increases. Best case: 3x faster. Worst case: same as baseline. P50 gets better, but P99 - P50 widens. UX-sensitive apps may care.
Speculative decoding is the single most effective latency optimization for interactive LLM workloads on modern hardware. Turn it on for:
Turn it off for:
vLLM makes it trivial to enable. Test on your workload. Measure acceptance and latency. Keep it in your production bag of tricks.
Exploring speculative decoding for your workload? Reach out — we’ll benchmark it with your actual traffic in a day.
The KV cache dominates memory in LLM serving and often dominates cost. Paged, compressed, offloaded, and shared — a guide to the techniques that let you serve 2–4x more concurrent requests.
PagedAttention borrows OS virtual-memory ideas to fix the biggest efficiency problem in LLM serving: fragmented KV caches. Here's how it works and why it changed LLM inference.
Static batching leaves 50%+ of your GPU idle on variable-length workloads. Continuous batching — batching at the iteration level — closes the gap. A visual explanation with the production consequences.