KV Cache Optimization Techniques for LLM Serving
KV cache dominates memory and cost in LLM serving. Paged, compressed, offloaded, and shared — serve 2–4x more concurrent requests.
KV Cache Optimization Techniques for LLM Serving
Ask an inference engineer what limits concurrency on their GPU and the answer is almost always “KV cache.” Model weights are fixed. Activations are transient. The KV cache — the stored attention keys and values for every token in every active request — grows linearly with concurrent requests and context length, and eats the remaining GPU memory.
If you can fit more KV cache, you can serve more concurrent requests. That’s the whole game. This post surveys the techniques that make that possible in 2025.
The Baseline: How Much KV Cache Do You Actually Need?
For a transformer model:
KV cache per token = 2 (K+V) × num_layers × num_heads × head_dim × bytes_per_element
For Llama-3-70B:
- 80 layers, 8 KV heads (GQA), 128 head_dim
- FP16: 2 × 80 × 8 × 128 × 2 = 327,680 bytes ≈ 320 KB per token
For 8K context and 128 concurrent requests: 8,192 × 320 KB × 128 = ~336 GB. That’s more than any single GPU.
Strategies to fit more in the same memory:
- Reduce per-token size (quantization, GQA/MQA)
- Reduce fragmentation (PagedAttention)
- Reduce duplication (prefix sharing)
- Move some off-GPU (offloading)
- Evict or compress cold blocks
Let’s cover each.
1. Architectural Reductions: GQA and MQA
Multi-Head Attention (MHA) — every head has its own K and V. Max quality, max cache.
Grouped-Query Attention (GQA) — K and V are shared across groups of query heads. 4–8x smaller KV cache vs MHA, minor quality loss.
Multi-Query Attention (MQA) — single K and V shared across all query heads. Smallest cache, biggest quality hit.
Llama-3-70B uses GQA with 8 KV heads for 64 query heads — an 8x KV cache reduction over pure MHA. Mistral, Qwen, and most modern models also use GQA.
If you’re training a new model from scratch, use GQA. If you’re serving an existing one, its architecture is fixed — check the config to know what you’re dealing with.
2. KV Cache Quantization
Storing KV cache in lower precision cuts its size proportionally.
FP8 KV cache:
- 2x smaller than FP16
- Negligible quality loss on most workloads
- Supported in vLLM, TGI, TRT-LLM
vllm serve ... --kv-cache-dtype fp8
INT4 KV cache:
- 4x smaller than FP16
- Small but measurable quality loss
- Supported by some servers (SGLang’s KV cache quantization, custom)
FP4 KV cache:
- 8x smaller on B200 hardware
- Early 2025, quality on long contexts still being validated
For production on H100: FP8 KV cache is essentially free savings. Turn it on.
3. PagedAttention: Eliminating Fragmentation
Covered in depth in PagedAttention Explained. The short version: traditional allocators reserve worst-case contiguous blocks per request, wasting 40–60% of KV memory. PagedAttention uses fixed-size blocks (typically 16 tokens each) that any request can use.
In practice: KV memory utilization goes from ~50% (traditional) to ~92% (PagedAttention). Nearly 2x the concurrent requests from the same hardware.
Every modern inference server supports this. If yours doesn’t, switch.
4. Prefix Caching: Sharing KV Across Requests
If many requests share a prefix (system prompt, retrieved documents, few-shot examples), the KV cache for that prefix can be computed once and pointed to by multiple requests.
Gains are workload-dependent but often large:
| Workload | Prefix share | Speedup |
|---|---|---|
| RAG with 2K system prompt | 100% of requests | 1.5–2x |
| Coding assistant with docs | ~90% | 1.4x |
| Open chat | Variable | 1.1–1.3x |
| Agent with tool schemas | 100% | 1.6–2x |
Enable on vLLM:
vllm serve ... --enable-prefix-caching
Combined with PagedAttention, this is essentially free for most production workloads.
5. SGLang’s RadixAttention
SGLang extends prefix caching with a radix tree over all seen prefixes. Unlike linear prefix caching, RadixAttention handles prefixes that branch: system prompt → few-shot example A → user query, and system prompt → few-shot example B → user query share cache up to the branch point.
For agent workloads with heavy branching (planning trees, tool-calling loops), this can add another 1.3–2x on top of linear prefix caching.
If your workload has heavy prefix branching, consider SGLang.
6. KV Cache Offloading
When GPU memory fills, older KV blocks can be paged to CPU memory or even NVMe, fetched back when needed.
Offload to CPU:
- 10–50x slower than GPU memory, but far bigger (hundreds of GB)
- vLLM supports CPU offload; SGLang and TGI partial support
- Best for workloads with long idle periods between turns (chat apps)
Offload to disk (NVMe):
- Another 10x slower than CPU memory
- Used for very long context or multi-day conversations
- Research-stage; emerging in production systems
Offloading trades latency for capacity. Works best when the “hot” working set fits in GPU while the full history can live elsewhere.
7. Eviction Policies
When memory fills and you don’t want to offload, you evict. Options:
- LRU: evict the least-recently-used request
- Preempt and recompute: drop a request’s KV cache; re-prefill when it resumes
- Priority-based: keep premium users’ requests, evict free-tier first
vLLM’s --preemption-mode recompute is the default in recent versions. It drops blocks and recomputes when the request resumes. Simpler and usually faster than swap-to-CPU.
8. Sliding-Window and Chunked Attention
Some models (Mistral, Gemma, Qwen) were trained with sliding-window attention — they only attend to the last N tokens, so KV cache is bounded regardless of context length.
At serving time, you can also chunk attention artificially, at the cost of some quality for very long contexts. Useful when you need to fit 128K+ context in limited memory.
9. Disaggregated Prefill and Decode
Prefill (processing the prompt) and decode (generating tokens) have different resource profiles. Separating them onto different node pools lets each be optimized:
- Prefill nodes: compute-bound, smaller KV needed
- Decode nodes: bandwidth-bound, larger KV cache
KV cache is transferred from prefill to decode when prompt processing finishes. See Disaggregated Inference.
10. Compression
Research techniques compress KV cache beyond quantization:
- H2O — keeps “heavy hitter” tokens, evicts others
- StreamingLLM — keeps only the start tokens and a recent window
- KIVI — 2-bit KV quantization with calibrated grouping
- CachGen — compresses at the block level with context-aware policies
These are mostly 2024–2025 research. Some are in vLLM or SGLang experimental branches. Worth tracking; not yet table stakes.
Tuning KV Cache For Your Workload
Three knobs in any good inference server:
--gpu-memory-utilization: what fraction of GPU memory the server can use for KV cache (weights take the rest). Push to 0.92–0.95 on dedicated nodes.
--max-num-seqs: max concurrent active requests. Higher = more concurrency, less KV per request, potential for eviction. Tune based on acceptable eviction rate.
--max-model-len: max context length you’ll allow. Smaller = more slots in the same KV budget. Set to actual production max, not theoretical.
Example for production Llama-3-70B on H100 TP=2:
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--quantization fp8 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.94 \
--max-num-seqs 256 \
--max-model-len 8192 \
--enable-prefix-caching
That configuration gives us ~250 concurrent active requests with ~8K context, sustained on 2x H100s. Without FP8 KV cache, it would be ~130.
Measuring Success
Three metrics to watch:
- KV cache utilization: how full your KV cache is on average. vLLM exports this. >85% means you’re pushing hard; >95% sustained means you’re eviction-thrashing.
- Preemption rate: how often active requests get evicted. >1% means your concurrency cap is too high or you need more memory.
- TTFT vs queue depth correlation: if TTFT spikes when queue depth rises, you’re capacity-bound on KV, not GPU compute.
The Path Forward
KV cache management is still evolving. 2025–2026 directions:
- Cross-request KV sharing beyond prefix: mid-sequence sharing when multiple requests converge on similar generations.
- Persistent KV across sessions: caching a user’s conversation KV in external storage.
- Cluster-wide KV pools: shared blocks across inference nodes, enabled by fast interconnects.
- Learned cache policies: RL-trained eviction.
For 2025: quantize (FP8), page (PagedAttention), share (prefix caching), and tune. That covers 95% of the gains.
Further Reading
- PagedAttention Explained: How vLLM Achieves 24x Throughput
- FP8 and Quantization: Serving LLMs at Half the Cost
- Disaggregated Inference: Prefill, Decode, and the New Serving Topology
Running into KV cache pressure in production? We can help — profiling and tuning in under a week.
Related Posts
PagedAttention Explained: How vLLM Achieves 24x Throughput
PagedAttention borrows OS virtual-memory ideas to fix the biggest efficiency problem in LLM serving: fragmented KV caches. Here's how it works and why it changed LLM inference.
vLLM and SGLang Are Converging — and That Changes the Inference Stack
Both engines now share NVIDIA's FlashInfer kernels and expose identical OpenAI-compatible APIs. Meanwhile, SGLang spun out as RadixArk with $100M in seed funding, and vLLM hit 2M weekly installs. The inference layer is consolidating faster than anyone expected — here's what that means for teams building on top of it.
Speculative Decoding for Production LLMs
Speculative decoding uses a small 'draft' model to propose multiple tokens that a larger model verifies in parallel, cutting inference latency 2–3x. A practical guide to production deployment.