PagedAttention Explained: How vLLM Achieves 24x Throughput
PagedAttention borrows OS virtual-memory ideas to fix the biggest efficiency problem in LLM serving: fragmented KV caches. Here's how it works and why it changed LLM inference.
Ask an inference engineer what limits concurrency on their GPU and the answer is almost always “KV cache.” Model weights are fixed. Activations are transient. The KV cache — the stored attention keys and values for every token in every active request — grows linearly with concurrent requests and context length, and eats the remaining GPU memory.
If you can fit more KV cache, you can serve more concurrent requests. That’s the whole game. This post surveys the techniques that make that possible in 2025.
For a transformer model:
KV cache per token = 2 (K+V) × num_layers × num_heads × head_dim × bytes_per_element
For Llama-3-70B:
For 8K context and 128 concurrent requests: 8,192 × 320 KB × 128 = ~336 GB. That’s more than any single GPU.
Strategies to fit more in the same memory:
Let’s cover each.
Multi-Head Attention (MHA) — every head has its own K and V. Max quality, max cache.
Grouped-Query Attention (GQA) — K and V are shared across groups of query heads. 4–8x smaller KV cache vs MHA, minor quality loss.
Multi-Query Attention (MQA) — single K and V shared across all query heads. Smallest cache, biggest quality hit.
Llama-3-70B uses GQA with 8 KV heads for 64 query heads — an 8x KV cache reduction over pure MHA. Mistral, Qwen, and most modern models also use GQA.
If you’re training a new model from scratch, use GQA. If you’re serving an existing one, its architecture is fixed — check the config to know what you’re dealing with.
Storing KV cache in lower precision cuts its size proportionally.
FP8 KV cache:
vllm serve ... --kv-cache-dtype fp8
INT4 KV cache:
FP4 KV cache:
For production on H100: FP8 KV cache is essentially free savings. Turn it on.
Covered in depth in PagedAttention Explained. The short version: traditional allocators reserve worst-case contiguous blocks per request, wasting 40–60% of KV memory. PagedAttention uses fixed-size blocks (typically 16 tokens each) that any request can use.
In practice: KV memory utilization goes from ~50% (traditional) to ~92% (PagedAttention). Nearly 2x the concurrent requests from the same hardware.
Every modern inference server supports this. If yours doesn’t, switch.
If many requests share a prefix (system prompt, retrieved documents, few-shot examples), the KV cache for that prefix can be computed once and pointed to by multiple requests.
Gains are workload-dependent but often large:
| Workload | Prefix share | Speedup |
|---|---|---|
| RAG with 2K system prompt | 100% of requests | 1.5–2x |
| Coding assistant with docs | ~90% | 1.4x |
| Open chat | Variable | 1.1–1.3x |
| Agent with tool schemas | 100% | 1.6–2x |
Enable on vLLM:
vllm serve ... --enable-prefix-caching
Combined with PagedAttention, this is essentially free for most production workloads.
SGLang extends prefix caching with a radix tree over all seen prefixes. Unlike linear prefix caching, RadixAttention handles prefixes that branch: system prompt → few-shot example A → user query, and system prompt → few-shot example B → user query share cache up to the branch point.
For agent workloads with heavy branching (planning trees, tool-calling loops), this can add another 1.3–2x on top of linear prefix caching.
If your workload has heavy prefix branching, consider SGLang.
When GPU memory fills, older KV blocks can be paged to CPU memory or even NVMe, fetched back when needed.
Offload to CPU:
Offload to disk (NVMe):
Offloading trades latency for capacity. Works best when the “hot” working set fits in GPU while the full history can live elsewhere.
When memory fills and you don’t want to offload, you evict. Options:
vLLM’s --preemption-mode recompute is the default in recent versions. It drops blocks and recomputes when the request resumes. Simpler and usually faster than swap-to-CPU.
Some models (Mistral, Gemma, Qwen) were trained with sliding-window attention — they only attend to the last N tokens, so KV cache is bounded regardless of context length.
At serving time, you can also chunk attention artificially, at the cost of some quality for very long contexts. Useful when you need to fit 128K+ context in limited memory.
Prefill (processing the prompt) and decode (generating tokens) have different resource profiles. Separating them onto different node pools lets each be optimized:
KV cache is transferred from prefill to decode when prompt processing finishes. See Disaggregated Inference.
Research techniques compress KV cache beyond quantization:
These are mostly 2024–2025 research. Some are in vLLM or SGLang experimental branches. Worth tracking; not yet table stakes.
Three knobs in any good inference server:
--gpu-memory-utilization: what fraction of GPU memory the server can use for KV cache (weights take the rest). Push to 0.92–0.95 on dedicated nodes.
--max-num-seqs: max concurrent active requests. Higher = more concurrency, less KV per request, potential for eviction. Tune based on acceptable eviction rate.
--max-model-len: max context length you’ll allow. Smaller = more slots in the same KV budget. Set to actual production max, not theoretical.
Example for production Llama-3-70B on H100 TP=2:
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--quantization fp8 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.94 \
--max-num-seqs 256 \
--max-model-len 8192 \
--enable-prefix-caching
That configuration gives us ~250 concurrent active requests with ~8K context, sustained on 2x H100s. Without FP8 KV cache, it would be ~130.
Three metrics to watch:
KV cache management is still evolving. 2025–2026 directions:
For 2025: quantize (FP8), page (PagedAttention), share (prefix caching), and tune. That covers 95% of the gains.
Running into KV cache pressure in production? We can help — profiling and tuning in under a week.
PagedAttention borrows OS virtual-memory ideas to fix the biggest efficiency problem in LLM serving: fragmented KV caches. Here's how it works and why it changed LLM inference.
Speculative decoding uses a small 'draft' model to propose multiple tokens that a larger model verifies in parallel, cutting inference latency 2–3x. A practical guide to production deployment.
Static batching leaves 50%+ of your GPU idle on variable-length workloads. Continuous batching — batching at the iteration level — closes the gap. A visual explanation with the production consequences.