PagedAttention Explained: How vLLM Achieves 24x Throughput

Balys Kriksciunas · Wed Oct 30 2024 · 7 min read

#ai #infrastructure #vllm #paged-attention #kv-cache #inference #llm-serving #gpu-memory

PagedAttention borrows OS virtual-memory ideas to fix the biggest efficiency problem in LLM serving: fragmented KV caches. Here's how it works and why it changed LLM inference.

PagedAttention Explained: How vLLM Achieves 24x Throughput

The original vLLM paper reported up to 24x throughput over HuggingFace’s default inference on shared workloads. The key enabler was PagedAttention — a memory-management technique that borrowed directly from how operating systems handle virtual memory.

This post explains PagedAttention from the ground up: the KV-cache fragmentation problem, how PagedAttention fixes it, what it enables (prefix sharing, beam search), and where the remaining ceilings are.

The KV Cache, Briefly

When a transformer generates a token, it computes attention over all prior tokens. To avoid recomputing attention keys and values from scratch on every step, it caches them — the KV cache.

For a typical 13B model with 40 layers, 40 attention heads, and 128-dimensional heads, per-token KV cache size is:

2 (K+V) × 40 layers × 40 heads × 128 dim × 2 bytes (FP16)
= 819,200 bytes ≈ 800 KB per token

A 4K-token context uses ~3.2 GB of KV cache. A 32K context uses 25 GB. On a single H100 80GB, after loading 26 GB of 13B weights, you have ~54 GB of memory for activations and KV cache. That budget runs out fast.

The Fragmentation Problem

Traditional inference servers pre-allocate a contiguous block of KV cache per request, sized for the worst case (max_tokens).

If you run a chat model with max_tokens=2048, every request reserves 2048 tokens’ worth of KV cache — about 1.6 GB — regardless of how long the actual response turns out to be.

This leads to three kinds of waste:

Internal fragmentation

A request reserved 2048 slots but only generated 100 tokens. 1948 slots of KV cache were allocated but never used. Gone until the request completes and the block is freed.

External fragmentation

As short requests complete and long ones continue, the pool of allocated memory becomes checkerboarded. New requests need contiguous blocks of a certain size, but no contiguous hole is big enough. Memory exists, but can’t be used.

Reservation waste

To serve N concurrent requests at worst-case length, you need N × max_tokens of KV cache. Even if requests average 200 tokens instead of 2048, you’re provisioning for the tail.

Combined, this means production servers without paged attention run at 40–60% KV cache efficiency. On a $30k GPU, that’s ~$12–$18k of memory you paid for and didn’t use.

Enter PagedAttention

PagedAttention solves fragmentation by treating KV cache like virtual memory.

Core ideas:

1. Fixed-size blocks

Instead of one contiguous allocation per request, KV cache is stored in fixed-size blocks (vLLM default: 16 tokens per block). A block is the unit of allocation.

Token positions 0–15 go in block A. Tokens 16–31 in block B. Tokens 32–47 in block C. If a request ends at token 150, it occupies exactly ⌈150/16⌉ = 10 blocks — no over-provisioning.

2. Per-request page tables

Each active request has a page table — a mapping from logical token positions to physical block IDs. This decouples logical layout (the contiguous-looking sequence the model expects) from physical layout (blocks scattered through GPU memory).

When the attention kernel runs, it indirects through the page table to fetch KV from the right physical location. The model never sees the indirection.

3. Block allocator

A global allocator hands out free blocks to requests on demand. Blocks are fungible — any free block can serve any request. No external fragmentation.

4. Copy-on-write for shared blocks

Two requests that begin with the same tokens (e.g. same system prompt) can share physical blocks via their page tables. When they diverge — generating different tokens — the shared block is copied on write. This enables prefix caching (more below).

The Custom CUDA Kernel

The tricky part: standard attention kernels assume KV is contiguous. You can’t just plug PagedAttention into an existing FlashAttention implementation. vLLM ships a custom CUDA kernel that:

Walks the page table during attention computation
Loads KV blocks via indirection
Handles variable-length sequences in the same batch (unavoidable when mixing requests at different positions)

This is a genuinely hard piece of engineering. The performance is within ~5–10% of the contiguous-KV baseline for single-request workloads, and massively better for multi-request ones. Later optimizations (vLLM v0.4+ integrated FlashAttention and FlashInfer for the attention computation itself, using PagedAttention’s block layout on top) have closed that 5–10% gap.

What PagedAttention Unlocks

Beyond fixing fragmentation, PagedAttention enables several features that are impossible or painful without it:

Prefix caching

If many requests share a prefix (system prompt, few-shot examples, a long document for RAG), the KV cache for that prefix can be computed once and shared across requests.

With PagedAttention, sharing is just pointing multiple page tables at the same physical blocks. No data is duplicated. On RAG workloads where every query has the same 2K-token system preamble, prefix caching alone gives a 1.5–3x throughput boost on top of the base PagedAttention gains.

vLLM: --enable-prefix-caching. SGLang and TensorRT-LLM have equivalent features.

Beam search and parallel sampling

Beam search maintains multiple hypotheses, each sharing a prefix. Before PagedAttention, each beam duplicated the shared prefix KV cache. Now they share blocks until they diverge, paying the copy only where they actually differ.

Long context without quadratic memory

Long context windows (32K, 128K) don’t blow up memory the way they used to, because blocks are only allocated for positions actually in use. Combined with sliding-window attention or chunked KV cache, PagedAttention makes 1M+ context feasible.

Disaggregated prefill/decode

In 2024+, some systems (including vLLM’s upcoming disaggregated deployments) separate prefill (compute-bound) and decode (bandwidth-bound) onto different hardware. PagedAttention’s block-level representation makes it natural to ship KV blocks between nodes.

See our upcoming piece on Disaggregated Inference.

The Tradeoffs

PagedAttention isn’t free. The costs:

1. Indirection overhead. Every attention computation pays a page-table lookup. With a well-tuned kernel, this is small (~5% on single-request), but nonzero.

2. Block size tuning. Too small (e.g., 1 token/block) means huge page tables and constant allocator churn. Too large (e.g., 256 tokens/block) brings back fragmentation. vLLM’s 16-token default is a good compromise but isn’t optimal for every workload.

3. Kernel complexity. The custom attention kernel is a nontrivial piece of CUDA. Bugs have happened. Cross-version compatibility with new attention variants (ALiBi, RoPE, sliding window) requires vigilance.

4. Debugging. When something goes wrong, “why did this pod OOM?” involves reading block allocator traces. Operationally heavier than contiguous allocation.

PagedAttention vs. Alternatives

Contiguous KV cache (HuggingFace default). Easy, simple, wastes memory.

FlexGen-style offloading. Swap KV blocks to CPU memory under pressure. Works but slow; often used in combination with PagedAttention.

RadixAttention (SGLang). Takes the prefix-sharing idea further, building a radix tree of KV caches indexed by prompt prefixes. Great for agent workloads with many shared branches. Built on PagedAttention-like block layout.

Infinigen / AttentionStore. Research systems that push KV cache management to a shared cluster-wide layer. Still mostly research, but the direction is clear.

Measurable Impact

Some numbers from our own benchmarks, single H100 80GB, Llama-2-13B-chat, realistic chat workload:

Config	Throughput (tok/s)	KV memory efficiency	Concurrent requests
HF + static batching	220	~45%	8
HF + continuous batching (no PagedAttention)	980	~55%	16
vLLM (PagedAttention, no prefix cache)	3,100	~92%	96
vLLM (PagedAttention + prefix caching, RAG workload)	4,900	~92%	96

The two big jumps — 980 → 3,100 from PagedAttention proper, and 3,100 → 4,900 from prefix caching — are both enabled by the block-level memory model.

When Does It Matter Least?

PagedAttention’s edge shrinks in some cases:

Single-request, fixed-length workloads. Nothing to share, no fragmentation to avoid. Plain contiguous KV cache works fine.
Tiny models. Where the model dominates memory and KV cache is a small share, fragmentation is less costly.
Extreme small-batch low-latency. If you’re serving a single request end-to-end with P50 < 10ms, the indirection overhead becomes relatively more visible.

For the other 95% of production workloads — variable lengths, concurrent requests, system prompts, RAG — PagedAttention is a game-changer.

PagedAttention Explained: How vLLM Achieves 24x Throughput

PagedAttention Explained: How vLLM Achieves 24x Throughput

The KV Cache, Briefly

The Fragmentation Problem

Internal fragmentation

External fragmentation

Reservation waste

Enter PagedAttention

1. Fixed-size blocks

2. Per-request page tables

3. Block allocator

4. Copy-on-write for shared blocks

The Custom CUDA Kernel

What PagedAttention Unlocks

Prefix caching

Beam search and parallel sampling

Long context without quadratic memory

Disaggregated prefill/decode

The Tradeoffs

PagedAttention vs. Alternatives

Measurable Impact

When Does It Matter Least?

Further Reading

Related Posts

KV Cache Optimization Techniques for LLM Serving

vLLM: The Open-Source Inference Engine Changing LLM Serving

vLLM and SGLang Are Converging — and That Changes the Inference Stack

PagedAttention Explained: How vLLM Achieves 24x Throughput

PagedAttention Explained: How vLLM Achieves 24x Throughput

The KV Cache, Briefly

The Fragmentation Problem

Internal fragmentation

External fragmentation

Reservation waste

Enter PagedAttention

1. Fixed-size blocks

2. Per-request page tables

3. Block allocator

4. Copy-on-write for shared blocks

The Custom CUDA Kernel

What PagedAttention Unlocks

Prefix caching

Beam search and parallel sampling

Long context without quadratic memory

Disaggregated prefill/decode

The Tradeoffs

PagedAttention vs. Alternatives

Measurable Impact

When Does It Matter Least?

Further Reading

Related Posts

KV Cache Optimization Techniques for LLM Serving

vLLM: The Open-Source Inference Engine Changing LLM Serving

vLLM and SGLang Are Converging — and That Changes the Inference Stack

Don't miss out on AI insights