Infrastructure

KV Cache Optimization Techniques for LLM Serving

Balys Kriksciunas 7 min read
#ai#infrastructure#kv-cache#inference#vllm#memory#llm-serving

KV Cache Optimization Techniques for LLM Serving

Ask an inference engineer what limits concurrency on their GPU and the answer is almost always “KV cache.” Model weights are fixed. Activations are transient. The KV cache — the stored attention keys and values for every token in every active request — grows linearly with concurrent requests and context length, and eats the remaining GPU memory.

If you can fit more KV cache, you can serve more concurrent requests. That’s the whole game. This post surveys the techniques that make that possible in 2025.


The Baseline: How Much KV Cache Do You Actually Need?

For a transformer model:

KV cache per token = 2 (K+V) × num_layers × num_heads × head_dim × bytes_per_element

For Llama-3-70B:

For 8K context and 128 concurrent requests: 8,192 × 320 KB × 128 = ~336 GB. That’s more than any single GPU.

Strategies to fit more in the same memory:

  1. Reduce per-token size (quantization, GQA/MQA)
  2. Reduce fragmentation (PagedAttention)
  3. Reduce duplication (prefix sharing)
  4. Move some off-GPU (offloading)
  5. Evict or compress cold blocks

Let’s cover each.


1. Architectural Reductions: GQA and MQA

Multi-Head Attention (MHA) — every head has its own K and V. Max quality, max cache.

Grouped-Query Attention (GQA) — K and V are shared across groups of query heads. 4–8x smaller KV cache vs MHA, minor quality loss.

Multi-Query Attention (MQA) — single K and V shared across all query heads. Smallest cache, biggest quality hit.

Llama-3-70B uses GQA with 8 KV heads for 64 query heads — an 8x KV cache reduction over pure MHA. Mistral, Qwen, and most modern models also use GQA.

If you’re training a new model from scratch, use GQA. If you’re serving an existing one, its architecture is fixed — check the config to know what you’re dealing with.


2. KV Cache Quantization

Storing KV cache in lower precision cuts its size proportionally.

FP8 KV cache:

vllm serve ... --kv-cache-dtype fp8

INT4 KV cache:

FP4 KV cache:

For production on H100: FP8 KV cache is essentially free savings. Turn it on.


3. PagedAttention: Eliminating Fragmentation

Covered in depth in PagedAttention Explained. The short version: traditional allocators reserve worst-case contiguous blocks per request, wasting 40–60% of KV memory. PagedAttention uses fixed-size blocks (typically 16 tokens each) that any request can use.

In practice: KV memory utilization goes from ~50% (traditional) to ~92% (PagedAttention). Nearly 2x the concurrent requests from the same hardware.

Every modern inference server supports this. If yours doesn’t, switch.


4. Prefix Caching: Sharing KV Across Requests

If many requests share a prefix (system prompt, retrieved documents, few-shot examples), the KV cache for that prefix can be computed once and pointed to by multiple requests.

Gains are workload-dependent but often large:

WorkloadPrefix shareSpeedup
RAG with 2K system prompt100% of requests1.5–2x
Coding assistant with docs~90%1.4x
Open chatVariable1.1–1.3x
Agent with tool schemas100%1.6–2x

Enable on vLLM:

vllm serve ... --enable-prefix-caching

Combined with PagedAttention, this is essentially free for most production workloads.


5. SGLang’s RadixAttention

SGLang extends prefix caching with a radix tree over all seen prefixes. Unlike linear prefix caching, RadixAttention handles prefixes that branch: system prompt → few-shot example A → user query, and system prompt → few-shot example B → user query share cache up to the branch point.

For agent workloads with heavy branching (planning trees, tool-calling loops), this can add another 1.3–2x on top of linear prefix caching.

If your workload has heavy prefix branching, consider SGLang.


6. KV Cache Offloading

When GPU memory fills, older KV blocks can be paged to CPU memory or even NVMe, fetched back when needed.

Offload to CPU:

Offload to disk (NVMe):

Offloading trades latency for capacity. Works best when the “hot” working set fits in GPU while the full history can live elsewhere.


7. Eviction Policies

When memory fills and you don’t want to offload, you evict. Options:

vLLM’s --preemption-mode recompute is the default in recent versions. It drops blocks and recomputes when the request resumes. Simpler and usually faster than swap-to-CPU.


8. Sliding-Window and Chunked Attention

Some models (Mistral, Gemma, Qwen) were trained with sliding-window attention — they only attend to the last N tokens, so KV cache is bounded regardless of context length.

At serving time, you can also chunk attention artificially, at the cost of some quality for very long contexts. Useful when you need to fit 128K+ context in limited memory.


9. Disaggregated Prefill and Decode

Prefill (processing the prompt) and decode (generating tokens) have different resource profiles. Separating them onto different node pools lets each be optimized:

KV cache is transferred from prefill to decode when prompt processing finishes. See Disaggregated Inference.


10. Compression

Research techniques compress KV cache beyond quantization:

These are mostly 2024–2025 research. Some are in vLLM or SGLang experimental branches. Worth tracking; not yet table stakes.


Tuning KV Cache For Your Workload

Three knobs in any good inference server:

--gpu-memory-utilization: what fraction of GPU memory the server can use for KV cache (weights take the rest). Push to 0.92–0.95 on dedicated nodes.

--max-num-seqs: max concurrent active requests. Higher = more concurrency, less KV per request, potential for eviction. Tune based on acceptable eviction rate.

--max-model-len: max context length you’ll allow. Smaller = more slots in the same KV budget. Set to actual production max, not theoretical.

Example for production Llama-3-70B on H100 TP=2:

vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 2 \
    --quantization fp8 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.94 \
    --max-num-seqs 256 \
    --max-model-len 8192 \
    --enable-prefix-caching

That configuration gives us ~250 concurrent active requests with ~8K context, sustained on 2x H100s. Without FP8 KV cache, it would be ~130.


Measuring Success

Three metrics to watch:

  1. KV cache utilization: how full your KV cache is on average. vLLM exports this. >85% means you’re pushing hard; >95% sustained means you’re eviction-thrashing.
  2. Preemption rate: how often active requests get evicted. >1% means your concurrency cap is too high or you need more memory.
  3. TTFT vs queue depth correlation: if TTFT spikes when queue depth rises, you’re capacity-bound on KV, not GPU compute.

The Path Forward

KV cache management is still evolving. 2025–2026 directions:

For 2025: quantize (FP8), page (PagedAttention), share (prefix caching), and tune. That covers 95% of the gains.


Further Reading

Running into KV cache pressure in production? We can help — profiling and tuning in under a week.

← Back to Blog