KV Cache Optimization Techniques for LLM Serving
The KV cache dominates memory in LLM serving and often dominates cost. Paged, compressed, offloaded, and shared — a guide to the techniques that let you serve 2–4x more concurrent requests.
Walk into a well-tuned LLM inference stack in 2024 and you’d see one vLLM replica per GPU doing everything: prefill (processing the prompt) and decode (generating tokens) all interleaved. It worked. It was “good enough.”
Walk in today and you increasingly see two distinct fleets: prefill workers and decode workers, communicating over a fast network. This is disaggregated inference, and on realistic production workloads it delivers 30–50% throughput improvements on the same hardware.
Prefill and decode have fundamentally different resource profiles.
Prefill — processing the user’s prompt. Runs one forward pass over the entire input. Parallelizes well across the input sequence. Bounded by FLOPS (compute).
Decode — generating tokens one at a time. Runs one forward pass per output token, each small. Poorly parallelizable (autoregressive). Bounded by memory bandwidth (reading weights and KV cache every step).
When you run them on the same GPU interleaved (the traditional approach):
Disaggregated inference runs prefill on one type of GPU (or pool) optimized for compute, decode on another optimized for bandwidth, and ships KV cache between them.
Three conditions had to be right for this to be practical:
All three hit production readiness in 2025. In 2026, disaggregation moved from experimental to standard for large deployments.
[ Request ]
│
▼
[ Router / Scheduler ]
│
├─────────────┐
▼ ▼
[ Prefill pool ] [ Decode pool ]
(large GPUs, (bandwidth-
compute- optimized,
optimized) ~2x decode
per $)
│ │
└─── KV ──────┘
cache transfer
(NVLink /
InfiniBand)
A request arrives:
Key design choices:
A realistic workload (Llama-3.1-70B, 2K average prompt, 300 average response):
Non-disaggregated (monolithic vLLM, 8 H100):
Disaggregated (2 H100 prefill + 6 H100 decode):
Better throughput and better latency, same hardware budget.
The win is larger when:
The win is smaller when:
Disaggregation opens up using different GPUs for different roles:
A production cluster might run H100 for prefill (where compute matters) and MI300X for decode (where memory bandwidth matters and MI300X’s 5.3 TB/s outpaces H100’s 3.35 TB/s).
Cost-per-token drops substantially with right-sized hardware.
vLLM 0.7+ supports disaggregated serving. Configuration involves:
Still maturing; mostly deployed by larger shops.
SGLang supports disaggregated serving with similar patterns. Strong for agent workloads where prompts often change but decode is long.
NVIDIA’s production TensorRT-LLM supports disaggregation natively. Best performance, most engineering investment to deploy.
Research and open-source projects (DistServe from Beijing University) pioneered the architecture. Some teams still use research implementations; most migrate to vLLM or TRT-LLM once their features catch up.
Worth it:
Not worth it:
A rough threshold: if your monthly GPU bill is under $50k, disaggregation is probably premature optimization.
Prefill and decode pools scale independently. Your autoscaler needs to understand:
If decode can’t keep up, prefill backs off. If prefill can’t keep up, decode workers sit idle. Needs smarter coordination than standard HPA.
When a request arrives, which prefill worker gets it? Which decode worker gets the result?
Policies:
For mixed workload (long and short prompts), consider two prefill pools — a “fast” one for short prompts and a “bulk” one for long prompts.
Transfer is on the critical path. Optimizations:
1. Cold start latency. A new decode worker has to receive KV cache before it can start. Warming strategies matter.
2. KV cache size spikes. Very long prompts produce huge KV caches. Can overwhelm transfer bandwidth.
3. Fault recovery. If a decode worker dies mid-generation, where does the request recover to? Need checkpointing or quick re-prefill.
4. Observability complexity. A request’s trace spans multiple GPUs and potentially multiple pods. Must thread trace IDs carefully.
5. Debugging. Subtle issues (off-by-one in KV layouts) don’t show up in small tests; manifest under load.
Patterns emerging on top of disaggregation:
KV cache persisted to a fast distributed store (Mooncake, LMCache). Any decode worker can pull any KV cache. Sessions can pause and resume cleanly.
Hot KV in GPU memory, warm KV in CPU memory, cold KV on NVMe or object storage. Agent-style workloads with long sessions benefit substantially.
Not just prefill vs decode — also separating:
Experimental but promising for very large-scale serving.
Disaggregated inference is the inference pattern that won 2025. It delivers 30–50% throughput improvements on realistic workloads by matching hardware to work type.
If you’re running 8+ GPUs sustained, evaluate it. If you’re running 1–4, stick with monolithic serving — the complexity isn’t worth the gain at that scale.
Your inference server probably already supports it (vLLM, SGLang, TensorRT-LLM). Deployment is the real work: the scheduler, the pool sizing, the observability, the fault recovery. Budget real engineering time.
Evaluating disaggregated inference for your deployment? Reach out — we can run a sizing exercise against your actual workload.
The KV cache dominates memory in LLM serving and often dominates cost. Paged, compressed, offloaded, and shared — a guide to the techniques that let you serve 2–4x more concurrent requests.
Speculative decoding uses a small 'draft' model to propose multiple tokens that a larger model verifies in parallel, cutting inference latency 2–3x. A practical guide to production deployment.
The three dominant LLM inference servers compared head-to-head on throughput, latency, features, and operational complexity. Benchmarks on H100, A100, and L40S — and which one to pick when.