Disaggregated Inference: 30–50% Throughput Wins
Prefill is compute-bound; decode is memory-bound. Disaggregating them across separate GPUs yields 30–50% throughput wins in production.
We've deployed both at scale. Here's what the benchmarks actually show, where RadixAttention beats PagedAttention, and which engine to pick for your workload.
We’ve deployed vLLM and SGLang across dozens of production environments — from single-GPU startups serving 50k LLM calls/day to multi-node fleets running agent workloads that generate millions of tokens daily. When a team asks us which inference engine to use, the answer is no longer obvious.
Two years ago, the recommendation was easy: use vLLM. It solved KV cache fragmentation with PagedAttention, had the broadest model support, and shipped faster than anyone. Then Hugging Face put TGI into maintenance mode in December 2025, and SGLang emerged as its primary alternative — backed by a reported $400M spinout as RadixArk, led by Accel.
The question shifted from “which engine?” to “which engine for this workload?”
This post compares vLLM and SGLang on the metrics that matter in production: throughput, latency, prefix caching behavior, structured output performance, hardware support, and operational overhead. We end with a decision framework.
The fundamental difference isn’t performance. It’s architecture:
vLLM is an inference server. Its job: load a model, accept HTTP requests, return responses — as fast and efficiently as possible. PagedAttention handles KV cache by breaking it into fixed-size blocks (like OS pages), eliminating internal fragmentation. Continuous batching schedules requests at the token level so fast-generating requests don’t block slow finishers.
SGLang is an inference engine with a programming model. It exposes the same OpenAI-compatible endpoints, but its runtime understands structured generation workflows — multi-step reasoning, tool calls, conditional branching — and optimizes for them. RadixAttention replaces PagedAttention’s block-level cache with a token-level radix tree that automatically discovers and reuses shared prefixes across requests.
In practice: vLLM asks “how many tokens per second can I generate?” SGLang asks “how many tokens can I avoid regenerating?”
We ran side-by-side tests on H100 80GB across model sizes. Results align with published benchmarks from independent sources.
| Model | vLLM (tok/s) | SGLang (tok/s) | Delta |
|---|---|---|---|
| Llama 3.1 8B | ~12,500 | ~16,215 | +29% SGLang |
| Llama 3.3 70B (FP8) | ~1,850 | ~1,920 | +4% SGLang |
At 50 concurrent requests, SGLang leads on smaller models. On 70B+ models, the gap narrows to single digits. At 100+ concurrent requests, vLLM’s tail TTFT (time-to-first-token) begins to lag — SGLang’s p95 TTFT stays tighter under heavy load. Source: Spheron H100 benchmarks, March 2026.
Cold start times are similar: roughly 60 seconds for both engines on Llama 70B. No compilation step. That’s a meaningful advantage over TensorRT-LLL, which requires a 25–40 minute engine compilation per model version on H100.
Bottom line: If raw tokens/second at scale on large models is your metric, the engines are close. On smaller models, SGLang pulls ahead meaningfully.
This is where the engines diverge most sharply.
vLLM caches KV blocks at page granularity. If two requests share a prefix that aligns to block boundaries, vLLM reuses the cached blocks. But partial block matches compute the full block. For a shared system prompt that is 3,047 tokens and a block size of 16 tokens, the last 13 tokens of that prompt get recomputed on every request.
Prefix caching in vLLM is opt-in via --enable-prefix-caching. When enabled, cache hit rates depend heavily on request similarity patterns.
SGLang builds a radix tree from token sequences. Every node stores a KV cache for the path from root to that node. When a new request arrives, SGLang walks the tree, finds the longest matching prefix, caches the full match at the token level, and only computes the delta.
This is not incremental — it’s structural. The radix tree lives in GPU memory and persists across requests. There is no block alignment penalty.
Where this matters in production:
We have measured 3–5x improvement in effective prefill latency on workloads with >60% prefix reuse when switching from vLLM to SGLang. On workloads where every request is unique (e.g., creative generation, translation), the advantage disappears.
Function calling and constrained JSON output are now baseline requirements. Both engines support grammar-constrained decoding, but they implement it differently.
vLLM uses guided decoding on the CPU side — it applies a grammar mask during sampling to prevent the model from generating invalid tokens. This creates a CPU-side bottleneck at high batch sizes. At batch sizes of 8+, throughput degradation is noticeable.
SGLang overlaps grammar mask generation with the GPU forward pass. The mask is computed on a parallel CPU thread and applied during GPU decoding, hiding most of the CPU overhead. Throughput impact is minimal even at batch sizes of 32+.
If your production workload enforces JSON schemas on every response (and it should), SGLang handles it with roughly half the latency penalty at scale.
| Dimension | vLLM | SGLang |
|---|---|---|
| NVIDIA GPUs | A100, H100, H200, B200, L40S, RTX 4090 | A100, H100, H200, B200 |
| AMD GPUs | MI300X (ROCm 6.x–7.x) | MI300X (ROCm) |
| Intel/XPU | Yes (IPEX) | No |
| AWS Inferentia/Trainium | Yes | No |
| Google TPU | Experimental | No |
| Hugging Face model compatibility | Virtually all | Most (growing rapidly) |
| Quantization (AWQ, GPTQ, FP8) | Full support | Full support |
| Multi-LoRA serving | Yes | Yes (more efficient scheduling) |
| Speculative decoding | Draft model support | Medusa, EAGLE support |
| Disaggregated prefill/decode | Production-ready | Production-ready |
vLLM wins on hardware breadth. If you are deploying on Intel GPUs, AWS Trainium, or Google TPUs, vLLM is currently the only option. SGLang focuses on NVIDIA and AMD and does it deeply.
Model support is the other differentiator: vLLM supports virtually every Hugging Face model within days of release. SGLang is sometimes a version behind on obscure architectures, though its DeepSeek support has been best-in-class.
Both engines now support disaggregated serving — running prefill on compute-optimized workers and decode on bandwidth-optimized workers, with KV cache shipped over the network. We covered the architecture in depth in our disaggregated inference guide.
The production story here is worth noting: SGLang published production disaggregation results earlier and has a cleaner API for the prefill-decode handoff (via cross-node KV cache transfer). vLLM’s implementation is more configurable but also more complex to tune.
For teams running disaggregated serving at scale, we recommend evaluating SGLang first, especially if your prefill workers serve prefix-heavy workloads — the RadixAdvantage compounds across the prefill fleet.
Both engines ship as Python packages with one-line server starts:
vLLM:
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 8 \
--enable-prefix-caching \
--max-model-len 8192
SGLang:
pip install sglang
python -m sglang.launch_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tp 8 \
--enable-prefix-cache \
--context-length 8192
Both expose an OpenAI-compatible REST API at /v1/chat/completions and /v1/completions. Client code does not need to change when swapping engines. Kubernetes deployment patterns are identical — both work as standard container images.
SGLang offers a Python SDK for defining structured generation workflows:
import sglang as sgl
@sgl.function
def research_agent(s, topic: str):
s += "You are a research assistant. " + topic
s += "Step 1: List key questions about this topic.\n"
s += sgl.gen("questions", max_tokens=500, stop=["Step 2"])
s += "\nStep 2: For each question, provide a detailed answer.\n"
s += sgl.gen("answers", max_tokens=2000)
result = research_agent.run(topic="RadixAttention performance",
backend=sgl.RuntimeEndpoint("http://localhost:30000"))
This multi-step generation pattern — where one generation output feeds the next — lives inside the engine. In vLLM, you would manage the same flow in application code with multiple sequential API calls. For agent systems, this reduces application-layer complexity.
We recommend vLLM as the default inference engine when:
We recommend SGLang when:
| Your workload | Recommendation |
|---|---|
| Chat API with shared system prompts | SGLang |
| RAG pipeline serving repeated documents | SGLang |
| Multi-agent orchestration with tool calls | SGLang |
| Single model, fixed in production, maximum throughput needed | TensorRT-LLM (not covered here) |
| Serving many different models, frequent updates | vLLM |
| Exotic hardware (TPUs, Inferentia) | vLLM |
| Stateless, unique-prompt workloads | vLLM |
| General-purpose API with moderate concurrency | Either — start with vLLM for simpler ops |
The infrastructure landscape has consolidated since we last compared inference servers in our vLLM vs TGI vs Triton benchmarks. TGI is gone. Triton+TensorRT-LLM owns the raw-throughput ceiling for teams willing to maintain compilation pipelines. vLLM and SGLang are the two that matter for the vast majority of production teams in 2026.
Our practice: we default to vLLM for new deployments until we establish the workload profile. Once we confirm prefix reuse patterns, structured output load, and agent workflow complexity, we reassess. For teams building agent-heavy stacks from scratch in 2026, we increasingly start with SGLang.
vLLM and SGLang are both production-grade inference engines. vLLM wins on breadth, community, and operational simplicity. SGLang wins on prefix caching efficiency, structured output performance, and agent workflow alignment. The $400M RadixArk spinout signals where institutional money is betting on this race.
The right answer depends on what you are serving. Match the engine to the workload pattern — not to the benchmark that looks best on a spec sheet.
Prefill is compute-bound; decode is memory-bound. Disaggregating them across separate GPUs yields 30–50% throughput wins in production.
A deep dive into vLLM — the open-source LLM inference server that uses PagedAttention and continuous batching to deliver dramatically higher throughput than naive HuggingFace serving. Architecture, benchmarks, and deployment notes.
KV cache dominates memory and cost in LLM serving. Paged, compressed, offloaded, and shared — serve 2–4x more concurrent requests.