Infrastructure

vLLM: The Open-Source Inference Engine Changing LLM Serving

Balys Kriksciunas 7 min read
#ai#infrastructure#inference#vllm#llm-serving#paged-attention#continuous-batching#gpu

vLLM: The Open-Source Inference Engine Changing LLM Serving

Before vLLM, self-hosting a large language model meant picking between HuggingFace’s generate() (simple, slow) and NVIDIA Triton + TensorRT-LLM (fast, complex, NVIDIA-only). vLLM shipped in mid-2023 from a Berkeley Sky Computing Lab team, open-sourced its PagedAttention implementation, and within a year became the default inference stack for most self-hosted LLM workloads.

This article explains what vLLM is, why it’s fast, what it’s good for, and the operational details you need to know before shipping it.


The Problem vLLM Solves

Naive LLM serving wastes GPU memory and compute in two specific ways.

1. KV cache fragmentation. When a model generates tokens, it caches the attention keys and values (KV cache) for every position. Traditional implementations pre-allocate a contiguous block per request, sized for the worst-case max length. If you serve a 4K-context model, every request reserves 4K-worth of KV even if it only generates 100 tokens. Memory utilization typically sits below 50%.

2. Batching at the wrong granularity. Static batching waits to assemble a batch, runs it start-to-finish, then assembles the next. The longest request in the batch pins everyone else. GPU utilization also sits below 50%.

Together these mean a single request workload uses the GPU well, but a multi-request production workload uses maybe 20–40% of the theoretical throughput.

vLLM fixes both.


PagedAttention: Virtual Memory for the KV Cache

The headline innovation is PagedAttention, which treats the KV cache like an OS virtual memory subsystem.

Instead of contiguous per-request blocks, KV cache is stored in fixed-size blocks (default 16 tokens). A per-request page table maps logical token positions to physical blocks. New tokens allocate blocks on demand; finished requests free them.

The consequences:

Memory utilization in production vLLM deployments routinely hits 90%+ of available HBM. That’s directly convertible into more concurrent requests, longer contexts, or bigger batches.

For the architectural details, see our PagedAttention Explained.


Continuous Batching: The Other Half of the Win

The second big idea in vLLM is continuous batching (sometimes called “in-flight batching” or “iteration-level scheduling”).

Instead of batching at the request level, vLLM batches at the token-generation-step level. At each forward pass, the scheduler looks at all in-flight requests, decides which can make progress, and batches those. A request that finishes mid-batch is replaced immediately with a waiting request.

This means:

Orca, a paper by the Seoul National University team, introduced the technique in 2022. vLLM was the first widely-adopted open-source implementation, and it’s now standard across TGI, TensorRT-LLM, and others.

See our Continuous Batching explainer for the full mechanics.


Benchmarks: Is the Hype Justified?

In our own testing on a single H100 80GB, serving Llama-2-13B-chat:

SetupThroughput (tokens/s)P50 latencyP99 latency
HF generate(), batch 13628ms/tok32ms/tok
HF generate(), naive batching (8)22036ms/tok190ms/tok
vLLM, default settings3,10038ms/tok95ms/tok
vLLM w/ prefix caching (2K shared system prompt)4,60035ms/tok88ms/tok

That’s ~14x vs single-stream HF, ~20x vs naive batching. Gaps are larger at higher concurrency — at 64+ concurrent requests, vLLM is 30x+ because naive implementations fall over entirely.

vLLM vs other production servers:

For most teams, vLLM is the right default. You switch away only if you have a specific reason: structured output coverage (SGLang), maximum performance-at-any-cost (TensorRT-LLM), tight HF integration (TGI).

We cover the full comparison in vLLM vs TGI vs Triton Benchmarks.


What You Actually Get

vLLM ships as a Python package and a server. The two usage patterns:

Offline batched inference:

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-13b-chat-hf")
params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(prompts, params)

OpenAI-compatible server:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-13b-chat-hf \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.92

The server exposes /v1/completions, /v1/chat/completions, and /v1/embeddings endpoints compatible with OpenAI’s API. That means you can point an existing OpenAI client at a vLLM endpoint and it “just works.”

Supported features as of 2024:


Operational Notes From Production

Things we’ve learned running vLLM in production:

1. GPU memory utilization is a real knob. Default is 0.9 (90%). Push to 0.92–0.95 if you’re not running other workloads on the GPU. Below 0.85 you’re leaving throughput on the table.

2. max-model-len matters more than you think. vLLM pre-computes KV cache block capacity based on max length. Setting it way above your actual usage wastes memory. Set it to the largest context you actually use.

3. Warm it up. First request latency is ~10–30 seconds for weight loading and CUDA graph capture. Ship a warmup script in your container entrypoint.

4. Prefix caching is a free 1.5–2x for RAG/agent workloads. Enable --enable-prefix-caching — the only reason not to is if you’re running truly unique prompts per request.

5. Tensor parallelism has overhead. TP=2 doesn’t give you 2x throughput; it gives you 1.4–1.7x. Use it when you need the memory (a model that doesn’t fit on one GPU), not for throughput scaling.

6. The Python GIL is a real bottleneck for pre/post-processing. vLLM 0.5+ pushes more logic into C++; still worth minimizing Python work in your wrapping service.

7. Health checks lie. vLLM’s /health endpoint returns 200 even when the model is wedged. Pair it with a synthetic-request probe that actually generates a token.


When Not to Use vLLM


Deployment Patterns

Three common patterns we see with clients:

Pattern 1: Single-model, single-node. One vLLM server, behind a LiteLLM gateway, serving one model. Simple, works up to ~1M tokens/day.

Pattern 2: Multi-LoRA, single base model. One vLLM server loaded with a base model (e.g. Llama-3-8B) and dozens of LoRA adapters. You serve many “models” from one GPU fleet. Cost-effective for per-customer fine-tunes.

Pattern 3: Sharded multi-node. For 70B+ models, tensor-parallel across 4–8 GPUs, multiple replicas behind a scheduler. This is where you start caring about Ray Serve or Kubernetes orchestration — see Ray Serve vs Kubernetes for Model Serving.


Further Reading

Running vLLM in production and want help tuning it? Reach out — we’ve profiled vLLM deployments from a single H100 to multi-cluster fleets.

← Back to Blog