Infrastructure

Continuous Batching for LLMs: Why It Matters

Balys Kriksciunas 7 min read
#ai#infrastructure#inference#batching#vllm#tgi#llm-serving#throughput

Continuous Batching for LLMs: Why It Matters

Ask a cloud bill-payer what would most dramatically reduce their inference costs and most will say “a cheaper GPU” or “a smaller model.” The actual answer, for most self-hosted LLM deployments, is continuous batching. Switching from static to continuous batching on the same hardware regularly delivers 2–5x throughput improvements.

This article explains what continuous batching is, why it works, what it breaks, and what you need to know to deploy it.


The Problem: LLM Requests Don’t Batch Like Other ML

In classical ML inference, batching is easy: collect N inputs, run one forward pass, return N outputs. All inputs finish at the same time. This works because classical models produce fixed-size output.

LLMs are autoregressive. Each request generates a variable number of tokens, one at a time, until it hits a stop token or max length. If you batch 8 requests together and one generates 800 tokens while another generates 10, the second request waits behind the first for 790 unnecessary iterations.

With static batching (sometimes called “request-level batching”), the entire batch waits until the longest request finishes. This is catastrophic for GPU utilization:

Measured GPU utilization on typical chat workloads with static batching: 30–50%. Half your expensive GPU sits idle.


The Insight: Batch at the Iteration Level

Continuous batching (also called in-flight batching or iteration-level scheduling) reframes the problem.

Instead of batching at the request level, batch at the token-generation step. At each forward pass:

  1. Look at all in-flight requests.
  2. For each, determine if it needs to generate another token (still active) or has finished.
  3. Build a batch of the active requests.
  4. Run one forward pass.
  5. Finished requests leave the batch immediately; their slots are available.
  6. Waiting requests can join the batch on the next iteration.

The critical property: a request that finishes doesn’t wait for anyone. Its slot frees, and a waiting request takes its place.

Visually:

Static batching timeline:
req1 [==========] (done)     waiting...
req2 [==] (done)             waiting...
req3 [========] (done)       waiting...
req4 [=============] (done)  waiting...
                              ^ all finish together, then next batch starts

Continuous batching timeline:
req1 [==========]
req2 [==] (done, req5 joins) [==========]
req3 [========] (done, req6 joins) [========]
req4 [=============]

The total time is roughly max(individual latencies), not sum. Throughput tracks the steady-state rate of token generation, not the worst-case request.


Why It Works

Three properties of LLM inference align to make this fast:

1. Each iteration is a fixed-cost forward pass. Whether there are 4 or 16 active requests, the forward pass cost is roughly constant. The GPU parallelizes across the batch dimension. So packing more active requests into each iteration is nearly free.

2. KV cache is position-independent. The attention operation doesn’t care where in a sequence a token is. Mixing requests at different generation positions in the same batch works, as long as each gets its own KV cache.

3. GPUs are wide. A single H100 has 132 SMs and can process thousands of attention heads in parallel. For small batches, it’s underutilized; for batches of 32–128 active requests, it fills up nicely.


Implementation Details That Matter

Attention masking and variable lengths. Requests at different positions need careful masking to avoid cross-contamination. This is where PagedAttention (in vLLM) or equivalent techniques earn their keep — they make it cheap to mix requests at any positions in one batch.

Prefill vs decode phases. Prefill (processing the initial prompt) is compute-bound. Decode (generating tokens) is memory-bandwidth-bound. Mixing them naively hurts both. Modern servers use chunked prefill — breaking a long prompt prefill into chunks that co-batch with decode steps — to smooth out utilization.

Scheduling policy. First-come-first-served is the simplest policy. Priority, deadline, and fairness policies add complexity but help with multi-tenant or premium/basic tiers.

Preemption. What happens when a new request arrives but memory is full? Two choices:

Max concurrent requests. You need to cap this. If you admit too many, every active request gets a tiny slice of KV cache and throughput tanks. Production servers expose max_num_seqs or similar.


Real Benchmarks

Same hardware (single A100 80GB), same model (Llama-2-13B-chat), same workload (16K sampled chat turns from LMSys-Chat-1M, variable input/output lengths):

Batching StrategyThroughput (tokens/s)P50 latencyP99 latency
No batching (single-stream)5228ms38ms
Static batching (batch=32)68062ms2,400ms
Continuous batching (vLLM defaults)2,95045ms180ms

Continuous batching is 4.3x throughput, and P99 is 13x better. Why both? Because static batching’s tail latency is dominated by head-of-line blocking, which continuous batching eliminates.


The Downsides and Gotchas

1. Latency variance. Each iteration takes slightly different time depending on batch size. For UX reasons, you still typically see smoother streaming than static batching, but not perfectly smooth. Most users don’t notice; some real-time apps do.

2. Memory pressure increases. You’re running more concurrent requests. Each needs KV cache. Pair continuous batching with efficient KV cache management (PagedAttention) or you’ll run out of memory fast.

3. Debugging is harder. A single forward pass now serves many requests. Tracing and logging need to carefully track which token belongs to which request.

4. Structured output + streaming. Some structured output systems (guided decoding, outlines) have performance implications that interact poorly with high concurrency. Benchmark before assuming.

5. Prefix caching amplification. Continuous batching works even better with prefix caching — if many requests share a system prompt, the shared KV cache serves them all. Turn this on.


Servers That Implement It

As of 2024, every production-grade LLM inference server supports continuous batching:

If you are running a 2023 or earlier version of anything, check. Static batching isn’t always explicit in the docs but it’s easy to test: fire 16 concurrent requests with one having max_tokens=2000 and the rest with max_tokens=50. If the short ones finish in milliseconds, continuous batching is active. If they all finish after 30 seconds, it’s not.


Tuning Continuous Batching for Your Workload

Three knobs that matter:

1. max_num_seqs (concurrent active requests). Start at 256 for H100, 128 for A100. Too low underuses GPU; too high runs out of KV cache. Tune while watching P99 latency.

2. gpu_memory_utilization. Push it to 0.92–0.95 for pure inference nodes. More memory = more KV blocks = more concurrent requests.

3. max_num_batched_tokens (per-iteration token budget). Caps total tokens processed per forward pass, balancing prefill cost vs decode co-batching. Default usually fine; tune if you have very long prompts.


The Bottom Line

Continuous batching is free 2–5x throughput on the exact same hardware. If your inference server doesn’t support it, that’s the single biggest optimization you can make. If it does, tune it for your workload shape — the default settings are not always optimal.

For a deep dive on the memory management side, see PagedAttention Explained. For the broader serving architecture, see vLLM: The Open-Source Inference Engine.


Further Reading

Running an inference fleet and suspect you’re leaving throughput on the table? Get in touch — we can profile your setup in a day.

← Back to Blog