Infrastructure

vLLM vs TGI vs Triton: LLM Inference Server Benchmarks

Balys Kriksciunas 6 min read
#ai#infrastructure#vllm#tgi#triton#tensorrt-llm#benchmark#inference

vLLM vs TGI vs Triton: LLM Inference Server Benchmarks

If you’re self-hosting an LLM in 2025, three production-grade inference servers dominate: vLLM, Hugging Face TGI, and NVIDIA Triton + TensorRT-LLM. All three support continuous batching, tensor parallelism, quantization, and OpenAI-compatible APIs. The differences show up in the margins — throughput under load, latency tails, operational burden, and ecosystem fit.

This post reports benchmarks we ran across the three on identical hardware, plus a decision framework for picking.


The Contenders

vLLM — Open-source, Python/CUDA. Reference implementation for PagedAttention. Broadest community, fastest feature velocity. Our default recommendation for most teams.

TGI (Text Generation Inference) — Hugging Face’s server, Rust/Python. Tight integration with HF Hub. Slightly more opinionated on configuration. Battle-tested at HF’s own Inference Endpoints.

NVIDIA Triton + TensorRT-LLM — Enterprise-grade serving platform from NVIDIA. TensorRT-LLM optimizes model engines; Triton serves them. Highest raw throughput if you invest in the engine build pipeline.

Honorable mentions we’ll mention briefly at the end: SGLang, LMDeploy, LightLLM, MLC-LLM.


Benchmark Setup

Fair benchmarking is hard. We tried to minimize the usual pitfalls.

Hardware: Single-node 8x H100 80GB (via CoreWeave), also tested on 2x A100 80GB and 1x L40S 48GB.

Models:

Workload: 10k sampled messages from LMSys-Chat-1M, filtered to realistic production length distribution (avg 320 input / 180 output tokens, P99 at 2500 / 1200).

Concurrency: ramped from 1 to 256 simultaneous requests.

What we measured:

Every server was tuned with sensible production settings — not default, but not cherry-picked either. We used each project’s own recommendations as a starting point.


Llama-3.1-8B Results (1x H100 80GB)

At concurrency 64 (sweet spot for single H100):

ServerThroughput (tok/s)TTFT P50TTFT P99ITL P50End-to-end P99
vLLM 0.6.x4,84062ms310ms18ms7.2s
TGI 2.3.x4,59058ms265ms19ms7.9s
Triton + TRT-LLM5,21048ms220ms16ms6.8s

TensorRT-LLM wins throughput by ~7% and latency by ~15%. The gap is real but modest for small models.


Llama-3.1-70B Results (4x H100 80GB, TP=4)

At concurrency 128:

ServerThroughput (tok/s)TTFT P50TTFT P99ITL P50End-to-end P99
vLLM 0.6.x6,420180ms1.2s42ms15.1s
TGI 2.3.x6,180190ms1.3s44ms16.4s
Triton + TRT-LLM8,290145ms920ms36ms12.6s

Triton pulls ahead more clearly at 70B. ~30% throughput advantage, noticeable TTFT improvement. TensorRT-LLM’s kernel optimizations matter more at larger scale.


Prefix Caching Impact (RAG Workload, 2K Shared System Prompt)

Real-world RAG-style workloads often have a large shared system prompt. With prefix caching enabled:

ServerThroughput (tok/s)Throughput vs baseline
vLLM + prefix caching9,110+42%
TGI + prefix caching8,640+39%
Triton (with KV reuse)10,420+26%

Prefix caching is a bigger multiplier for vLLM and TGI than for Triton, because Triton’s baseline is already higher. All three benefit significantly.


Feature Comparison

FeaturevLLMTGITriton + TRT-LLM
Continuous batching
Tensor parallelism
Pipeline parallelismPartialPartial
FP8
AWQ, GPTQ
Prefix caching✅ (KV reuse)
Speculative decoding
Structured output✅ (Outlines/guided)Basic
Multi-LoRA servingPartial
Chunked prefill
Disaggregated servingPartialNo
OpenAI-compatible API✅ (via front-end)
Multi-model per serverPartialNo
AMD GPU supportNo
Engine build requiredNoNoYes

The key asymmetry: TensorRT-LLM requires an ahead-of-time engine build for each (model, hardware, config) combination. This is hours of CI work per model. vLLM and TGI load weights directly.


Operational Burden

vLLM:

TGI:

Triton + TensorRT-LLM:

The operational gap is real. A fresh vLLM deployment takes 20 minutes. A fresh TensorRT-LLM + Triton deployment, including engine build pipeline, takes a week of engineer time the first time.


Latency Behavior Under Load

Throughput is one view; latency tails are another. At concurrency 256 on Llama-3.1-70B:

ServerITL P50ITL P99Failures
vLLM48ms210ms0.3% (mostly timeouts)
TGI52ms260ms0.4%
Triton40ms160ms0.1%

Triton has tighter tails, which matters for latency-sensitive user experiences. vLLM and TGI are comparable.


The Decision Framework

Pick vLLM if:

Pick TGI if:

Pick Triton + TensorRT-LLM if:

Our split in the field: ~70% vLLM, ~15% TGI, ~15% Triton. Triton’s share rises with customer size.


The Up-and-Comers

Worth your attention but not yet mainstream:

SGLang specifically is worth evaluating if your workload involves complex branching, many tool calls per request, or structured output. It’s the fastest server for those patterns.


Known Caveats

1. Benchmark numbers age fast. vLLM, TGI, and TRT-LLM all release every 2–4 weeks. The gap between them at any given moment shifts. Rerun if you care.

2. Workload shape matters. Long prompts + short outputs favor different servers than short prompts + long outputs. Don’t trust a single benchmark.

3. FP8 quality. All three have FP8 support, but the exact quantization routines differ subtly. Always validate on your own eval set before deploying FP8 in production.

4. Costs. TRT-LLM’s engineering overhead is real. On small fleets, the simpler server saves engineering time that dwarfs perf wins.


Further Reading

Benchmarking inference servers for your workload? Reach out — we’ll help you compare apples to apples.

← Back to Blog