vLLM and SGLang Are Converging — and That Changes the Inference Stack

Balys Kriksciunas · Sat May 23 2026 · 9 min read

#ai #infrastructure #inference #vllm #sglang #flashinfer #llm-serving #ecosystem

3D render of two GPU server racks merging through glowing fiber-optic data streams — orange vLLM and teal SGLang sides converging into a white-gold fusion center with translucent geometric code fragments floating above

Both engines now share NVIDIA's FlashInfer kernels and expose identical OpenAI-compatible APIs. Meanwhile, SGLang spun out as RadixArk with $100M in seed funding, and vLLM hit 2M weekly installs. The inference layer is consolidating faster than anyone expected — here's what that means for teams building on top of it.

This time last year, picking an inference engine meant picking a side. The vLLM-versus-SGLang debate had the energy of a framework war — blog posts with side-by-side benchmarks, sprawling GitHub issue threads, and production teams forced to choose. The engines had different memory management strategies, different schedulers, different attention kernels, and different community cultures. Choosing one locked you into a stack.

That era is ending. Quietly, rapidly, and largely below the radar of the AI discourse that’s been consumed by agent architectures and reasoning models, the two dominant open-source inference engines are converging. They’re converging on kernels, on APIs, and on the operational primitives that matter to production teams. At the same time, they’re diverging sharply on governance — and that divergence may matter more for the next five years than any benchmark delta ever did.

If you’re serving models in production, or planning to, the convergence of vLLM and SGLang isn’t a curiosity. It’s a rewrite of the assumptions your infrastructure decisions rest on.

The convergence that’s already happened

Let’s start with what’s concrete. As of May 2026, vLLM and SGLang share:

1. The same attention kernels

In April 2026, NVIDIA released its most optimized inference kernels — the kind that previously only shipped inside TensorRT-LLM — through FlashInfer, an open-source kernel library originally developed by a PhD student. Both vLLM and SGLang now consume these kernels directly. This isn’t a compatibility shim; it’s a shared kernel substrate.

The implication is stark. When both engines run identical FlashInfer kernels under the hood, the remaining performance differences come from scheduling and orchestration overhead, not kernel quality. As one production engineer noted on LinkedIn: “Recent benchmarks show SGLang maintains a throughput advantage over vLLM even when both use identical FlashInfer kernels. This reveals the bottleneck isn’t the attention kernel itself — it’s the engine’s internal orchestration overhead.”

The kernel war is over. The scheduler war is what’s left.

2. OpenAI-compatible API surfaces

Both engines expose identical /v1/chat/completions and /v1/completions endpoints. This has been true for a while, but the compatibility has deepened to the point where switching engines requires changing nothing but the Docker run command. Your application code, your prompt templates, your streaming logic — none of it needs to change.

For teams that run both engines in different environments (SGLang for prefix-heavy agent workloads, vLLM for heterogeneous model serving), this API convergence means the operational boundary between engines is now a deployment flag, not an architectural decision. That’s a big deal.

3. Shared model support and hardware targets

Both engines support 60+ model families across NVIDIA, AMD, and increasingly Google TPU hardware. The model support gap — which was once a legitimate reason to pick one over the other — has narrowed to near-parity. When a new model drops (DeepSeek V3, Gemma 4, Llama 4), both engines ship support within days. vLLM’s v0.19.0 release shipped with 200 contributors and immediate support for new model families. SGLang matched pace.

The divergence that matters more

If the technical layer is consolidating, the organizational layer is fragmenting in a way that will shape the next five years.

SGLang: the RadixArk spinout

In January 2026, the SGLang project spun out of UC Berkeley’s Sky Computing Lab as RadixArk, a commercial startup valued at approximately $400 million in an Accel-led round. By May 2026, RadixArk officially launched with $100 million in seed funding, naming Google, Microsoft, NVIDIA, Oracle, AMD, LinkedIn, and xAI as production users.

This is not a typical open-source corporate steward situation. RadixArk is led by Ying Sheng — an SGLang core contributor who previously worked at xAI — and has Intel CEO Lip-Bu Tan as an angel investor. The company has an explicit commercial roadmap: the Miles framework for reinforcement learning training loops, enterprise support contracts, and a managed cloud offering. SGLang remains Apache 2.0 licensed, but the project’s velocity and strategic direction now sit inside a venture-backed company with a $400M valuation to justify.

The community dynamics are already shifting. SGLang averages 3–5 day response times on issues (compared to vLLM’s 12 hours to 3 days), and its total contributor count is less than half of vLLM’s, according to Ant Group’s analysis. A leaner, faster-moving team — but one that’s now structurally accountable to investors.

vLLM: distributed governance, 2M installs per week

vLLM took the opposite path. It remains a community-governed project under the Linux Foundation, with no single corporate backer. Its weekly installs doubled to approximately 2 million in March 2026, driven by community meetups, the v0.19.0 release, and adoption for MLPerf Inference v6.0. Nearly 2,000 contributors have submitted PRs, and the project averages 10+ new issues daily.

vLLM’s governance model means no single entity can steer the roadmap. That’s simultaneously its greatest strength and its greatest limitation. It keeps the project neutral — crucial for cloud providers and enterprises that don’t want dependency on a vendor — but it also means decisions about speculative decoding, disaggregated serving, and structured output can take longer to land.

What the divergence means for you

If you’re a startup with 5 engineers and a single model in production, both engines will serve you fine. The API compatibility means you can switch later.

If you’re an enterprise with heterogeneous workloads — some prefix-heavy agent pipelines, some one-shot inference, some batch — the divergence starts to bite. vLLM’s broader model support and faster community response times make it the safer default for diverse workloads. SGLang’s RadixAttention gives it a 29% throughput advantage over fully optimized vLLM for prefix-sharing workloads (multi-turn chat, RAG, agent orchestration), and its structured output engine is 3–10× faster than alternatives. If your workload is prefix-heavy, the performance gap is real and worth the organizational risk of betting on a venture-backed startup.

But here’s the thing: the convergence trend suggests that gap will narrow, not widen. As both engines converge on shared kernels, the scheduler optimizations that give SGLang its edge become replicable. vLLM’s community has 4× the contributor bandwidth. The question isn’t whether vLLM can catch up on prefix-aware scheduling — it’s how fast.

The third player nobody talks about

While the vLLM-SGLang duopoly grabs attention, Modular’s MAX engine is quietly compiling Mojo into CUDA kernels that outperform both on dense models at high concurrency. Fish Audio’s benchmarks show MAX delivering 16% higher throughput than vLLM on L40 GPUs with a p99 TTFT of 13.1ms vs 23.6ms.

MAX isn’t open source in the same way — Mojo’s compiler stack is source-available, not Apache 2.0 — and its model support is narrower. But for teams that standardize on a small set of dense models and want maximum GPU efficiency, it’s a legitimate third option. And unlike vLLM (governed by committee) or SGLang (governed by a startup), MAX is governed by Modular — a well-funded company with a clear technical vision and no split between open-source purity and commercial reality.

We’re not saying MAX will unseat the duopoly. We’re saying that a year ago, the inference engine conversation was “vLLM or SGLang?” Today, it’s “vLLM, SGLang, or MAX?” — and the answer increasingly depends on your workload pattern, not generic benchmarks.

Why this convergence changes how you should think about inference

The practical upshot of all this: the inference engine layer is becoming a commodity faster than anyone predicted. When the kernels are shared, the APIs are identical, and the performance deltas are in the 10–20% range on specific workloads, the engine you pick matters less than the infrastructure around it.

What does still matter:

Disaggregated serving architecture. Whether you use vLLM or SGLang, separating prefill and decode phases onto different GPU types is yielding 30–50% throughput wins in production. This is an architectural decision that sits above the engine layer and delivers more impact than switching engines.

GPU procurement and placement. The engine doesn’t matter if your GPUs are in the wrong region, on the wrong interconnect, or running at 40% utilization because your autoscaler is misconfigured. Our GPU FinOps analysis found that most teams lose more money to poor scheduling than they’d save by switching inference engines.

Continuous batching configuration. Both engines support continuous batching, but the tuning parameters — max batch size, queue delay, prefill chunking — are workload-dependent and matter more than engine choice. Get these wrong and you’ll leave 30%+ throughput on the table regardless of which engine you run. We covered the fundamentals in our continuous batching deep dive.

The agent infrastructure layer above inference. This is the real shift. Inference engines serve tokens. Agent frameworks (LangGraph, OpenAI Agents SDK, Claude SDK) consume those tokens and make decisions. The infrastructure that sits between them — state management, checkpointing, tool execution sandboxes — is where the hard engineering problems live in 2026. The inference engine is becoming a fungible backend to that layer. That’s the convergence story in one sentence.

What we’re watching

Three signals will tell us where this goes over the next 6–12 months:

vLLM’s prefix-aware scheduler. If vLLM ships a competitive RadixAttention equivalent within the next two releases, SGLang’s primary architectural advantage evaporates. The vLLM v0.21 roadmap suggests this is in active development.
RadixArk’s enterprise pricing. Once RadixArk launches its managed cloud offering and support contracts, we’ll see whether the SGLang community stays open or bifurcates into open-core and paid tiers. The Apache 2.0 license provides a floor, but managed services and enterprise features will test the community’s tolerance for a venture-backed steward.
MAX’s model coverage expansion. If Modular ships support for the top 20 model families within 2026, the duopoly becomes a triopoly. If it stays limited to dense decoder-only architectures, MAX remains a specialist tool.

The inference layer is consolidating. The kernels are shared, the APIs are identical, and the engine you pick matters less every quarter. That’s good news for teams building on top of it — and uncomfortable news for anyone whose infrastructure strategy is defined by engine loyalty. The real differentiation in 2026 lives one layer up.

← back to blog

Split-screen comparison of vLLM blue memory block visualization and SGLang orange radix tree data structure on dark background with GPU silhouettes

Comparisons

vLLM vs SGLang: Inference Engine Comparison 2026

We've deployed both at scale. Here's what the benchmarks actually show, where RadixAttention beats PagedAttention, and which engine to pick for your workload.

Apr 30, 2026