Kubernetes for GPU Workloads: A Primer
Running AI workloads on Kubernetes isn't the same as running stateless microservices. A primer on GPU operators, device plugins, node affinity, MIG, and the patterns that keep clusters healthy.
Every week a team tells me their “AI project” is stuck. Usually it’s not the model. The model works fine in a notebook. What’s stuck is the stack around it — the GPU scheduling, the vector store, the inference server, the evals, the cost accounting, the fallback paths when a provider returns 503. Those pieces are “AI infrastructure,” and they are where projects die.
This guide walks through the six layers of a modern AI stack, the tradeoffs inside each, and the patterns that keep production systems alive. If you’re planning a platform investment, treat this as your mental map.
Traditional web infrastructure evolved around deterministic request/response: a load balancer, a stateless service, a database, a cache. Modern AI systems inherit all of that, then add constraints the old stack was not built for:
The rest of this article breaks the stack into six layers you actually deploy — not a slideware hierarchy, but the real components shipping in 2024.
At the bottom is compute. For most teams this means NVIDIA GPUs on a cloud — A100 (still ubiquitous), H100 (the 2024 workhorse), and the upcoming B200. If you are an AWS shop you may touch Trainium or Inferentia. If you are a Google shop, TPUs.
The critical decisions:
Training vs. inference hardware. Training is memory- and bandwidth-bound; H100 with NVLink dominates. Inference is increasingly throughput-bound; a fleet of L40S or even L4 can beat one H100 on cost per token for a 7–13B model.
Memory ceiling. A single H100 has 80GB; that holds a 70B model at FP8 with short context, or a 13B at FP16 with headroom. Beyond that you need tensor parallelism across GPUs, which raises networking requirements.
Interconnect. NVLink is 900 GB/s inside a node. InfiniBand is 400 Gbps across nodes. Ethernet with RoCE is catching up but nontrivial to tune. Multi-node training cares about this; single-node inference does not.
Sourcing. You can rent on-demand from the hyperscalers, reserve capacity via CoreWeave or Lambda, or commit long-term. In 2024, on-demand H100 is $2.50–$4/hr depending on region and season; reserved can be half that.
If you’re early, rent. If you’re running a fleet bigger than ~64 H100s continuously, do the math on reserved or bare-metal.
Above raw compute is the inference server — the process that loads model weights, batches requests, runs the forward pass, and streams tokens.
The three production-grade options in 2024:
generate(). See our PagedAttention deep-dive.Small and specialized servers have their place: Ollama and LM Studio for local dev, TGI or RunPod endpoints for quick experiments, vLLM for most production self-hosting.
Picking a server early matters. Each makes different tradeoffs around concurrency, token streaming, structured output (JSON mode), tool calling, and observability hooks.
Layer 3 is where the agent lives. It decides what to do with a user request, which tools to call, how to remember prior turns, when to hand off to a human.
In 2024 the mature options are:
These frameworks are thin — the value is in the glue code, the retries, the fallbacks, and the evals. A good pattern: start with a framework, rewrite the hot paths once you understand them.
The MCP (Model Context Protocol) is the important 2024 development here. It standardizes how tools and data sources are exposed to any LLM. We expect MCP to replace most ad-hoc tool registries within a year. Our Claude Code MCP guide has the current state of the art.
Retrieval-augmented generation (RAG) made the vector database a standard part of the stack. The four leaders in 2024 are Pinecone (managed, most mature), Qdrant (open-source, Rust, fast), Weaviate (rich hybrid search), and Milvus (Alibaba-scale, Kubernetes-native).
The 2024 surprise: pgvector caught up for many workloads. If you are already running Postgres, a single CREATE EXTENSION vector; gives you 80% of what you need up to tens of millions of vectors. See pgvector at Scale.
Beyond the vector store, a real RAG stack needs:
The data layer is where most RAG systems rot silently. Bad chunking or stale indexes produce technically-correct answers that subtly diverge from ground truth. Evals here are non-negotiable.
You cannot debug what you cannot see, and LLMs are black boxes by default. Production teams run at least three observability surfaces:
The mistake almost every team makes: treating observability as something to bolt on after launch. Build the tracing layer first, then iterate on the product. You will move faster.
Between your orchestration layer and the outside world — OpenAI, Anthropic, Google, self-hosted models — sits the gateway. In 2024 this is typically:
The gateway handles:
Without a gateway, every service calls OpenAI directly and every service owns retry logic. That does not scale.
A mid-sized AI-first product in 2024 typically looks like this:
[ Clients / Apps ]
│
▼
[ API Gateway ] ← auth, rate limits, observability
│
▼
[ Orchestration Layer (LangGraph / custom) ]
│ │ │
▼ ▼ ▼
[ LLM Gateway ] [ Vector DB ] [ Tools / MCP ]
│ │
▼ ▼
[ Inference (vLLM) ] [ Retrieval cache ]
│
▼
[ GPU Fleet (K8s / Ray) ]
│
▼
[ Observability (OTel → Langfuse / Phoenix / Datadog) ]
Small teams collapse this (e.g., skip the gateway, use hosted LLMs only). Large teams fragment it further (separate fine-tuning, eval, and labeling infrastructure). The shape stays recognizable.
If you’re shipping your first AI feature, focus on Layers 3 and 5 first. Pick a framework, wire tracing, and use a hosted model. You can skip Layers 1, 2, and 6 entirely.
If you’re scaling past $10k/month in inference spend, Layer 6 (the gateway) and Layer 2 (self-hosted serving of small models) start paying back. This is usually where a platform team emerges.
If you’re committed to self-hosting frontier-size models, Layer 1 (compute) and Layer 2 (serving) become a full-time concern. Expect a dedicated infra team, reserved capacity contracts, and a real SRE rotation.
Across every size: invest in observability and evals early, and treat AI infrastructure as software. The teams that do this quietly ship reliable products. The teams that don’t spend a year debugging intermittent hallucinations in their weekly deploy.
Looking at your AI stack and want a second opinion? Talk to our engineers about an architecture review.
Running AI workloads on Kubernetes isn't the same as running stateless microservices. A primer on GPU operators, device plugins, node affinity, MIG, and the patterns that keep clusters healthy.
A deep dive into vLLM — the open-source LLM inference server that uses PagedAttention and continuous batching to deliver dramatically higher throughput than naive HuggingFace serving. Architecture, benchmarks, and deployment notes.
A practical comparison of NVIDIA's H100 and A100 for LLM training and inference — memory, FLOPS, interconnect, price per token, and the cases where the older A100 still wins.