The AI Infrastructure Stack Explained (2024)
A grounded tour of the six layers that make modern AI systems work — from GPUs and inference servers to vector databases, orchestration, and observability — with the tradeoffs that matter in production.
Kubernetes works fine for stateless HTTP services. It works surprisingly well for batch jobs. It works with some effort for stateful databases. It works, but with many opinions and pitfalls, for GPU-backed AI workloads.
If you’re putting AI inference or training on Kubernetes — and most platform teams eventually do — this primer covers the pieces that are different from normal K8s workloads and the patterns that keep your cluster healthy.
Fair question. For a single GPU running a single model, docker run on a VM is simpler. The case for K8s emerges when:
The counter-case — when K8s is wrong — is usually “we have one training job, it runs for two weeks, we have one team.” Use Slurm or just a VM. We’ll cover Slurm in the age of Kubernetes separately.
Any K8s cluster running GPU workloads needs three things installed beyond stock K8s:
The GPU Operator is NVIDIA’s opinionated bundle: driver, container toolkit, device plugin, DCGM exporter, MIG manager. Install it via Helm:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator
For most clouds (GKE, EKS, AKS, etc.), managed GPU node pools ship with the driver and device plugin pre-installed. Check before double-installing.
This is the bridge that exposes GPUs to the kubelet as a schedulable resource (nvidia.com/gpu). The GPU Operator installs it automatically. Once it’s running, your pods can request GPUs:
resources:
limits:
nvidia.com/gpu: 1
GPU utilization, memory usage, temperature, and SM occupancy are not visible via standard K8s metrics. The DCGM exporter exposes them as Prometheus metrics. If you don’t have this, you are flying blind.
Default K8s scheduling is built for fungible pods on fungible nodes. GPUs are neither fungible (an H100 is not an A100) nor cheap to wait for.
Three things you need to configure:
Every GPU node should be labeled with hardware specifics:
nodeSelector:
nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
topology.kubernetes.io/zone: us-east-1a
And tainted so non-GPU workloads don’t squat on them:
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
Then GPU pods tolerate the taint. Without this, your $30k/GPU nodes will happily run nginx sidecars.
Don’t stack two replicas of the same inference service on one node. If that node dies, you have zero capacity. Use pod anti-affinity:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: inference-server
topologyKey: kubernetes.io/hostname
GPU capacity is expensive and scarce. Your batch training job should not evict your production inference pod. Define priority classes:
# High priority - inference
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata: {name: inference}
value: 1000000
globalDefault: false
---
# Low priority - batch / experiments
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata: {name: batch}
value: 100
globalDefault: false
This also lets you use preemption — batch jobs get killed to make room for inference surges.
A full H100 is wildly overprovisioned for a small model at low QPS. Three ways to share:
Hardware-partitioned. On H100 you can split one GPU into up to seven isolated instances, each with dedicated memory and SMs. Strong isolation, predictable performance. The downside: partitions are fixed at node boot, so you lose flexibility.
Configure via GPU Operator:
migManager:
enabled: true
config:
default: "all-1g.10gb" # 7 instances per H100
Each pod requests e.g. nvidia.com/mig-1g.10gb: 1.
Software-level sharing. Multiple processes submit CUDA work to a single MPS server, which time-multiplexes onto the GPU. Lower isolation, higher flexibility.
Purest software approach. The device plugin lets N pods each request nvidia.com/gpu: 1 and they take turns. Good for dev/test; don’t run production inference this way.
Rule of thumb:
Model weights are big. A 70B model is ~140GB. Loading it at pod-start time means pulling 140GB from somewhere, every scale-up, for every replica.
Three patterns:
Simple, reproducible, but images balloon to 100GB+. Registry pulls become the bottleneck. Works for small models; doesn’t for frontier-size.
Weights in S3/GCS, pod downloads on start. Works, but startup is slow and you repeat the transfer every scale-up.
Mount a hostPath or a read-only PVC that caches weights across pod restarts. First pod on a node pays the cost; subsequent pods mmap. This is the best pattern for production at scale.
Some teams use Fluxcd with pre-pulled images or a warmed cluster DaemonSet to keep hot weights resident. For training checkpoints, a parallel filesystem (Weka, Lustre, FSx) is often worth the spend.
The Horizontal Pod Autoscaler scales on CPU and memory. Neither is a good signal for GPU workloads.
What you actually want:
Options:
And remember: GPU nodes do not scale instantly. A cold node takes 2–5 minutes to come up with drivers loaded. You need a buffer of warm capacity or your SLOs take hits during scale events.
Single-node inference doesn’t care about network. Multi-node training absolutely does.
If you’re doing distributed training on K8s, you need:
If you are not doing distributed training, ignore this section. If you are, this section is worth a full guide of its own.
Stock kube-scheduler works fine for single-GPU inference pods. For batch/training, it’s not ideal — it doesn’t understand gang scheduling (all-or-nothing for a multi-pod job) or fair share.
We cover these in depth in GPU Scheduling on Kubernetes.
From a dozen platform teams running GPU K8s in production:
Setting up Kubernetes for AI workloads and want a second pair of eyes? Get in touch — we’ve stood up GPU K8s clusters from 4 nodes to 400.
A grounded tour of the six layers that make modern AI systems work — from GPUs and inference servers to vector databases, orchestration, and observability — with the tradeoffs that matter in production.
When GPU spend crosses $500k/month, informal cost discipline stops working. A FinOps playbook for large AI compute bills — attribution, commitments, workload placement, and the structural changes that matter.
AMD's MI300X turned from curiosity to production option during 2024–2025. Where AMD wins, where NVIDIA still leads, and how to integrate MI300X into a mixed fleet.