Infrastructure

Kubernetes for GPU Workloads: A Primer

Balys Kriksciunas 7 min read
#ai#infrastructure#kubernetes#gpu#mig#scheduler#nvidia-gpu-operator#devops

Kubernetes for GPU Workloads: A Primer

Kubernetes works fine for stateless HTTP services. It works surprisingly well for batch jobs. It works with some effort for stateful databases. It works, but with many opinions and pitfalls, for GPU-backed AI workloads.

If you’re putting AI inference or training on Kubernetes — and most platform teams eventually do — this primer covers the pieces that are different from normal K8s workloads and the patterns that keep your cluster healthy.


Why Use Kubernetes for AI at All?

Fair question. For a single GPU running a single model, docker run on a VM is simpler. The case for K8s emerges when:

The counter-case — when K8s is wrong — is usually “we have one training job, it runs for two weeks, we have one team.” Use Slurm or just a VM. We’ll cover Slurm in the age of Kubernetes separately.


The Three Non-Negotiable Components

Any K8s cluster running GPU workloads needs three things installed beyond stock K8s:

1. NVIDIA GPU Operator (or GPU Driver DaemonSet)

The GPU Operator is NVIDIA’s opinionated bundle: driver, container toolkit, device plugin, DCGM exporter, MIG manager. Install it via Helm:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install --wait --generate-name \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator

For most clouds (GKE, EKS, AKS, etc.), managed GPU node pools ship with the driver and device plugin pre-installed. Check before double-installing.

2. Device Plugin

This is the bridge that exposes GPUs to the kubelet as a schedulable resource (nvidia.com/gpu). The GPU Operator installs it automatically. Once it’s running, your pods can request GPUs:

resources:
  limits:
    nvidia.com/gpu: 1

3. DCGM Exporter + Monitoring

GPU utilization, memory usage, temperature, and SM occupancy are not visible via standard K8s metrics. The DCGM exporter exposes them as Prometheus metrics. If you don’t have this, you are flying blind.


Scheduling: The Part That Bites

Default K8s scheduling is built for fungible pods on fungible nodes. GPUs are neither fungible (an H100 is not an A100) nor cheap to wait for.

Three things you need to configure:

Node labels and taints

Every GPU node should be labeled with hardware specifics:

nodeSelector:
  nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
  topology.kubernetes.io/zone: us-east-1a

And tainted so non-GPU workloads don’t squat on them:

taints:
  - key: nvidia.com/gpu
    value: "true"
    effect: NoSchedule

Then GPU pods tolerate the taint. Without this, your $30k/GPU nodes will happily run nginx sidecars.

Pod anti-affinity for replicas

Don’t stack two replicas of the same inference service on one node. If that node dies, you have zero capacity. Use pod anti-affinity:

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app: inference-server
          topologyKey: kubernetes.io/hostname

Priority classes

GPU capacity is expensive and scarce. Your batch training job should not evict your production inference pod. Define priority classes:

# High priority - inference
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata: {name: inference}
value: 1000000
globalDefault: false

---
# Low priority - batch / experiments
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata: {name: batch}
value: 100
globalDefault: false

This also lets you use preemption — batch jobs get killed to make room for inference surges.


Fractional GPUs: MIG, MPS, and Time-Slicing

A full H100 is wildly overprovisioned for a small model at low QPS. Three ways to share:

MIG (Multi-Instance GPU)

Hardware-partitioned. On H100 you can split one GPU into up to seven isolated instances, each with dedicated memory and SMs. Strong isolation, predictable performance. The downside: partitions are fixed at node boot, so you lose flexibility.

Configure via GPU Operator:

migManager:
  enabled: true
  config:
    default: "all-1g.10gb"  # 7 instances per H100

Each pod requests e.g. nvidia.com/mig-1g.10gb: 1.

MPS (Multi-Process Service)

Software-level sharing. Multiple processes submit CUDA work to a single MPS server, which time-multiplexes onto the GPU. Lower isolation, higher flexibility.

Time-slicing

Purest software approach. The device plugin lets N pods each request nvidia.com/gpu: 1 and they take turns. Good for dev/test; don’t run production inference this way.

Rule of thumb:


Storage: The Underrated Concern

Model weights are big. A 70B model is ~140GB. Loading it at pod-start time means pulling 140GB from somewhere, every scale-up, for every replica.

Three patterns:

1. Build weights into the image

Simple, reproducible, but images balloon to 100GB+. Registry pulls become the bottleneck. Works for small models; doesn’t for frontier-size.

2. Pull from object storage at startup

Weights in S3/GCS, pod downloads on start. Works, but startup is slow and you repeat the transfer every scale-up.

3. Shared cache on node or network

Mount a hostPath or a read-only PVC that caches weights across pod restarts. First pod on a node pays the cost; subsequent pods mmap. This is the best pattern for production at scale.

Some teams use Fluxcd with pre-pulled images or a warmed cluster DaemonSet to keep hot weights resident. For training checkpoints, a parallel filesystem (Weka, Lustre, FSx) is often worth the spend.


Autoscaling: HPA Alone Isn’t Enough

The Horizontal Pod Autoscaler scales on CPU and memory. Neither is a good signal for GPU workloads.

What you actually want:

Options:

And remember: GPU nodes do not scale instantly. A cold node takes 2–5 minutes to come up with drivers loaded. You need a buffer of warm capacity or your SLOs take hits during scale events.


Network: The Silent Killer of Distributed Training

Single-node inference doesn’t care about network. Multi-node training absolutely does.

If you’re doing distributed training on K8s, you need:

If you are not doing distributed training, ignore this section. If you are, this section is worth a full guide of its own.


The Schedulers: Kube-scheduler, Volcano, KAI, Slinky

Stock kube-scheduler works fine for single-GPU inference pods. For batch/training, it’s not ideal — it doesn’t understand gang scheduling (all-or-nothing for a multi-pod job) or fair share.

We cover these in depth in GPU Scheduling on Kubernetes.


The Patterns I See in Healthy Clusters

From a dozen platform teams running GPU K8s in production:

  1. Separate node pools for inference and batch. Different priority classes, different autoscaling. Never let batch compete with inference.
  2. MIG for shared inference, whole GPU for training. Matches the workload shape.
  3. Dedicated SRE rotation. GPU nodes have new failure modes (XID errors, thermal events, ECC failures). You need someone watching.
  4. Aggressive DCGM alerting. Alert on ECC errors, GPU throttling, fan failure. These precede node failure by hours or days.
  5. Chaos engineering. Kill a GPU node every Friday. Your system should handle it.
  6. Cost telemetry per namespace. GPU bills are opaque. Attribute spend back to teams or you’ll never control it.

Further Reading

Setting up Kubernetes for AI workloads and want a second pair of eyes? Get in touch — we’ve stood up GPU K8s clusters from 4 nodes to 400.

← Back to Blog