Kubernetes for GPU Workloads: A Primer

Balys Kriksciunas · Sat Sep 28 2024 · 7 min read

#ai #infrastructure #kubernetes #gpu #mig #scheduler #nvidia-gpu-operator #devops

Running AI workloads on Kubernetes isn't the same as running stateless microservices. A primer on GPU operators, device plugins, node affinity, MIG, and the patterns that keep clusters healthy.

Kubernetes for GPU Workloads: A Primer

Kubernetes works fine for stateless HTTP services. It works surprisingly well for batch jobs. It works with some effort for stateful databases. It works, but with many opinions and pitfalls, for GPU-backed AI workloads.

If you’re putting AI inference or training on Kubernetes — and most platform teams eventually do — this primer covers the pieces that are different from normal K8s workloads and the patterns that keep your cluster healthy.

Why Use Kubernetes for AI at All?

Fair question. For a single GPU running a single model, docker run on a VM is simpler. The case for K8s emerges when:

You have many models, many teams, or many tenants
You need multi-zone or multi-region resilience
You want to share expensive GPU fleets across workloads
You need rolling upgrades without dropping requests
You’re already a Kubernetes shop and consistency matters

The counter-case — when K8s is wrong — is usually “we have one training job, it runs for two weeks, we have one team.” Use Slurm or just a VM. We’ll cover Slurm in the age of Kubernetes separately.

The Three Non-Negotiable Components

Any K8s cluster running GPU workloads needs three things installed beyond stock K8s:

1. NVIDIA GPU Operator (or GPU Driver DaemonSet)

The GPU Operator is NVIDIA’s opinionated bundle: driver, container toolkit, device plugin, DCGM exporter, MIG manager. Install it via Helm:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install --wait --generate-name \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator

For most clouds (GKE, EKS, AKS, etc.), managed GPU node pools ship with the driver and device plugin pre-installed. Check before double-installing.

2. Device Plugin

This is the bridge that exposes GPUs to the kubelet as a schedulable resource (nvidia.com/gpu). The GPU Operator installs it automatically. Once it’s running, your pods can request GPUs:

resources:
  limits:
    nvidia.com/gpu: 1

3. DCGM Exporter + Monitoring

GPU utilization, memory usage, temperature, and SM occupancy are not visible via standard K8s metrics. The DCGM exporter exposes them as Prometheus metrics. If you don’t have this, you are flying blind.

Scheduling: The Part That Bites

Default K8s scheduling is built for fungible pods on fungible nodes. GPUs are neither fungible (an H100 is not an A100) nor cheap to wait for.

Three things you need to configure:

Node labels and taints

Every GPU node should be labeled with hardware specifics:

nodeSelector:
  nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
  topology.kubernetes.io/zone: us-east-1a

And tainted so non-GPU workloads don’t squat on them:

taints:
  - key: nvidia.com/gpu
    value: "true"
    effect: NoSchedule

Then GPU pods tolerate the taint. Without this, your $30k/GPU nodes will happily run nginx sidecars.

Pod anti-affinity for replicas

Don’t stack two replicas of the same inference service on one node. If that node dies, you have zero capacity. Use pod anti-affinity:

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app: inference-server
          topologyKey: kubernetes.io/hostname

Priority classes

GPU capacity is expensive and scarce. Your batch training job should not evict your production inference pod. Define priority classes:

# High priority - inference
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata: {name: inference}
value: 1000000
globalDefault: false

---
# Low priority - batch / experiments
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata: {name: batch}
value: 100
globalDefault: false

This also lets you use preemption — batch jobs get killed to make room for inference surges.

Fractional GPUs: MIG, MPS, and Time-Slicing

A full H100 is wildly overprovisioned for a small model at low QPS. Three ways to share:

MIG (Multi-Instance GPU)

Hardware-partitioned. On H100 you can split one GPU into up to seven isolated instances, each with dedicated memory and SMs. Strong isolation, predictable performance. The downside: partitions are fixed at node boot, so you lose flexibility.

Configure via GPU Operator:

migManager:
  enabled: true
  config:
    default: "all-1g.10gb"  # 7 instances per H100

Each pod requests e.g. nvidia.com/mig-1g.10gb: 1.

MPS (Multi-Process Service)

Software-level sharing. Multiple processes submit CUDA work to a single MPS server, which time-multiplexes onto the GPU. Lower isolation, higher flexibility.

Time-slicing

Purest software approach. The device plugin lets N pods each request nvidia.com/gpu: 1 and they take turns. Good for dev/test; don’t run production inference this way.

Rule of thumb:

Production multi-tenant inference → MIG
Dev environments, Jupyter notebooks → time-slicing or MPS
Single high-throughput job → whole GPU, no sharing

Storage: The Underrated Concern

Model weights are big. A 70B model is ~140GB. Loading it at pod-start time means pulling 140GB from somewhere, every scale-up, for every replica.

Three patterns:

1. Build weights into the image

Simple, reproducible, but images balloon to 100GB+. Registry pulls become the bottleneck. Works for small models; doesn’t for frontier-size.

2. Pull from object storage at startup

Weights in S3/GCS, pod downloads on start. Works, but startup is slow and you repeat the transfer every scale-up.

3. Shared cache on node or network

Mount a hostPath or a read-only PVC that caches weights across pod restarts. First pod on a node pays the cost; subsequent pods mmap. This is the best pattern for production at scale.

Some teams use Fluxcd with pre-pulled images or a warmed cluster DaemonSet to keep hot weights resident. For training checkpoints, a parallel filesystem (Weka, Lustre, FSx) is often worth the spend.

Autoscaling: HPA Alone Isn’t Enough

The Horizontal Pod Autoscaler scales on CPU and memory. Neither is a good signal for GPU workloads.

What you actually want:

Scale on queue depth — requests waiting to be served
Scale on GPU utilization — if SM occupancy drops, scale down
Scale on latency SLO — if P95 drifts, scale up

Options:

KEDA — exposes arbitrary metrics (from Prometheus, Redis queues, Kafka) to HPA. Standard choice.
Custom metrics adapter — for DCGM-based metrics.
Knative Serving — if you’re doing request-based autoscaling with scale-to-zero.

And remember: GPU nodes do not scale instantly. A cold node takes 2–5 minutes to come up with drivers loaded. You need a buffer of warm capacity or your SLOs take hits during scale events.

Network: The Silent Killer of Distributed Training

Single-node inference doesn’t care about network. Multi-node training absolutely does.

If you’re doing distributed training on K8s, you need:

RDMA — RoCE or InfiniBand. SR-IOV or the NVIDIA Network Operator to expose devices to pods.
Topology-aware scheduling — Ray, Kubeflow’s MPI Operator, or Volcano can schedule the entire training job on a tightly-coupled subset of nodes.
Jumbo frames and tuned kernel params — off by default, essential for RDMA perf.
Dedicated network namespaces — don’t share the RDMA fabric with your cluster’s pod network.

If you are not doing distributed training, ignore this section. If you are, this section is worth a full guide of its own.

The Schedulers: Kube-scheduler, Volcano, KAI, Slinky

Stock kube-scheduler works fine for single-GPU inference pods. For batch/training, it’s not ideal — it doesn’t understand gang scheduling (all-or-nothing for a multi-pod job) or fair share.

Volcano — batch scheduler from the CNCF. Gang scheduling, fair queuing, priorities.
KAI Scheduler — NVIDIA’s scheduler (formerly Run:ai). Opinionated toward GPU fairness and quotas.
Kueue — upstream K8s sub-project. Lighter-weight, increasingly standard.
Slinky (Slurm-in-K8s) — emerging, bridges HPC and cloud-native.

We cover these in depth in GPU Scheduling on Kubernetes.

The Patterns I See in Healthy Clusters

From a dozen platform teams running GPU K8s in production:

Separate node pools for inference and batch. Different priority classes, different autoscaling. Never let batch compete with inference.
MIG for shared inference, whole GPU for training. Matches the workload shape.
Dedicated SRE rotation. GPU nodes have new failure modes (XID errors, thermal events, ECC failures). You need someone watching.
Aggressive DCGM alerting. Alert on ECC errors, GPU throttling, fan failure. These precede node failure by hours or days.
Chaos engineering. Kill a GPU node every Friday. Your system should handle it.
Cost telemetry per namespace. GPU bills are opaque. Attribute spend back to teams or you’ll never control it.

Kubernetes for GPU Workloads: A Primer

Kubernetes for GPU Workloads: A Primer

Why Use Kubernetes for AI at All?

The Three Non-Negotiable Components

1. NVIDIA GPU Operator (or GPU Driver DaemonSet)

2. Device Plugin

3. DCGM Exporter + Monitoring

Scheduling: The Part That Bites

Node labels and taints

Pod anti-affinity for replicas

Priority classes

Fractional GPUs: MIG, MPS, and Time-Slicing

MIG (Multi-Instance GPU)

MPS (Multi-Process Service)

Time-slicing

Storage: The Underrated Concern

1. Build weights into the image

2. Pull from object storage at startup

3. Shared cache on node or network

Autoscaling: HPA Alone Isn’t Enough

Network: The Silent Killer of Distributed Training

The Schedulers: Kube-scheduler, Volcano, KAI, Slinky

The Patterns I See in Healthy Clusters

Further Reading

Related Posts

The AI Infrastructure Stack Explained (2024)

vLLM vs SGLang: Inference Engine Comparison 2026

State of AI Infrastructure 2026: Mid-Year Reality Check

Kubernetes for GPU Workloads: A Primer

Kubernetes for GPU Workloads: A Primer

Why Use Kubernetes for AI at All?

The Three Non-Negotiable Components

1. NVIDIA GPU Operator (or GPU Driver DaemonSet)

2. Device Plugin

3. DCGM Exporter + Monitoring

Scheduling: The Part That Bites

Node labels and taints

Pod anti-affinity for replicas

Priority classes

Fractional GPUs: MIG, MPS, and Time-Slicing

MIG (Multi-Instance GPU)

MPS (Multi-Process Service)

Time-slicing

Storage: The Underrated Concern

1. Build weights into the image

2. Pull from object storage at startup

3. Shared cache on node or network

Autoscaling: HPA Alone Isn’t Enough

Network: The Silent Killer of Distributed Training

The Schedulers: Kube-scheduler, Volcano, KAI, Slinky

The Patterns I See in Healthy Clusters

Further Reading

Related Posts

The AI Infrastructure Stack Explained (2024)

vLLM vs SGLang: Inference Engine Comparison 2026

State of AI Infrastructure 2026: Mid-Year Reality Check

Don't miss out on AI insights