Ray Serve vs Kubernetes for Model Serving

Balys Kriksciunas · Mon Jun 30 2025 · 7 min read

#ai #infrastructure #ray #ray-serve #kubernetes #model-serving #orchestration

Ray Serve and Kubernetes solve overlapping problems for ML serving but make different tradeoffs. When Ray's dev ergonomics earn their keep, when raw Kubernetes wins, and how to combine them.

Ray Serve vs Kubernetes for Model Serving

Every ML platform team eventually asks: “Should we serve models on Ray Serve or plain Kubernetes?” Both can run your inference workload. Both can scale, both can multi-model, both can orchestrate pipelines. They make different tradeoffs on ergonomics, control, and scope.

This post walks through what each is good at, what breaks in each, and the pattern we most often recommend (spoiler: it’s a mix).

What Each Actually Is

Ray Serve

Ray Serve is a serving library inside the Ray framework. You write Python classes annotated with @serve.deployment, compose them into a graph, and Ray handles:

Process lifecycle and autoscaling
Request batching
Cross-node routing and model composition
Python-native integration (no HTTP between pipeline steps)

Ray Serve runs on top of a Ray cluster, which itself often runs on Kubernetes (via KubeRay). So the question isn’t “Ray or Kubernetes” — it’s “Ray Serve on Kubernetes, or Kubernetes-native serving?”

Kubernetes-native serving

This is what you get with a Deployment (or KServe, or Knative Serving) plus Horizontal Pod Autoscaler plus an Ingress. Your inference server (vLLM, TGI, Triton) runs in a pod. Another pod runs a different model. Routing happens via Ingress or service mesh.

Lowest abstraction level. Maximum control. All responsibility is yours.

The Ergonomic Delta

The fastest way to feel the difference is to write the same thing in each.

Ray Serve: multi-step RAG pipeline

from ray import serve
from ray.serve.handle import DeploymentHandle

@serve.deployment
class Embedder:
    def __init__(self, model="BAAI/bge-large"):
        self.model = SentenceTransformer(model)

    def embed(self, texts):
        return self.model.encode(texts)

@serve.deployment
class Retriever:
    def __init__(self, embedder: DeploymentHandle, vector_db):
        self.embedder = embedder
        self.db = vector_db

    async def retrieve(self, query):
        embedding = await self.embedder.embed.remote([query])
        return self.db.search(embedding[0], k=5)

@serve.deployment
class Generator:
    def __init__(self, model="..."):
        self.llm = vLLM(model)

    def generate(self, query, context):
        return self.llm.generate(prompt(query, context))

@serve.deployment
class RAGApp:
    def __init__(self, retriever: DeploymentHandle, generator: DeploymentHandle):
        self.retriever = retriever
        self.generator = generator

    async def __call__(self, request):
        query = request.query_params["q"]
        docs = await self.retriever.retrieve.remote(query)
        return await self.generator.generate.remote(query, docs)

app = RAGApp.bind(
    Retriever.bind(Embedder.bind(), vector_db),
    Generator.bind()
)
serve.run(app)

One file, composable. Autoscaling per component. In-process Python calls between components (no HTTP overhead).

Kubernetes-native: same RAG pipeline

Four separate services (embedder, retriever, generator, API). Four Deployments, four Services, four HPAs, an Ingress. Communication via HTTP or gRPC between pods. Each service its own Docker image, each with its own CI pipeline.

It works. It’s 10x more YAML. Observable parts are clearer; the tradeoff is real.

Where Ray Serve Wins

1. Multi-step ML pipelines. When you have embedder + retriever + reranker + generator all orchestrated, Ray Serve’s composition model is genuinely better than stitching together HTTP services.

2. Autoscaling granularity. Each @serve.deployment scales independently. Your generator might be GPU-bound and at 10 replicas; your embedder at 2 replicas; your HTTP frontend at 20. Kubernetes HPA does this too but requires 3 separate Deployments and 3 separate scaling policies.

3. GPU sharing within a Ray cluster. Ray’s scheduler understands fractional GPUs natively. “This deployment needs 0.25 GPU” works out of the box. Kubernetes has MIG and time-slicing but it’s more manual.

4. Python-first development. Your ML engineers write Python, not YAML. Feels like a native ML framework.

5. Spot instance support. Ray has good handling of spot preemption (replaces nodes, reschedules actors). Doable in K8s, but Ray’s out-of-the-box support is smoother.

6. Distributed model execution. Huge model split across multiple nodes via Ray’s object store? Ray Serve handles the coordination. In K8s, you end up writing custom tensor-parallel coordination.

Where Kubernetes Wins

1. Operational maturity. Kubernetes has been in production for a decade. Every ops team knows it. Every observability stack integrates. Every security tool supports it.

2. Non-ML workloads. Your auth service, your database proxy, your cron jobs — all already on Kubernetes. Adding a Ray layer just for inference means two orchestration systems.

3. Debugging. kubectl logs works. kubectl exec works. kubectl describe pod works. Ray has its own tooling (Ray Dashboard, Ray Workflows logs), but it’s smaller and more opinionated.

4. Heterogeneous workloads on one cluster. Kubernetes natively runs inference pods next to training jobs next to HTTP services, all scheduled on the same GPU nodes. Ray can do it, but you’re pushing its model.

5. Enterprise controls. Admission controllers, policies, network segmentation, Pod Security Standards. Ray has less of this built in.

6. When your serving stack is simpler. One vLLM pod per model type, standard HTTP interface — Kubernetes is straightforwardly simpler.

The Tradeoffs in One Table

Dimension	Ray Serve	Plain Kubernetes
Multi-step pipelines	Excellent	Painful
Single-model serving	Overkill	Natural fit
Debugging	Ray-specific tooling	Familiar
Team skill requirement	Python + Ray	Kubernetes
Autoscaling granularity	Per-component	Per-Deployment
GPU fractional sharing	Native	MIG / time-slicing
Ecosystem breadth	ML-focused	Entire cloud-native
Operational maturity	Growing	Massive
Best fit	ML platform team	General platform team

For a small team (< 10 engineers, ~5 models in production): Just Kubernetes. One vLLM Deployment per model, one Ingress, an LLM gateway in front. Don’t add Ray.

For a dedicated ML platform team with complex pipelines: Ray Serve on Kubernetes (via KubeRay). You get the pipeline ergonomics without giving up Kubernetes for the rest of the stack.

For an enterprise platform team serving many models to many tenants: Kubernetes with KServe or a custom controller. KServe handles the model-serving patterns; Kubernetes handles everything else. Ray only where pipelines are genuinely complex.

Specific anti-pattern: All-Ray, no Kubernetes. You end up rebuilding network policies, RBAC, and observability that K8s gives you for free.

Installing Ray Serve on Kubernetes: KubeRay

If you’re going Ray Serve, use KubeRay:

helm repo add kuberay https://ray-project.github.io/kuberay-helm
helm install kuberay-operator kuberay/kuberay-operator

Then define a RayService:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: rag-app
spec:
  serviceUnhealthySecondThreshold: 300
  deploymentUnhealthySecondThreshold: 300
  serveConfigV2: |
    applications:
      - name: rag
        import_path: rag_app:app
        deployments:
          - name: Generator
            num_replicas: 2
            ray_actor_options: {num_gpus: 1}
  rayClusterConfig:
    rayVersion: '2.20.0'
    headGroupSpec: {...}
    workerGroupSpecs: [{...}]

KubeRay handles Ray cluster lifecycle inside your K8s. Upgrades become rolling by default; resource accounting flows back to K8s node-level metrics.

Alternatives Worth Knowing

KServe (formerly KFServing): Kubernetes-native model serving controller. Adds features like canary rollouts, auto-scaling to zero, transformers. Good middle ground between raw K8s and Ray Serve.

Knative Serving: General-purpose serverless on K8s. Works for ML too; scale-to-zero is the headline feature. Cold start latency is the cost.

BentoML / Bento Cloud: Opinionated ML serving framework. Nice dev experience; vendor-managed cloud if you want it.

Seldon Core: Enterprise-focused ML serving on K8s. Strong for regulated industries.

AWS SageMaker / GCP Vertex Endpoints / Azure ML endpoints: Managed options. Less control, less work.

For most teams in 2025 the choice is between plain K8s (+ KServe for ML-specific features) and KubeRay with Ray Serve. The hyperscaler managed options have their place but lock you in.

Operational Notes

1. Cluster boundaries matter. Putting a Ray cluster’s head node and workers in the same K8s namespace is fine for dev. In production, give Ray its own namespace, network policies, and resource quotas.

2. Health checks are subtler. Ray Serve deployments have their own readiness semantics separate from K8s pod readiness. KubeRay bridges this; verify it’s working in your setup.

3. Upgrades. Ray’s zero-downtime upgrades are doable but require careful KubeRay config. Test it in staging.

4. Observability. Ray’s Prometheus metrics are Ray-shaped; you’ll add extra labels to join with your K8s-shaped metrics. Not hard; needs a plan.

5. Cost accounting. If your FinOps tooling attributes by K8s labels, Ray deployments show up as the underlying K8s Deployments. Make sure Ray’s labels propagate the way you want.

The Short Version

Start with Kubernetes. Add Ray Serve when you have:

Multi-step pipelines with fine-grained per-component scaling
A team that does Python-first ML development
A real need for fractional GPU sharing
Complex model composition

Skip Ray Serve if:

Your serving is simple (one or a few models, HTTP front)
Your team is Kubernetes-native, not Ray-native
You want to minimize the number of orchestration systems

Most mature ML platforms end up with Ray Serve for the ML orchestration layer and Kubernetes for everything else.

Ray Serve vs Kubernetes for Model Serving

Ray Serve vs Kubernetes for Model Serving

What Each Actually Is

Ray Serve

Kubernetes-native serving

The Ergonomic Delta

Ray Serve: multi-step RAG pipeline

Kubernetes-native: same RAG pipeline

Where Ray Serve Wins

Where Kubernetes Wins

The Tradeoffs in One Table

Installing Ray Serve on Kubernetes: KubeRay

Alternatives Worth Knowing

Operational Notes

The Short Version

Further Reading

Related Posts

Multi-Agent Orchestration Infrastructure: Lessons from Production

Agent Infrastructure: What's Different from LLM Serving

Kubernetes for GPU Workloads: A Primer

Ray Serve vs Kubernetes for Model Serving

Ray Serve vs Kubernetes for Model Serving

What Each Actually Is

Ray Serve

Kubernetes-native serving

The Ergonomic Delta

Ray Serve: multi-step RAG pipeline

Kubernetes-native: same RAG pipeline

Where Ray Serve Wins

Where Kubernetes Wins

The Tradeoffs in One Table

What We Usually Recommend

Installing Ray Serve on Kubernetes: KubeRay

Alternatives Worth Knowing

Operational Notes

The Short Version

Further Reading

Related Posts

Multi-Agent Orchestration Infrastructure: Lessons from Production

Agent Infrastructure: What's Different from LLM Serving

Kubernetes for GPU Workloads: A Primer

Don't miss out on AI insights