Ray Serve vs Kubernetes for Model Serving
Every ML platform team eventually asks: “Should we serve models on Ray Serve or plain Kubernetes?” Both can run your inference workload. Both can scale, both can multi-model, both can orchestrate pipelines. They make different tradeoffs on ergonomics, control, and scope.
This post walks through what each is good at, what breaks in each, and the pattern we most often recommend (spoiler: it’s a mix).
What Each Actually Is
Ray Serve
Ray Serve is a serving library inside the Ray framework. You write Python classes annotated with @serve.deployment, compose them into a graph, and Ray handles:
- Process lifecycle and autoscaling
- Request batching
- Cross-node routing and model composition
- Python-native integration (no HTTP between pipeline steps)
Ray Serve runs on top of a Ray cluster, which itself often runs on Kubernetes (via KubeRay). So the question isn’t “Ray or Kubernetes” — it’s “Ray Serve on Kubernetes, or Kubernetes-native serving?”
Kubernetes-native serving
This is what you get with a Deployment (or KServe, or Knative Serving) plus Horizontal Pod Autoscaler plus an Ingress. Your inference server (vLLM, TGI, Triton) runs in a pod. Another pod runs a different model. Routing happens via Ingress or service mesh.
Lowest abstraction level. Maximum control. All responsibility is yours.
The Ergonomic Delta
The fastest way to feel the difference is to write the same thing in each.
Ray Serve: multi-step RAG pipeline
from ray import serve
from ray.serve.handle import DeploymentHandle
@serve.deployment
class Embedder:
def __init__(self, model="BAAI/bge-large"):
self.model = SentenceTransformer(model)
def embed(self, texts):
return self.model.encode(texts)
@serve.deployment
class Retriever:
def __init__(self, embedder: DeploymentHandle, vector_db):
self.embedder = embedder
self.db = vector_db
async def retrieve(self, query):
embedding = await self.embedder.embed.remote([query])
return self.db.search(embedding[0], k=5)
@serve.deployment
class Generator:
def __init__(self, model="..."):
self.llm = vLLM(model)
def generate(self, query, context):
return self.llm.generate(prompt(query, context))
@serve.deployment
class RAGApp:
def __init__(self, retriever: DeploymentHandle, generator: DeploymentHandle):
self.retriever = retriever
self.generator = generator
async def __call__(self, request):
query = request.query_params["q"]
docs = await self.retriever.retrieve.remote(query)
return await self.generator.generate.remote(query, docs)
app = RAGApp.bind(
Retriever.bind(Embedder.bind(), vector_db),
Generator.bind()
)
serve.run(app)
One file, composable. Autoscaling per component. In-process Python calls between components (no HTTP overhead).
Kubernetes-native: same RAG pipeline
Four separate services (embedder, retriever, generator, API). Four Deployments, four Services, four HPAs, an Ingress. Communication via HTTP or gRPC between pods. Each service its own Docker image, each with its own CI pipeline.
It works. It’s 10x more YAML. Observable parts are clearer; the tradeoff is real.
Where Ray Serve Wins
1. Multi-step ML pipelines. When you have embedder + retriever + reranker + generator all orchestrated, Ray Serve’s composition model is genuinely better than stitching together HTTP services.
2. Autoscaling granularity. Each @serve.deployment scales independently. Your generator might be GPU-bound and at 10 replicas; your embedder at 2 replicas; your HTTP frontend at 20. Kubernetes HPA does this too but requires 3 separate Deployments and 3 separate scaling policies.
3. GPU sharing within a Ray cluster. Ray’s scheduler understands fractional GPUs natively. “This deployment needs 0.25 GPU” works out of the box. Kubernetes has MIG and time-slicing but it’s more manual.
4. Python-first development. Your ML engineers write Python, not YAML. Feels like a native ML framework.
5. Spot instance support. Ray has good handling of spot preemption (replaces nodes, reschedules actors). Doable in K8s, but Ray’s out-of-the-box support is smoother.
6. Distributed model execution. Huge model split across multiple nodes via Ray’s object store? Ray Serve handles the coordination. In K8s, you end up writing custom tensor-parallel coordination.
Where Kubernetes Wins
1. Operational maturity. Kubernetes has been in production for a decade. Every ops team knows it. Every observability stack integrates. Every security tool supports it.
2. Non-ML workloads. Your auth service, your database proxy, your cron jobs — all already on Kubernetes. Adding a Ray layer just for inference means two orchestration systems.
3. Debugging. kubectl logs works. kubectl exec works. kubectl describe pod works. Ray has its own tooling (Ray Dashboard, Ray Workflows logs), but it’s smaller and more opinionated.
4. Heterogeneous workloads on one cluster. Kubernetes natively runs inference pods next to training jobs next to HTTP services, all scheduled on the same GPU nodes. Ray can do it, but you’re pushing its model.
5. Enterprise controls. Admission controllers, policies, network segmentation, Pod Security Standards. Ray has less of this built in.
6. When your serving stack is simpler. One vLLM pod per model type, standard HTTP interface — Kubernetes is straightforwardly simpler.
The Tradeoffs in One Table
| Dimension | Ray Serve | Plain Kubernetes |
|---|
| Multi-step pipelines | Excellent | Painful |
| Single-model serving | Overkill | Natural fit |
| Debugging | Ray-specific tooling | Familiar |
| Team skill requirement | Python + Ray | Kubernetes |
| Autoscaling granularity | Per-component | Per-Deployment |
| GPU fractional sharing | Native | MIG / time-slicing |
| Ecosystem breadth | ML-focused | Entire cloud-native |
| Operational maturity | Growing | Massive |
| Best fit | ML platform team | General platform team |
What We Usually Recommend
For a small team (< 10 engineers, ~5 models in production): Just Kubernetes. One vLLM Deployment per model, one Ingress, an LLM gateway in front. Don’t add Ray.
For a dedicated ML platform team with complex pipelines: Ray Serve on Kubernetes (via KubeRay). You get the pipeline ergonomics without giving up Kubernetes for the rest of the stack.
For an enterprise platform team serving many models to many tenants: Kubernetes with KServe or a custom controller. KServe handles the model-serving patterns; Kubernetes handles everything else. Ray only where pipelines are genuinely complex.
Specific anti-pattern: All-Ray, no Kubernetes. You end up rebuilding network policies, RBAC, and observability that K8s gives you for free.
Installing Ray Serve on Kubernetes: KubeRay
If you’re going Ray Serve, use KubeRay:
helm repo add kuberay https://ray-project.github.io/kuberay-helm
helm install kuberay-operator kuberay/kuberay-operator
Then define a RayService:
apiVersion: ray.io/v1
kind: RayService
metadata:
name: rag-app
spec:
serviceUnhealthySecondThreshold: 300
deploymentUnhealthySecondThreshold: 300
serveConfigV2: |
applications:
- name: rag
import_path: rag_app:app
deployments:
- name: Generator
num_replicas: 2
ray_actor_options: {num_gpus: 1}
rayClusterConfig:
rayVersion: '2.20.0'
headGroupSpec: {...}
workerGroupSpecs: [{...}]
KubeRay handles Ray cluster lifecycle inside your K8s. Upgrades become rolling by default; resource accounting flows back to K8s node-level metrics.
Alternatives Worth Knowing
KServe (formerly KFServing): Kubernetes-native model serving controller. Adds features like canary rollouts, auto-scaling to zero, transformers. Good middle ground between raw K8s and Ray Serve.
Knative Serving: General-purpose serverless on K8s. Works for ML too; scale-to-zero is the headline feature. Cold start latency is the cost.
BentoML / Bento Cloud: Opinionated ML serving framework. Nice dev experience; vendor-managed cloud if you want it.
Seldon Core: Enterprise-focused ML serving on K8s. Strong for regulated industries.
AWS SageMaker / GCP Vertex Endpoints / Azure ML endpoints: Managed options. Less control, less work.
For most teams in 2025 the choice is between plain K8s (+ KServe for ML-specific features) and KubeRay with Ray Serve. The hyperscaler managed options have their place but lock you in.
Operational Notes
1. Cluster boundaries matter. Putting a Ray cluster’s head node and workers in the same K8s namespace is fine for dev. In production, give Ray its own namespace, network policies, and resource quotas.
2. Health checks are subtler. Ray Serve deployments have their own readiness semantics separate from K8s pod readiness. KubeRay bridges this; verify it’s working in your setup.
3. Upgrades. Ray’s zero-downtime upgrades are doable but require careful KubeRay config. Test it in staging.
4. Observability. Ray’s Prometheus metrics are Ray-shaped; you’ll add extra labels to join with your K8s-shaped metrics. Not hard; needs a plan.
5. Cost accounting. If your FinOps tooling attributes by K8s labels, Ray deployments show up as the underlying K8s Deployments. Make sure Ray’s labels propagate the way you want.
The Short Version
Start with Kubernetes. Add Ray Serve when you have:
- Multi-step pipelines with fine-grained per-component scaling
- A team that does Python-first ML development
- A real need for fractional GPU sharing
- Complex model composition
Skip Ray Serve if:
- Your serving is simple (one or a few models, HTTP front)
- Your team is Kubernetes-native, not Ray-native
- You want to minimize the number of orchestration systems
Most mature ML platforms end up with Ray Serve for the ML orchestration layer and Kubernetes for everything else.
Further Reading
Deciding between Ray Serve and Kubernetes for a new platform? Let’s talk — we’ll help scope based on your team and workload.