Infrastructure

Building an AI Platform Team: Roles, Tools, and Rituals

Balys Kriksciunas 8 min read
#ai#infrastructure#platform-engineering#team#hiring#org#ml-ops

Building an AI Platform Team: Roles, Tools, and Rituals

At some point, every company with serious AI ambitions realizes the same thing: product teams shouldn’t be choosing their own vector database, writing their own retry logic, or re-implementing observability. That’s when the AI platform team is born.

This post covers what we’ve learned about building, staffing, and operating these teams — what they own, how they’re different from MLOps teams, when to start one, and the rituals that make them effective instead of bureaucratic.


What An AI Platform Team Actually Owns

The explicit surface area:

The implicit:


How It’s Different From MLOps

MLOps is about training, deploying, and monitoring traditional ML models — classification, regression, recommendation. The discipline evolved around Kubeflow, MLflow, feature stores, and model registries.

AI platform (LLM-focused) is different in three structural ways:

  1. Less training, more serving. Most teams use foundation models. Training pipelines are smaller; inference and retrieval are dominant.
  2. More runtime orchestration. Agents, tool use, RAG — the complexity is runtime, not training.
  3. New observability primitives. Traces, evals, prompts — not the feature/model/prediction stack MLOps built for.

A good MLOps team can grow into an AI platform team. It’s not automatic — different tools, different patterns, different failure modes.

For shops doing both (traditional ML + LLM systems), many consolidate under one “AI platform” umbrella with sub-teams.


When To Start The Team

Before you need it, the platform work is spread across product teams. Each team builds its own gateway, its own tracing, its own eval. Duplication.

Signs it’s time:

Most companies hit these signals around 30–100 engineers with meaningful AI investment.

Don’t wait past these signs. Every month you delay, duplication grows. Later, consolidation is painful.


The First Hire

Before a team: a platform lead. Someone who will:

The right first hire has:

Avoid:

This is a staff+ IC or a tech lead role. Compensate accordingly.


The First Five Hires

Beyond the lead, the initial team composition that works:

  1. Inference infrastructure engineer — GPU, Kubernetes, vLLM/TGI. Owns the serving stack.
  2. Developer experience engineer — SDKs, templates, internal docs. Owns enablement.
  3. Observability / evals engineer — tracing, eval harness, dashboards. Owns quality signal.
  4. ML / model specialist — model selection, fine-tuning, evaluation judgment. Bridge to product teams on quality.
  5. Site reliability engineer (AI-specific) — on-call, incident response, capacity planning. Focus on operational excellence.

Beyond this core, you grow in the direction of the bottleneck. Many AI platform teams never grow past 10–15 people even at 500-engineer companies — the leverage is high.


The Platform Products

An AI platform team ships products internally. Some common ones:

The Internal LLM API

A single OpenAI-compatible endpoint that all services call. Routing, auth, rate limiting, observability, cost attribution. Built on LiteLLM / Portkey or custom.

The Eval Harness

A framework where teams define test cases, golden answers, and quality metrics. CI-integrated. Team owns the harness; product teams own their test sets.

The Agent Runtime

A managed environment for running agents: workflow engine, tool registry, state store, observability. Product teams bring agent logic; platform team runs it.

The Vector DB Service

A managed vector DB (or many) accessible via a unified API. Handles multi-tenancy, quotas, cost attribution. Product teams don’t need to know which vector DB.

The Dev Playground

A low-friction environment for prototyping. Notebook-friendly, safe credentials, cost caps.

The Safety Toolkit

Libraries for PII redaction, prompt-injection detection, output filtering. Used across services.

Not all of these on day one. Build in order of what unblocks product teams most.


Rituals That Keep It Effective

Platform teams without rituals devolve into infrastructure fortresses. Rituals we see work:

Weekly office hours

Product teams bring questions. Platform team answers. Sometimes open problems get co-designed.

Monthly architecture review

Product teams propose new AI features; platform team reviews for feasibility, cost, and alignment with the platform.

Quarterly platform roadmap review

Platform team presents: what shipped, what’s next, why. Product teams give feedback. Adjustments made.

Incident retros

When AI production breaks, the retro includes platform team. Pattern emerges over time; roadmap adjusts.

Internal blog / changelog

Platform team writes about what they built, why it matters, how to use it. Reduces “how do I…” tickets.

Shared Slack channel

One channel where product teams ask and platform team answers. Search-indexed institutional memory.

Teams that do these rituals ship 2–3x more platform value than teams that don’t. The time spent on rituals pays back in reduced ad-hoc support load.


Metrics Worth Tracking

Platform teams are easy to measure badly. A few metrics that actually mean something:

Avoid:


Build vs Buy

The platform team’s most consequential decisions. Rough rules:

Always buy (or open-source adopt):

Sometimes build:

Rarely build:

The fastest way to sink a platform team is to have them re-implement open-source tooling. Customize, don’t recreate.


Common Failure Modes

1. Ivory tower. Platform team builds beautiful systems nobody uses. Fix: embed platform engineers with product teams part-time.

2. Too much governance. Every change goes through platform review. Product teams go around. Fix: opinionated defaults, easy golden paths, governance only on high-risk decisions.

3. Under-resourced. 1 person trying to support 15 product teams. Things break; frustration mounts. Fix: headcount parity grows with adoption.

4. Wrong leader. A lead without production LLM experience can’t make good architectural calls. Fix: hire right for the role.

5. Fragmented ownership. Different teams own different pieces of the platform with no integration. Fix: single throat to choke for the whole surface.


Evolution Over Time

A typical platform team evolution:

Year 1: “Stop the bleeding.” Consolidate gateway, tracing, core serving. Small team, hands-on.

Year 2: “Enable acceleration.” DX focus. SDKs, templates, docs. Product teams ship AI features 3x faster.

Year 3: “Optimize and govern.” FinOps, quality evals, compliance. Platform team becomes cost and risk owner.

Year 4+: “Strategic differentiation.” Custom models, specialized infra, multi-region. Platform is the moat.


The Short Version

Good AI platform teams compound their value. Three years in, they’re the reason your company can ship AI features faster and cheaper than competitors. Starting well matters more than starting fast.


Further Reading

Building or scaling an AI platform team? Reach out — we help with platform strategy, hiring, and technical direction.

← Back to Blog