Infrastructure

Agent Infrastructure: What's Different from LLM Serving

Balys Kriksciunas 8 min read
#ai#infrastructure#agents#orchestration#mcp#langgraph#production

Agent Infrastructure: What’s Different from LLM Serving

Most teams build their first agent on top of an LLM inference stack and expect it to work like a chatbot. Then they hit weird production problems: requests that run for minutes; tool calls that fail silently; agents that generate 200 LLM calls per user request; observability stacks that can’t make sense of the traces.

Agent infrastructure is a different animal. Not “LLM serving plus some extra.” Its concurrency model, failure modes, observability needs, and cost patterns are distinct. This post maps what’s actually different, and the patterns that production agent platforms look like in 2026.


The Core Difference: Stateful Long-Running Work

An LLM call is a request/response: prompt in, completion out, done in seconds. Resources are returned when the HTTP request completes.

An agent is a long-running stateful task:

This shift — from stateless call to stateful task — changes almost every piece of the supporting infrastructure.


The Five Concrete Differences

1. Concurrency model

LLM serving: threads or async contexts per request, all in one process. Lifetime in seconds.

Agents: durable workflows orchestrated by a workflow engine. State persisted between steps. Can survive process restarts. Can wait on external events.

Implementation options:

Our default: Temporal for non-trivial agent systems. LangGraph’s checkpointer for simpler ones. Don’t hand-roll this.

2. LLM call amplification

A single user request to an agent typically produces 5–50 LLM calls internally. A bad loop can produce 500.

Implications:

Budget per-user-session LLM call caps aggressively. Alert on outliers.

3. Observability shape

LLM tracing: one span per API call, maybe with retries.

Agent tracing: a trace tree that can branch, merge, parallelize, pause for human input, resume days later. Complete traces can have thousands of spans.

Tools that handle this well: Langfuse, Langsmith, Arize Phoenix, Datadog APM with careful OTel config. See Tracing LLM Applications with OpenTelemetry.

What you want to visualize:

Standard HTTP-service observability doesn’t capture this well. Invest in a proper LLM observability stack.

4. Failure modes

LLM serving failures: 429, 500, timeout, bad output. Easy to classify.

Agent failures include:

Need agent-specific guardrails:

5. Cost patterns

LLM serving: cost = tokens × price, predictable per call.

Agent serving: cost = (LLM calls per task) × (tokens per call) × price. Two sources of variance. Agent cost per user-request can vary 100x.

Attribution becomes harder:

Finance wants “$X per completed user task.” You need to track full trajectory cost.


Reference Architecture

What a production-grade agent infrastructure looks like:

[ Users / API ]


[ API gateway ]


[ Agent orchestrator (Temporal / LangGraph / Inngest) ]
      │         │         │
      ▼         ▼         ▼
[ Workers ] [ State  ] [ Event bus  ]
   │        [ store  ] [ (Kafka,   ]
   │        [(Postgres)][ Redis) ]

   ├── [ LLM gateway (LiteLLM / Portkey) ]
   │       └── (OpenAI, Anthropic, Google, self-hosted)

   ├── [ Tool registry (MCP) ]
   │       ├── Internal tools
   │       ├── External APIs
   │       └── Sandboxed code execution

   ├── [ Context store ]
   │       ├── Vector DB
   │       ├── Graph DB
   │       └── Working memory cache

   └── [ Observability ]
          ├── Langfuse (LLM traces)
          ├── Datadog (operational metrics)
          └── Eval pipeline

The orchestrator is the main new component vs a plain LLM stack. Everything else is an evolution of LLM infrastructure.


Workflow Engine Choice

Temporal

Most comprehensive. Handles durable state, retries, human-in-the-loop, signals, long-running tasks. Polyglot (Go, Java, Python, TypeScript, .NET SDKs).

Operating it: either self-host (Kubernetes + Postgres + Elastic) or use Temporal Cloud.

Our default for non-trivial agent systems.

Inngest

Simpler than Temporal. Function-based. Good DX for Node/TS teams. Managed SaaS with self-host option.

Great for teams wanting less operational burden.

LangGraph (+ checkpointer)

Built into the agent framework itself. State persists via a checkpointer (Postgres, Redis, etc). Not a full workflow engine — no retries with backoff, no schedules, no signals — but good enough for many cases.

Best for when the whole stack is LangGraph-based and workflow needs are moderate.

DBOS

Durable workflow on top of Postgres. Uses Postgres transactions for exactly-once guarantees. Lightweight. Emerging.

Custom

Rolling your own is where we’ve seen the most pain. Don’t.


Tool Registry: MCP Won

The Model Context Protocol (MCP) is now the default standard for agent tool registries. Every major framework (LangGraph, AutoGen, CrewAI, Anthropic’s Agent SDK, OpenAI Agents) speaks MCP.

Production MCP setup:

Tools should follow the principle of least privilege. Not every agent gets every tool. See Securing RAG Pipelines: Prompt Injection via Data for the security implications.


Context Store

Agent memory is more than a vector DB. Production agents work with:

The context store is the combination of all these. See Context Engineering.


Human-In-the-Loop

Production agents often need human approval for high-stakes actions. The infrastructure pieces:

Underrated: the approval UI is where your whole system’s user experience lives. Invest in it.


Autoscaling Agents

Not the same as autoscaling inference servers.

Inference servers: scale on request rate or queue depth. Stateless, lightweight horizontal scaling.

Agent workers: need to handle long-running tasks. Scaling up easy; scaling down requires draining — waiting for in-flight tasks to complete before terminating.

Patterns:


Evals for Agents

Standard LLM evals (correctness on a benchmark) don’t fully capture agent behavior. You also care about:

See Model Evals in Production: Regression Testing Prompts for the eval pipeline side.

For agents, we also run “trajectory replay” evals — run the agent against known scenarios, compare the full trajectory to a gold-standard path.


The Bill

A realistic production agent system costs more than a chat app:

For an internal-tool agent serving a few hundred users: $2,000–$8,000/month. For a consumer agent at scale: $50k+/month minimum. For enterprise SaaS platform serving many tenants: $200k+/month.

Budget accordingly. And invest in AI FinOps from day one — agent costs will surprise you.


The Short Version

Most teams building their first real agent system underestimate this for a quarter. Then they rebuild. Plan for the shape early.


Further Reading

Building production agent infrastructure? Reach out — we’ve built agent platforms from MVP to multi-tenant scale.

← Back to Blog