Multi-Agent Orchestration Infrastructure: Lessons from Production

Balys Kriksciunas · Tue Mar 31 2026 · 7 min read

#ai #infrastructure #multi-agent #orchestration #crewai #autogen #langgraph #mcp

Multi-agent systems are harder to operate than single agents by roughly the order of their agent count. Hard-won lessons from production deployments — coordination, state, cost, and failure handling.

Multi-Agent Orchestration Infrastructure: Lessons from Production

Multi-agent systems promise to divide complex work across specialized agents that coordinate to solve problems. In 2023, demos looked great. In 2024, production deployments mostly looked cursed. In 2025–2026, a handful of patterns emerged that actually work — and a lot of patterns that don’t.

This post is the lessons learned from deploying multi-agent systems across a dozen production contexts: what works, what burns, and what we’d rewrite if we started over.

The Patterns That Actually Work

1. Supervisor + Specialists

One “supervisor” agent decomposes tasks and routes subtasks to specialist agents. Specialists execute and return results. Supervisor integrates.

Simple, debuggable, effective. Most “multi-agent” production systems we see are actually this pattern.

Tools: LangGraph’s supervisor pattern, CrewAI’s hierarchical mode, custom orchestrators.

2. Pipeline (sequential specialists)

Task flows through a fixed sequence of agents: researcher → writer → editor. Each agent has a clear contract.

Predictable cost, easy to eval each step, low latency overhead.

When it works: tasks that naturally decompose into linear steps. Research, content pipelines, data processing.

3. Swarm (parallel specialists with shared state)

Multiple agents work on the same task simultaneously, coordinating via shared state or message bus.

Expensive (N agents × N LLM calls), harder to debug, but genuinely better on tasks where independent perspectives help.

When it works: complex research, code review (multiple reviewers with different perspectives), adversarial setups (one agent produces, another critiques).

4. Negotiator (two-agent)

Two agents negotiate until they agree. E.g., proposer + critic, buyer + seller.

Smallest possible “multi-agent” pattern. Gets a lot of value from two-agent setups without the explosion of cost of larger swarms.

The Patterns That Don’t Work (But Look Good)

Fully-emergent crews

“Five agents with different roles just figure it out.” In practice: they spin forever, hand work back and forth, generate garbage, or silently coalesce on one agent doing everything.

Lessons: explicit control flow beats emergent coordination 9 times out of 10.

Peer-to-peer equal agents

“No supervisor, peer agents coordinate.” Communication overhead dominates. Tasks take 10x longer than they should.

Lessons: add a supervisor. Even a thin one.

Unbounded tool chaining

“Let the agent call tools until it’s done.” In practice: 200 LLM calls, $40 in tokens, agent gets confused at turn 40 and loops.

Lessons: hard budgets. Max turns, max tokens, max calls. Always.

One model does everything

Using the strongest / most expensive model for every agent in the system. Costs stack.

Lessons: route. Supervisor on GPT-4o; specialists on cheaper models. Worker agents on 8B or 70B self-hosted.

The Infrastructure Shape

What’s actually underneath a production multi-agent system:

                 [ User request ]
                         │
                         ▼
              [ API gateway + auth ]
                         │
                         ▼
            [ Orchestrator workflow ]
               (Temporal / LangGraph)
                         │
       ┌─────────────────┼─────────────────┐
       ▼                 ▼                 ▼
[ Supervisor ]   [ Agent pool ]   [ Tool registry ]
   LLM call      (worker pods     (MCP-enabled)
                  running agents)
       │                 │
       ▼                 ▼
[ Shared state ]  [ LLM gateway ]
 (Redis / PG)     (LiteLLM / Portkey)
       │                 │
       ▼                 ▼
[ Context store ] [ Providers + self-hosted ]
 (memory stack)
       │
       ▼
[ Observability ]
 (Langfuse / OTel)

Key components beyond single-agent:

Agent pool — workers capable of running any specialist agent. Scaled independently.
Shared state — how agents communicate. Redis for fast ephemeral; Postgres for durable.
Tool registry — shared across agents (MCP).
Observability must handle multi-span traces — trace IDs propagate across agent calls.

Coordination Primitives

How do agents talk?

Shared memory

Agents read and write a common state object. Supervisor reads what specialists wrote; specialists see what others wrote.

Simplest. Works. Debug via state snapshots.

Message passing

Agents send explicit messages via a queue (Kafka, Redis streams). Each agent subscribes to relevant messages.

More flexible. Handles high concurrency. More complex.

Direct calls

Supervisor literally invokes a specialist function/method. Specialist returns. Functional composition.

Cleanest for supervisor-specialist patterns. Breaks for async / long-running agents.

RPC

Agents are separate services. Communicate via HTTP/gRPC.

For multi-tenant platforms where agents need to be deployed independently. Higher operational burden.

We default to shared memory + direct calls for most multi-agent systems. Message passing and RPC for platforms with independent agent lifecycles.

Cost Control

Multi-agent systems amplify costs. A task that would use 10 LLM calls with one agent easily uses 100 with five agents.

Controls we always deploy:

Per-task budget cap. Max $X per user task. Hit it → escalate to human.
Per-agent turn cap. Each agent can call LLM N times max per task.
Total-turn cap. Entire multi-agent task limited to M total turns.
Tool-call budget. Tools cost money (external APIs, compute). Cap.
Loop detection. Same state 3 times in a row = loop; escalate.

Without these, one bug produces a $10k bill overnight. With them, bugs get caught by a 429-style error.

Observability for Multi-Agent

Tracing is harder than single-agent:

One user task → many agent invocations → many LLM calls → many tool calls
Agents call each other; tree structure has branches
Async patterns mean spans can be concurrent
Re-entrant agents (same agent called twice in a task) need distinguishable spans

We tag every span with:

user_task_id — the top-level task
agent_id — which agent this span belongs to
agent_role — “supervisor”, “researcher”, etc.
parent_agent_id — for calls between agents

Langfuse and Phoenix both visualize multi-agent traces well. Datadog does it with careful OTel semantic convention use.

Failure Handling

Single-agent failures: retry, fail gracefully, log.

Multi-agent failures compound:

Specialist fails → supervisor either retries (infinite loop risk) or bails (task fails)
Two specialists disagree → deadlock if no tiebreaker
Shared state gets corrupted → all agents see bad data
Partial failure (3 of 5 specialists succeed) → supervisor needs a policy

Patterns:

Fail fast, retry whole task. Simplest. Works for cheap tasks.
Stateless restart. Workflow engine checkpoints state; on failure, restart from last known-good step.
Specialist retries with backoff. Supervisor retries individual specialists before failing the whole task.
Human escalation. After N failed specialist attempts, escalate to human.

Default: Temporal workflow with checkpoint per agent step + specialist-level retries + task-level human escalation after failure budget.

When Is Multi-Agent Actually Worth It?

Honest answer: most “multi-agent” use cases we see would work as well with a single well-structured agent.

Multi-agent earns its keep when:

Specialization is real. One agent with deep domain knowledge does meaningfully better than generalists. E.g., a “SQL agent” that knows schema.
Parallelism helps. Task genuinely parallelizes. E.g., analyzing 10 documents simultaneously.
Independent perspectives add value. Adversarial critique, red-team, eval-by-panel.
Tool isolation. One agent has credentials to call one system; another agent doesn’t; so you split by privilege.

If none of these apply, single-agent is simpler and cheaper.

Framework Choice

Current state of the major options:

LangGraph

Flexible graph-based orchestration. Can express most multi-agent patterns. Moderate learning curve. Mature in 2026.

Best general-purpose choice.

CrewAI

Opinionated role-based multi-agent. Sacrifices flexibility for simplicity. Strong for supervisor + pipeline patterns.

Best “get started fast” choice.

AutoGen

Microsoft’s research-grade framework. Rich conversational patterns. More experimental than LangGraph.

Best if you’re Azure-committed or doing research.

Swarm (OpenAI) / Agent SDK

Lightweight OpenAI-first framework. Minimalist; assumes OpenAI models.

Best for OpenAI-only stacks.

Semantic Kernel

Microsoft .NET-flavored. Good for .NET organizations.

Anthropic Agent SDK

New in 2025. Claude-native. Strong tool use and MCP integration.

Best for Claude-based systems.

Our default: LangGraph for production, CrewAI for quick prototypes.

The Short Version

Supervisor + specialists is the most reliable pattern
Budget aggressively (turns, tokens, dollars) or multi-agent will surprise your bill
Observability must span across agents, not just per-LLM-call
Temporal-style workflow engines are genuinely necessary for production
Most “multi-agent” use cases work as single agents if structured well
MCP for shared tool registries; LangGraph or CrewAI for orchestration

Multi-agent systems are powerful. They’re also the area of AI infrastructure where hype has outrun reality the most. Use the pattern when it earns its keep; use a single agent otherwise.

Multi-Agent Orchestration Infrastructure: Lessons from Production

Multi-Agent Orchestration Infrastructure: Lessons from Production

The Patterns That Actually Work

1. Supervisor + Specialists

2. Pipeline (sequential specialists)

3. Swarm (parallel specialists with shared state)

4. Negotiator (two-agent)

The Patterns That Don’t Work (But Look Good)

Fully-emergent crews

Peer-to-peer equal agents

Unbounded tool chaining

One model does everything

The Infrastructure Shape

Coordination Primitives

Shared memory

Message passing

Direct calls

RPC

Cost Control

Observability for Multi-Agent

Failure Handling

When Is Multi-Agent Actually Worth It?

Framework Choice

LangGraph

CrewAI

AutoGen

Swarm (OpenAI) / Agent SDK

Semantic Kernel

Anthropic Agent SDK

The Short Version

Further Reading

Related Posts

LangGraph vs CrewAI vs AutoGen: 2026 Comparison

Agent Infrastructure: What's Different from LLM Serving

Complete Guide to AI Agent Frameworks 2026

Multi-Agent Orchestration Infrastructure: Lessons from Production

Multi-Agent Orchestration Infrastructure: Lessons from Production

The Patterns That Actually Work

1. Supervisor + Specialists

2. Pipeline (sequential specialists)

3. Swarm (parallel specialists with shared state)

4. Negotiator (two-agent)

The Patterns That Don’t Work (But Look Good)

Fully-emergent crews

Peer-to-peer equal agents

Unbounded tool chaining

One model does everything

The Infrastructure Shape

Coordination Primitives

Shared memory

Message passing

Direct calls

RPC

Cost Control

Observability for Multi-Agent

Failure Handling

When Is Multi-Agent Actually Worth It?

Framework Choice

LangGraph

CrewAI

AutoGen

Swarm (OpenAI) / Agent SDK

Semantic Kernel

Anthropic Agent SDK

The Short Version

Further Reading

Related Posts

LangGraph vs CrewAI vs AutoGen: 2026 Comparison

Agent Infrastructure: What's Different from LLM Serving

Complete Guide to AI Agent Frameworks 2026

Don't miss out on AI insights