Multi-Agent Orchestration Infrastructure: Lessons from Production
Multi-agent systems promise to divide complex work across specialized agents that coordinate to solve problems. In 2023, demos looked great. In 2024, production deployments mostly looked cursed. In 2025–2026, a handful of patterns emerged that actually work — and a lot of patterns that don’t.
This post is the lessons learned from deploying multi-agent systems across a dozen production contexts: what works, what burns, and what we’d rewrite if we started over.
The Patterns That Actually Work
1. Supervisor + Specialists
One “supervisor” agent decomposes tasks and routes subtasks to specialist agents. Specialists execute and return results. Supervisor integrates.
Simple, debuggable, effective. Most “multi-agent” production systems we see are actually this pattern.
Tools: LangGraph’s supervisor pattern, CrewAI’s hierarchical mode, custom orchestrators.
2. Pipeline (sequential specialists)
Task flows through a fixed sequence of agents: researcher → writer → editor. Each agent has a clear contract.
Predictable cost, easy to eval each step, low latency overhead.
When it works: tasks that naturally decompose into linear steps. Research, content pipelines, data processing.
3. Swarm (parallel specialists with shared state)
Multiple agents work on the same task simultaneously, coordinating via shared state or message bus.
Expensive (N agents × N LLM calls), harder to debug, but genuinely better on tasks where independent perspectives help.
When it works: complex research, code review (multiple reviewers with different perspectives), adversarial setups (one agent produces, another critiques).
4. Negotiator (two-agent)
Two agents negotiate until they agree. E.g., proposer + critic, buyer + seller.
Smallest possible “multi-agent” pattern. Gets a lot of value from two-agent setups without the explosion of cost of larger swarms.
The Patterns That Don’t Work (But Look Good)
Fully-emergent crews
“Five agents with different roles just figure it out.” In practice: they spin forever, hand work back and forth, generate garbage, or silently coalesce on one agent doing everything.
Lessons: explicit control flow beats emergent coordination 9 times out of 10.
Peer-to-peer equal agents
“No supervisor, peer agents coordinate.” Communication overhead dominates. Tasks take 10x longer than they should.
Lessons: add a supervisor. Even a thin one.
“Let the agent call tools until it’s done.” In practice: 200 LLM calls, $40 in tokens, agent gets confused at turn 40 and loops.
Lessons: hard budgets. Max turns, max tokens, max calls. Always.
One model does everything
Using the strongest / most expensive model for every agent in the system. Costs stack.
Lessons: route. Supervisor on GPT-4o; specialists on cheaper models. Worker agents on 8B or 70B self-hosted.
The Infrastructure Shape
What’s actually underneath a production multi-agent system:
[ User request ]
│
▼
[ API gateway + auth ]
│
▼
[ Orchestrator workflow ]
(Temporal / LangGraph)
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
[ Supervisor ] [ Agent pool ] [ Tool registry ]
LLM call (worker pods (MCP-enabled)
running agents)
│ │
▼ ▼
[ Shared state ] [ LLM gateway ]
(Redis / PG) (LiteLLM / Portkey)
│ │
▼ ▼
[ Context store ] [ Providers + self-hosted ]
(memory stack)
│
▼
[ Observability ]
(Langfuse / OTel)
Key components beyond single-agent:
- Agent pool — workers capable of running any specialist agent. Scaled independently.
- Shared state — how agents communicate. Redis for fast ephemeral; Postgres for durable.
- Tool registry — shared across agents (MCP).
- Observability must handle multi-span traces — trace IDs propagate across agent calls.
Coordination Primitives
How do agents talk?
Shared memory
Agents read and write a common state object. Supervisor reads what specialists wrote; specialists see what others wrote.
Simplest. Works. Debug via state snapshots.
Message passing
Agents send explicit messages via a queue (Kafka, Redis streams). Each agent subscribes to relevant messages.
More flexible. Handles high concurrency. More complex.
Direct calls
Supervisor literally invokes a specialist function/method. Specialist returns. Functional composition.
Cleanest for supervisor-specialist patterns. Breaks for async / long-running agents.
RPC
Agents are separate services. Communicate via HTTP/gRPC.
For multi-tenant platforms where agents need to be deployed independently. Higher operational burden.
We default to shared memory + direct calls for most multi-agent systems. Message passing and RPC for platforms with independent agent lifecycles.
Cost Control
Multi-agent systems amplify costs. A task that would use 10 LLM calls with one agent easily uses 100 with five agents.
Controls we always deploy:
- Per-task budget cap. Max $X per user task. Hit it → escalate to human.
- Per-agent turn cap. Each agent can call LLM N times max per task.
- Total-turn cap. Entire multi-agent task limited to M total turns.
- Tool-call budget. Tools cost money (external APIs, compute). Cap.
- Loop detection. Same state 3 times in a row = loop; escalate.
Without these, one bug produces a $10k bill overnight. With them, bugs get caught by a 429-style error.
Observability for Multi-Agent
Tracing is harder than single-agent:
- One user task → many agent invocations → many LLM calls → many tool calls
- Agents call each other; tree structure has branches
- Async patterns mean spans can be concurrent
- Re-entrant agents (same agent called twice in a task) need distinguishable spans
We tag every span with:
user_task_id — the top-level task
agent_id — which agent this span belongs to
agent_role — “supervisor”, “researcher”, etc.
parent_agent_id — for calls between agents
Langfuse and Phoenix both visualize multi-agent traces well. Datadog does it with careful OTel semantic convention use.
Failure Handling
Single-agent failures: retry, fail gracefully, log.
Multi-agent failures compound:
- Specialist fails → supervisor either retries (infinite loop risk) or bails (task fails)
- Two specialists disagree → deadlock if no tiebreaker
- Shared state gets corrupted → all agents see bad data
- Partial failure (3 of 5 specialists succeed) → supervisor needs a policy
Patterns:
- Fail fast, retry whole task. Simplest. Works for cheap tasks.
- Stateless restart. Workflow engine checkpoints state; on failure, restart from last known-good step.
- Specialist retries with backoff. Supervisor retries individual specialists before failing the whole task.
- Human escalation. After N failed specialist attempts, escalate to human.
Default: Temporal workflow with checkpoint per agent step + specialist-level retries + task-level human escalation after failure budget.
When Is Multi-Agent Actually Worth It?
Honest answer: most “multi-agent” use cases we see would work as well with a single well-structured agent.
Multi-agent earns its keep when:
- Specialization is real. One agent with deep domain knowledge does meaningfully better than generalists. E.g., a “SQL agent” that knows schema.
- Parallelism helps. Task genuinely parallelizes. E.g., analyzing 10 documents simultaneously.
- Independent perspectives add value. Adversarial critique, red-team, eval-by-panel.
- Tool isolation. One agent has credentials to call one system; another agent doesn’t; so you split by privilege.
If none of these apply, single-agent is simpler and cheaper.
Framework Choice
Current state of the major options:
LangGraph
Flexible graph-based orchestration. Can express most multi-agent patterns. Moderate learning curve. Mature in 2026.
Best general-purpose choice.
CrewAI
Opinionated role-based multi-agent. Sacrifices flexibility for simplicity. Strong for supervisor + pipeline patterns.
Best “get started fast” choice.
AutoGen
Microsoft’s research-grade framework. Rich conversational patterns. More experimental than LangGraph.
Best if you’re Azure-committed or doing research.
Swarm (OpenAI) / Agent SDK
Lightweight OpenAI-first framework. Minimalist; assumes OpenAI models.
Best for OpenAI-only stacks.
Semantic Kernel
Microsoft .NET-flavored. Good for .NET organizations.
Anthropic Agent SDK
New in 2025. Claude-native. Strong tool use and MCP integration.
Best for Claude-based systems.
Our default: LangGraph for production, CrewAI for quick prototypes.
The Short Version
- Supervisor + specialists is the most reliable pattern
- Budget aggressively (turns, tokens, dollars) or multi-agent will surprise your bill
- Observability must span across agents, not just per-LLM-call
- Temporal-style workflow engines are genuinely necessary for production
- Most “multi-agent” use cases work as single agents if structured well
- MCP for shared tool registries; LangGraph or CrewAI for orchestration
Multi-agent systems are powerful. They’re also the area of AI infrastructure where hype has outrun reality the most. Use the pattern when it earns its keep; use a single agent otherwise.
Further Reading
Building a multi-agent system? Let’s talk — we can help scope whether multi-agent is actually warranted, and if so, architect it.