Infrastructure

Multi-Agent Orchestration Infrastructure: Lessons from Production

Balys Kriksciunas 7 min read
#ai#infrastructure#multi-agent#orchestration#crewai#autogen#langgraph#mcp

Multi-Agent Orchestration Infrastructure: Lessons from Production

Multi-agent systems promise to divide complex work across specialized agents that coordinate to solve problems. In 2023, demos looked great. In 2024, production deployments mostly looked cursed. In 2025–2026, a handful of patterns emerged that actually work — and a lot of patterns that don’t.

This post is the lessons learned from deploying multi-agent systems across a dozen production contexts: what works, what burns, and what we’d rewrite if we started over.


The Patterns That Actually Work

1. Supervisor + Specialists

One “supervisor” agent decomposes tasks and routes subtasks to specialist agents. Specialists execute and return results. Supervisor integrates.

Simple, debuggable, effective. Most “multi-agent” production systems we see are actually this pattern.

Tools: LangGraph’s supervisor pattern, CrewAI’s hierarchical mode, custom orchestrators.

2. Pipeline (sequential specialists)

Task flows through a fixed sequence of agents: researcher → writer → editor. Each agent has a clear contract.

Predictable cost, easy to eval each step, low latency overhead.

When it works: tasks that naturally decompose into linear steps. Research, content pipelines, data processing.

3. Swarm (parallel specialists with shared state)

Multiple agents work on the same task simultaneously, coordinating via shared state or message bus.

Expensive (N agents × N LLM calls), harder to debug, but genuinely better on tasks where independent perspectives help.

When it works: complex research, code review (multiple reviewers with different perspectives), adversarial setups (one agent produces, another critiques).

4. Negotiator (two-agent)

Two agents negotiate until they agree. E.g., proposer + critic, buyer + seller.

Smallest possible “multi-agent” pattern. Gets a lot of value from two-agent setups without the explosion of cost of larger swarms.


The Patterns That Don’t Work (But Look Good)

Fully-emergent crews

“Five agents with different roles just figure it out.” In practice: they spin forever, hand work back and forth, generate garbage, or silently coalesce on one agent doing everything.

Lessons: explicit control flow beats emergent coordination 9 times out of 10.

Peer-to-peer equal agents

“No supervisor, peer agents coordinate.” Communication overhead dominates. Tasks take 10x longer than they should.

Lessons: add a supervisor. Even a thin one.

Unbounded tool chaining

“Let the agent call tools until it’s done.” In practice: 200 LLM calls, $40 in tokens, agent gets confused at turn 40 and loops.

Lessons: hard budgets. Max turns, max tokens, max calls. Always.

One model does everything

Using the strongest / most expensive model for every agent in the system. Costs stack.

Lessons: route. Supervisor on GPT-4o; specialists on cheaper models. Worker agents on 8B or 70B self-hosted.


The Infrastructure Shape

What’s actually underneath a production multi-agent system:

                 [ User request ]


              [ API gateway + auth ]


            [ Orchestrator workflow ]
               (Temporal / LangGraph)

       ┌─────────────────┼─────────────────┐
       ▼                 ▼                 ▼
[ Supervisor ]   [ Agent pool ]   [ Tool registry ]
   LLM call      (worker pods     (MCP-enabled)
                  running agents)
       │                 │
       ▼                 ▼
[ Shared state ]  [ LLM gateway ]
 (Redis / PG)     (LiteLLM / Portkey)
       │                 │
       ▼                 ▼
[ Context store ] [ Providers + self-hosted ]
 (memory stack)


[ Observability ]
 (Langfuse / OTel)

Key components beyond single-agent:


Coordination Primitives

How do agents talk?

Shared memory

Agents read and write a common state object. Supervisor reads what specialists wrote; specialists see what others wrote.

Simplest. Works. Debug via state snapshots.

Message passing

Agents send explicit messages via a queue (Kafka, Redis streams). Each agent subscribes to relevant messages.

More flexible. Handles high concurrency. More complex.

Direct calls

Supervisor literally invokes a specialist function/method. Specialist returns. Functional composition.

Cleanest for supervisor-specialist patterns. Breaks for async / long-running agents.

RPC

Agents are separate services. Communicate via HTTP/gRPC.

For multi-tenant platforms where agents need to be deployed independently. Higher operational burden.

We default to shared memory + direct calls for most multi-agent systems. Message passing and RPC for platforms with independent agent lifecycles.


Cost Control

Multi-agent systems amplify costs. A task that would use 10 LLM calls with one agent easily uses 100 with five agents.

Controls we always deploy:

  1. Per-task budget cap. Max $X per user task. Hit it → escalate to human.
  2. Per-agent turn cap. Each agent can call LLM N times max per task.
  3. Total-turn cap. Entire multi-agent task limited to M total turns.
  4. Tool-call budget. Tools cost money (external APIs, compute). Cap.
  5. Loop detection. Same state 3 times in a row = loop; escalate.

Without these, one bug produces a $10k bill overnight. With them, bugs get caught by a 429-style error.


Observability for Multi-Agent

Tracing is harder than single-agent:

We tag every span with:

Langfuse and Phoenix both visualize multi-agent traces well. Datadog does it with careful OTel semantic convention use.


Failure Handling

Single-agent failures: retry, fail gracefully, log.

Multi-agent failures compound:

Patterns:

Default: Temporal workflow with checkpoint per agent step + specialist-level retries + task-level human escalation after failure budget.


When Is Multi-Agent Actually Worth It?

Honest answer: most “multi-agent” use cases we see would work as well with a single well-structured agent.

Multi-agent earns its keep when:

  1. Specialization is real. One agent with deep domain knowledge does meaningfully better than generalists. E.g., a “SQL agent” that knows schema.
  2. Parallelism helps. Task genuinely parallelizes. E.g., analyzing 10 documents simultaneously.
  3. Independent perspectives add value. Adversarial critique, red-team, eval-by-panel.
  4. Tool isolation. One agent has credentials to call one system; another agent doesn’t; so you split by privilege.

If none of these apply, single-agent is simpler and cheaper.


Framework Choice

Current state of the major options:

LangGraph

Flexible graph-based orchestration. Can express most multi-agent patterns. Moderate learning curve. Mature in 2026.

Best general-purpose choice.

CrewAI

Opinionated role-based multi-agent. Sacrifices flexibility for simplicity. Strong for supervisor + pipeline patterns.

Best “get started fast” choice.

AutoGen

Microsoft’s research-grade framework. Rich conversational patterns. More experimental than LangGraph.

Best if you’re Azure-committed or doing research.

Swarm (OpenAI) / Agent SDK

Lightweight OpenAI-first framework. Minimalist; assumes OpenAI models.

Best for OpenAI-only stacks.

Semantic Kernel

Microsoft .NET-flavored. Good for .NET organizations.

Anthropic Agent SDK

New in 2025. Claude-native. Strong tool use and MCP integration.

Best for Claude-based systems.

Our default: LangGraph for production, CrewAI for quick prototypes.


The Short Version

Multi-agent systems are powerful. They’re also the area of AI infrastructure where hype has outrun reality the most. Use the pattern when it earns its keep; use a single agent otherwise.


Further Reading

Building a multi-agent system? Let’s talk — we can help scope whether multi-agent is actually warranted, and if so, architect it.

← Back to Blog