Multi-agent system design patterns that survive production

A multi-agent system is an AI architecture where two or more LLM agents coordinate to solve a task that a single agent handles poorly, by splitting roles, running in parallel, or passing state through an explicit graph. The pattern works when the problem is decomposable, the value of the answer covers the token bill, and the team has a way to debug failures across agents. It breaks when those three conditions are not met.

Most teams find this out the expensive way. Anthropic's own measurement is that multi-agent systems consume roughly 15 times more tokens than a chat interaction, while delivering a 90.2% improvement on internal research evaluations. That ratio only makes sense for high-value tasks where the answer is worth the spend. For a chatbot answering refund requests, it is malpractice.

Why naive multi-agent setups fail in production

The Berkeley Sky Computing Lab released MAST, a Multi-Agent Systems Failure Taxonomy, after analyzing 1,600+ traces across seven popular open-source frameworks. They found 14 distinct failure modes grouped into three categories: system design issues, inter-agent misalignment, and task verification gaps. The headline number: ChatDev, one of the most cited frameworks in academic literature, scored 33.33% correctness on its own ProgramDev benchmark.

Industry data points the same way. Roughly 40% of multi-agent pilots fail within six months of production deployment, usually because the team picked an orchestration shape that did not match the problem and only discovered it after the bill arrived.

The naive failure mode is consistent. A team reads a blog post about agent collaboration, chooses CrewAI or LangGraph because it was top of the search results, wraps four roles around the same model, and calls it a multi-agent system. The agents then spend tokens arguing with each other, repeat each other's work, lose state when the conversation grows past the context window, and fail silently because no one wired observability into the graph. Each individual fix looks small. The cumulative cost is the project.

Why "just add more agents" makes it worse

The instinct, when an agent struggles, is to spawn helpers. This is wrong by default. Each new agent multiplies coordination cost, expands the surface for inter-agent misalignment (MAST category 2), and adds another candidate for the orchestrator to mismanage. Every paper and every production retrospective converges on the same advice: start with a single agent, add specialization only when a measurable bottleneck demands it, and never split a task that one agent could complete in one pass.

The other instinct is to debate. Multi-agent debate looks elegant on paper and produces better answers on certain reasoning benchmarks, but in production it consumes tokens linearly per turn, often without converging. Reserve it for tasks where the marginal accuracy gain is auditable and the latency budget is generous.

The five patterns that survive production

1. Orchestrator-Worker

One lead agent decomposes a query into subtasks, dispatches them to specialized worker agents, and assembles the result. This is the shape Anthropic's Research feature uses, with Claude Opus 4 as orchestrator and Claude Sonnet 4 as the workers, each with its own context window and tool set. The pattern is correct when subtasks are genuinely independent (search subgraph A while subgraph B runs), when each worker has a clearly bounded objective, and when the orchestrator can describe the task in a structured prompt that names the output format.

Use it for: research synthesis, parallel data gathering, code generation across modules. Avoid for: chat, refund flows, anything sequential by nature. The orchestrator delegates tools through a standard interface, which is one reason an MCP server usually pays for itself the moment more than one agent shares the same tools.

2. Sequential Pipeline

Agents run in a fixed order. Output of step N is the input to step N+1. The shape is boring, predictable, and accounts for most production agent workloads that ship. Pipelines make sense when the task has a natural progression (extract, transform, validate, write) and when each step's output schema is stable enough to type-check at the boundary.

The trap is using a pipeline when the task is actually a graph. If step 3 sometimes needs to revisit step 1, you do not have a pipeline. You have a state machine, which is what LangGraph models well.

3. Parallel fan-out and fan-in

An orchestrator dispatches the same task variation across N workers, then aggregates. This pattern is what gives multi-agent systems their measured speed advantage. It only pays off when the work is genuinely parallel, when the aggregator step is non-trivial (otherwise you are paying N times for one answer), and when the token budget allows. The Anthropic 90.2% performance number lives here.

4. Choreography over orchestration (when failures are tolerable)

Choreographed systems let agents declare capabilities and a router pulls them in as needed. CrewAI defaults to this shape. Choreography survives the failure of one agent better than orchestration does, because there is no single coordinator to crash. The trade-off is debuggability: when something goes wrong in a choreographed run, the trace looks like a phone tree, and you spend hours reconstructing intent.

Pick choreography for resilience-critical workloads where a 10% degraded answer is better than a missing one. Pick orchestration for everything where you need to explain to a customer why an answer is wrong.

5. Model tiering

Use a cheap model (Haiku, Sonnet) for triage, classification, and routing. Reserve the expensive model (Opus, GPT-4-class) for actual reasoning and final synthesis. Production teams report 40 to 60% cost reductions from this single decision, which is usually enough to make the 15-times multi-agent token multiplier defensible. Tiering is not technically a multi-agent pattern in the orchestration sense, but no production multi-agent system survives without it.

What this looks like in a real engagement

A recent setup we built replaced a single-prompt research agent with an orchestrator-worker shape for competitive intelligence on private companies. The single-prompt version made one Claude call, returned a paragraph, and missed roughly half of the relevant signal. The multi-agent version uses one orchestrator (Sonnet 4) that decomposes a company query into four parallel research tracks (funding, product, leadership, hiring). Each track is a worker (Sonnet 4) with its own search tool and a strict output schema. The orchestrator stitches the four artifacts into a brief.

The before and after: research depth roughly tripled, latency went from 6 seconds to 22 seconds, token cost rose 11 times. For a sales team that used to spend 40 minutes per company in manual research, the trade was easy. For a consumer chatbot, it would have been ruinous. The pattern only works when the maths work.

Pick the framework after you pick the pattern

Most teams reverse this. They pick LangGraph or CrewAI first, then bend the problem to match. The honest order is: name the pattern (orchestrator-worker, pipeline, fan-out, choreography, tiering), then choose the framework that expresses the pattern with the least ceremony.

For graph-shaped state machines with explicit cycles, branching, and production observability requirements, LangGraph remains the most battle-tested choice in 2026. For role-based crews with linear flow and a one-day prototype budget, CrewAI's Flows mode is now production-acceptable. AutoGen is effectively in maintenance mode while Microsoft consolidates around its broader Agent Framework, so reach for it only if conversational debate is genuinely your primitive.

The pattern that survives production is the one that matches the problem. The framework is downstream of that decision, never upstream.

Sources

Photo by Kazuo ota ↗ on Unsplash ↗

Studio

Start a project.

One partner for companies, public sector, startups and SaaS. Faster delivery, modern tech, lower costs. One team, one invoice.

Tell us what you are building Read more articles

AI and Automation

Claude Code hooks: PreToolUse, PostToolUse, and patterns that scale

How to wire Claude Code hooks for security gates, format-on-save, and team-wide guardrails without slowing the agent down.

May 3, 20269 min read

AI and Automation

AI-native vs AI bolted-on: how to tell in 60 seconds

What separates AI-native products from retrofits: the architecture, the data flow, the team shape. Plus a 60-second test you can run on any product.

Apr 30, 20268 min read

AI and Automation

Claude Code workflow patterns we use to ship production code daily

Seven Claude Code workflow patterns we run daily to ship production code: plan mode, CLAUDE.md, hooks, subagents, parallel sessions, slash commands, MCP.

Apr 27, 20267 min read

Why naive multi-agent setups fail in production

Why "just add more agents" makes it worse

The five patterns that survive production

1. Orchestrator-Worker

2. Sequential Pipeline

3. Parallel fan-out and fan-in

4. Choreography over orchestration (when failures are tolerable)

5. Model tiering

What this looks like in a real engagement

Pick the framework after you pick the pattern

Sources

Start a project.

Related articles

Claude Code hooks: PreToolUse, PostToolUse, and patterns that scale

AI-native vs AI bolted-on: how to tell in 60 seconds

Claude Code workflow patterns we use to ship production code daily