AI and Automation

12 mistakes we see teams make building their first multi-agent ops system

Multi-agent LLM systems fail in production at 41 to 86 percent. Most of the failures trace back to twelve specific decisions teams make in the first month. Here is what they look like and how to undo them.

May 23, 20267 min read
a group of cubes that are on a black surface

A multi-agent ops system is a workflow where two or more LLM agents share state and split work toward a single business outcome, usually under an orchestrator that holds the plan and assigns tasks to specialised subagents. The pattern looks clean on a whiteboard. In production, it breaks in predictable ways. Recent failure analyses put the production failure rate of multi-agent LLM systems between 41 and 86.7 percent, and roughly 79 percent of those failures trace back to bad specifications and broken coordination, not model quality (Augment Code, MAST taxonomy).

TL;DR. Most teams ship the orchestrator before they ship the eval harness, give subagents shared writable state, skip the verifier, and discover the token bill on a Monday morning. The fix is boring: define ownership per resource, run a verifier that is not the orchestrator, cap retries and tokens, and start with two agents before you scale to ten.

1. Treating the orchestrator like a chatbot

The orchestrator is not a chat assistant with extra tools. It is a planner that has to record its plan, track what each subagent owns, and survive token exhaustion mid-run. Teams who skip the planning step write an orchestrator prompt that reads like a friendly system message and then watch agents diverge in the first long task. Anthropic's research team made the lead researcher write the plan to memory before dispatching subagents, precisely so the run survives a context reset (Anthropic Engineering).

2. Letting subagents share writable state

If two agents can write to the same database row, the same file, or the same Linear ticket, you have a concurrency bug waiting for a deadline. The rule that survives contact with production is straightforward: every resource has one owner. Agents that need to coordinate do it through a message bus or a task queue, not through a shared mutable record. Specification and design issues account for 41.8 percent of multi-agent failures in the MAST taxonomy, and ownership ambiguity is the largest sub-category.

3. Skipping the verifier, or making the orchestrator the verifier

Roughly one in four multi-agent failures happens because the system did not check its own work properly, and incorrect verification is the single most common failure mode at 9.1 percent. The verifier needs to be a separate agent with a separate prompt and, ideally, a separate model class. An orchestrator that grades its own plan will accept its own bad output. We have seen teams add a verifier only after the third post-mortem.

4. Running synchronous when the work parallelises

The default Claude Code subagent topology is synchronous. The lead waits for each batch of subagents to finish before dispatching the next. Anthropic flagged this as a real bottleneck and noted that async would unlock more parallelism at the cost of harder error handling. If your subagents read from independent sources and never need to compare notes mid-task, the async cost is worth paying. If they do, stay synchronous and reduce the fan-out instead.

5. No per-task token budget

Token accumulation in a naive agent loop is quadratic, not linear. A 20-step loop where each step generates 1,000 tokens produces around 210,000 cumulative input tokens, because the full history is re-serialised on every call (Augment Code). The fix is a pre-execution budget check and automatic context compaction before the window fills. Without it, a single overzealous task on a Sunday night can spend more than a senior engineer earns in a week.

6. No retry ceiling

The most common production incident in agent systems is not a wrong answer. It is an agent that retries, and retries, and retries. Each retry is a full provider call. Context doubles. A 4,000-token initial context can reach 128,000 by step 5, and the per-step cost has gone up 32x. By step 30 the loop has spent more than a competent engineer's monthly salary (TrueFoundry). Cap retries at three, log the failure, and exit. Better to escalate to a human than to bleed.

7. Forcing multi-agent on tightly coupled work

Anthropic was direct on this point. Multi-agent systems excel at problems that split into parallel strands of research and are less effective for tightly interdependent tasks such as coding. If your task graph has every node reading the previous node's output, you do not have a multi-agent problem, you have a long single-agent run with checkpoints. The cost gap is real: a multi-agent setup consumes around fifteen times more tokens than a single chat for the same outcome.

8. Letting agents discover schemas at runtime

Data contamination is the most common external failure mode in multi-agent systems. The root cause is usually the same: an agent calls an API or queries a table without a strict schema, gets back a shape it does not understand, and either fabricates fields or writes garbage downstream. Hand each agent its schema as part of the prompt, validate inbound payloads with Zod or Pydantic before reasoning over them, and refuse to act on inputs that do not parse.

9. Logging final outputs only

If your traces capture only the final answer, you cannot debug a multi-agent failure. You need the orchestrator's plan, each subagent's prompt and response, every tool call, the token count per call, and the verifier's verdict. Minor changes in agentic systems cascade into large behavioural changes, and post-incident debugging without step-level traces is guesswork. OpenTelemetry with span attributes for agent name and task ID is the cheapest setup that works.

10. Skipping the eval harness before production

Roughly 88 percent of AI agent projects never reach production, and the modal reason is not model capability. It is that the team had no offline evaluation set, no regression suite, and no way to know whether a prompt change made things better or worse. Build a fixture set of twenty representative tasks before you ship. Run the whole multi-agent loop against them on every prompt or model change. Without this, every release is a coin flip.

11. Picking the wrong topology before you have telemetry

The 2026 framework landscape now distinguishes three dominant orchestration models: graph-based (LangGraph), role-based (CrewAI), and swarm-style handoffs (OpenAI Agents SDK, Anthropic Agent Teams). Each one optimises a different failure mode. Graphs make state explicit but rigid. Roles make collaboration legible but expensive. Swarms minimise overhead but make state hard to recover. Pick the topology that matches your actual failure pattern, which you only know after a week of telemetry.

12. Hiding cost inside a wrapper

The last mistake is organisational. The team building the agent sees model latency and quality. Finance sees a $40,000 invoice on the first of the month. Without per-agent cost attribution, no one can answer the next obvious question, which is whether the agent is paying back its own bill. Tag every provider call with an agent ID, a task ID, and a tenant ID, and push the spend to the same dashboard as your other unit economics. An agent that costs more than it saves is a feature waiting to be turned off.

The pattern under the twelve

Read the list again and one shape repeats. Multi-agent systems fail when the team treats them like bigger chatbots instead of distributed systems with a token meter. The competent version of this work looks closer to a small queue-based microservice architecture than to a Claude conversation, with budgets, ownership boundaries, verifiers, retries, eval suites, and traces. The model is one component. The other twelve are the system.

Sources

Photo by Shubham Dhage on Unsplash

Frequently asked questions

How many agents should a first multi-agent system have?
Two. An orchestrator that holds the plan and a worker that executes one task at a time, plus a verifier as soon as the system makes a non-trivial decision. Adding more agents before you have telemetry and an eval harness multiplies failure surface without lifting outcome quality. Anthropic's published research recommends scaling agent count only after the smaller system is observable and budgeted.
What is the difference between Claude Code subagents and Agent Teams?
Subagents run inside a single Claude Code session and report back to the main agent. They share the same context budget and cannot communicate with each other directly. Agent Teams, shipped in early 2026, let multiple Claude Code sessions act as one team with a shared task list. Each teammate runs in its own context window and can address other teammates by name. Teams suit longer-running parallel work. Subagents suit a single focused task with light fan-out.
How do I budget tokens for a multi-agent system?
Set three layers. A per-call token cap at the SDK level so no single request can blow up. A per-task cap held by the orchestrator that includes all subagent calls for a single user task. A per-tenant daily cap enforced at the gateway. Log spend with agent ID, task ID, and tenant ID. If any layer trips, escalate to a human queue, do not retry silently. Without all three, a single runaway loop can cost more than a senior engineer's monthly salary by the time anyone notices.
Should I use LangGraph, CrewAI, or Anthropic Agent Teams?
Pick based on the failure you most want to prevent. LangGraph forces explicit state transitions and shines when you need step-level resumability and auditing. CrewAI structures collaboration around named roles and is the fastest path to a working prototype. Anthropic Agent Teams keeps you inside Claude Code with native handoff and short-circuits the integration cost if your work already lives in that environment. None is universally better. Start with the one that matches your dominant constraint, run two weeks of telemetry, and switch if the data justifies it.

Studio

Start a project.

One partner for the digital product you need to build. Faster delivery, modern tech, lower costs. One team, one invoice.