Multi-Agent Systems - When Splitting the Work Actually Helps

The instinct when a task is complex is to throw more agents at it. Spin up a researcher, a writer, a critic, a planner - a whole crew. It feels like good engineering. It is usually not.

Most multi-agent systems in production are slower, more expensive, and less reliable than a single well-designed agent. The coordination overhead is real: more LLM calls, more context to manage, more failure points, and latency that multiplies rather than shrinks. The only reason to use multiple agents is if the task structure genuinely requires it. Most tasks do not.

Here is how to tell the difference, and when multi-agent design pays off.

The Single Agent Baseline

Before splitting work, understand what one agent can actually handle.

A single agent with a good tool set and a clear system prompt handles remarkably complex tasks. Modern frontier models have large context windows, strong instruction-following, and can chain dozens of tool calls without losing the thread. The problems people assume require multiple agents - long tasks, diverse subtasks, iterative refinement - often resolve with better prompt design and tool structure.

A single agent breaks down in three specific situations:

1. Context window saturation. Long-running tasks accumulate tool call results, intermediate reasoning, and error history. After enough steps, the agent loses track of its original goal. This is not a prompt engineering problem - it is a physics problem. You are running out of space.

2. Genuinely parallel workloads. Some tasks have independent subtasks that do not need to wait for each other. Analyzing 20 documents sequentially in one agent is slower than processing them in parallel across multiple agents.

3. Conflicting objectives within one prompt. An agent asked to both generate content and critique it will anchor on whatever it generated first. The “critic” role gets corrupted by the prior “generator” role.

These are the real cases for multi-agent design. Everything else is usually a sign that the single agent needs better tooling or a tighter prompt.

Three Patterns Worth Knowing

Pattern 1: Orchestrator-Worker

One orchestrator LLM receives the task, decomposes it, fans work out to specialized workers, and synthesizes the results.

User Request
     |
     v
[Orchestrator] - plans and decomposes
     |
     |-- Subtask A --> [Worker A] --> Result A
     |-- Subtask B --> [Worker B] --> Result B
     |-- Subtask C --> [Worker C] --> Result C
     |
     v
[Orchestrator] - synthesizes results
     |
     v
Final Output

The orchestrator does not do the actual work. Its job is decomposition and synthesis. Workers are focused and stateless - they receive a specific task and return a structured result.

import anthropic
import json
from concurrent.futures import ThreadPoolExecutor

client = anthropic.Anthropic()

ORCHESTRATOR_SYSTEM = """You are a task orchestrator. Given a user request,
decompose it into independent subtasks that can be executed in parallel.
Return a JSON array of subtasks, each with: id, description, context."""

WORKER_SYSTEM = """You are a focused worker agent. Complete the given task
and return a structured result. Be precise and concise."""

def orchestrate(user_request: str) -> str:
    # Step 1: Decompose
    plan_response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=ORCHESTRATOR_SYSTEM,
        messages=[{"role": "user", "content": user_request}]
    )
    subtasks = json.loads(plan_response.content[0].text)

    # Step 2: Execute workers in parallel
    def run_worker(subtask: dict) -> dict:
        result = client.messages.create(
            model="claude-haiku-4-5-20251001",  # cheaper model for workers
            max_tokens=512,
            system=WORKER_SYSTEM,
            messages=[{"role": "user", "content": subtask["description"]}]
        )
        return {"id": subtask["id"], "result": result.content[0].text}

    with ThreadPoolExecutor(max_workers=5) as executor:
        worker_results = list(executor.map(run_worker, subtasks))

    # Step 3: Synthesize
    synthesis_prompt = f"""Original request: {user_request}

Worker results:
{json.dumps(worker_results, indent=2)}

Synthesize these results into a coherent final answer."""

    final = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[{"role": "user", "content": synthesis_prompt}]
    )
    return final.content[0].text

This pattern works when subtasks are genuinely independent. If Worker B needs the output of Worker A, you have a sequential pipeline, not a parallel fan-out. Forcing parallel execution in that case just introduces bugs.

Where it breaks: The orchestrator makes assumptions about how to decompose a task. If the decomposition is wrong, every worker produces irrelevant output. There is no feedback loop between workers and the orchestrator during execution - only at synthesis time. A bad plan produces bad results that look coherent.

Pattern 2: Sequential Pipeline

A chain of specialized agents where each stage produces output that feeds the next.

[Researcher] --> [Analyst] --> [Writer] --> [Editor]
    raw data      insights     draft       final doc

This is simpler than orchestrator-worker and easier to debug. Each agent is focused, the data flow is explicit, and you can inspect every intermediate output.

def research_pipeline(topic: str) -> str:
    # Stage 1: Research
    research = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=2048,
        system="You are a researcher. Find and summarize key facts about the topic.",
        messages=[{"role": "user", "content": f"Research: {topic}"}]
    ).content[0].text

    # Stage 2: Analysis
    analysis = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="You are an analyst. Identify the most important insights from research.",
        messages=[{"role": "user", "content": f"Research findings:\n{research}\n\nWhat are the key insights?"}]
    ).content[0].text

    # Stage 3: Write
    draft = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        system="You are a technical writer. Write a clear, concise report.",
        messages=[{"role": "user", "content": f"Topic: {topic}\n\nInsights: {analysis}\n\nWrite the report."}]
    ).content[0].text

    return draft

The advantage over a single agent trying to do all three is clarity of separation: the researcher does not optimize for writing, the analyst does not get distracted by raw data formatting, the writer does not second-guess the research. Each agent has one job.

The disadvantage is latency. Three sequential calls means 3x the round-trip time. And errors compound: if the researcher returns garbage, the analyst produces slightly cleaner garbage, and the writer produces a well-formatted garbage report.

Use this when: The task naturally decomposes into phases where later phases depend on earlier output. Content pipelines, code review workflows, data transformation chains.

Pattern 3: Debate

Two agents independently produce answers or evaluations, then a judge reconciles the disagreement.

User Question
     |
     +-> [Agent A] --> Opinion A
     |                        \
     |                         --> [Judge] --> Final Answer
     +-> [Agent B] --> Opinion B

The debate pattern is specifically for situations where generating and evaluating in the same agent creates anchoring bias. A single agent that writes code and then reviews it will be gentle with its own work. Two independent agents evaluating the same code will find more issues.

def debate_evaluate(code: str, criteria: str) -> dict:
    def evaluate(perspective: str) -> str:
        return client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            system=f"You are a code reviewer focusing on {perspective}. Be critical and specific.",
            messages=[{"role": "user", "content": f"Review this code:\n\n```\n{code}\n```\n\nCriteria: {criteria}"}]
        ).content[0].text

    # Run two independent evaluations in parallel
    with ThreadPoolExecutor(max_workers=2) as executor:
        future_security = executor.submit(evaluate, "security and correctness")
        future_perf = executor.submit(evaluate, "performance and maintainability")
        review_a = future_security.result()
        review_b = future_perf.result()

    # Judge reconciles
    verdict = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="You are a senior engineer. Given two code reviews, synthesize a final verdict.",
        messages=[{"role": "user", "content": f"Review A:\n{review_a}\n\nReview B:\n{review_b}\n\nFinal assessment:"}]
    ).content[0].text

    return {"review_a": review_a, "review_b": review_b, "verdict": verdict}

The debate pattern costs 2-3x the tokens of a single evaluation. It is only worth it when the cost of a missed issue exceeds the cost of the extra API calls - security reviews, production deployment gates, high-stakes decisions.

Do not use debate for tasks where there is a clear right answer and a single agent can find it. You are not gaining quality, just adding cost and latency.

The Real Coordination Cost

Every multi-agent system pays a coordination tax that single-agent benchmarks never show.

Cost Component	Orchestrator-Worker	Sequential Pipeline	Debate
Extra LLM calls	Orchestrator + N workers + synthesizer	N agents in chain	2 agents + judge
Token overhead	Orchestrator prompt repeated per call	Context passed between stages	Full context replicated per agent
Latency (sequential)	Decompose + synthesize	Multiplied by stages	Judge waits for both
Latency (parallel)	Max(worker times) + overhead	Not parallelizable	Max(agent times) + judge
Failure surface	Any worker failure can corrupt synthesis	Single stage failure halts chain	Judge must handle disagreement

Here is what this looks like in practice. A task that takes 1 LLM call with one agent might take 4-6 calls with orchestrator-worker (decompose, 3 workers, synthesize). At $3/million input tokens on Sonnet, a 50K token task that costs $0.15 as a single call costs $0.60-$0.90 with coordination overhead. At scale, this is not a rounding error.

The latency picture is better for parallel execution but only if you can actually parallelize. If workers have dependencies, you are paying the coordination overhead without getting the speed benefit.

Decision Framework

Before reaching for multiple agents, answer these questions:

Does the task have genuinely independent subtasks? If yes, orchestrator-worker can reduce wall-clock time. If no, you are adding overhead for no gain.

Does the task span more than one context window? If you are processing 10 documents that together exceed 200K tokens, parallelizing across workers is necessary, not optional. If the task fits in one context, keep it there.

Is anchoring bias a real risk? If you need an objective evaluation of output that an agent just produced, debate or a separate critic agent is justified. If the task is purely generative, a second agent adds no quality.

Can you afford the latency? Sequential pipelines add latency at every stage. If you need a response in under 5 seconds, three sequential 2-second agent calls will miss that target.

Situation	Recommendation
Task fits in one context window	Single agent
Tasks are independent and can parallelize	Orchestrator-Worker
Output depends on prior output	Sequential Pipeline
Need unbiased evaluation of generated output	Debate
Task is exploratory and direction unknown	Single agent with ReAct
Processing many documents at once	Orchestrator-Worker

A Production Orchestration Design

Here is an architecture for a code review system that handles repositories too large for a single context window:

                    Pull Request
                         |
                         v
              [File Change Splitter]
              (deterministic code, not LLM)
                         |
          +--------------+---------------+
          |              |               |
          v              v               v
    [Worker: auth/]  [Worker: api/]  [Worker: db/]
    security focus   correctness     data safety
          |              |               |
          +--------------+---------------+
                         |
                         v
                [Aggregator + Dedup]
                (removes duplicate findings)
                         |
                         v
                  [Summary Agent]
                  (ranks issues by severity)
                         |
                         v
                  PR Comment Output

Key design decisions in this architecture:

Splitting is deterministic. The file splitter is regular code, not an LLM. It groups changed files by directory. This removes one LLM call and makes the split predictable.
Workers are cheap models. Each file-group review uses a smaller model (Haiku instead of Sonnet). Parallelization across directories plus cheaper models makes this cost-competitive with one expensive single-agent pass.
Dedup before synthesis. Multiple workers will independently flag the same issue. A deterministic dedup step (hash the issue description, keep unique) is cheaper and more reliable than asking an LLM to deduplicate.
The summary agent ranks, it does not re-review. It receives structured findings and adds severity ranking. It does not re-read the code. This keeps the summary call fast and cheap.

This system handles pull requests with 50+ changed files across multiple packages without hitting context limits and returns results in under 30 seconds by running workers in parallel.

What Breaks in Practice

Shared mutable state. If workers write to the same database, cache, or file, you need distributed locking. Most agent frameworks do not handle this. Design workers to be stateless where possible.

Error propagation. In a sequential pipeline, a hallucination in stage 2 gets laundered by stage 3. The final output looks clean but is wrong. Add structured output validation between stages.

Synthesis failures. An orchestrator that receives five worker results might produce a synthesis that contradicts some of them. The synthesis step is itself an LLM call and subject to all the usual failure modes. Do not treat the synthesis as ground truth.

Runaway parallelism. Spinning up 50 workers for 50 documents will hit API rate limits. Add a concurrency cap (5-10 concurrent workers is usually the sweet spot) and a queue.

from asyncio import Semaphore

async def run_workers_bounded(subtasks: list, max_concurrent: int = 5):
    semaphore = Semaphore(max_concurrent)

    async def run_one(subtask):
        async with semaphore:
            return await call_worker(subtask)

    return await asyncio.gather(*[run_one(t) for t in subtasks])

The Honest Assessment

Multi-agent systems are not a quality multiplier. They are a structural solution to structural problems: context limits, parallelizable workloads, and anchoring bias. If the task does not have one of those problems, adding agents adds overhead without payoff.

The orchestrator-worker pattern is the most practical for real tasks. It handles scale and parallelism and is easier to debug than debate because data flow is one-directional. Start here if you need multi-agent at all.

Debate is overused. It looks rigorous but usually produces one opinion dressed as two. Use it specifically when you have a bias problem in evaluation tasks.

Sequential pipelines are predictable and simple but multiply latency and error risk. Measure whether the quality improvement from specialization justifies the cost before shipping one to production.

If you are uncertain, build the single-agent version first. Put it in production. Measure where it breaks. The places where it actually breaks are the places where multi-agent design will help. The places where you expected it to break usually turn out to be fine.

The Single Agent Baseline#

Three Patterns Worth Knowing#

Pattern 1: Orchestrator-Worker#

Pattern 2: Sequential Pipeline#

Pattern 3: Debate#

The Real Coordination Cost#

Decision Framework#

A Production Orchestration Design#

What Breaks in Practice#

The Honest Assessment#

Comments