Anthropic reveals how its multi-agent research system achieves 90% better performance

Anthropic has published detailed insights into how it built Claude’s research capabilities, revealing that its multi-agent system outperforms single-agent approaches by 90.2%. The post was written by Jeremy Hadfield, Barry Zhang, Kenneth Lien, Florian Scholz, Jeremy Fox, and Daniel Ford from Anthropic.

The research feature allows Claude to search across the web, Google Workspace, and other integrations to handle complex tasks. Unlike traditional systems that follow predetermined steps, the multi-agent approach uses multiple AI agents working in parallel to tackle open-ended research problems.

The system employs an orchestrator-worker pattern where a lead agent coordinates the process and delegates tasks to specialized subagents. When users submit queries, the lead agent analyzes the request, develops a strategy, and spawns subagents to explore different aspects simultaneously. Each subagent operates independently with its own context window before returning findings to the lead agent.

Anthropic’s internal evaluations demonstrate that multi-agent systems excel particularly at breadth-first queries requiring multiple independent directions. The system uses Claude Opus 4 as the lead agent with Claude Sonnet 4 subagents. In one example, when asked to identify board members of Information Technology S&P 500 companies, the multi-agent system succeeded by decomposing tasks for subagents while the single-agent system failed with slow sequential searches.

Token usage emerged as the primary performance factor, explaining 80% of variance in Anthropic’s BrowseComp evaluation. The company found that agents typically consume four times more tokens than chat interactions, while multi-agent systems use 15 times more tokens than regular chats. This increased resource consumption limits viability to high-value tasks where performance gains justify the costs.

The development process revealed critical lessons about prompt engineering for multi-agent coordination. Early versions made errors like spawning 50 subagents for simple queries or endlessly searching for nonexistent sources. Anthropic addressed these issues by teaching the orchestrator proper delegation techniques and scaling effort to query complexity.

The company established specific guidelines for different task types. Simple fact-finding requires one agent with three to ten tool calls, while complex research might use more than ten subagents with clearly divided responsibilities. Parallel tool calling proved transformative, cutting research time by up to 90% for complex queries.

Evaluation presented unique challenges since multi-agent systems can take different valid paths to reach identical goals. Anthropic developed flexible evaluation methods focusing on outcomes rather than prescribed steps. The company used LLM judges to evaluate factual accuracy, citation quality, completeness, source quality, and tool efficiency.

Production deployment required addressing stateful execution and error handling across long-running processes. Anthropic implemented rainbow deployments to avoid disrupting running agents and built systems that can resume from failure points rather than restarting entirely.

Current limitations include synchronous execution creating bottlenecks, as lead agents must wait for subagents to complete before proceeding. Anthropic identified asynchronous execution as a future improvement that would enable additional parallelism but acknowledged the added complexity in coordination and error handling.

Users report significant benefits from the research system, including finding business opportunities, navigating healthcare options, and resolving technical issues. Common use cases include developing software systems, optimizing professional content, creating business strategies, and conducting academic research.

Related posts:

Stay up-to-date: