Which Agent Causes Task Failures and When?
On Automated Failure Attribution of LLM Multi-Agent Systems

Shaokun Zhang1    Ming Yin2    Jieyu Zhang3    Jiale Liu1    Zhiguang Han4    Jingyang Zhang2    Beibin Li5    Chi Wang6    Huazheng Wang7    Yiran Chen2    Qingyun Wu1
1Penn State    2Duke    3UW    4NTU    5xAI    6Google DeepMind    7Oregon State
Automated Failure Attribution — A rigorous definition and evaluation framework for identifying and analyzing failures in multi-agent systems.
Who&When Benchmark — The first benchmark built from failure logs across 127 multi-agent systems, featuring fine-grained annotations that link failures to specific agents and execution steps.
Three Attribution Methods — A systematic comparison of All-at-Once, Binary Search, and Step-by-Step strategies, highlighting their trade-offs in accuracy and efficiency.
Challenging Open Problem — Even state-of-the-art models achieve limited performance, underscoring substantial opportunities for future research.
Case 1
Web Navigation
Case 2
Math Reasoning
Case 3
Code Generation
Case 4
Data Analysis
Case 5
Information Retrieval
Agent Trajectory
Task
Loading...
Loading trajectory...
Failure Attribution
Failure-Responsible Agent
Loading...
Decisive Error Step
Loading...
Reason
Loading...
Ground Truth Answer
Loading...
Tool Demo Video
1 Install
Terminal
pip install ag2tracer
2 Set API Key
Terminal
export OPENAI_API_KEY="your-openai-key"
# Optional: other providers
export ANTHROPIC_API_KEY="your-anthropic-key"
export GOOGLE_API_KEY="your-google-key"
3 Prepare Your Trace

Your trace JSON must contain a "history" or "conversation" array:

JSON
{
  "history": [
    {"role": "assistant", "name": "Excel_Expert", "content": "Let me analyze..."},
    {"role": "user", "name": "Computer_terminal", "content": "exitcode: 0 ..."},
    {"role": "user", "name": "DataVerification_Expert", "content": "TERMINATE"}
  ],
  "question": "optional problem description",
  "ground_truth": "optional expected answer"
}
4a Use the Python API
Python
import json
from ag2tracer import all_at_once, step_by_step, binary_search

with open("trace.json") as f:
    trace = json.load(f)

# Choose a method
result = all_at_once(trace, model="gpt-4o")
# result = step_by_step(trace, model="gpt-4o")
# result = binary_search(trace, model="gpt-4o")

print(f"Agent: {result.agent_name}")
print(f"Step:  {result.step_number}")
print(f"Reason: {result.reason}")
4b Or Launch the Web UI
Terminal
# Launch the web interface
ag2tracer launch

# Or run in background
ag2tracer launch --daemon

# Management commands
ag2tracer status    # check if server is running
ag2tracer logs -f   # follow logs
ag2tracer stop      # stop the background server

Then open http://127.0.0.1:8500 to upload traces and run attribution interactively.

5 Attribution Methods
Method Description LLM Calls
All-at-Once Feeds entire conversation to the LLM 1
Step-by-Step Evaluates each step sequentially, stops at first error Up to N
Binary Search Recursively narrows down the error location O(log N)

Performance comparison of failure attribution methods across different settings (GPT-4o as the backbone).

Method Metric With Ground Truth Without Ground Truth
Algorithm Generated Hand Crafted Algorithm Generated Hand Crafted
Random Agent-Level Accuracy 29.10 12.00 29.10 12.00
Step-Level Accuracy 19.06 4.16 19.06 4.16
All-at-Once Agent-Level Accuracy 54.33 55.17 51.12 53.44
Step-Level Accuracy 12.50 5.26 13.53 3.51
Step-by-Step Agent-Level Accuracy 35.20 34.48 26.02 32.75
Step-Level Accuracy 25.51 7.02 15.31 8.77
Binary Search Agent-Level Accuracy 44.13 51.72 30.11 36.21
Step-Level Accuracy 23.98 6.90 16.59 6.90

Overview

Failure attribution in LLM multi-agent systems—identifying the agent and step responsible for task failures—provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems. To support this initiative, we introduce the Who&When dataset, comprising extensive failure logs from 127 LLM multi-agent systems with fine-grained annotations linking failures to specific agents and decisive error steps. Using the Who\&When, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5\% accuracy in identifying failure-responsible agents but only 14.2\% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI o1 and DeepSeek R1, fail to achieve practical usability. These results highlight the task's complexity and the need for further research in this area.

Motivation

Modern LLM multi-agent systems often fail in subtle ways: the final answer is wrong, but it is unclear which agent caused the failure or which step first went wrong. In practice, developers must manually inspect long, multi-agent execution logs to identify the root cause, a process that is slow, error-prone, and requires significant domain expertise.

This manual failure attribution step has become a major bottleneck in agent system development. As agent teams grow larger and interactions become longer, simply knowing that a system failed is no longer actionable without knowing who failed and when.

We argue that failure attribution should be treated as a first-class research problem. This work introduces automated failure attribution: using LLMs to automatically identify the failure-responsible agent and the decisive error step directly from execution logs.

Failure Attribution Demo

Method

Method Overview

We introduce Who&When, the first benchmark for automated failure attribution in LLM multi-agent systems. The dataset contains failure logs from 127 multi-agent systems, each annotated with: (i) the failure-responsible agent, and (ii) the decisive error step, the earliest mistake whose correction would change failure into success. On top of this benchmark, we study three representative failure attribution strategies that expose fundamental trade-offs:

  • All-at-Once: Analyzes the entire failure log in a single pass.
  • Binary Search: Iteratively narrows down the search space by splitting logs in half.
  • Step-by-Step: Evaluates each step sequentially to detect errors.

Results

Main Experimental Results

Our experiments reveal that automated failure attribution is substantially harder than standard evaluation tasks. Even the best-performing approach achieves only 53.5% accuracy in identifying the failure-responsible agent and 14.2% accuracy in pinpointing the exact failure step.

Model Ablation Study

Notably, stronger reasoning models do not solve the problem. State-of-the-art reasoning models such as OpenAI o1 and DeepSeek R1 fail to achieve practical step-level accuracy, and in some settings perform worse than simpler baselines. These results indicate that failure attribution is not merely a matter of stronger reasoning or larger models. Instead, it exposes fundamental limitations in how LLMs retrieve, localize, and reason about errors within long, multi-agent interaction histories.

BibTeX


@inproceedings{
  zhang2025which,
  title={Which Agent Causes Task Failures and When? On Automated Failure Attribution of {LLM} Multi-Agent Systems},
  author={Shaokun Zhang and Ming Yin and Jieyu Zhang and Jiale Liu and Zhiguang Han and Jingyang Zhang and Beibin Li and Chi Wang and Huazheng Wang and Yiran Chen and Qingyun Wu},
  booktitle={Forty-second International Conference on Machine Learning},
  year={2025},
  url={https://openreview.net/forum?id=GazlTYxZss}
}