Which Agent Causes Task Failures and When?
On Automated Failure Attribution of LLM Multi-Agent Systems

Shaokun Zhang¹ Ming Yin² Jieyu Zhang³ Jiale Liu¹ Zhiguang Han⁴ Jingyang Zhang² Beibin Li⁵ Chi Wang⁶ Huazheng Wang⁷ Yiran Chen² Qingyun Wu¹

¹Penn State ²Duke ³UW ⁴NTU ⁵xAI ⁶Google DeepMind ⁷Oregon State

Paper Code Dataset Synced (机器之心) AIEra (新智元) QbitAI (量子位) Discord

                
                Automated Failure Attribution — A rigorous definition and evaluation framework for identifying and analyzing failures in multi-agent systems.
              

                
                Who&When Benchmark — The first benchmark built from failure logs across 127 multi-agent systems, featuring fine-grained annotations that link failures to specific agents and execution steps.
              

                
                Three Attribution Methods — A systematic comparison of All-at-Once, Binary Search, and Step-by-Step strategies, highlighting their trade-offs in accuracy and efficiency.
              

                
                Challenging Open Problem — Even state-of-the-art models achieve limited performance, underscoring substantial opportunities for future research.
              

Case 1

Web Navigation

Case 2

Math Reasoning

Case 3

Code Generation

Case 4

Data Analysis

Case 5

Information Retrieval

Agent Trajectory

Task

Loading trajectory...

Failure Attribution

Failure-Responsible Agent

Decisive Error Step

Reason

Ground Truth Answer

Tool Demo Video

1 Install

Terminal

pip install ag2tracer

2 Set API Key

Terminal

export OPENAI_API_KEY="your-openai-key"
# Optional: other providers
export ANTHROPIC_API_KEY="your-anthropic-key"
export GOOGLE_API_KEY="your-google-key"

3 Prepare Your Trace

Your trace JSON must contain a "history" or "conversation" array:

JSON

{
  "history": [
    {"role": "assistant", "name": "Excel_Expert", "content": "Let me analyze..."},
    {"role": "user", "name": "Computer_terminal", "content": "exitcode: 0 ..."},
    {"role": "user", "name": "DataVerification_Expert", "content": "TERMINATE"}
  ],
  "question": "optional problem description",
  "ground_truth": "optional expected answer"
}

4a Use the Python API

Python

import json
from ag2tracer import all_at_once, step_by_step, binary_search

with open("trace.json") as f:
    trace = json.load(f)

# Choose a method
result = all_at_once(trace, model="gpt-4o")
# result = step_by_step(trace, model="gpt-4o")
# result = binary_search(trace, model="gpt-4o")

print(f"Agent: {result.agent_name}")
print(f"Step:  {result.step_number}")
print(f"Reason: {result.reason}")

4b Or Launch the Web UI

Terminal

# Launch the web interface
ag2tracer launch

# Or run in background
ag2tracer launch --daemon

# Management commands
ag2tracer status    # check if server is running
ag2tracer logs -f   # follow logs
ag2tracer stop      # stop the background server

Then open http://127.0.0.1:8500 to upload traces and run attribution interactively.

5 Attribution Methods

Method	Description	LLM Calls
All-at-Once	Feeds entire conversation to the LLM	1
Step-by-Step	Evaluates each step sequentially, stops at first error	Up to N
Binary Search	Recursively narrows down the error location	O(log N)

Performance comparison of failure attribution methods across different settings (GPT-4o as the backbone).

Method	Metric	With Ground Truth		Without Ground Truth
Method	Metric	Algorithm Generated	Hand Crafted	Algorithm Generated	Hand Crafted
Random	Agent-Level Accuracy	29.10	12.00	29.10	12.00
Random	Step-Level Accuracy	19.06	4.16	19.06	4.16
All-at-Once	Agent-Level Accuracy	54.33	55.17	51.12	53.44
All-at-Once	Step-Level Accuracy	12.50	5.26	13.53	3.51
Step-by-Step	Agent-Level Accuracy	35.20	34.48	26.02	32.75
Step-by-Step	Step-Level Accuracy	25.51	7.02	15.31	8.77
Binary Search	Agent-Level Accuracy	44.13	51.72	30.11	36.21
Binary Search	Step-Level Accuracy	23.98	6.90	16.59	6.90

Overview

Failure attribution in LLM multi-agent systems—identifying the agent and step responsible for task failures—provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems. To support this initiative, we introduce the Who&When dataset, comprising extensive failure logs from 127 LLM multi-agent systems with fine-grained annotations linking failures to specific agents and decisive error steps. Using the Who\&When, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5\% accuracy in identifying failure-responsible agents but only 14.2\% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI o1 and DeepSeek R1, fail to achieve practical usability. These results highlight the task's complexity and the need for further research in this area.

Motivation

Modern LLM multi-agent systems often fail in subtle ways: the final answer is wrong, but it is unclear which agent caused the failure or which step first went wrong. In practice, developers must manually inspect long, multi-agent execution logs to identify the root cause, a process that is slow, error-prone, and requires significant domain expertise.

This manual failure attribution step has become a major bottleneck in agent system development. As agent teams grow larger and interactions become longer, simply knowing that a system failed is no longer actionable without knowing who failed and when.

We argue that failure attribution should be treated as a first-class research problem. This work introduces automated failure attribution: using LLMs to automatically identify the failure-responsible agent and the decisive error step directly from execution logs.

Failure Attribution Demo

Method

Method Overview

We introduce Who&When, the first benchmark for automated failure attribution in LLM multi-agent systems. The dataset contains failure logs from 127 multi-agent systems, each annotated with: (i) the failure-responsible agent, and (ii) the decisive error step, the earliest mistake whose correction would change failure into success. On top of this benchmark, we study three representative failure attribution strategies that expose fundamental trade-offs:

All-at-Once: Analyzes the entire failure log in a single pass.
Binary Search: Iteratively narrows down the search space by splitting logs in half.
Step-by-Step: Evaluates each step sequentially to detect errors.

Results

Main Experimental Results

Our experiments reveal that automated failure attribution is substantially harder than standard evaluation tasks. Even the best-performing approach achieves only 53.5% accuracy in identifying the failure-responsible agent and 14.2% accuracy in pinpointing the exact failure step.

Model Ablation Study

Notably, stronger reasoning models do not solve the problem. State-of-the-art reasoning models such as OpenAI o1 and DeepSeek R1 fail to achieve practical step-level accuracy, and in some settings perform worse than simpler baselines. These results indicate that failure attribution is not merely a matter of stronger reasoning or larger models. Instead, it exposes fundamental limitations in how LLMs retrieve, localize, and reason about errors within long, multi-agent interaction histories.

BibTeX


@inproceedings{
  zhang2025which,
  title={Which Agent Causes Task Failures and When? On Automated Failure Attribution of {LLM} Multi-Agent Systems},
  author={Shaokun Zhang and Ming Yin and Jieyu Zhang and Jiale Liu and Zhiguang Han and Jingyang Zhang and Beibin Li and Chi Wang and Huazheng Wang and Yiran Chen and Qingyun Wu},
  booktitle={Forty-second International Conference on Machine Learning},
  year={2025},
  url={https://openreview.net/forum?id=GazlTYxZss}
}

Which Agent Causes Task Failures and When?On Automated Failure Attribution of LLM Multi-Agent Systems

Overview

Motivation

Method

Results

BibTeX

Which Agent Causes Task Failures and When?
On Automated Failure Attribution of LLM Multi-Agent Systems