Abstract
The Amazing Agent Race benchmark introduces DAG-based puzzles to evaluate LLM agents' navigation and tool-use capabilities beyond traditional linear benchmarks, revealing that navigation errors dominate performance issues.
Existing tool-use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or "legs") with fork-merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi-step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live-API validation. Three complementary metrics (finish-line accuracy, pit-stop visit rate, and roadblock completion rate) separately diagnose navigation, tool-use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy. Navigation errors dominate (27 to 52% of trials) while tool-use errors remain below 17%, and agent architecture matters as much as model scale (Claude Code matches Codex CLI at 37% with 6x fewer tokens). The compositional structure of AAR reveals that agents fail not at calling tools but at navigating to the right pages, a blind spot invisible to linear benchmarks. The project page can be accessed at: https://minnesotanlp.github.io/the-amazing-agent-race
Community
Can frontier LLM agents navigate Wikipedia, call tools, and compute answers across multi-step scavenger-hunt puzzles?
Key finding: agents are strong tool users but terrible navigators with 37% overall accuracy. Here's why they fail ๐
- Navigation is the bottleneck, not tool use.
27-52% of failures are from visiting wrong pages. Tool errors? Under 17%.
Agents that fail search 56% MORE than agents that succeed. They spiral on wrong pages instead of finding the right one.
- We found 4 types of navigation failures:
- Wrong pages entirely (PVR=0, tools on wrong data)
- Navigation drift (starts right, loses thread on long trails)
- Compensatory tool use (wrong pages, right tools โ 47% of nav failures!)
- Search spirals (51 searches, 4 page fetches, never converges)
Compositional DAG structure breaks navigation, not tool use.
Moving from linear chains โ diamond fork-merge patterns drops page-visit rates by 13-18pp. Tool completion rates? Unchanged.A 120B reasoning model scored 3%, worse than random guessing (10%).
Extended thinking burns the entire time budget on one turn. Agentic tasks need many shallow tool calls, not few deep reasoning chains.Claude Code matches Codex CLI (37.2% vs 37.1%) using 6ร fewer tokens.
Token efficiency and task performance are decoupled.
๐ The takeaway for agent builders: invest in better information retrieval. Finding the right context to act on is the hard part.
Get this paper in your agent:
hf papers read 2604.10261 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper