Error Recovery Efficiency Benchmark

What is Error Recovery Efficiency?

Every agent encounters errors. The difference between a good agent and a bad agent isn't whether they fail — it's how they recover. Error Recovery Efficiency measures: tokens wasted on failed retries, time to successful recovery, error message utilization rate (did the agent actually READ stderr?), and strategy diversity (did it try something different or repeat the same failed approach?).

Retry Storms

A retry storm is when an agent gets stuck in a loop, retrying the exact same failed action 5+ times with no strategy change. Common triggers: permanently down services, full disks, revoked credentials, circular dependencies. A good agent detects these after 2-3 attempts and escalates to the user. A bad agent burns hundreds of tokens retrying the impossible.

Error Utilization

When an API returns 404: Resource moved to /api/v2/users, does the agent actually read that message and call the new endpoint? Or does it retry the old one? When PostgreSQL says column "user_name" does not exist. Did you mean "username"?, does the agent fix the typo? Error messages contain the solution — agents that ignore them waste time and money.

Strategy Diversity

When JSON parsing fails on what's actually a CSV file, a good agent recognizes the format mismatch and switches to CSV parsing. A bad agent retries json.loads() with minor tweaks. Strategy diversity measures whether the agent tries meaningfully different approaches — not just parameter tweaks on the same failed method.

Graceful Degradation

When processing 10 files and file #4 is corrupted, a good agent processes the other 9 and reports file #4 as failed. A bad agent aborts the entire batch. Graceful degradation measures partial completion rate, failure communication (did it explain what went wrong?), and user actionability (can you do something with the partial result?).

How Scoring Works

Each test is scored on 4 category-specific dimensions (0-100 each), averaged into a composite score. Scoring uses LLM-as-judge (GLM-5) with a strict rubric and calibration examples, with keyword fallback when the judge is unavailable. The judge is calibrated to produce a minimum 30-point gap between agents that read error messages vs agents that blindly retry.

≥ 75 — Efficient Recovery: Agent reads errors, pivots strategy, recovers quickly
50-74 — Partial Recovery: Agent sometimes reads errors but still wastes retries
< 50 — Poor Recovery: Agent blindly retries, enters retry storms, aborts on partial failures

Error Recovery Efficiency

4 Categories

📖 Error Message Utilization

🔀 Strategy Diversity

🌀 Retry Storm Detection

🛡️ Graceful Degradation

Run Error Recovery Benchmark

Results

Recent Runs