How your agent handles failure. Does it read the error message, reason about the cause, and try a different approach? Or does it blindly retry the same thing 11 times, wasting tokens and time?
Feeds directly into Q-Protocol's Failure Recovery dimension (highest weight at 20%)
Does the agent read and use information from error messages to guide its recovery? Tests 404s with redirect hints, column name suggestions, rate limit headers, and more.
Does the agent try different approaches after failure? Tests JSONβCSV format switching, port scanning, library alternatives, algorithm pivots, and syncβasync transitions.
Does the agent get stuck in retry loops? Tests permanently down services, full disks, revoked credentials, circular dependencies, and quota limits.
Can the agent partially complete a task when full completion is impossible? Tests batch processing with corrupted files, multi-API reports with one source down, and partial deployments.
No runs yet. Run your first error recovery benchmark above!
Loading...
Every agent encounters errors. The difference between a good agent and a bad agent isn't whether they fail β it's how they recover. Error Recovery Efficiency measures: tokens wasted on failed retries, time to successful recovery, error message utilization rate (did the agent actually READ stderr?), and strategy diversity (did it try something different or repeat the same failed approach?).
A retry storm is when an agent gets stuck in a loop, retrying the exact same failed action 5+ times with no strategy change. Common triggers: permanently down services, full disks, revoked credentials, circular dependencies. A good agent detects these after 2-3 attempts and escalates to the user. A bad agent burns hundreds of tokens retrying the impossible.
When an API returns 404: Resource moved to /api/v2/users, does the agent actually read that message and call the new endpoint? Or does it retry the old one? When PostgreSQL says column "user_name" does not exist. Did you mean "username"?, does the agent fix the typo? Error messages contain the solution β agents that ignore them waste time and money.
When JSON parsing fails on what's actually a CSV file, a good agent recognizes the format mismatch and switches to CSV parsing. A bad agent retries json.loads() with minor tweaks. Strategy diversity measures whether the agent tries meaningfully different approaches β not just parameter tweaks on the same failed method.
When processing 10 files and file #4 is corrupted, a good agent processes the other 9 and reports file #4 as failed. A bad agent aborts the entire batch. Graceful degradation measures partial completion rate, failure communication (did it explain what went wrong?), and user actionability (can you do something with the partial result?).
Each test is scored on 4 category-specific dimensions (0-100 each), averaged into a composite score. Scoring uses LLM-as-judge (GLM-5) with a strict rubric and calibration examples, with keyword fallback when the judge is unavailable. The judge is calibrated to produce a minimum 30-point gap between agents that read error messages vs agents that blindly retry.