πŸ”„

Error Recovery Efficiency

How your agent handles failure. Does it read the error message, reason about the cause, and try a different approach? Or does it blindly retry the same thing 11 times, wasting tokens and time?

40
Total Tests
4
Categories
20%
Q-Protocol Weight
--
API Status
The difference between a good agent and a bad agent isn't whether they fail β€” it's whether they waste 8x the tokens retrying the same mistake. An agent that fails once, reads the error, adjusts, and succeeds on attempt #2 is dramatically more efficient than one that retries identically 8 times. Both eventually succeed. One costs 8x more. TAB measures the difference.

4 Categories

Feeds directly into Q-Protocol's Failure Recovery dimension (highest weight at 20%)

Resilience

πŸ“– Error Message Utilization

Does the agent read and use information from error messages to guide its recovery? Tests 404s with redirect hints, column name suggestions, rate limit headers, and more.

10 tests Β· Scoring: error_message_read, fix_accuracy, retry_efficiency, token_waste
Resilience

πŸ”€ Strategy Diversity

Does the agent try different approaches after failure? Tests JSON→CSV format switching, port scanning, library alternatives, algorithm pivots, and sync→async transitions.

10 tests Β· Scoring: strategy_count, first_pivot_speed, approach_quality, eventual_success
Resilience

πŸŒ€ Retry Storm Detection

Does the agent get stuck in retry loops? Tests permanently down services, full disks, revoked credentials, circular dependencies, and quota limits.

10 tests Β· Scoring: max_identical_retries, storm_detected, escalation_quality, token_waste_ratio
Resilience

πŸ›‘οΈ Graceful Degradation

Can the agent partially complete a task when full completion is impossible? Tests batch processing with corrupted files, multi-API reports with one source down, and partial deployments.

10 tests Β· Scoring: partial_completion_rate, failure_communication, data_preservation, user_actionability

Run Error Recovery Benchmark

Recent Runs

No runs yet. Run your first error recovery benchmark above!

All 40 Test Cases

Loading...

What is Error Recovery Efficiency?

Every agent encounters errors. The difference between a good agent and a bad agent isn't whether they fail β€” it's how they recover. Error Recovery Efficiency measures: tokens wasted on failed retries, time to successful recovery, error message utilization rate (did the agent actually READ stderr?), and strategy diversity (did it try something different or repeat the same failed approach?).

Retry Storms

A retry storm is when an agent gets stuck in a loop, retrying the exact same failed action 5+ times with no strategy change. Common triggers: permanently down services, full disks, revoked credentials, circular dependencies. A good agent detects these after 2-3 attempts and escalates to the user. A bad agent burns hundreds of tokens retrying the impossible.

Error Utilization

When an API returns 404: Resource moved to /api/v2/users, does the agent actually read that message and call the new endpoint? Or does it retry the old one? When PostgreSQL says column "user_name" does not exist. Did you mean "username"?, does the agent fix the typo? Error messages contain the solution β€” agents that ignore them waste time and money.

Strategy Diversity

When JSON parsing fails on what's actually a CSV file, a good agent recognizes the format mismatch and switches to CSV parsing. A bad agent retries json.loads() with minor tweaks. Strategy diversity measures whether the agent tries meaningfully different approaches β€” not just parameter tweaks on the same failed method.

Graceful Degradation

When processing 10 files and file #4 is corrupted, a good agent processes the other 9 and reports file #4 as failed. A bad agent aborts the entire batch. Graceful degradation measures partial completion rate, failure communication (did it explain what went wrong?), and user actionability (can you do something with the partial result?).

How Scoring Works

Each test is scored on 4 category-specific dimensions (0-100 each), averaged into a composite score. Scoring uses LLM-as-judge (GLM-5) with a strict rubric and calibration examples, with keyword fallback when the judge is unavailable. The judge is calibrated to produce a minimum 30-point gap between agents that read error messages vs agents that blindly retry.

  • β‰₯ 75 β€” Efficient Recovery: Agent reads errors, pivots strategy, recovers quickly
  • 50-74 β€” Partial Recovery: Agent sometimes reads errors but still wastes retries
  • < 50 β€” Poor Recovery: Agent blindly retries, enters retry storms, aborts on partial failures