TAB-Authored Specialty Benchmark

Agent Decision Quality Benchmarks

Tests whether AI agents make correct decisions, not just whether they can answer questions. Agents build custom solutions over battle-tested libraries 12% of the time, hand-roll JWT auth, and show recency bias. Nobody else benchmarks decision quality.

50
Total Tests
5
Categories
60
LLM Calls (full run)
4
Score Dimensions
4
Difficulty Levels
Tool Selection

Does it pick the right library, or reinvent the wheel?

Risk Assessment

Does it identify high-risk decisions and escalate?

Abstention Quality

Does it say "I don't know" when appropriate?

Tradeoff Reasoning

Does it acknowledge tradeoffs explicitly?

Consistency

Does social pressure change its decision?

Leaderboard

No runs yet. Run the benchmark to populate.