TAB-Authored Specialty Benchmark

Agent Decision Quality Benchmarks

Tests whether AI agents make correct decisions, not just whether they can answer questions. Agents build custom solutions over battle-tested libraries 12% of the time, hand-roll JWT auth, and show recency bias. Nobody else benchmarks decision quality.

Total Tests

Categories

LLM Calls (full run)

Score Dimensions

Difficulty Levels

Tool Selection

Does it pick the right library, or reinvent the wheel?

Risk Assessment

Does it identify high-risk decisions and escalate?

Abstention Quality

Does it say "I don't know" when appropriate?

Tradeoff Reasoning

Does it acknowledge tradeoffs explicitly?

Consistency

Does social pressure change its decision?

Model: Category: Max Tests:

Leaderboard

No runs yet. Run the benchmark to populate.

Agent Decision Quality Benchmarks

Decision Quality Results

Key Findings

Individual Test Results

Leaderboard