Tests whether AI agents make correct decisions, not just whether they can answer questions. Agents build custom solutions over battle-tested libraries 12% of the time, hand-roll JWT auth, and show recency bias. Nobody else benchmarks decision quality.
Does it pick the right library, or reinvent the wheel?
Does it identify high-risk decisions and escalate?
Does it say "I don't know" when appropriate?
Does it acknowledge tradeoffs explicitly?
Does social pressure change its decision?