Task Presentation
Each test presents a realistic professional task: write code, analyze data, draft communications, execute DevOps tasks, or provide strategic reasoning.
The Benchmark That Benchmarks Benchmarks
A 2026 METR study found that 50% of code that "passed" automated benchmarks was rejected by human reviewers. This meta-benchmark measures: would a human actually accept your agent's work?
Traditional benchmarks ask "did the agent produce the right output?" Human Rejection Rate asks "would a human accept this output in a real work context?"
An agent can produce technically correct code that's unreadable, technically accurate text that's unusable, technically valid SQL that's unmaintainable. Automated benchmarks say PASS. Humans say REJECT.
A Human Rejection Rate of 0% means everything correct is also acceptable. A rate of 50% means half of the agent's "correct" work would be thrown away and rewritten โ matching the METR finding.
Would a senior developer accept this in a code review? Tests variable naming, decomposition, documentation, security, and best practices.
Would a human editor publish this? Tests tone, audience awareness, clarity, actionability, and professional polish.
Would an analyst present these findings? Tests insight depth, statistical rigor, caveats, and actionable recommendations.
Would a DevOps engineer approve this? Tests safety practices, logging, error handling, rollback, and documentation.
Would a decision-maker trust this? Tests nuance, alternatives, uncertainty acknowledgment, and decision framework quality.