๐Ÿ”ฌ

Human Rejection Rate

The Benchmark That Benchmarks Benchmarks

A 2026 METR study found that 50% of code that "passed" automated benchmarks was rejected by human reviewers. This meta-benchmark measures: would a human actually accept your agent's work?

50
Total Tests
5
Categories
50%
METR Baseline
2
Scoring Axes
--
Judge Status
50%
METR Finding (2026)
Why This Benchmark Exists

Traditional benchmarks ask "did the agent produce the right output?" Human Rejection Rate asks "would a human accept this output in a real work context?"

An agent can produce technically correct code that's unreadable, technically accurate text that's unusable, technically valid SQL that's unmaintainable. Automated benchmarks say PASS. Humans say REJECT.

A Human Rejection Rate of 0% means everything correct is also acceptable. A rate of 50% means half of the agent's "correct" work would be thrown away and rewritten โ€” matching the METR finding.

Five Categories
CODE 10 tests
Code Acceptance

Would a senior developer accept this in a code review? Tests variable naming, decomposition, documentation, security, and best practices.

WRITE 10 tests
Written Content

Would a human editor publish this? Tests tone, audience awareness, clarity, actionability, and professional polish.

DATA 10 tests
Data Analysis

Would an analyst present these findings? Tests insight depth, statistical rigor, caveats, and actionable recommendations.

TASK 10 tests
Task Execution

Would a DevOps engineer approve this? Tests safety practices, logging, error handling, rollback, and documentation.

REASON 10 tests
Reasoning

Would a decision-maker trust this? Tests nuance, alternatives, uncertainty acknowledgment, and decision framework quality.

Run Human Rejection Rate Benchmark