🔬

Human Rejection Rate

The Benchmark That Benchmarks Benchmarks

A 2026 METR study found that 50% of code that "passed" automated benchmarks was rejected by human reviewers. This meta-benchmark measures: would a human actually accept your agent's work?

Total Tests

Why This Benchmark Exists

Traditional benchmarks ask "did the agent produce the right output?" Human Rejection Rate asks "would a human accept this output in a real work context?"

An agent can produce technically correct code that's unreadable, technically accurate text that's unusable, technically valid SQL that's unmaintainable. Automated benchmarks say PASS. Humans say REJECT.

A Human Rejection Rate of 0% means everything correct is also acceptable. A rate of 50% means half of the agent's "correct" work would be thrown away and rewritten — matching the METR finding.

Five Categories

CODE 10 tests

Code Acceptance

Would a senior developer accept this in a code review? Tests variable naming, decomposition, documentation, security, and best practices.

WRITE 10 tests

Written Content

Would a human editor publish this? Tests tone, audience awareness, clarity, actionability, and professional polish.

DATA 10 tests

Data Analysis

Would an analyst present these findings? Tests insight depth, statistical rigor, caveats, and actionable recommendations.

TASK 10 tests

Task Execution

Would a DevOps engineer approve this? Tests safety practices, logging, error handling, rollback, and documentation.

REASON 10 tests

Reasoning

Would a decision-maker trust this? Tests nuance, alternatives, uncertainty acknowledgment, and decision framework quality.

Run Human Rejection Rate Benchmark

Agent

Max Tasks

How It Works

Two-Axis Scoring

Every response is scored on two independent axes. The gap between them IS the Human Rejection Rate.

Axis 1: Correctness (0-100%)

Does the output technically satisfy the requirement? Is it factually accurate? This is what traditional benchmarks measure.

Axis 2: Acceptance (0-100%)

Would a human professional accept this in a real work context? Readability, maintainability, conventions, depth, edge cases.

Task Presentation

Each test presents a realistic professional task: write code, analyze data, draft communications, execute DevOps tasks, or provide strategic reasoning.

Agent Response

The agent responds as it would in production. The response is captured for dual-axis evaluation.

LLM-as-Judge Scoring

GLM-5 evaluates each response on both correctness AND acceptance using calibrated rubrics with concrete examples. The judge is specifically trained to distinguish "technically correct but professionally useless" from "ready for production."

Rejection Rate Calculation

Human Rejection Rate = (correct but rejected ÷ total correct) × 100%. An output is "correct" if correctness ≥ 70%. It's "rejected" if acceptance < 50% despite being correct. A rate of 0% is perfect. A rate of 50% matches the METR baseline — half your agent's work would be thrown away.

The METR Finding

In 2026, METR (Model Evaluation and Threat Research) published a study showing that 50% of code generated by AI agents that passed automated test suites was rejected during human code review. The code was technically correct — tests passed, output was valid — but it was unreadable, unmaintainable, poorly structured, or violated team conventions. This benchmark measures whether your agent has the same problem across code, writing, data analysis, task execution, and reasoning.