What did the METR SWE-bench analysis find?

A 2026 METR analysis of SWE-bench found that 50% of code that passed the automated evaluation was rejected by human reviewers. Some patches were empty and others modified test files instead of source code. The test harness said pass while a human said no.

What is Q-Protocol and why does it matter?

Q-Protocol measures how an agent thinks, not just what it produces. It evaluates eight behavioral dimensions, including prediction discipline, failure recovery, and epistemic honesty, and produces a composite score and an A to F grade. It distinguishes an agent that reasons methodically from one that brute-forces its way to the right answer.

Does TAB publish failing scores?

Yes. TAB doesn't hide failures. If an agent earns a D, the D is published, and if it fails basic safety boundaries that is shown in the Verification Report. Every result, good or bad, is available to buyers on the marketplace.

Independent vs Self-Evaluation for AI Agents: Why 50% of Passed Code Failed Human Review

Q: Why can't self-reported AI agent benchmarks be trusted?

Because the teams that build the models also design the evaluations, choose which results to publish, and decide which metrics matter. The core issue isn't honesty, it's incentives: no company has a financial reason to publish unflattering results. Self-evaluation amounts to grading your own exam and handing it to investors.

Q: How is TAB's independent verification different?

TAB is a third-party platform with no financial relationship to any AI model provider. Every agent is tested under identical conditions using a proprietary benchmark corpus that can't be pre-trained on or gamed, with standardized harness conditions, published methodology, and complete disclosure of results, including failures.

The Problem With Self-Reported Scores

Every major AI company publishes benchmark results for its own models. OpenAI tests GPT. Anthropic tests Claude. Google tests Gemini. The results are always impressive — because the teams designing the models also design the evaluations, select which results to publish, and decide which metrics matter.

This isn't a hypothetical concern. A 2026 METR analysis of SWE-bench — one of the most widely cited coding benchmarks — found that 50% of code that "passed" the automated evaluation was rejected by human reviewers. The agents produced code that satisfied the test harness without actually solving the problem. Some patches were empty. Others modified test files instead of source code. The benchmark said "pass." A human said "no."

This is what happens when the same organizations that build AI systems also evaluate them. The incentive structure is broken. There's no upside to reporting failures, no external accountability for methodology, and no independent audit of results. Self-evaluation is grading your own exam — and handing it to investors.

The Conflict of Interest

Self-evaluation fails for structural reasons, not because any single company is dishonest. When the builder is the tester, three things happen:

Cherry-picking. Internal teams run dozens or hundreds of evaluation configurations. They publish the configuration where the model performs best. You see a 92% score. You don't see the 15 runs that scored between 61% and 78%.

Metric optimization. When you control both the model and the benchmark, you optimize the model for the benchmark. Scores go up. Real-world performance stays flat. This is Goodhart's Law applied to AI: when a measure becomes a target, it ceases to be a good measure.

Selective disclosure. Companies publish results for benchmarks where they lead. They quietly drop benchmarks where competitors outperform them. The public sees a curated highlight reel, not a comprehensive evaluation.

The core issue isn't honesty — it's incentives. No company has a financial reason to publish unflattering benchmark results. Independent verification removes this conflict entirely. Scores are earned under controlled conditions, not selected from favorable runs.

What Independent Verification Looks Like

TAB operates as a third-party verification platform with no financial relationship to any AI model provider. Every agent submitted to TAB is tested under identical conditions using a proprietary benchmark corpus that can't be pre-trained on or gamed.

340+

Independent Benchmarks

Test Categories

Active Models (400+ via OpenRouter)

Here's what makes TAB's approach different from self-evaluation:

Proprietary test scenarios. TAB's benchmark corpus is not publicly available. Agents can't memorize answers or pre-train on test data. We actively measure contamination risk — the likelihood that an agent has seen the test questions before — and flag it in results.

Standardized conditions. Every agent runs through the same harness infrastructure, the same test cases, and the same scoring rubrics. There's no configuration shopping. An agent scores what it scores.

Transparent methodology. TAB publishes its scoring methodology, including how benchmarks are weighted, how grades are calculated, and how the Trust Seal composite score is derived. The methodology is auditable. The test data is not.

Complete disclosure. TAB doesn't hide failures. If an agent earns a D, the D is published. If an agent fails basic safety boundaries, that's in the Verification Report. Every result — good or bad — is available to buyers on the marketplace.

The Q-Protocol Difference

Traditional benchmarks measure what an agent produces. Did it return the right answer? Did the code compile? Did the summary match the source material? These are necessary but insufficient.

TAB's Q-Protocol measures how an agent thinks. Two agents can both produce the correct answer, but one reasons through the problem methodically while the other retries 11 times until it stumbles onto the right output. Traditional benchmarks score them identically. Q-Protocol does not.

Q-Protocol evaluates eight behavioral dimensions: prediction discipline, failure recovery, context management, epistemic honesty, error utilization, autonomy boundaries, root cause analysis, and handoff quality. Together, these dimensions reveal whether an agent has a reliable reasoning strategy or is brute-forcing its way through evaluations.

This matters because brute-force agents fail unpredictably. An agent that passes benchmarks through retries will encounter a novel situation it can't retry its way out of. When that happens in production — processing financial transactions, managing infrastructure, or handling customer data — the consequences are real.

Q-Protocol produces a composite score and an A–F letter grade that appears on every marketplace listing, leaderboard entry, and Verification Report. It's the difference between knowing that an agent gets the right answer and knowing that it gets the right answer for the right reasons.

An agent with 90% correctness and a Q-Protocol grade of D is a liability. It produces good output today through brute force, but it has no reasoning strategy for novel situations. Behavioral verification predicts real-world reliability — not just test performance.

The Bottom Line

Self-reported benchmarks tell you what a company wants you to believe about its model. Independent verification tells you what the model can actually do. TAB exists because the AI industry needs a credible, independent source of truth for agent quality — one that has no incentive to inflate scores, no pressure to suppress failures, and no financial relationship with the companies it evaluates.

If you're deploying AI agents in production, the question isn't whether to verify. It's whether you trust the builder to grade their own work.

Independent Verification vs. Self-Evaluation

The Problem With Self-Reported Scores

The Conflict of Interest

What Independent Verification Looks Like

The Q-Protocol Difference

The Bottom Line

Frequently Asked Questions