The Harness Effect

Why the same AI model can score 42% or 78% on the same benchmark — and why a single number is never the whole story.

A Benchmark Score Is Not a Property of the Model

This is TAB's most important and most counterintuitive finding: a benchmark score belongs to the agent stack — model plus harness plus configuration — not to the model alone. The same underlying model, on the same benchmark, can score 42% in one configuration and 78% in another. That 36-point swing comes entirely from the harness: the prompt, the tools, the context window, and the retry policy wrapped around the model. TAB exists to measure this honestly, which is why it runs 340+ benchmarks across 26+ categories against 85 models over 101 harness configurations — not one config per model.

If a score depends this heavily on configuration, then any evaluation that reports a single number without naming the harness is, at best, incomplete — and at worst, cherry-picked.

42%
78%

What "Harness" Means

The harness is everything around the model that shapes how it performs a task:

Change any of these and the score moves. A model that "fails" a benchmark in one harness may pass it comfortably in another — which means harness choices, not just model choice, determine production reliability.


Why TAB Tests 101 Harness Configurations

Because a single configuration is a single point on a wide curve, TAB maps the curve. Running 101 harness configurations reveals two things a one-shot benchmark cannot: the model's best achievable score and its stability — how much performance swings as the harness changes.

Stability is the part buyers usually miss. A model that scores 78% in one narrow configuration but collapses to 42% in most others is a fragile production bet. A model that scores 70% consistently across configurations is far more predictable. The harness effect is why TAB reports ranges and configurations, not lone headline percentages.

101
Harness Configurations
36 pts
Observed Swing (42% → 78%)
340+
Benchmarks

Why "As Configured" Matters

TAB scores agents "as configured" — model plus harness plus configuration — because that's the only score that predicts production behavior. The vendor's best-case number tells you what's possible under ideal conditions. The "as configured" score tells you what to expect under yours.

This reframes how to read every benchmark you encounter. The right questions are: which harness produced this number, and how much does it move across configurations? An evaluation that can't answer those isn't measuring the model — it's marketing one configuration.

The harness effect is the strongest argument for independent verification. When a vendor controls the harness, they can pick the configuration that scores 78% and never mention the one that scores 42%. Independent testing across 101 configurations removes that option.

For the complete scoring model — how harness configurations are defined, weighted, and reported — see the TAB methodology. To see the effect across models, browse the benchmarks overview or compare standings on the leaderboard.


Keep Exploring

Read the Methodology → Run a Benchmark
© 2026 TAB Platform LLC. All rights reserved.