The Harness Effect: Why AI Benchmark Results Vary by Configuration

A Benchmark Score Is Not a Property of the Model

This is TAB's most important and most counterintuitive finding: a benchmark score belongs to the agent stack — model plus harness plus configuration — not to the model alone. The same underlying model, on the same benchmark, can score 42% in one configuration and 78% in another. That 36-point swing comes entirely from the harness: the prompt, the tools, the context window, and the retry policy wrapped around the model. TAB exists to measure this honestly, which is why it runs 340+ benchmarks across 26+ categories against 88 models over 101 harness configurations — not one config per model.

If a score depends this heavily on configuration, then any evaluation that reports a single number without naming the harness is, at best, incomplete — and at worst, cherry-picked.

42%

78%

What "Harness" Means

The harness is everything around the model that shapes how it performs a task:

Prompt structure. System instructions, formatting, and how the task is framed.
Tool access. Which tools are available, how they're described, and how results are returned.
Context management. How much history is retained, summarized, or truncated.
Control flow. Retry policies, planning steps, and how failures are handled.

Change any of these and the score moves. A model that "fails" a benchmark in one harness may pass it comfortably in another — which means harness choices, not just model choice, determine production reliability.

Why TAB Tests 101 Harness Configurations

Because a single configuration is a single point on a wide curve, TAB maps the curve. Running 101 harness configurations reveals two things a one-shot benchmark cannot: the model's best achievable score and its stability — how much performance swings as the harness changes.

Stability is the part buyers usually miss. A model that scores 78% in one narrow configuration but collapses to 42% in most others is a fragile production bet. A model that scores 70% consistently across configurations is far more predictable. The harness effect is why TAB reports ranges and configurations, not lone headline percentages.

101

Harness Configurations

36 pts

Observed Swing (42% → 78%)

340+

Benchmarks

Why "As Configured" Matters

TAB scores agents "as configured" — model plus harness plus configuration — because that's the only score that predicts production behavior. The vendor's best-case number tells you what's possible under ideal conditions. The "as configured" score tells you what to expect under yours.

This reframes how to read every benchmark you encounter. The right questions are: which harness produced this number, and how much does it move across configurations? An evaluation that can't answer those isn't measuring the model — it's marketing one configuration.

The harness effect is the strongest argument for independent verification. When a vendor controls the harness, they can pick the configuration that scores 78% and never mention the one that scores 42%. Independent testing across 101 configurations removes that option.

For the complete scoring model — how harness configurations are defined, weighted, and reported — see the TAB methodology. To see the effect across models, browse the benchmarks overview or compare standings on the leaderboard.

The Harness Effect

A Benchmark Score Is Not a Property of the Model

What "Harness" Means

Why TAB Tests 101 Harness Configurations

Why "As Configured" Matters

Keep Exploring