Why Independent Comparison Matters
When every lab grades its own model, "Model A beats Model B" usually means "Model A's team built a benchmark Model A wins." There is no shared yardstick. TAB provides one: an independent platform that runs 85 models through the same 340+ benchmarks across 26+ categories and 101 harness configurations, with every open-ended answer graded by a neutral GLM-5 judge. Because the conditions are identical for every model, a comparison on TAB measures the models — not the marketing.
This matters for anyone choosing a model for production. A vendor chart tells you how a model does on the vendor's favorite test. An independent comparison tells you how it does against its competitors on the same test, scored by the same judge.
Same Harness, Fair Fight
A model's score depends heavily on the harness around it — the prompt, the tools, the context window, the retry policy. If two models are compared under different harnesses, the comparison is meaningless. TAB fixes the harness and varies only the model, then repeats across all 101 configurations to map the full range.
This produces two things a vendor chart never does: a head-to-head score under identical conditions, and a stability profile showing how much each model's performance swings as the configuration changes. A model that scores well only in one narrow configuration is a riskier production bet than one that scores consistently across many.
The Harness Effect: 42% vs. 78%
The reason "as configured" matters is dramatic. In TAB's testing, the same model scored 42% under one harness configuration and 78% under another — a 36-point swing with no change to the model itself. A vendor benchmark reporting a single number is, at best, reporting one point on this curve, usually the most favorable one.
Independent comparison surfaces the whole curve. It's why a responsible model comparison reports a range and a configuration, not a lone headline percentage.
A single benchmark number is a snapshot, not a measurement. The same model can look excellent or mediocre depending on its harness. Comparing models honestly requires holding the harness constant and reporting the distribution — which is exactly what independent testing does.
Read the full breakdown of the harness effect on the dedicated harness effect page, and see how TAB derives comparable scores in the methodology.
See the Comparison
Live, independent standings across models are published on the TAB leaderboard, updated as new runs complete. To compare specific behaviors — hallucination, sycophancy, tool use, security — browse the full catalog on the benchmarks overview and run any benchmark against the models you're evaluating.