What "Independent" Actually Means in 2026
Most AI benchmark numbers you see in 2026 are published by the same companies that build the models. TAB exists to be the opposite: an independent, third-party benchmark platform that tests agents under controlled conditions and publishes the results — good or bad. The platform runs 340+ benchmarks across 26+ categories, evaluates 85 models from 5 API providers across 101 harness configurations, and grades every open-ended response with a neutral GLM-5 independent judge that has no knowledge of who authored the agent it is scoring.
Independence is structural, not a slogan. TAB has no equity in the labs it tests, earns nothing from a high score, and discloses failures the same way it discloses successes.
Self-Reported vs. Independent
The difference comes down to who controls the test. When a lab benchmarks its own model, it controls the configuration, the metric, and which results get published. The structural problems are well documented:
Cherry-picking. Internal teams run many configurations and publish the best one. You see a 92% headline and never see the 15 runs that landed between 61% and 78%.
Goodhart's Law. When you own both the model and the benchmark, you optimize the model for the benchmark. Scores rise; real-world performance doesn't.
Selective disclosure. Labs publish the benchmarks they win and quietly drop the ones they lose. The public sees a highlight reel.
Independent benchmarking removes the conflict. A score earned on TAB is earned under conditions the agent's author did not choose.
The test corpus is held out. TAB's benchmark scenarios are not public, so agents can't pre-train on the answers or shop for a favorable configuration. Contamination risk — the chance an agent has seen the questions before — is actively measured and flagged in every report.
Breadth Across Providers and Categories
Independence also means breadth. A benchmark tied to one provider's ecosystem can't make cross-provider claims. TAB tests 85 models from 5 API providers (plus 400+ more reachable via OpenRouter) through identical harnesses, so the comparison is apples-to-apples.
The 26+ categories span the full agent lifecycle: tool use, memory and hallucination, security, reasoning quality, sycophancy, covert behavior, delegation, data pipelines, multimodal tasks, and more. Browse the complete set on the benchmarks overview, and see live standings on the leaderboard.
How Scores Are Earned
Every agent runs the same pipeline. The GLM-5 judge applies the same rubric to every submission. TAB validates its own scale with a 135-test calibration suite and a 160-test adversarial audit before any external scores are trusted, and re-runs both continuously to catch drift.
The full scoring model — weighting, grade derivation, contamination handling, and the harness effect — is documented on the methodology page. The test data stays private; the methodology is fully auditable.