What TAB Actually Tests

340+ benchmarks. 26 categories. 30 specialty pages. 1,330+ specialty tests. Independent verification across 66 active models (400+ available via OpenRouter) from 20+ providers.

Not Vibes. Verified.
New • 2026 Data
Five Real Findings from Independent AI Agent Testing
95 tests. 6 frontier models. No model scored above 70% on honesty. GPT-5.4 defers to fake authority 64% of the time. Read the full report →

How TAB Scoring Works

TAB runs agents against real test cases. Scores are not self-reported, not cherry-picked, and not based on demos. Every agent is tested under the same conditions with the same harnesses, and every result is independently verifiable.

Scores reflect actual performance. Grades include D's — TAB does not inflate scores. If an agent fails, it fails publicly. Every methodology is transparent and published.

Platinum  90+ Gold  75–89 Silver  60–74 Bronze  45–59 Unrated  <45

26 Benchmark Categories

Every category targets a specific failure mode. Here's what each one tests.

Reasoning & Logic

Does the agent think through problems correctly or jump to wrong conclusions?

Tool Use & API Calls

Does the agent call the right tools with the right parameters?

Instruction Following

Does the agent do what it's told, exactly as instructed?

Safety & Alignment

Does the agent refuse harmful requests and stay within boundaries?

Hallucination Detection

Does the agent make up facts, citations, or data that don't exist?

Code Generation

Does the agent write functional, secure, production-quality code?

Mathematical Reasoning

Can the agent solve math problems accurately without errors?

Context Retention

Does the agent remember what was said earlier in a conversation?

Sycophancy Detection

Does the agent change its answers under social pressure or to please the user?

Calibration & Uncertainty

Does the agent know what it doesn't know, or does it fake confidence?

Multi-Step Planning

Can the agent break down complex tasks and execute them in sequence?

Web Navigation

Can the agent browse the web accurately and extract the right information?

Document Understanding

Can the agent read and correctly interpret long documents?

Vision & Multimodal

Can the agent accurately read charts, images, and visual data?

Citation Accuracy

Does the agent cite real sources, or fabricate references?

Adversarial Robustness

Does the agent hold up against prompt injection and manipulation attempts?

Delegation & Orchestration

Can the agent correctly coordinate with other agents in a pipeline?

Autonomy Boundaries

Does the agent know when to stop and ask a human vs act on its own?

Memory Hallucination

Does the agent correctly remember and update information, or corrupt it?

Decision Under Pressure

Does the agent make consistent decisions or cave when challenged?

Authority Sycophancy

Does the agent defer to fake credentials and false authority?

Conflict Resolution

Does the agent handle contradictory instructions correctly?

Gaming Detection

Does the agent try to game benchmarks or manipulate its own scores?

Contamination Resistance

Are benchmark results clean, or has the agent memorized the answers?

MCP Compliance

Does the agent correctly implement Model Context Protocol standards?

Transparency

Does the agent disclose what it is, how it works, and what it can't do?

Security Screening

Does the agent resist prompt injection, toxic output, and basic attack vectors?

Collaboration

Can the agent work effectively alongside humans and other systems?


26 Deep-Dive Specialty Benchmarks

Each specialty page runs a focused suite of tests targeting one critical dimension of agent behavior.


The Harness System

TAB uses 100+ verified harnesses — the scaffolding that connects an agent to a benchmark. A harness defines how the test is run, what tools are available, and how responses are scored. The same model can score 42% with one harness and 78% with another. TAB benchmarks both the agent and the harness.

62 Natively Configured Models, 5 Providers

TAB supports benchmarking across Anthropic, OpenAI, Google Gemini, xAI Grok, and OpenRouter. 80 models (66 active, 400+ available via OpenRouter) mapped across Core, Pro, Premium, and Ultra tiers.

← Return to Home

Frequently Asked Questions About AI Agent Benchmarks

How many benchmarks does TAB have?+
TAB has 340+ benchmarks across 26+ categories, including 30 specialty test suites. Categories include security, reasoning, code generation, tool use, context engineering, sycophancy detection, memory hallucination, covert behavior detection, and more.
Can I benchmark my own agent on TAB?+
Yes. Register at tabverified.ai, build or import your agent, and select benchmarks. Every agent gets one free security screening (15 tests). Additional benchmarks use pay-as-you-go credits starting at $0.03 per test case.
What makes TAB different from SWE-bench or other leaderboards?+
TAB is independent (no financial relationship with any model provider), tests agents as configured (model + harness + config), includes 40 contamination detection tests, and publishes all grades including failing ones. SWE-bench was killed by OpenAI due to contamination — TAB built contamination resistance from day one.
What is the Q-Protocol behavioral score?+
Q-Protocol measures HOW an agent thinks across 9 dimensions: Prediction Discipline, Failure Recovery, Context Discipline, Epistemic Honesty, Error Utilization, Autonomy Boundaries, Root Cause Analysis, Handoff Quality, and Reasoning Faithfulness. It runs automatically on every benchmark.
How does TAB handle harness configurations?+
TAB has 101 harness configurations (21 agent-level + 80 platform-level). The same model scores 42% with one harness and 78% with another. TAB measures the complete stack, not just the model, because the harness is often the determining factor in agent performance.