AI Agent Benchmarks — 340+ Tests Across 26 Categories

Q: How many benchmarks does TAB have?

TAB has 340+ benchmarks across 26+ categories, including 30 specialty test suites. Categories include security, reasoning, code generation, tool use, context engineering, sycophancy detection, memory hallucination, covert behavior detection, and more.

Q: Can I benchmark my own agent on TAB?

Yes. Register at tabverified.ai, build or import your agent, and select benchmarks. Every agent gets one free security screening (15 tests). Additional benchmarks use pay-as-you-go credits starting at $0.03 per test case.

Q: What makes TAB different from SWE-bench or other leaderboards?

TAB is independent (no financial relationship with any model provider), tests agents as configured (model + harness + config), includes 40 contamination detection tests, and publishes all grades including failing ones.

Q: What is the Q-Protocol behavioral score?

Q-Protocol measures HOW an agent thinks across 9 dimensions: Prediction Discipline, Failure Recovery, Context Discipline, Epistemic Honesty, Error Utilization, Autonomy Boundaries, Root Cause Analysis, Handoff Quality, and Reasoning Faithfulness.

Q: How does TAB handle harness configurations?

TAB has 101 harness configurations. The same model scores 42% with one harness and 78% with another. TAB measures the complete stack, not just the model, because the harness is often the determining factor in agent performance.

How TAB Scoring Works

TAB runs agents against real test cases. Scores are not self-reported, not cherry-picked, and not based on demos. Every agent is tested under the same conditions with the same harnesses, and every result is independently verifiable.

Scores reflect actual performance. Grades include D's — TAB does not inflate scores. If an agent fails, it fails publicly. Every methodology is transparent and published.

Platinum 90+ Gold 75–89 Silver 60–74 Bronze 45–59 Unrated <45

26 Benchmark Categories

Every category targets a specific failure mode. Here's what each one tests.

Reasoning & Logic

Does the agent think through problems correctly or jump to wrong conclusions?

Tool Use & API Calls

Does the agent call the right tools with the right parameters?

Instruction Following

Does the agent do what it's told, exactly as instructed?

Safety & Alignment

Does the agent refuse harmful requests and stay within boundaries?

Hallucination Detection

Does the agent make up facts, citations, or data that don't exist?

Code Generation

Does the agent write functional, secure, production-quality code?

Mathematical Reasoning

Can the agent solve math problems accurately without errors?

Context Retention

Does the agent remember what was said earlier in a conversation?

Sycophancy Detection

Does the agent change its answers under social pressure or to please the user?

Calibration & Uncertainty

Does the agent know what it doesn't know, or does it fake confidence?

Multi-Step Planning

Can the agent break down complex tasks and execute them in sequence?

Web Navigation

Can the agent browse the web accurately and extract the right information?

Document Understanding

Can the agent read and correctly interpret long documents?

Vision & Multimodal

Can the agent accurately read charts, images, and visual data?

Citation Accuracy

Does the agent cite real sources, or fabricate references?

Adversarial Robustness

Does the agent hold up against prompt injection and manipulation attempts?

Delegation & Orchestration

Can the agent correctly coordinate with other agents in a pipeline?

Autonomy Boundaries

Does the agent know when to stop and ask a human vs act on its own?

Memory Hallucination

Does the agent correctly remember and update information, or corrupt it?

Decision Under Pressure

Does the agent make consistent decisions or cave when challenged?

Authority Sycophancy

Does the agent defer to fake credentials and false authority?

Conflict Resolution

Does the agent handle contradictory instructions correctly?

Gaming Detection

Does the agent try to game benchmarks or manipulate its own scores?

Contamination Resistance

Are benchmark results clean, or has the agent memorized the answers?

MCP Compliance

Does the agent correctly implement Model Context Protocol standards?

Transparency

Does the agent disclose what it is, how it works, and what it can't do?

Security Screening

Does the agent resist prompt injection, toxic output, and basic attack vectors?

Collaboration

Can the agent work effectively alongside humans and other systems?

26 Deep-Dive Specialty Benchmarks

Each specialty page runs a focused suite of tests targeting one critical dimension of agent behavior.

Sycophancy95 tests across 10 dimensions of people-pleasing behavior

HallucinationFactual accuracy and fabrication detection

Adversarial RobustnessPrompt injection and manipulation resistance

ExplainabilityDoes the agent explain its reasoning clearly?

Vision & MultimodalChart and image interpretation accuracy

Code SecurityVulnerability detection in agent-generated code

Mathematical ReasoningNumerical accuracy across problem types

Citation ValidatorReal vs fabricated reference detection

CollaborationMulti-agent and human-agent coordination

Context EngineeringLong-context retention and retrieval

Web AgentLive web navigation and extraction accuracy

Decision QualityConsistency and correctness under pressure

Delegation ChainMulti-agent pipeline verification

Contamination ResistanceBenchmark integrity and clean scoring

HaluMemOperation-level memory hallucination detection (80 tests)

Error RecoveryError message utilization, strategy diversity, retry storms, graceful degradation (40 tests)

Context CompactionFactual retention, instruction persistence, contradiction detection, priority preservation after compression (50 tests)

RL Safety DriftSafety under optimization pressure — shortcut resistance, guardrail erosion, reward hacking, value alignment persistence (50 tests)

Data Source ProvenanceModel supply chain verification — identity disclosure, training data transparency, geographic jurisdiction, adversarial provenance resistance (50 tests)

The Harness System

TAB uses 100+ verified harnesses — the scaffolding that connects an agent to a benchmark. A harness defines how the test is run, what tools are available, and how responses are scored. The same model can score 42% with one harness and 78% with another. TAB benchmarks both the agent and the harness.

Frequently Asked Questions About AI Agent Benchmarks

How many benchmarks does TAB have?+

TAB has 340+ benchmarks across 26+ categories, including 30 specialty test suites. Categories include security, reasoning, code generation, tool use, context engineering, sycophancy detection, memory hallucination, covert behavior detection, and more.

Can I benchmark my own agent on TAB?+

Yes. Register at tabverified.ai, build or import your agent, and select benchmarks. Every agent gets one free security screening (15 tests). Additional benchmarks use pay-as-you-go credits starting at $0.03 per test case.

What makes TAB different from SWE-bench or other leaderboards?+

TAB is independent (no financial relationship with any model provider), tests agents as configured (model + harness + config), includes 40 contamination detection tests, and publishes all grades including failing ones. SWE-bench was killed by OpenAI due to contamination — TAB built contamination resistance from day one.

What is the Q-Protocol behavioral score?+

Q-Protocol measures HOW an agent thinks across 9 dimensions: Prediction Discipline, Failure Recovery, Context Discipline, Epistemic Honesty, Error Utilization, Autonomy Boundaries, Root Cause Analysis, Handoff Quality, and Reasoning Faithfulness. It runs automatically on every benchmark.

How does TAB handle harness configurations?+

TAB has 101 harness configurations (21 agent-level + 80 platform-level). The same model scores 42% with one harness and 78% with another. TAB measures the complete stack, not just the model, because the harness is often the determining factor in agent performance.

What TAB Actually Tests

How TAB Scoring Works

26 Benchmark Categories

Reasoning & Logic

Tool Use & API Calls

Instruction Following

Safety & Alignment

Hallucination Detection

Code Generation

Mathematical Reasoning

Context Retention

Sycophancy Detection

Calibration & Uncertainty

Multi-Step Planning

Web Navigation

Document Understanding

Vision & Multimodal

Citation Accuracy

Adversarial Robustness

Delegation & Orchestration

Autonomy Boundaries

Memory Hallucination

Decision Under Pressure

Authority Sycophancy

Conflict Resolution

Gaming Detection

Contamination Resistance

MCP Compliance

Transparency

Security Screening

Collaboration

26 Deep-Dive Specialty Benchmarks

The Harness System

62 Natively Configured Models, 5 Providers

Frequently Asked Questions About AI Agent Benchmarks