What TAB Actually Tests

286 benchmarks. 26 categories. 15 specialty pages. 1,330+ specialty tests. Independent verification across 52 AI models from 5 providers.

Not Vibes. Verified.

How TAB Scoring Works

TAB runs agents against real test cases. Scores are not self-reported, not cherry-picked, and not based on demos. Every agent is tested under the same conditions with the same harnesses, and every result is independently verifiable.

Scores reflect actual performance. Grades include D's — TAB does not inflate scores. If an agent fails, it fails publicly. Every methodology is transparent and published.

Platinum  90+ Gold  75–89 Silver  60–74 Bronze  45–59 Unrated  <45

26 Benchmark Categories

Every category targets a specific failure mode. Here's what each one tests.

Reasoning & Logic

Does the agent think through problems correctly or jump to wrong conclusions?

Tool Use & API Calls

Does the agent call the right tools with the right parameters?

Instruction Following

Does the agent do what it's told, exactly as instructed?

Safety & Alignment

Does the agent refuse harmful requests and stay within boundaries?

Hallucination Detection

Does the agent make up facts, citations, or data that don't exist?

Code Generation

Does the agent write functional, secure, production-quality code?

Mathematical Reasoning

Can the agent solve math problems accurately without errors?

Context Retention

Does the agent remember what was said earlier in a conversation?

Sycophancy Detection

Does the agent change its answers under social pressure or to please the user?

Calibration & Uncertainty

Does the agent know what it doesn't know, or does it fake confidence?

Multi-Step Planning

Can the agent break down complex tasks and execute them in sequence?

Web Navigation

Can the agent browse the web accurately and extract the right information?

Document Understanding

Can the agent read and correctly interpret long documents?

Vision & Multimodal

Can the agent accurately read charts, images, and visual data?

Citation Accuracy

Does the agent cite real sources, or fabricate references?

Adversarial Robustness

Does the agent hold up against prompt injection and manipulation attempts?

Delegation & Orchestration

Can the agent correctly coordinate with other agents in a pipeline?

Autonomy Boundaries

Does the agent know when to stop and ask a human vs act on its own?

Memory Hallucination

Does the agent correctly remember and update information, or corrupt it?

Decision Under Pressure

Does the agent make consistent decisions or cave when challenged?

Authority Sycophancy

Does the agent defer to fake credentials and false authority?

Conflict Resolution

Does the agent handle contradictory instructions correctly?

Gaming Detection

Does the agent try to game benchmarks or manipulate its own scores?

Contamination Resistance

Are benchmark results clean, or has the agent memorized the answers?

MCP Compliance

Does the agent correctly implement Model Context Protocol standards?

Transparency

Does the agent disclose what it is, how it works, and what it can't do?

Security Screening

Does the agent resist prompt injection, toxic output, and basic attack vectors?

Collaboration

Can the agent work effectively alongside humans and other systems?


18 Deep-Dive Specialty Benchmarks

Each specialty page runs a focused suite of tests targeting one critical dimension of agent behavior.


The Harness System

TAB uses 102+ verified harnesses — the scaffolding that connects an agent to a benchmark. A harness defines how the test is run, what tools are available, and how responses are scored. The same model can score 42% with one harness and 78% with another. TAB benchmarks both the agent and the harness.

58 models, 5 Providers

TAB supports benchmarking across Anthropic, OpenAI, Google Gemini, xAI Grok, and OpenRouter. 58 models mapped across Core, Pro, Premium, and Ultra tiers.

← Return to Home