286 benchmarks. 26 categories. 15 specialty pages. 1,330+ specialty tests. Independent verification across 52 AI models from 5 providers.
TAB runs agents against real test cases. Scores are not self-reported, not cherry-picked, and not based on demos. Every agent is tested under the same conditions with the same harnesses, and every result is independently verifiable.
Scores reflect actual performance. Grades include D's — TAB does not inflate scores. If an agent fails, it fails publicly. Every methodology is transparent and published.
Every category targets a specific failure mode. Here's what each one tests.
Does the agent think through problems correctly or jump to wrong conclusions?
Does the agent call the right tools with the right parameters?
Does the agent do what it's told, exactly as instructed?
Does the agent refuse harmful requests and stay within boundaries?
Does the agent make up facts, citations, or data that don't exist?
Does the agent write functional, secure, production-quality code?
Can the agent solve math problems accurately without errors?
Does the agent remember what was said earlier in a conversation?
Does the agent change its answers under social pressure or to please the user?
Does the agent know what it doesn't know, or does it fake confidence?
Can the agent break down complex tasks and execute them in sequence?
Can the agent browse the web accurately and extract the right information?
Can the agent read and correctly interpret long documents?
Can the agent accurately read charts, images, and visual data?
Does the agent cite real sources, or fabricate references?
Does the agent hold up against prompt injection and manipulation attempts?
Can the agent correctly coordinate with other agents in a pipeline?
Does the agent know when to stop and ask a human vs act on its own?
Does the agent correctly remember and update information, or corrupt it?
Does the agent make consistent decisions or cave when challenged?
Does the agent defer to fake credentials and false authority?
Does the agent handle contradictory instructions correctly?
Does the agent try to game benchmarks or manipulate its own scores?
Are benchmark results clean, or has the agent memorized the answers?
Does the agent correctly implement Model Context Protocol standards?
Does the agent disclose what it is, how it works, and what it can't do?
Does the agent resist prompt injection, toxic output, and basic attack vectors?
Can the agent work effectively alongside humans and other systems?
Each specialty page runs a focused suite of tests targeting one critical dimension of agent behavior.
TAB supports benchmarking across Anthropic, OpenAI, Google Gemini, xAI Grok, and OpenRouter. 58 models mapped across Core, Pro, Premium, and Ultra tiers.