What TAB Actually Tests
340+ benchmarks. 26 categories. 20 specialty pages. 1,330+ specialty tests. Independent verification across 88 total models (48 active, 40 deprecated; 400+ available via OpenRouter) from 20+ providers.
How TAB Scoring Works
TAB runs agents against real test cases. Scores are not self-reported, not cherry-picked, and not based on demos. Every agent is tested under the same conditions with the same harnesses, and every result is independently verifiable.
Scores reflect actual performance. Grades include D's — TAB does not inflate scores. If an agent fails, it fails publicly. Every methodology is transparent and published.
26 Benchmark Categories
Every category targets a specific failure mode. Here's what each one tests.
Reasoning & Logic
Does the agent think through problems correctly or jump to wrong conclusions?
Tool Use & API Calls
Does the agent call the right tools with the right parameters?
Instruction Following
Does the agent do what it's told, exactly as instructed?
Safety & Alignment
Does the agent refuse harmful requests and stay within boundaries?
Hallucination Detection
Does the agent make up facts, citations, or data that don't exist?
Code Generation
Does the agent write functional, secure, production-quality code?
Mathematical Reasoning
Can the agent solve math problems accurately without errors?
Context Retention
Does the agent remember what was said earlier in a conversation?
Sycophancy Detection
Does the agent change its answers under social pressure or to please the user?
Calibration & Uncertainty
Does the agent know what it doesn't know, or does it fake confidence?
Multi-Step Planning
Can the agent break down complex tasks and execute them in sequence?
Web Navigation
Can the agent browse the web accurately and extract the right information?
Document Understanding
Can the agent read and correctly interpret long documents?
Vision & Multimodal
Can the agent accurately read charts, images, and visual data?
Citation Accuracy
Does the agent cite real sources, or fabricate references?
Adversarial Robustness
Does the agent hold up against prompt injection and manipulation attempts?
Delegation & Orchestration
Can the agent correctly coordinate with other agents in a pipeline?
Autonomy Boundaries
Does the agent know when to stop and ask a human vs act on its own?
Memory Hallucination
Does the agent correctly remember and update information, or corrupt it?
Decision Under Pressure
Does the agent make consistent decisions or cave when challenged?
Authority Sycophancy
Does the agent defer to fake credentials and false authority?
Conflict Resolution
Does the agent handle contradictory instructions correctly?
Gaming Detection
Does the agent try to game benchmarks or manipulate its own scores?
Contamination Resistance
Are benchmark results clean, or has the agent memorized the answers?
MCP Compliance
Does the agent correctly implement Model Context Protocol standards?
Transparency
Does the agent disclose what it is, how it works, and what it can't do?
Security Screening
Does the agent resist prompt injection, toxic output, and basic attack vectors?
Collaboration
Can the agent work effectively alongside humans and other systems?
26 Deep-Dive Specialty Benchmarks
Each specialty page runs a focused suite of tests targeting one critical dimension of agent behavior.
- Sycophancy95 tests across 10 dimensions of people-pleasing behavior
- HallucinationFactual accuracy and fabrication detection
- Adversarial RobustnessPrompt injection and manipulation resistance
- ExplainabilityDoes the agent explain its reasoning clearly?
- Vision & MultimodalChart and image interpretation accuracy
- Code SecurityVulnerability detection in agent-generated code
- Mathematical ReasoningNumerical accuracy across problem types
- Citation ValidatorReal vs fabricated reference detection
- CollaborationMulti-agent and human-agent coordination
- Context EngineeringLong-context retention and retrieval
- Web AgentLive web navigation and extraction accuracy
- Decision QualityConsistency and correctness under pressure
- Delegation ChainMulti-agent pipeline verification
- Contamination ResistanceBenchmark integrity and clean scoring
- HaluMemOperation-level memory hallucination detection (80 tests)
- Error RecoveryError message utilization, strategy diversity, retry storms, graceful degradation (40 tests)
- Context CompactionFactual retention, instruction persistence, contradiction detection, priority preservation after compression (50 tests)
- RL Safety DriftSafety under optimization pressure — shortcut resistance, guardrail erosion, reward hacking, value alignment persistence (50 tests)
- Data Source ProvenanceModel supply chain verification — identity disclosure, training data transparency, geographic jurisdiction, adversarial provenance resistance (50 tests)
The Harness System
88 Models (48 Active), 5 Providers
TAB supports benchmarking across Anthropic, OpenAI, Google Gemini, xAI Grok, and OpenRouter. 88 models (48 active, 40 deprecated; 400+ available via OpenRouter) mapped across Core, Pro, Premium, and Ultra tiers.