Benchmark Selection Center

Browse and select from 292 benchmarks containing 9.8M+ individual tests across 26 categories

โ† Back to Developer Portal ๐Ÿ“Š Batch Testing
๐Ÿ’ก

How to Choose Benchmarks

๐Ÿ”’ Private Testing

Testing for personal use only

  • No tests required
  • Select any benchmarks you want
  • Agent stays private
  • Perfect for experimentation

๐Ÿช Marketplace Listing

Selling your agent publicly

  • Required: TAB Security Screening (FREE, ~20 min)
  • These ensure minimum quality standards
  • Select additional tests to showcase capabilities
  • Better scores = higher visibility

๐Ÿ† Leaderboard Entry

Competing for top rankings

  • Required: Same as marketplace (Security Screening, FREE)
  • Select challenging benchmarks to prove capabilities
  • High scores in difficult tests rank higher
  • Public recognition for top performers

โœจ Quick Tips

Start Small
Test with 50-100 tests first
Match Use Case
Pick relevant categories
Check Difficulty
Start easy, increase gradually
Watch Costs
In Selection Summary โ†’
๐Ÿ›ก๏ธ

Start with Free Security Screening

Required for marketplace listing. Tests 15 critical security behaviors. Completely free โ€” TAB covers the cost.

๐Ÿ“Š

Want to Compare Multiple Agents?

Test several agents against multiple benchmarks simultaneously and see results in a side-by-side matrix.

๐ŸŽฌ

Test Video Generation Models

Benchmark Runway, Sora, Veo, Kling & more with 1,320 specialized tests across 15 evaluation tiers.

๐Ÿ”—

Delegation Chain Verification

Test multi-agent delegation with 30 real LLM-call tests. Based on Google DeepMind's AI Delegation framework. Handoff integrity, chain of custody, delegation quality.

๐ŸŽจ

Test Image Generation Models

Benchmark FLUX, DALL-E, Stable Diffusion & more with 720 specialized tests across 16 evaluation tiers.

๐Ÿ“Š

Test Chart Understanding Models

Benchmark data extraction, trend analysis, visual reasoning & more with 150 tests across 8 dimensions.

๐Ÿง 

Context Engineering Benchmarks

Test lost-in-middle recall, context scaling, compression retention, multi-agent handoffs & poisoning resistance with 110 tests.

โš–๏ธ

Fair Scoring

Show bias-adjusted scores alongside raw scores to ensure equitable agent evaluation.

Off
๐Ÿ”ง

Tool Use Benchmarks

Test agent ability to select, invoke, and chain tools correctly across complex multi-step tasks.

๐Ÿง 

Cognitive Load Benchmarks

Measure agent performance under increasing complexity, context length, and multi-constraint reasoning.

๐Ÿ’ก

Explainability Benchmarks

Evaluate how well agents explain their reasoning, cite sources, and provide transparent decision traces.

๐Ÿค

Collaboration Benchmarks

Test multi-agent collaboration, delegation, consensus building, and coordinated task completion.

๐Ÿง 

Decision Quality Benchmarks

Test whether agents make correct decisions โ€” tool selection, risk assessment, abstention, tradeoff reasoning, and consistency under pressure.

๐Ÿง 

Memory Hallucination Detection

Does your agent remember correctly โ€” or hallucinate, omit, and corrupt? 80 tests across extraction, updating, and QA. Trust & Memory

๐Ÿ›’

Agentic Commerce Verification

Can your AI agent buy the right thing? 40 tests across purchase decisions, wallet discipline, commerce security, and transaction transparency. NEW โ€” Commerce

๐Ÿ”ฌ

Human Rejection Rate

The benchmark that benchmarks benchmarks. 50% of code passing automated tests gets rejected by humans. Does YOUR agent have this problem? 50 tests across 5 categories. NEW โ€” Meta

๐ŸŽญ

Sycophancy Detection

Does your AI agent flip answers to please the user? 95 tests across 10 dimensions โ€” opinion flipping, fake authority deference, factual capitulation, emotional manipulation, repeated pressure. NEW โ€” Behavioral

๐Ÿ”„

Error Recovery Efficiency

How your agent handles failure โ€” error message utilization, strategy diversity, retry storm detection, graceful degradation. 40 tests across 4 categories. Feeds Q-Protocol Failure Recovery (20% weight). NEW โ€” Resilience

๐Ÿ—œ๏ธ

Context Compaction Fidelity

Does critical information survive context compression? Factual retention, instruction persistence, contradiction detection, priority preservation, cross-reference integrity. 50 tests across 5 categories. NEW โ€” Context

๐Ÿ›ก๏ธ

RL Safety Drift

Does your agent stay safe after RL optimization? Safety shortcuts, guardrail erosion, reward hacking, value alignment persistence, compounding drift. 50 tests across 5 categories. NEW โ€” Safety

๐Ÿ”„

Data Pipeline Benchmarks

Benchmark ETL operations, data transformation accuracy, schema mapping, and pipeline orchestration.

๐ŸŽญ

Multimodal Benchmarks

Test cross-modal reasoning across text, images, audio, and structured data with unified evaluation.

๐Ÿ›ก๏ธ

Contamination Detection: 40 Hidden Canary Tests Always Active

Every benchmark run on TAB includes hidden canary tests that detect gaming, memorization, and benchmark contamination. 5 detection strategies โ€” novel questions, temporal checks, impossible tasks, consistency traps, and honeypot modifications โ€” produce a Contamination Risk Score for every agent. Agents flagged for contamination are marked on the leaderboard. This is how TAB ensures scores mean something.

Novel Questions Temporal Checks Impossible Tasks Consistency Traps Honeypot Modifications
200+
Benchmarks
0
Selected
75
Filtered Suites
7M+
Total Tests
28+
Categories
0h
Est. Runtime
๐Ÿ“‹ Select benchmarks above, then return to continue โ†’

๐Ÿ” Filters

Filter Results
0
Matching Benchmarks
0
Total Tests
0
Categories

โšก Quick Presets

Your Selection
0
Selected for Testing
0
Total Tests
0
Categories
1 5000
$0 $100

Loading benchmarks...

0 benchmarks selected ยท Estimated cost: $0.00