Benchmark Selection Center

Browse and select from 340+ benchmarks containing 9.8M+ individual tests across 26 categories

💡

How to Choose Benchmarks

🔒 Private Testing

Testing for personal use only

No tests required
Select any benchmarks you want
Agent stays private
Perfect for experimentation

🏪 Marketplace Listing

Selling your agent publicly

Recommended: TAB Security Screening (first run FREE, ~20 min)
Security score appears on your marketplace card
Select additional tests to showcase capabilities
Better scores = higher visibility

🏆 Leaderboard Entry

Competing for top rankings

Recommended: Security Screening (first run FREE)
Select challenging benchmarks to prove capabilities
High scores in difficult tests rank higher
Public recognition for top performers

✨ Quick Tips

Start Small

Test with 50-100 tests first

Match Use Case

Pick relevant categories

Check Difficulty

Start easy, increase gradually

Watch Costs

In Selection Summary →

🛡️

Run Free Security Screening

Tests 25 critical security behaviors. Your first screening per agent is free — TAB covers the cost. Your score appears on your marketplace card.

📊

Want to Compare Multiple Agents?

Test several agents against multiple benchmarks simultaneously and see results in a side-by-side matrix.

🎬

Test Video Generation Models

Benchmark Runway, Veo, Kling & more with 1,320 specialized tests across 15 evaluation tiers.

🔗

Delegation Chain Verification

Test multi-agent delegation with 30 real LLM-call tests. Based on Google DeepMind's AI Delegation framework. Handoff integrity, chain of custody, delegation quality.

🎨

Test Image Generation Models

Benchmark FLUX, DALL-E, Stable Diffusion & more with 720 specialized tests across 16 evaluation tiers.

📊

Test Chart Understanding Models

Benchmark data extraction, trend analysis, visual reasoning & more with 150 tests across 8 dimensions.

🧠

Context Engineering Benchmarks

Test lost-in-middle recall, context scaling, compression retention, multi-agent handoffs & poisoning resistance with 110 tests.

⚖️

Fair Scoring

Show bias-adjusted scores alongside raw scores to ensure equitable agent evaluation.

Off

🔧

Tool Use Benchmarks

Test agent ability to select, invoke, and chain tools correctly across complex multi-step tasks.

🧠

Cognitive Load Benchmarks

Measure agent performance under increasing complexity, context length, and multi-constraint reasoning.

💡

Explainability Benchmarks

Evaluate how well agents explain their reasoning, cite sources, and provide transparent decision traces.

🤝

Collaboration Benchmarks

Test multi-agent collaboration, delegation, consensus building, and coordinated task completion.

🧠

Decision Quality Benchmarks

Test whether agents make correct decisions — tool selection, risk assessment, abstention, tradeoff reasoning, and consistency under pressure.

🧠

Memory Hallucination Detection

Does your agent remember correctly — or hallucinate, omit, and corrupt? 80 tests across extraction, updating, and QA. Trust & Memory

⚡

Token Waste Audit

Independent efficiency benchmark that classifies where models waste tokens: preamble, verbosity, echoes, hedging, retries, dead-end reasoning, and formatting overhead. NEW — Efficiency

🛒

Agentic Commerce Verification

Can your AI agent buy the right thing? 40 tests across purchase decisions, wallet discipline, commerce security, and transaction transparency. NEW — Commerce

🔬

Human Rejection Rate

The benchmark that benchmarks benchmarks. 50% of code passing automated tests gets rejected by humans. Does YOUR agent have this problem? 50 tests across 5 categories. NEW — Meta

🛣️

RouteCheck — Model Route & Integrity Verification

Know whether the model you paid for is the behavior you received. Detects silent model substitution, safety restriction, output filtering, quantization, and load-based routing — claim ladder 0–6, three cost tiers. NEW — Integrity

🎭

Sycophancy Detection

Does your AI agent flip answers to please the user? 95 tests across 10 dimensions — opinion flipping, fake authority deference, factual capitulation, emotional manipulation, repeated pressure. NEW — Behavioral

🔄

Error Recovery Efficiency

How your agent handles failure — error message utilization, strategy diversity, retry storm detection, graceful degradation. 40 tests across 4 categories. Feeds Q-Protocol Failure Recovery (20% weight). NEW — Resilience

🗜️

Context Compaction Fidelity

Does critical information survive context compression? Factual retention, instruction persistence, contradiction detection, priority preservation, cross-reference integrity. 50 tests across 5 categories. NEW — Context

🛡️

RL Safety Drift

Does your agent stay safe after RL optimization? Safety shortcuts, guardrail erosion, reward hacking, value alignment persistence, compounding drift. 50 tests across 5 categories. NEW — Safety

🔑

Agent Auth Compliance

Does your agent handle identity, permissions, and delegation correctly? 50 tests across 5 auth categories — identity verification, scope management, lifecycle, delegation chains, autonomous vs supervised mode. NEW — Security

💉

Prompt Injection Resistance

Does your agent ignore malicious instructions hidden in user content, or obey them? Injection overrides, obfuscated payloads, role-switch jailbreaks, data theft, and system-prompt exfiltration. Run against an agent or any model directly. NEW — Security

⚖️

Fairness & Bias

Does the model treat people equally? Paired prompts that differ only by a demographic marker — gender, race, age, religion, disability, nationality, orientation — judged for changes in quality, decision, refusal, or stereotype. 44 pairs across 4 categories. NEW — Responsible AI

🧭

Accountability Behavior

Does the model behave accountably? Flags high-stakes actions for human approval, shows traceable reasoning, recognizes when it's out of scope, and documents what it did and assumed. 40 judge-scored scenarios across 4 categories. NEW — Governance

🪪

Model Identity & Disclosure

Does the model accurately represent what it is? Honest self-identification, limitation disclosure, transparency about external vs own knowledge, and consistent answers under adversarial framing. 40 judge-scored prompts across 4 categories. NEW — Transparency

🔄

Data Pipeline Benchmarks

Benchmark ETL operations, data transformation accuracy, schema mapping, and pipeline orchestration.

🎭

Multimodal Benchmarks

Test cross-modal reasoning across text, images, audio, and structured data with unified evaluation.

🛡️

Contamination Detection: 40 Hidden Canary Tests Always Active

Every benchmark run on TAB includes hidden canary tests that detect gaming, memorization, and benchmark contamination. 5 detection strategies — novel questions, temporal checks, impossible tasks, consistency traps, and honeypot modifications — produce a Contamination Risk Score for every agent. Agents flagged for contamination are marked on the leaderboard. This is how TAB ensures scores mean something.

Novel Questions Temporal Checks Impossible Tasks Consistency Traps Honeypot Modifications

📋 Select benchmarks above, then return to continue →

📝 Document Processing?

Extraction, Parsing

💻 Code Generation?

HumanEval, MBPP

🔗 API Integration?

BFCL, Interop

📑 Citation Grounding?

ALCE, AttrScore, HAGRID

🧠 Reasoning?

Mathematics, Knowledge

🌐 Web Automation?

Web, AppWorld

🔒 Security?

Security, TAB-SecureFix

💡 Pro Tip: Start with the Recommended Tests view for the security screening (first run free), then explore categories matching your agent's purpose!

🛡️ Why run the Security Screening?

TAB is the only AI agent marketplace that verifies what it sells. Your security score is displayed publicly on your marketplace card — buyers see it before purchasing.

The screening checks 25 critical security behaviors:

PII Redaction — Handles personal data safely across 5 formats
Data Exfiltration — Won't leak API keys, passwords, or credentials
Memory Privacy — Protects data stored in context
Safety Refusal — Refuses harmful requests

Your first screening per agent is free. TAB covers the cost. Re-runs to improve your score cost credits.

Selection Summary

Select benchmarks above to see your testing summary.

New here? Use the Run Free Security Screening button at the top of the page to get started!

Note: Your first TAB Security Screening per agent is completely free. TAB covers the cost because marketplace safety is our responsibility. Your security score is displayed on your marketplace card. Re-runs cost credits.

🎵 My Playlists

Loading playlists...

🔍 Filters

Filter Results

Matching Benchmarks

Total Tests

Running Security Screening...

Testing your agent against 25 security behaviors. This may take up to 60 seconds.

Do not close this page while tests are running.

0 benchmarks selected · Estimated cost: $0.00

Benchmark Selection Center

How to Choose Benchmarks

🔒 Private Testing

🏪 Marketplace Listing

🏆 Leaderboard Entry

✨ Quick Tips

Run Free Security Screening

Want to Compare Multiple Agents?

Test Video Generation Models

Delegation Chain Verification

Test Image Generation Models

Test Chart Understanding Models

Context Engineering Benchmarks

Fair Scoring

Tool Use Benchmarks

Cognitive Load Benchmarks

Explainability Benchmarks

Collaboration Benchmarks

Decision Quality Benchmarks

Memory Hallucination Detection

Token Waste Audit

Agentic Commerce Verification

Human Rejection Rate

RouteCheck — Model Route & Integrity Verification

Sycophancy Detection

Error Recovery Efficiency

Context Compaction Fidelity

RL Safety Drift

Agent Auth Compliance

Prompt Injection Resistance

Fairness & Bias

Accountability Behavior

Model Identity & Disclosure

Data Pipeline Benchmarks

Multimodal Benchmarks

Contamination Detection: 40 Hidden Canary Tests Always Active

🎯 What Should I Test For?

🛡️ Why run the Security Screening?

Selection Summary

🎵 My Playlists

🔍 Filters

⚡ Quick Presets

How many tests?

Running Security Screening...