π Filters
β‘ Quick Presets
Loading benchmarks...
Browse and select from 340+ benchmarks containing 9.8M+ individual tests across 26 categories
Testing for personal use only
Selling your agent publicly
Competing for top rankings
Tests 25 critical security behaviors. Your first screening per agent is free — TAB covers the cost. Your score appears on your marketplace card.
Test several agents against multiple benchmarks simultaneously and see results in a side-by-side matrix.
Benchmark Runway, Veo, Kling & more with 1,320 specialized tests across 15 evaluation tiers.
Test multi-agent delegation with 30 real LLM-call tests. Based on Google DeepMind's AI Delegation framework. Handoff integrity, chain of custody, delegation quality.
Benchmark FLUX, DALL-E, Stable Diffusion & more with 720 specialized tests across 16 evaluation tiers.
Benchmark data extraction, trend analysis, visual reasoning & more with 150 tests across 8 dimensions.
Test lost-in-middle recall, context scaling, compression retention, multi-agent handoffs & poisoning resistance with 110 tests.
Show bias-adjusted scores alongside raw scores to ensure equitable agent evaluation.
Test agent ability to select, invoke, and chain tools correctly across complex multi-step tasks.
Measure agent performance under increasing complexity, context length, and multi-constraint reasoning.
Evaluate how well agents explain their reasoning, cite sources, and provide transparent decision traces.
Test multi-agent collaboration, delegation, consensus building, and coordinated task completion.
Test whether agents make correct decisions β tool selection, risk assessment, abstention, tradeoff reasoning, and consistency under pressure.
Does your agent remember correctly β or hallucinate, omit, and corrupt? 80 tests across extraction, updating, and QA. Trust & Memory
Independent efficiency benchmark that classifies where models waste tokens: preamble, verbosity, echoes, hedging, retries, dead-end reasoning, and formatting overhead. NEW β Efficiency
Can your AI agent buy the right thing? 40 tests across purchase decisions, wallet discipline, commerce security, and transaction transparency. NEW β Commerce
The benchmark that benchmarks benchmarks. 50% of code passing automated tests gets rejected by humans. Does YOUR agent have this problem? 50 tests across 5 categories. NEW β Meta
Know whether the model you paid for is the behavior you received. Detects silent model substitution, safety restriction, output filtering, quantization, and load-based routing β claim ladder 0β6, three cost tiers. NEW β Integrity
Does your AI agent flip answers to please the user? 95 tests across 10 dimensions β opinion flipping, fake authority deference, factual capitulation, emotional manipulation, repeated pressure. NEW β Behavioral
How your agent handles failure β error message utilization, strategy diversity, retry storm detection, graceful degradation. 40 tests across 4 categories. Feeds Q-Protocol Failure Recovery (20% weight). NEW β Resilience
Does critical information survive context compression? Factual retention, instruction persistence, contradiction detection, priority preservation, cross-reference integrity. 50 tests across 5 categories. NEW β Context
Does your agent stay safe after RL optimization? Safety shortcuts, guardrail erosion, reward hacking, value alignment persistence, compounding drift. 50 tests across 5 categories. NEW β Safety
Does your agent handle identity, permissions, and delegation correctly? 50 tests across 5 auth categories β identity verification, scope management, lifecycle, delegation chains, autonomous vs supervised mode. NEW β Security
Does your agent ignore malicious instructions hidden in user content, or obey them? Injection overrides, obfuscated payloads, role-switch jailbreaks, data theft, and system-prompt exfiltration. Run against an agent or any model directly. NEW β Security
Does the model treat people equally? Paired prompts that differ only by a demographic marker β gender, race, age, religion, disability, nationality, orientation β judged for changes in quality, decision, refusal, or stereotype. 44 pairs across 4 categories. NEW β Responsible AI
Does the model behave accountably? Flags high-stakes actions for human approval, shows traceable reasoning, recognizes when it's out of scope, and documents what it did and assumed. 40 judge-scored scenarios across 4 categories. NEW β Governance
Does the model accurately represent what it is? Honest self-identification, limitation disclosure, transparency about external vs own knowledge, and consistent answers under adversarial framing. 40 judge-scored prompts across 4 categories. NEW β Transparency
Benchmark ETL operations, data transformation accuracy, schema mapping, and pipeline orchestration.
Test cross-modal reasoning across text, images, audio, and structured data with unified evaluation.
Loading benchmarks...