๐ Filters
โก Quick Presets
Loading benchmarks...
Browse and select from 292 benchmarks containing 9.8M+ individual tests across 26 categories
Testing for personal use only
Selling your agent publicly
Competing for top rankings
Required for marketplace listing. Tests 15 critical security behaviors. Completely free โ TAB covers the cost.
Test several agents against multiple benchmarks simultaneously and see results in a side-by-side matrix.
Benchmark Runway, Sora, Veo, Kling & more with 1,320 specialized tests across 15 evaluation tiers.
Test multi-agent delegation with 30 real LLM-call tests. Based on Google DeepMind's AI Delegation framework. Handoff integrity, chain of custody, delegation quality.
Benchmark FLUX, DALL-E, Stable Diffusion & more with 720 specialized tests across 16 evaluation tiers.
Benchmark data extraction, trend analysis, visual reasoning & more with 150 tests across 8 dimensions.
Test lost-in-middle recall, context scaling, compression retention, multi-agent handoffs & poisoning resistance with 110 tests.
Show bias-adjusted scores alongside raw scores to ensure equitable agent evaluation.
Test agent ability to select, invoke, and chain tools correctly across complex multi-step tasks.
Measure agent performance under increasing complexity, context length, and multi-constraint reasoning.
Evaluate how well agents explain their reasoning, cite sources, and provide transparent decision traces.
Test multi-agent collaboration, delegation, consensus building, and coordinated task completion.
Test whether agents make correct decisions โ tool selection, risk assessment, abstention, tradeoff reasoning, and consistency under pressure.
Does your agent remember correctly โ or hallucinate, omit, and corrupt? 80 tests across extraction, updating, and QA. Trust & Memory
Can your AI agent buy the right thing? 40 tests across purchase decisions, wallet discipline, commerce security, and transaction transparency. NEW โ Commerce
The benchmark that benchmarks benchmarks. 50% of code passing automated tests gets rejected by humans. Does YOUR agent have this problem? 50 tests across 5 categories. NEW โ Meta
Does your AI agent flip answers to please the user? 95 tests across 10 dimensions โ opinion flipping, fake authority deference, factual capitulation, emotional manipulation, repeated pressure. NEW โ Behavioral
How your agent handles failure โ error message utilization, strategy diversity, retry storm detection, graceful degradation. 40 tests across 4 categories. Feeds Q-Protocol Failure Recovery (20% weight). NEW โ Resilience
Does critical information survive context compression? Factual retention, instruction persistence, contradiction detection, priority preservation, cross-reference integrity. 50 tests across 5 categories. NEW โ Context
Does your agent stay safe after RL optimization? Safety shortcuts, guardrail erosion, reward hacking, value alignment persistence, compounding drift. 50 tests across 5 categories. NEW โ Safety
Benchmark ETL operations, data transformation accuracy, schema mapping, and pipeline orchestration.
Test cross-modal reasoning across text, images, audio, and structured data with unified evaluation.
Loading benchmarks...