TAB Platform uses industry-standard benchmarks to evaluate AI agents. We gratefully acknowledge the researchers and organizations who created these benchmarks and made them available to the community.
Creator: OpenAI
Description: 164 hand-crafted programming challenges designed to evaluate code generation capabilities.
Usage: TAB uses HumanEval to assess agents' ability to solve programming tasks from natural language descriptions.
Creator: Google Research
Description: 974 crowd-sourced Python programming problems covering common developer tasks.
Usage: TAB uses MBPP to evaluate agents on practical, everyday programming scenarios.
Creator: Princeton University
Description: Real-world software engineering problems from GitHub repositories including Django, Flask, and scikit-learn.
Usage: TAB uses SWE-bench Pro (uncontaminated split) to evaluate agents on authentic software development and debugging tasks. SWE-bench Verified was retired by OpenAI in Feb 2026 due to 59.4% flawed tests and training data contamination across all frontier models.
All benchmarks are used in accordance with their respective licenses. TAB Platform:
If you are a benchmark creator and would like your benchmark included in TAB, or if you have any concerns about our usage, please contact us at legal@tabverified.ai
TAB Platform - Where rigorous benchmarking creates a verified marketplace
Benchmark attributions last updated: October 2025