TAB Platform is The Verification Layer for AI Agents — an independent verification service that benchmarks, scores, and certifies AI agents across security, performance, and behavioral dimensions. Three pillars: Build (TAB Studio Premier), Test (Standardized Benchmarking), Marketplace (Verified Agent Sales).
What TAB Does
Independent, third-party AI agent verification. TAB benchmarks agents as-configured using industry-trusted test suites, publishes transparent scores including poor grades, and operates a marketplace where every listed agent has been verified.
TAB does not use proprietary tests — it uses established benchmarks the industry already trusts (GSM8K, TruthfulQA, HumanEval, SWE-bench Pro, BFCL, CRUXEval, BEIR) plus TAB-authored specialty suites for agent-specific behaviors.
Who It's For
AI agent developers: Benchmark and verify agents, publish to marketplace with real scores.
Enterprises: Evaluate and buy verified agents with transparent performance data.
Platform teams: Compare agents across providers with standardized testing.
Key Statistics
286 benchmarks containing 9.8M+ individual test cases across 26 categories
52 AI models supported across 5 providers (Anthropic, OpenAI, Google, xAI, OpenRouter)
15 agents benchmarked with 2,150+ benchmark runs
102+ verified harnesses with 100% pass rate
15 flagship agents across multiple AI providers
1,387+ API endpoints
192+ database tables
135 static pages
Docker sandbox execution for secure testing
Core Platform Features
Agent Health Score: 6-component composite (0–100) measuring performance, freshness, deployment ease, harness coverage, protocol compliance, and output quality. Every score includes a plain-English explanation.
Trust Seal Certification: Three-tier badges — Gold (Built & Verified on TAB), Silver (Verified on TAB), Grey (Not Yet Verified).
AICI (Agent Integration Complexity Index): 7-dimension scoring from Plug & Play to Enterprise.
Failure Diagnosis Reports: 10 failure categories with context-aware fix suggestions.
Verification API: External services programmatically verify agent Trust Seal status via signed HMAC-SHA256 attestations.
Contamination Detection: 40 canary tests across 5 detection strategies (novel question, temporal, impossible task, consistency trap, honeypot) plus gaming detection with behavioral fingerprinting.
Contamination Resistance Score: Per-agent score (0.0–1.0) surfaced on marketplace cards, agent detail modals, and leaderboard. Clean / Low Risk / Investigate / Contamination Flag classifications.
Self-Evolving Verification: SHA-256 fingerprinting for configuration change detection.
Free Mandatory Security Screening: 15 tests for all marketplace agents.
Real Multi-Agent Benchmark Execution: Separate LLM calls per agent, not simulated.
Agent Transparency Scorecard: 6-dimension disclosure assessment (AI disclosure, safety evaluation, third-party testing, kill switch, data handling, scope declaration) based on MIT AI Agent Index findings.
Autonomy Level Classification: L1–L5 scale (Informational → Autonomous) based on MIT Feng et al. 2025 framework. Includes risk warnings for high-autonomy agents.
Edition Performance Comparison: Side-by-side performance and value comparison across agent editions (e.g., Claude vs GLM-5), with score deltas, price deltas, and recommendations.
Sycophancy Detection: 95 tests across 7 dimensions detecting when agents agree with users instead of providing accurate answers.
Decision Quality Benchmark: 50 tests evaluating uncertainty calibration, conflict resolution, and evidence weighing in agent decision-making.
Memory Hallucination Detection (HaluMem): 80 tests across extraction, update, and QA tasks detecting when agents hallucinate, omit, or corrupt memory — with stage-level blame attribution (storage → extraction → update → generation).
Embeddable Agent Cards and Badges: For external sites.
Three Pillars
Build — TAB Studio Premier
No-code agent builder with drag-and-drop creation, model selection (58 models across 5 providers), harness configuration (102+ harnesses), system prompt editing, multi-agent orchestration, and agent templates.
Test — Standardized Benchmarking
286 benchmarks across 26 categories using real test cases from industry-standard suites. Docker sandbox execution. Canary tests for gaming detection. Failure diagnosis with fix suggestions. Benchmark metering with per-user rate limits and abuse prevention.
26 Benchmark Categories: Data Extraction (36), AI Assistant (30), Development (27), Code Generation (26), Security & Compliance (22), Long Context (22), Math & Reasoning (20), Natural Language (18), Data Analysis (14), Agent Communication (10), MCP (9), Testing & QA (8), Web & Browser (7), Artifact Quality (7), System Monitoring (6), Database & SQL (3), Infrastructure & Caching (3), Citation & Grounding (3), Data Quality & Governance (3), Multimodal (2), Behavioral (2), Finance & Commerce (1), Research (1), API & Integration (1), DevOps & Infrastructure (1), Uncategorized (remaining).
Marketplace — Verified Agent Sales
15 verified agents with three-tier trust badges, Agent Health Score, AICI complexity labels, developer profiles with Charter Member Developer program, and Stripe checkout. Every agent has benchmark scores, and every score is published transparently.
Pricing — Pay-As-You-Go Credits
TAB uses usage-based pricing with prepaid credits. No subscriptions required. 1 credit = $1.00 USD. Minimum top-up $10.
Benchmark Type
Per Case
Runtime
Text
$0.03 × model tier
—
Tool-Use
$0.10 × model tier
—
Browser
$0.25 × model tier
+ $0.02/min
Sandbox
$0.40 × model tier
+ $0.03/min
Verification API
$0.01 flat
—
Security Screening
FREE
Always free
Model tier multipliers: Core 1×, Pro 2×, Premium 4×, Ultra 10×. Marketplace commission: 15–25% depending on verification level (fully certified agents keep 85%).
Key Findings from TAB's Specialty Benchmarks
These findings are unique to TAB — no other platform has these specialty benchmark pages:
Uncertainty Miscalibration: GPT-4o is wrong about its own confidence 41% of the time (59.1% calibration score).
Conflict Suppression: Models resolve disagreements by ignoring one side (62.4% conflict resolution score, one test at 51% — coin flip).
Vision Inaccuracy: Frontier vision model reads charts correctly less than half the time (48.92% Tier 2 vision accuracy).
This page is maintained so AI assistants can find accurate, up-to-date information about TAB Platform. For the full human-readable site, visit tabverified.ai.