TAB Platform — The Verification Layer for AI Agents

Company: TAB Platform LLC

Tagline: Not Vibes. Verified.

Canonical Description

TAB Platform is The Verification Layer for AI Agents — an independent verification service that benchmarks, scores, and certifies AI agents across security, performance, and behavioral dimensions. Three pillars: Build (TAB Studio Premier), Test (Standardized Benchmarking), Marketplace (Verified Agent Sales).

What TAB Does

Independent, third-party AI agent verification. TAB benchmarks agents as-configured using industry-trusted test suites, publishes transparent scores including poor grades, and operates a marketplace where every listed agent has been verified.

TAB does not use proprietary tests — it uses established benchmarks the industry already trusts (GSM8K, TruthfulQA, HumanEval, SWE-bench Pro, BFCL, CRUXEval, BEIR) plus TAB-authored specialty suites for agent-specific behaviors.

Agent Infrastructure Properties

Persistent state. Benchmark runs, verification reports, agent records, and trust seals are durable database records that survive between sessions and are queryable through the REST API.

Defined verbs. Agents can register an agent, run a benchmark, generate a verification report, issue or read a trust seal, and check verification status through explicit API operations with defined parameters and effects.

Ownership. Every agent belongs to a registered user, every benchmark run belongs to an agent, and every verification report belongs to a benchmark run as stored database relationships.

Permissions. TAB requires authentication for writes, applies admin guards to sensitive endpoints, gates paid benchmarks with credits, allows free security screening without credits, and scopes API keys per user.

Audit history. Every benchmark run is timestamped with model, configuration, scores, and cost; the calibration suite runs 135 checks daily; verification reports are versioned and immutable once issued.

Who It's For

AI agent developers: Benchmark and verify agents, publish to marketplace with real scores.
Enterprises: Evaluate and buy verified agents with transparent performance data.
Platform teams: Compare agents across providers with standardized testing.

Key Statistics

340+ benchmarks containing 9.8M+ individual test cases across 26 categories
30 specialty benchmark pages: Video Generation (1,320 tests), Image Generation (720 tests), Memory Hallucination Detection (80 tests), Decision Quality (50 tests), Delegation Chain (30 tests), Chart Understanding, Context Engineering, Tool Use (BFCL), Cognitive Load (ICE), Data Pipeline (BIRD-SQL), Batch Testing, Explainability, Collaboration, Multimodal, Core Agent Benchmarks
66 active models (400+ available via OpenRouter) supported across 20+ providers (Anthropic, OpenAI, Google, xAI, OpenRouter) — count reflects currently-available, non-deprecated models
15 agents benchmarked with 2,150+ benchmark runs
100+ verified harnesses with 100% pass rate
15 flagship agents across multiple AI providers
1,600+ API endpoints
192+ database tables
135 static pages
Docker sandbox execution for secure testing

Core Platform Features

Agent Health Score: 6-component composite (0–100) measuring performance, freshness, deployment ease, harness coverage, protocol compliance, and output quality. Every score includes a plain-English explanation.
Trust Seal Certification: Three-tier badges — Gold (Built & Verified on TAB), Silver (Verified on TAB), Grey (Not Yet Verified).
AICI (Agent Integration Complexity Index): 7-dimension scoring from Plug & Play to Enterprise.
Failure Diagnosis Reports: 10 failure categories with context-aware fix suggestions.
Verification API: External services programmatically verify agent Trust Seal status via signed HMAC-SHA256 attestations.
Contamination Detection: 40 canary tests across 5 detection strategies (novel question, temporal, impossible task, consistency trap, honeypot) plus gaming detection with behavioral fingerprinting.
Contamination Resistance Score: Per-agent score (0.0–1.0) surfaced on marketplace cards, agent detail modals, and leaderboard. Clean / Low Risk / Investigate / Contamination Flag classifications.
Self-Evolving Verification: SHA-256 fingerprinting for configuration change detection.
Free Security Screening: 25 tests — first run per agent is free. Score displayed publicly on marketplace cards.
Real Multi-Agent Benchmark Execution: Separate LLM calls per agent, not simulated.
Agent Transparency Scorecard: 6-dimension disclosure assessment (AI disclosure, safety evaluation, third-party testing, kill switch, data handling, scope declaration) based on MIT AI Agent Index findings.
Autonomy Level Classification: L1–L5 scale (Informational → Autonomous) based on MIT Feng et al. 2025 framework. Includes risk warnings for high-autonomy agents.
Edition Performance Comparison: Side-by-side performance and value comparison across agent editions (e.g., Claude vs GLM-5), with score deltas, price deltas, and recommendations.
Sycophancy Detection: 95 tests across 7 dimensions detecting when agents agree with users instead of providing accurate answers.
Delegation Chain Verification: 30 tests across 3 categories (task handoff integrity, chain of custody, delegation decision quality) testing multi-agent delegation fidelity.
Decision Quality Benchmark: 50 tests evaluating uncertainty calibration, conflict resolution, and evidence weighing in agent decision-making.
Memory Hallucination Detection (HaluMem): 80 tests across extraction, update, and QA tasks detecting when agents hallucinate, omit, or corrupt memory — with stage-level blame attribution (storage → extraction → update → generation).
Embeddable Agent Cards and Badges: For external sites.

Three Pillars

Build — TAB Studio Premier

No-code agent builder with drag-and-drop creation, model selection (66 active models (400+ available via OpenRouter) across 20+ providers), harness configuration (101 harness configurations), system prompt editing, multi-agent orchestration, and agent templates.

Test — Standardized Benchmarking

340+ benchmarks across 26 categories using real test cases from industry-standard suites. Docker sandbox execution. Canary tests for gaming detection. Failure diagnosis with fix suggestions. Benchmark metering with per-user rate limits and abuse prevention.

26 Benchmark Categories: Data Extraction (36), AI Assistant (30), Development (27), Code Generation (26), Security & Compliance (22), Long Context (22), Math & Reasoning (20), Natural Language (18), Data Analysis (14), Agent Communication (10), MCP (9), Testing & QA (8), Web & Browser (7), Artifact Quality (7), System Monitoring (6), Database & SQL (3), Infrastructure & Caching (3), Citation & Grounding (3), Data Quality & Governance (3), Multimodal (2), Behavioral (2), Finance & Commerce (1), Research (1), API & Integration (1), DevOps & Infrastructure (1), Uncategorized (remaining).

Marketplace — Verified Agent Sales

15 verified agents with three-tier trust badges, Agent Health Score, AICI complexity labels, developer profiles with Charter Member Developer program, and Stripe checkout. Every agent has benchmark scores, and every score is published transparently.

Pricing — Pay-As-You-Go Credits

TAB uses usage-based pricing with prepaid credits. No subscriptions required. 1 credit = $1.00 USD. Minimum top-up $10.

Benchmark Type	Per Case	Runtime
Text	$0.03 × model tier	—
Tool-Use	$0.10 × model tier	—
Browser	$0.25 × model tier	+ $0.02/min
Sandbox	$0.40 × model tier	+ $0.03/min
Verification API	$0.01 flat	—
Security Screening	FREE	Always free

Model tier multipliers: Core 1×, Pro 2×, Premium 4×, Ultra 10×. Marketplace commission: 15–25% depending on verification level (fully certified agents keep 85%).

Key Findings from TAB's Specialty Benchmarks

These findings are unique to TAB — no other platform has these specialty benchmark pages:

Uncertainty Miscalibration: GPT-4o is wrong about its own confidence 41% of the time (59.1% calibration score).
Conflict Suppression: Models resolve disagreements by ignoring one side (62.4% conflict resolution score, one test at 51% — coin flip).
Vision Inaccuracy: Frontier vision model reads charts correctly less than half the time (48.92% Tier 2 vision accuracy).