Test Methodology
TAB verifies AI agents by tracing every step of their execution path, not just their final output. Across 340+ benchmarks, 88 models (48 active, 40 deprecated) from 20+ providers, and 101 harness configurations, TAB measures how an agent solves a problem... where it failed in a multi-step workflow, whether it called the right tool, whether it preserved context between steps, and whether it stayed within its authorized scope. Updated May 2026.
How TAB Verifies Multi-Step Agent Execution Paths
Multi-step agent workflow verification requires more than checking whether the agent reached the right answer. TAB traces every decision point, tool call, and reasoning step to test agent execution paths end to end, across 9.8M test cases spanning 26 categories.
Breaking Down the Execution Path
TAB scores each stage of an agent's workflow: API call accuracy, RAG retrieval quality, context preservation, tool selection, and final synthesis. A single multi-step workflow may involve 12 to 40 individual measurement points before a final score is computed.
Deterministic vs. Semantic Verification
TAB uses both hard format validation (deterministic: does the response match the expected schema, contain required fields, avoid forbidden outputs) and LLM-as-judge semantic evaluation (semantic: does the reasoning make sense, is the tone appropriate, did the agent understand the task intent). Deterministic vs semantic verification serves different failure modes, and TAB applies both on every run. The primary judge is GLM-5 via OpenRouter, which powers LLM-as-a-judge for multi-turn workflows across all evaluation categories.
Q-Protocol: Scoring How the Agent Thinks
8 behavioral dimensions: Prediction Discipline (does the agent commit before acting), Failure Recovery (does it adapt after errors), Context Discipline (does it stay on task), Epistemic Honesty (does it admit uncertainty), Error Utilization (does it learn within a session), Autonomy Boundaries (does it escalate vs. act unilaterally), Root Cause Analysis (does it diagnose correctly), Handoff Quality (does it summarize state cleanly).
Tracing the Trajectory, Not Just the Destination
TAB's Reasoning Trace Viewer records the full execution trace including intermediate steps, tool calls, and decision points. This creates a decision audit trail that shows exactly where an agent succeeded, where it hedged, and where it failed. For teams asking how to verify what steps an AI agent took, the trace is the definitive record.
Scoring Tiers
- • Platinum: 90+ composite score
- • Gold: 75 to 89
- • Silver: 60 to 74
- • Bronze: 45 to 59
- • Unrated: below 45
How We Test Agents
- • Isolated sandbox environment for each test
- • Deterministic test cases with known correct outputs
- • Performance metrics: latency, token usage, accuracy
- • Category-specific evaluation criteria
- • Automatic retry on transient failures
Benchmark Credits & Licenses
TAB uses industry-standard benchmarks from various sources. View full attributions and licenses:
📚 View Benchmark Attributions →