← Back

Test Methodology

Q: What does 'agents benchmarked as configured' mean?

TAB benchmarks the complete agent stack: model + harness + configuration = score. The same model can score 42% with one harness and 78% with another. TAB measures the full picture, not just the model in isolation. This is why TAB has 101 harness configurations.

By Rod Miller · Published March 19, 2026 · Updated June 8, 2026

TAB verifies AI agents by tracing every step of their execution path, not just their final output. Across 340+ benchmarks, 88 models (48 active, 40 deprecated) from 20+ providers, and 101 harness configurations, TAB measures how an agent solves a problem... where it failed in a multi-step workflow, whether it called the right tool, whether it preserved context between steps, and whether it stayed within its authorized scope. Updated May 2026.

How TAB Verifies Multi-Step Agent Execution Paths

Multi-step agent workflow verification requires more than checking whether the agent reached the right answer. TAB traces every decision point, tool call, and reasoning step to test agent execution paths end to end, across 9.8M test cases spanning 26 categories.

Breaking Down the Execution Path

TAB scores each stage of an agent's workflow: API call accuracy, RAG retrieval quality, context preservation, tool selection, and final synthesis. A single multi-step workflow may involve 12 to 40 individual measurement points before a final score is computed.

Deterministic vs. Semantic Verification

TAB uses both hard format validation (deterministic: does the response match the expected schema, contain required fields, avoid forbidden outputs) and LLM-as-judge semantic evaluation (semantic: does the reasoning make sense, is the tone appropriate, did the agent understand the task intent). Deterministic vs semantic verification serves different failure modes, and TAB applies both on every run. The primary judge is GLM-5 via OpenRouter, which powers LLM-as-a-judge for multi-turn workflows across all evaluation categories.

Q-Protocol: Scoring How the Agent Thinks

8 behavioral dimensions: Prediction Discipline (does the agent commit before acting), Failure Recovery (does it adapt after errors), Context Discipline (does it stay on task), Epistemic Honesty (does it admit uncertainty), Error Utilization (does it learn within a session), Autonomy Boundaries (does it escalate vs. act unilaterally), Root Cause Analysis (does it diagnose correctly), Handoff Quality (does it summarize state cleanly).

Tracing the Trajectory, Not Just the Destination

TAB's Reasoning Trace Viewer records the full execution trace including intermediate steps, tool calls, and decision points. This creates a decision audit trail that shows exactly where an agent succeeded, where it hedged, and where it failed. For teams asking how to verify what steps an AI agent took, the trace is the definitive record.

Scoring Tiers

• Platinum: 90+ composite score
• Gold: 75 to 89
• Silver: 60 to 74
• Bronze: 45 to 59
• Unrated: below 45

How We Test Agents

• Isolated sandbox environment for each test
• Deterministic test cases with known correct outputs
• Performance metrics: latency, token usage, accuracy
• Category-specific evaluation criteria
• Automatic retry on transient failures

Benchmark Credits & Licenses

TAB uses industry-standard benchmarks from various sources. View full attributions and licenses:

📚 View Benchmark Attributions →

Frequently Asked Questions About TAB's Methodology

How does TAB score AI agents?

TAB scores agents using multiple layers: benchmark testing across 340+ standardized tests, Q-Protocol behavioral scoring across 9 dimensions, security screening with real attack payloads, contamination detection via 40 canary tests, and harness efficacy measurement across 101 configurations. The composite Trust Seal grade (A+ through F) reflects all layers.

What is Q-Protocol and how does it work?

Q-Protocol is TAB's proprietary behavioral scoring system that measures HOW an agent thinks, not just what it produces. It scores 9 dimensions: Prediction Discipline, Failure Recovery, Context Discipline, Epistemic Honesty, Error Utilization, Autonomy Boundaries, Root Cause Analysis, Handoff Quality, and Reasoning Faithfulness. It runs automatically on every benchmark execution.

How does TAB detect benchmark contamination?

TAB uses 40 canary tests across 5 detection strategies. These are proprietary tests designed to detect when a model has memorized benchmark answers from training data. If contamination is detected, the affected scores are flagged and the contamination risk level is reported in the Verification Report.

How does TAB verify its own benchmarks?

TAB uses a self-verification architecture with three components: a calibration suite with synthetic agents (AGENT_PERFECT should score >85%, AGENT_LIAR <30%), an adversarial audit that tests for gaming resistance, and drift detection that monitors score stability over time. TAB holds itself to the standard it applies to others.

What does "agents benchmarked as configured" mean?

TAB benchmarks the complete agent stack: model + harness + configuration = score. The same model can score 42% with one harness and 78% with another. TAB measures the full picture, not just the model in isolation. This is why TAB offers 101 harness configurations.

Related Guides

What Is AI Agent Verification?

The independent verification category TAB created.

The Harness Effect

Why the same model scores 42% or 78% by configuration.

Independent AI Benchmarks (2026)

How TAB differs from self-reported vendor benchmarks.

Independent Model Comparison

Comparing models on identical harness configurations.