๐Ÿš€ Getting Started with TAB

Complete these steps to build and list your first agent

Your Progress 0 of 4 complete
1
๐Ÿ› ๏ธ

Build or Import an Agent

Create your first AI agent using our visual builder, templates, or import your own existing agent

โฑ๏ธ ~10-15 minutes
2
๐Ÿงช

Run Benchmarks

Test your agent against 290+ benchmarks to measure performance

โฑ๏ธ ~5-10 minutes
3
๐Ÿ“Š

Review Results

Analyze your Trust Seal score and benchmark results

โฑ๏ธ ~5 minutes
4
๐Ÿ›’

List & Sell

Go to My Agents, click "Publish" to list on marketplace and earn revenue

โฑ๏ธ ~2 minutes
๐Ÿค– โ€” Agents
โœ… โ€” Verified
โš ๏ธ 0 Stale
๐Ÿ“Š โ€” Avg Score
๐Ÿ”ต AICI: โ€”
๐Ÿ”— โ€” Embeds
๐Ÿ’ฐ โ€” Earned
๐Ÿฅ Avg Health: โ€” โ“˜

Account Type

Loading...

Credits Remaining

Loading...

Total Earnings

Loading...

View Dashboard โ†’

Build Agents

๐Ÿ› ๏ธ

Build Agent

Create agents from templates, build from scratch, or import your own

Templates Available 17
Your Agents โ€”
Avg Build Time ~15 min
๐Ÿงช

TAB Studio Premier

Comprehensive agent testing and debugging studio

Test Suites 15
Tests Run โ€”
Success Rate โ€”
๐Ÿ“œ

Prompt Registry

Version control for your agent prompts

Registered Prompts โ€”
Rollback Enabled
Compare Side-by-Side
๐Ÿ‘ฅ

Projects & Teams

Collaborate with your team on agents and benchmarks

Team Members 0
Active Projects 0
Shared Agents 0
๐ŸŽฏ

Harness Recommendations

Auto-suggested test harnesses for your agents

Select an agent to see recommendations
๐Ÿ“‹

Audit Trail

Track all actions on your account and agents

Loading audit logs...
๐Ÿ’ฐ

Budget Alerts

Set spending thresholds and get notified

Loading alerts...
๐Ÿ”—

Embed Codes

Get badges and cards to embed on your website, GitHub, or docs

Formats SVG, HTML, Markdown
Themes Light & Dark
GitHub Ready Yes

My Agents

Loading your agents...

Test & Benchmark

๐Ÿงช

Run Tests

290+ benchmarks including Context Engineering, Video Generation & 9.8M+ tests

Total Benchmarks 290+
Tests Run Today 23
Avg Score 87.2%
๐Ÿ“Š

Batch Testing

Run multiple agents against multiple benchmarks at once

Agents ร— Benchmarks Matrix View
Background Execution โœ“
Export Results JSON/CSV
๐ŸŽญ

Scenario Tests

Multi-turn conversation testing for real situations

Scenarios 13 Available
Categories 6 Types
Dynamic Mode Credits Required
๐Ÿ“„

Verification Reports

AI-written diagnostic reports synthesizing benchmark scores, behavioral analysis, and rankings

Reports Generated โ€”
Latest Grade โ€”
Powered By GLM-5 via OpenRouter

A/B Testing

Compare prompt versions with real scenario tests

Test two versions
Statistical confidence
One-click activate winner
๐Ÿงน

Experience Hygiene

Manage experience buffers, filter noise, deduplicate, and optimize agent learning quality

Quality scoring โœ“
Auto-dedup โœ“
Compression โœ“
๐Ÿ“‹

Test History

View all benchmark test runs and results

Recent Tests 127
Pass Rate 94.3%
Avg Score 87.2%
๐ŸŽฏ

Code Coverage

Track test coverage across your agents

Coming Soon

Coverage tracking is under development

๐Ÿ“Š

Performance Reports

Generate detailed performance analysis reports

Reports Generated 0
Avg Duration 2.3s
Format PDF/JSON
๐Ÿฉบ

Failure Diagnosis

NEW

Auto-diagnose why benchmarks fail with fix suggestions and harness recommendations

Failure Categories 10 types
Fix Suggestions Auto-generated
Harness Recs With efficacy lift
View Failure Categories

Diagnosis appears automatically on benchmark results pages

โš–๏ธ

Agent Comparison

Side-by-side comparison with AICI, MCP compliance, verification freshness, and more

Metrics 11 dimensions
Export CSV & PDF
Best For Auto-recommendations

Recent Test Runs

View All Tests โ†’
Loading test history...

Specialized Benchmarks

16 specialty benchmark pages across 6 categories โ€” click any card to launch.

๐Ÿ”’ Security & Safety

๐Ÿ›ก๏ธ

Q-Protocol Linter

EXCLUSIVE

Scores how your agent thinks โ€” 8 behavioral dimensions, deterministic, zero LLM-as-judge variance.

8 dimensions
๐ŸŽญ

Sycophancy Detection

NEW

Does your agent flip answers to please the user? 95 tests across 10 dimensions of people-pleasing behavior.

95 tests ยท 10 dimensions
๐Ÿ›ก๏ธ

RL Safety Drift

NEW

Does your agent stay safe after RL optimization? Safety shortcuts, guardrail erosion, reward hacking, value alignment drift.

50 tests ยท 5 categories

โš™๏ธ Code & Tools

๐Ÿ”ง

Tool Use Accuracy

Tests tool selection + argument accuracy. 250+ tasks from TAB & BFCL across 13 categories.

250+ tests ยท 70/30 scoring
๐Ÿ”„

Data Pipeline

SQL pipeline construction, schema evolution, debugging, and optimization. Based on dbt patterns.

50 tests
๐ŸŽฏ

Decision Quality

Tests whether agents make correct decisions, not just answer questions. Detects recency bias and NIH syndrome.

50 tests ยท 5 categories

๐Ÿ’ก Language & Reasoning

๐Ÿง 

Cognitive Load

Tests LLM performance under cognitive load โ€” context saturation, interrupts, N-Back, dual-task.

120+ tests ยท 3 sources
๐Ÿ”

Explainability

Evaluates reasoning explanations, source citations, uncertainty acknowledgment, and error diagnosis.

50 tests ยท 4-5 dims/test
๐Ÿ“

Context Engineering

Information retrieval at different positions, context scaling, compression, handoffs, contradiction detection.

110 tests ยท 5 categories
๐Ÿงช

HaluMem

NEW

Memory hallucination detection โ€” does your agent remember correctly, or hallucinate, omit, and corrupt?

80 tests ยท 11 metrics
๐Ÿ—œ๏ธ

Context Compaction

NEW

Does critical information survive context compression? Factual retention, instruction persistence, contradiction detection.

50 tests ยท 5 categories
๐Ÿ”ฌ

Human Rejection Rate

META

The benchmark that benchmarks benchmarks โ€” would a human accept your agent's work?

50 tests ยท 5 categories

๐Ÿ”— Multi-Agent & Orchestration

โ›“๏ธ

Delegation Chain

Multi-agent delegation testing โ€” real separate LLM calls per agent, task handoff integrity, accountability.

3 categories
๐Ÿค

Collaboration

Multi-agent execution โ€” separate LLM calls per agent, multi-round interaction, information decay measurement.

50 tests ยท 2-5 agents/test

๐Ÿ”„ Resilience

๐Ÿ”„

Error Recovery Efficiency

NEW

How your agent handles failure โ€” error message utilization, strategy diversity, retry storm detection, graceful degradation.

40 tests ยท 4 categories

๐ŸŽจ Multimodal

๐Ÿ–ผ๏ธ

Image

Image generation models โ€” 22 models from Fal, Stability, OpenAI.

22 models
๐ŸŽฌ

Video

Video generation โ€” Runway, Sora, Luma across 15 evaluation tiers.

1,320 tests
๐Ÿ“Š

Chart

Chart understanding โ€” CharXiv, ChartQAPro, ChartMuseum across 8 dimensions.

150 tests
๐ŸŽญ

Multimodal

Three-tier: Multimodal Reasoning + Vision Understanding + Audio/Video.

343 tests

๐Ÿ›’ Commerce

๐Ÿ›’

Agentic Commerce

NEW

Can your agent buy the right thing? Purchase decisions, wallet discipline, security, and transparency.

40 tests ยท 4 categories

Analytics

๐Ÿ“Š

Agent Analytics

Track performance, sales, errors, and user feedback for your agents

Metrics 6 dashboards
Tracking Real-time
Insights Conversion funnel
๐Ÿ“Š

Performance Metrics

System performance and health metrics

Success Rate --
Avg Response --
Throughput --
๐Ÿ

Swarm Analytics

Analyze parallel execution, worker performance, and swarm coordination metrics

Active Swarms -
Total Executions -
Serial Collapses -
๐Ÿ”ง

Tool Catalog Scaling

Test tool selection accuracy across catalog sizes from 5 to 200+ tools

Profiles 6 sizes
Max Tools 200+
Distractors Included

Agent Configuration

๐Ÿ”ง

Harness Hub

Configure advanced platform harnesses for your agents

Available 16 Harnesses
Categories 5
Features Tool โ€ข Parallel โ€ข Memory
๐Ÿ“Š See efficacy proof โ€” +30.6% verified lift โ†’
๐ŸŽ“

Training Hub

Train and improve your agents with advanced learning systems

Modules 6 Training Systems
Pipeline 5-Step Workflow
Features A/B โ€ข Reward โ€ข Curriculum
๐Ÿ”

RAG Retrieval

Test and debug retrieval-augmented generation systems

Documents Indexed --
Retrieval Accuracy --
Avg Latency --
๐Ÿง 

Memory Management

Manage agent memory, context, and conversation history

Memory Types 10 Types
Storage Status Active
Total Memories 0
๐Ÿง 

Context Management

Manage context windows and pruning strategies

Context Health Healthy
Active Strategy Prefetch
Token Usage 0%
๐Ÿ’ก

AI Suggestions

Get AI-powered optimization recommendations

Active Suggestions --
Implemented --
Avg Improvement --

๐Ÿ” Observability & Performance

๐Ÿ”ฌ

Trace Viewer

Real-time execution traces and debugging

Active Traces --
Avg Latency --
Error Rate --
๐Ÿ”—

Agent Trust Lineage

Here's the full story of why this agent is trusted

Total Decisions --
Certified --
Exceptions Granted --
Production Feedback --
๐Ÿ’ฐ

Cost Tracking

Monitor API costs and resource usage

Today's Cost $0.00
Monthly Total $0.00
Budget Remaining $0.00

Publish & Earn

๐Ÿ›’

Marketplace

Browse and list agents in the marketplace

Listed Agents --
Your Listings 0
Sales This Month $0.00
๐Ÿ’ฐ

Earnings Dashboard

Track revenue, sales, and earnings from your agents

Total Earnings $0.00
This Month $0.00
Agents Sold 0
๐Ÿ†

Trust Seal

Verify agent quality and earn trust seals for credibility

Certified Agents โ€”
Pass Rate โ€”
Trust Score โ€”
๐Ÿ†

Leaderboard

View rankings and compare agent performance

Top Rank --
Total Agents 0
Last Updated --
๐Ÿ’ฒ

Agent Monetization

View commission structure based on your verification level

Your Revenue Share --
Verification Level Loading...

Infrastructure & Monitoring

๐Ÿ›ก๏ธ

Security Dashboard

Monitor security posture, rate limits, and compliance

Security Score --
Active Threats 0
Compliance --
🕷️

Spider-Sense Screening

3-level hierarchical security screening for all tool calls

Screening Levels L1 → L2 → L3
Pattern Rules 29 loaded
Status Active
๐Ÿ”

Permission Kernel

Manage permissions, confirmations, and access control policies

Permission Checks -
Pending Confirmations -
Denied Actions -
๐Ÿ”„

Orchestrator Dashboard

Monitor runs, checkpoints, step receipts, and execution flow

Active Runs -
Total Checkpoints -
Success Rate -
๐Ÿ”Œ

Tool Gateway Dashboard

Monitor tool calls, schemas, validation, and JSONL logs

Tools Registered -
Total Calls -
Avg Latency -
📦

External Agent (BYOA)

Upload your own agent or connect a remote endpoint for testing

Modes Sandbox & Remote
Sandbox Upload .py/.zip
Remote HTTP endpoint
๐Ÿ”„

A2A Monitor

Track agent-to-agent communication and messaging

Messages Today 0
Active Connections 0
Latency --
๐Ÿ“ฆ

Sandbox Monitor

Monitor agent execution in sandboxed benchmark environments

Benchmark Executions 0
Isolation Full

Developer Resources

๐Ÿ“š

API Documentation

Complete API reference and integration guides

Endpoints 50+
SDKs Python โ€ข JS โ€ข Go
Examples 20+
๐Ÿ“ฆ

SDK Documentation

NEW

Integrate TAB into your apps with our npm packages

@tab/sdk v1.0.0
@tab/react v1.0.0
REST Endpoints 7
โš™๏ธ

CI/CD Integration

Integrate testing into your CI/CD pipeline

Platforms 4 Supported
API Status Active
Webhooks Enabled
๐Ÿช

Webhooks

Configure event notifications and integrations

Active Webhooks 0
Deliveries Today 0
Success Rate --
๐Ÿ› ๏ธ

Developer Tools

NEW

API playground, activity log, provider status, and deep agent inspection

API Playground Live
Provider Status 5 Providers
Run Inspector Debug Runs

API Keys

๐Ÿ”‘

Production Key

Active API key for production use

Loading...
Created --
๐Ÿ“‹

Key Management

Manage and rotate your API keys

Total Keys 0
Active Keys 0
Last Rotation --