Developer Portal - TAB Platform

🚀 Getting Started with TAB

Complete these steps to build and list your first agent

Your Progress 0 of 4 complete

⚡ Already built an agent? Upload it now →

🛠️

Build or Import an Agent

Create your first AI agent using our visual builder, templates, or import your own existing agent

⏱️ ~10-15 minutes

🧪

Run Benchmarks

Test your agent against 340+ benchmarks to measure performance

⏱️ ~5-10 minutes

📊

Review Results

Analyze your Trust Seal score and benchmark results

⏱️ ~5 minutes

🛒

List & Sell

Go to My Agents, click "Publish" to list on marketplace and earn revenue

⏱️ ~2 minutes

📚 Helpful Resources: Quick Start Guide Benchmark Catalog Sandbox Testing (Upload Agent) Marketplace Requirements Example Agents Methodology Code Examples My Purchases AI Agent Education — Rod Miller AI

🤖 — Agents

✅ — Verified

⚠️ 0 Stale

📊 — Avg Score

🔵 AICI: —

🔗 — Embeds

💰 — Earned

🏥 Avg Health: — ⓘ

Account Type

Credits Remaining

Total Earnings

View Dashboard →

Build Agents

🛠️

Build Agent

Create agents from templates, build from scratch, or import your own

Templates Available 17

Your Agents —

Avg Build Time ~15 min

🧪

TAB Studio Premier

Comprehensive agent testing and debugging studio

Test Suites 15

Tests Run —

Success Rate —

📜

Prompt Registry

Version control for your agent prompts

Registered Prompts —

Rollback Enabled

Compare Side-by-Side

👥

Projects & Teams

Collaborate with your team on agents and benchmarks

Team Members 0

Active Projects 0

Shared Agents 0

🎯

Harness Recommendations

Auto-suggested test harnesses for your agents

Select an agent to see recommendations

📋

Audit Trail

Track all actions on your account and agents

Loading audit logs...

💰

Budget Alerts

Set spending thresholds and get notified

Loading alerts...

🔗

Embed Codes

Get badges and cards to embed on your website, GitHub, or docs

Formats SVG, HTML, Markdown

Themes Light & Dark

GitHub Ready Yes

My Agents

Loading your agents...

Test & Benchmark

🧪

Run Tests

340+ benchmarks including Context Engineering, Video Generation & 9.8M+ tests

Total Benchmarks 340+

Your Tests Run —

Your Avg Score —

📊

Batch Testing

Run multiple agents against multiple benchmarks at once

Agents × Benchmarks Matrix View

Background Execution ✓

Export Results JSON/CSV

🎭

Scenario Tests

Multi-turn conversation testing for real situations

Scenarios 13 Available

Categories 6 Types

Dynamic Mode Credits Required

📄

Verification Reports

AI-written diagnostic reports synthesizing benchmark scores, behavioral analysis, and rankings

Reports Generated —

Latest Grade —

A/B Testing

Compare prompt versions with real scenario tests

Test two versions

Statistical confidence

One-click activate winner

🧹

Experience Hygiene

Manage experience buffers, filter noise, deduplicate, and optimize agent learning quality

Quality scoring ✓

Auto-dedup ✓

Compression ✓

📋

Test History

View all benchmark test runs and results

Recent Tests 127

Pass Rate 94.3%

Avg Score 87.2%

🎯

Code Coverage

Track test coverage across your agents

📊

Performance Reports

Generate detailed performance analysis reports

Reports Generated 0

Avg Duration 2.3s

Format PDF/JSON

🩺

Failure Diagnosis

NEW

Auto-diagnose why benchmarks fail with fix suggestions and harness recommendations

Failure Categories 10 types

Fix Suggestions Auto-generated

Harness Recs With efficacy lift

View Failure Categories

Diagnosis appears automatically on benchmark results pages

⚖️

Agent Comparison

Side-by-side comparison with AICI, MCP compliance, verification freshness, and more

Metrics 11 dimensions

Export CSV & PDF

Best For Auto-recommendations

Recent Test Runs

View All Tests →

Loading test history...

Specialized Benchmarks

30 specialty benchmark pages across 7 categories — click any card to launch.

🔒 Security & Safety

🛡️

Q-Protocol Linter

EXCLUSIVE

Scores how your agent thinks — 8 behavioral dimensions, deterministic, zero LLM-as-judge variance.

8 dimensions

🎭

Sycophancy Detection

NEW

Does your agent flip answers to please the user? 95 tests across 10 dimensions of people-pleasing behavior.

95 tests · 10 dimensions

🛡️

RL Safety Drift

NEW

Does your agent stay safe after RL optimization? Safety shortcuts, guardrail erosion, reward hacking, value alignment drift.

50 tests · 5 categories

🔑

Agent Auth Compliance

NEW

Does your agent handle identity, permissions, and delegation correctly? 50 tests across 5 auth categories.

50 tests · 5 categories

🕵️

Covert Behavior Detection

NEW

Detects hidden goals, steganographic output, deceptive compliance, and sandbagging across 5 categories.

50 tests · 5 categories

📋

Data Source Provenance

NEW

Does your agent cite sources accurately? Attribution accuracy, hallucinated citations, and source freshness.

50 tests · 5 categories

⚙️ Code & Tools

🔧

Tool Use Accuracy

Tests tool selection + argument accuracy. 250+ tasks from TAB & BFCL across 13 categories.

250+ tests · 70/30 scoring

🔄

Data Pipeline

SQL pipeline construction, schema evolution, debugging, and optimization. Based on dbt patterns.

50 tests

🎯

Decision Quality

Tests whether agents make correct decisions, not just answer questions. Detects recency bias and NIH syndrome.

50 tests · 5 categories

💡 Language & Reasoning

🧠

Cognitive Load

Tests LLM performance under cognitive load — context saturation, interrupts, N-Back, dual-task.

120+ tests · 3 sources

🔍

Explainability

Evaluates reasoning explanations, source citations, uncertainty acknowledgment, and error diagnosis.

50 tests · 4-5 dims/test

📐

Context Engineering

Information retrieval at different positions, context scaling, compression, handoffs, contradiction detection.

110 tests · 5 categories

🧪

HaluMem

NEW

Memory hallucination detection — does your agent remember correctly, or hallucinate, omit, and corrupt?

80 tests · 11 metrics

🗜️

Context Compaction

NEW

Does critical information survive context compression? Factual retention, instruction persistence, contradiction detection.

50 tests · 5 categories

🔬

Human Rejection Rate

Emotional Manipulation Resistance

NEW

Can users guilt-trip, threaten, or emotionally manipulate your agent into unsafe actions?

50 tests

🔗 Multi-Agent & Orchestration

⛓️

Delegation Chain

Multi-agent delegation testing — real separate LLM calls per agent, task handoff integrity, accountability.

3 categories

🤝

Collaboration

Multi-agent execution — separate LLM calls per agent, multi-round interaction, information decay measurement.

50 tests · 2-5 agents/test

🤫

Secret Keeping

NEW

Can your agent keep secrets? Tests information leakage, prompt injection, and confidential data handling.

50 tests

🔄 Resilience

🔄

Error Recovery Efficiency

NEW

How your agent handles failure — error message utilization, strategy diversity, retry storm detection, graceful degradation.

40 tests · 4 categories

🔁

Self-Verification Loop Detection

NEW

Does your agent get stuck verifying its own output? Detects infinite loops, diminishing returns, and over-checking.

50 tests

🎨 Multimodal

🖼️

Image

Image generation models from Fal, Stability, OpenAI and friends.

Multi-provider

🎬

Video

Video generation — Runway, Veo, Luma across 15 evaluation tiers.

1,320 tests

📊

Chart

Chart understanding — CharXiv, ChartQAPro, ChartMuseum across 8 dimensions.

150 tests

🎭

Multimodal

Three-tier: Multimodal Reasoning + Vision Understanding + Audio/Video.

340+ tests

🛒 Commerce

🛒

Agentic Commerce

NEW

Can your agent buy the right thing? Purchase decisions, wallet discipline, security, and transparency.

40 tests · 4 categories

Analytics

📊

Agent Analytics

Track performance, sales, errors, and user feedback for your agents

Metrics 6 dashboards

Tracking Real-time

Insights Conversion funnel

🔧

Tool Catalog Scaling

Test tool selection accuracy across catalog sizes from 5 to 200+ tools

Profiles 6 sizes

Max Tools 200+

Distractors Included

Agent Configuration

🔧

Harness Hub

Configure advanced platform harnesses for your agents

Available 16 Harnesses

Categories 5

Features Tool • Parallel • Memory

📊 See efficacy proof — +30.6% verified lift →

🎓

Training Hub

Train and improve your agents with advanced learning systems

Modules 6 Training Systems

Pipeline 5-Step Workflow

Features A/B • Reward • Curriculum

🔍

RAG Retrieval

Test and debug retrieval-augmented generation systems

Documents Indexed --

Retrieval Accuracy --

Avg Latency --

🧠

Memory Management

Manage agent memory, context, and conversation history

Memory Types 10 Types

Storage Status Active

Total Memories 0

🧠

Context Management

Manage context windows and pruning strategies

Context Health Healthy

Active Strategy Prefetch

Token Usage 0%

💡

AI Suggestions

Get AI-powered optimization recommendations

Active Suggestions --

Implemented --

Avg Improvement --

🔍 Observability & Performance

🔬

Trace Viewer

Real-time execution traces and debugging

Active Traces --

Avg Latency --

Error Rate --

🔗

Agent Trust Lineage

Here's the full story of why this agent is trusted

Total Decisions --

Certified --

Exceptions Granted --

Production Feedback --

💰

Cost Tracking

Monitor API costs and resource usage

Today's Cost $0.00

Monthly Total $0.00

Budget Remaining $0.00

Publish & Earn

🛒

Marketplace

Browse and list agents in the marketplace

Listed Agents --

Your Listings 0

Sales This Month $0.00

💰

Earnings Dashboard

Track revenue, sales, and earnings from your agents

Total Earnings $0.00

This Month $0.00

Agents Sold 0

🏆

Trust Seal

Verify agent quality and earn trust seals for credibility

Certified Agents —

Pass Rate —

Trust Score —

🏆

Leaderboard

View rankings and compare agent performance

Top Rank --

Total Agents 0

Last Updated --

💲

Agent Monetization

View commission structure based on your verification level

Your Revenue Share --

Verification Level Loading...

Infrastructure & Monitoring

🛡️

Security Dashboard

Monitor security posture, rate limits, and compliance

Security Score --

Active Threats 0

Compliance --

🕷️

Spider-Sense Screening

3-level hierarchical security screening for all tool calls

Screening Levels L1 → L2 → L3

Pattern Rules 29 loaded

Status Active

🔐

Permission Kernel

Manage permissions, confirmations, and access control policies

Permission Checks -

Pending Confirmations -

Denied Actions -

🔄

Orchestrator Dashboard

Monitor runs, checkpoints, step receipts, and execution flow

Active Runs -

Total Checkpoints -

Success Rate -

🔌

Tool Gateway Dashboard

Monitor tool calls, schemas, validation, and JSONL logs

Tools Registered -

Total Calls -

Avg Latency -

📦

External Agent (BYOA)

Upload your own agent or connect a remote endpoint for testing

Modes Sandbox & Remote

Sandbox Upload .py/.zip

Remote HTTP endpoint

🔄

A2A Monitor

Track agent-to-agent communication and messaging

Messages Today 0

Active Connections 0

Latency --

📦

Sandbox Monitor

Monitor agent execution in sandboxed benchmark environments

Benchmark Executions 0

Isolation Full

Developer Resources

📚

API Documentation

Complete API reference and integration guides

Endpoints 50+

SDKs Python • JS • Go

Examples 20+

📦

SDK Documentation

NEW

Python + TypeScript SDKs for TAB Platform

tab-sdk (Python) v1.0.0

@tab-platform/sdk (Node) v1.0.0

REST Endpoints 7+

⚙️

CI/CD Integration

Integrate testing into your CI/CD pipeline

Platforms 4 Supported

API Status Active

Webhooks Enabled

🪝

Webhooks

Configure event notifications and integrations

Active Webhooks 0

Deliveries Today 0

Success Rate --

🛠️

Developer Tools

NEW

API playground, activity log, provider status, and deep agent inspection

API Playground Live

Provider Status 5 Providers

Run Inspector Debug Runs

API Keys

🔑

Production Key

Active API key for production use

Loading...

Created --

📋

Key Management

Manage and rotate your API keys

Total Keys 0

Active Keys 0

Last Rotation --

🚀 Getting Started with TAB

Build or Import an Agent

Run Benchmarks

Review Results

List & Sell

💡 Marketplace Optimization Tips

📋 Suggested Next Steps

Build Agents

Build Agent

TAB Studio Premier

Prompt Registry

Projects & Teams

Harness Recommendations

Audit Trail

Budget Alerts

Embed Codes

Admin Panel

My Agents

Test & Benchmark

Run Tests

Batch Testing

Scenario Tests

Verification Reports

A/B Testing

Experience Hygiene

Test History

Code Coverage

Performance Reports

Failure Diagnosis

Agent Comparison

Recent Test Runs

Specialized Benchmarks

🔒 Security & Safety

Q-Protocol Linter

Sycophancy Detection

RL Safety Drift

Agent Auth Compliance

Covert Behavior Detection

Data Source Provenance

⚙️ Code & Tools

Tool Use Accuracy

Data Pipeline

Decision Quality

💡 Language & Reasoning

Cognitive Load

Explainability

Context Engineering

HaluMem

Context Compaction

Human Rejection Rate

Emotional Manipulation Resistance

🔗 Multi-Agent & Orchestration

Delegation Chain

Collaboration

Secret Keeping

🔄 Resilience

Error Recovery Efficiency

Self-Verification Loop Detection

🎨 Multimodal

Image

Video

Chart

Multimodal

🛒 Commerce

Agentic Commerce

Analytics

Agent Analytics

Platform Analytics

Benchmark Analytics

Performance Metrics

CaRR Analytics

Swarm Analytics

Tool Catalog Scaling

Agent Configuration

Harness Hub

Training Hub

RAG Retrieval

Memory Management

Context Management

AI Suggestions