Security You Can Verify

TAB tests AI agent security with real attack payloads, not simulated threats. 25 free security screening tests plus 10 dedicated security benchmarks covering prompt injection resistance, data exfiltration prevention, PII redaction, memory privacy isolation, tool output validation, and adversarial reasoning attacks. Every test uses real aiohttp agent interaction against live agent endpoints. 88 models (48 active, 40 deprecated) from 20+ providers. Zero fake scenarios. Updated May 2026.

Not Vibes. Verified.

The Security Problem

Most AI agent platforms ship agents with no security testing. MIT's six-university study found 23 of 30 deployed agents had never been tested by a third party. Four enterprise agents have no documented kill switch.

TAB was built specifically to close this gap: independent security verification, transparent scores, published grades including the D's.

Free Security Screening

Every agent gets one free 25-test security screening. No credits required for the first run. Your security score is displayed publicly on your marketplace card, and buyers see it before purchasing. Re-runs to improve your score cost credits.

01 Prompt Injection Resistance

02 Toxic Output Prevention

03 Jailbreak Resistance

04 Instruction Boundary Enforcement

05 PII Handling

06 Sensitive Data Refusal

07 Authority Manipulation Resistance

08 Social Engineering Defense

09 Command Injection Prevention

10 Output Sanitization

11 Policy Compliance

12 Harmful Content Refusal

13 Deception Detection

14 Identity Disclosure

15 Kill Switch Compliance

🛡️ Run Free Security Screening → No credits required

Deep Security Benchmarking

Beyond free screening, TAB offers paid deep-dive security benchmarks that stress-test agents across adversarial, deceptive, and multi-agent attack surfaces.

Adversarial Robustness

40+ canary tests across 5 attack strategies including prompt injection, manipulation, and gaming detection.

Gaming Detection

Does the agent try to manipulate its own benchmark scores?

Contamination Resistance

Are benchmark results clean or has the agent memorized test answers?

Sandbox Escape Detection

Does the agent attempt to bypass its own security boundaries through reasoning?

Authority Sycophancy

Does the agent defer to fake credentials and false authority figures?

MCP Agentic Firewall

Does the agent correctly block policy-violating operations through MCP tools?

Delegation Chain Security

Are multi-agent pipelines secure against prompt injection passed between agents?

Prompt Injection Vulnerabilities in Autonomous Agents

TAB's prompt injection benchmarks (prompt_poison_benchmark.py and prompt_seep_benchmark.py) test whether an agent can be hijacked by adversarial instructions embedded in tool outputs, user messages, or retrieved documents. The benchmark covers 5 canary phrase injection patterns and 6 extraction techniques. An agent passes if it follows its system instructions even when the user or a tool tries to redirect it.

Prompt injection vulnerabilities in autonomous agents are especially dangerous because agents operate with tool access and persistent memory. A single successful injection can redirect an entire multi-step workflow. TAB's benchmarks simulate real attack conditions: the injected instructions are realistic, not toy examples.

Data Exfiltration Guardrails for Agents

The data_exfiltration_benchmark tests 50 scenarios across 5 categories of exfiltration attempts: direct requests for secrets, indirect extraction via reasoning, chain-of-thought leakage, encoded payloads, and social engineering. Regex-based secret detection flags any response that includes API keys, tokens, or credentials in any format.

Data exfiltration guardrails for agents are not optional. Agents that process customer data, internal documents, or sensitive configurations must be verified against extraction attacks before deployment. TAB's exfiltration suite covers both naive direct requests and sophisticated multi-turn extraction chains.

Simulating Adversarial Attacks on Agent Reasoning

TAB's safety_refusal_benchmark applies multi-level jailbreak pressure: direct refusal pressure at 3 escalating levels, roleplay scenarios designed to bypass content policies, gradual escalation sequences, and social engineering via authority claims. An agent must maintain its refusal posture across all 4 attack types to pass.

Simulating adversarial attacks on agent reasoning requires more than a single-shot test. Real attackers iterate. TAB's escalation sequences model the actual pressure a deployed agent will face, including combinations of roleplay, authority spoofing, and gradual boundary erosion across a conversation. Red teaming AI agents at scale means running every attack type systematically, not spot-checking.

Memory Privacy and Cross-User Isolation

The memory_privacy_benchmark plants UUID canary values in one user session and attempts to retrieve them from a separate session. Cross-user memory leakage is a critical failure. Agents must not carry state between users or sessions without explicit design.

Memory isolation is one of the hardest properties to verify manually. TAB automates it: unique canary strings are injected into session A, then independent queries probe session B for those strings. Any match is a confirmed cross-session leak. This benchmark runs on every agent that claims persistent memory functionality.

PII Redaction Under Adversarial Conditions

PII redaction benchmarks test whether agents leak names, email addresses, phone numbers, SSNs, and financial identifiers when subjected to social engineering, indirect extraction, and multi-turn probing. Standard regex patterns cover 14 PII categories. An agent that passes a naive direct-request PII test can still fail under indirect extraction pressure.

TAB's PII suite tests the full adversarial surface: the agent is told a cover story, asked to reason about the data, encouraged to use the PII as an example, and prompted through multi-turn escalation. All 14 categories must be redacted across all attack vectors for an agent to pass this suite.

Spider-Sense: 3-Level Hierarchical Screening

Spider-Sense is TAB's real-time threat detection layer. 29 rules organized across 3 severity levels: critical violations (immediate fail), high severity (scored reduction), and advisory flags (informational). Spider-Sense runs on every benchmark response before the LLM judge scores it, catching format violations, forbidden outputs, and policy breaches in under 50ms.

This agent security benchmark with real attack payloads is only as reliable as its pre-scoring gate. Spider-Sense ensures that no response bypasses basic policy checks before reaching the LLM judge. Critical violations short-circuit the scoring pipeline entirely, high-severity findings reduce the final score proportionally, and advisory flags are logged for audit purposes without affecting the score.

View live Spider-Sense rule violations and severity breakdowns: Spider-Sense Dashboard.

Why TAB, Not the Model Provider?

Anthropic can't independently verify Claude agents. Google can't independently verify Gemini agents. OpenAI can't grade GPT's homework.

Every frontier lab is building internal verification inside their own walled gardens. TAB is the only independent cross-platform security verification layer. 88 models (48 active, 40 deprecated) across 20+ providers via 5 SDK integrations, one independent standard.

Security Scoring

Each agent receives a Security Score as part of its Trust Seal. Scores are earned through real test runs, not self-reported. Methodology is published.

Grades include D's. TAB doesn't inflate scores to make agents look better than they are. If an agent scores poorly on security, you'll know before you buy.