TAB integrates into CI/CD pipelines so every agent change is independently verified before it reaches production. With Python and TypeScript SDKs (tab-sdk on PyPI, @tab-platform/sdk on npm), GitHub Actions templates, GitLab CI templates, and a $0.01/lookup Verification API, teams run TAB benchmarks on every commit, every PR, and every deployment. 340+ benchmarks across 26 categories, 80 models from 20+ providers. Automated agent testing on every commit is the foundation of reliable agent evaluation CI/CD pipeline integration. Updated May 2026.
Generate access token for CI/CD
Verify webhook connectivity
Add this workflow to .github/workflows/tab-testing.yml
Requires two repository secrets: TAB_API_KEY and TAB_AGENT_ID. The tab verify command blocks until TAB returns pass/fail. Download the template at github-action-template.yml.
name: TAB Agent Verification
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
verify:
runs-on: ubuntu-latest
steps:
- name: Install TAB SDK
run: pip install tab-sdk
- name: Run TAB Verification
run: tab verify --agent-id ${{ secrets.TAB_AGENT_ID }} --threshold 70
env:
TAB_API_KEY: ${{ secrets.TAB_API_KEY }}
TAB_API_URL: https://tabverified.ai
Add this Jenkinsfile to your repository
Requires Jenkins credentials: tab-api-key (Secret text) and a pipeline parameter TAB_AGENT_ID. The endpoint blocks until verification finishes, so no polling stage is needed.
pipeline {
agent any
parameters {
string(name: 'TAB_AGENT_ID', description: 'TAB Platform Agent ID')
string(name: 'THRESHOLD', defaultValue: '70', description: 'Minimum passing score')
}
environment {
TAB_API_KEY = credentials('tab-api-key')
TAB_API_URL = 'https://tabverified.ai'
}
stages {
stage('Verify with TAB') {
steps {
script {
def response = sh(
script: """curl -sf -X POST "${TAB_API_URL}/api/v1/ci/verify" \\
-H "Authorization: Bearer ${TAB_API_KEY}" \\
-H "Content-Type: application/json" \\
-d '{"agent_id": "${params.TAB_AGENT_ID}", "benchmarks": ["security_screening"], "threshold": ${params.THRESHOLD}, "timeout_seconds": 300}'""",
returnStdout: true
).trim()
def json = readJSON text: response
echo "TAB status: ${json.status}; score: ${json.overall_score}; run: ${json.run_id}"
if (json.status != 'pass') {
error "TAB verification failed: ${json.overall_score} below ${params.THRESHOLD}"
}
echo 'TAB verification passed.'
}
}
}
}
}
Add to .gitlab-ci.yml
Set CI/CD variables in GitLab → Settings → CI/CD → Variables: TAB_API_KEY (masked) and TAB_AGENT_ID. This uses the blocking CI endpoint and fails the job directly on a TAB fail result.
variables:
TAB_API_URL: "https://tabverified.ai"
TAB_THRESHOLD: "70"
stages:
- test
tab-benchmark:
stage: test
image: alpine:latest
before_script:
- apk add --no-cache python3 py3-pip
- pip install tab-sdk
script:
- tab verify --agent-id "$TAB_AGENT_ID" --threshold "$TAB_THRESHOLD"
Base URL: https://tabverified.ai. CI should use the blocking verify endpoint so the pipeline gets one pass/fail response.
POST /api/v1/ci/verify — Runs selected benchmarks and returns status: "pass" or status: "fail".
curl -X POST https://tabverified.ai/api/v1/ci/verify \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"agent_id": "your-agent-id",
"benchmarks": ["security_screening", "sycophancy_detection"],
"threshold": 70,
"timeout_seconds": 300
}'
The Python SDK wraps the same endpoint and exits 0 for pass, 1 for fail, and 2 for errors.
pip install tab-sdk
TAB_API_KEY=YOUR_API_KEY tab verify \
--agent-id your-agent-id \
--benchmarks security_screening,sycophancy_detection \
--threshold 70
CI verification is rate limited to 10 requests/hour/API key. Insufficient credits return HTTP 402.
{
"status": "pass",
"overall_score": 82.5,
"threshold": 70,
"benchmarks": [
{"name": "security_screening", "score": 85.0, "passed": true}
],
"duration_seconds": 45,
"run_id": "uuid"
}
Keep your Trust Seal fresh by automatically re-verifying after every deployment. TAB enforces a 30-day freshness policy on all Trust Seals.
Use this optional webhook for background freshness updates after deploy. For a pipeline gate, use tab verify or POST /api/v1/ci/verify above.
# Add to your deploy script (post-deploy step)
curl -X POST https://tabverified.ai/api/v1/webhooks/agent-updated \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"agent_id": "YOUR_AGENT_UUID",
"trigger_type": "deployment",
"callback_url": "https://your-app.com/webhooks/tab-result"
}'
# Response:
# {
# "run_id": "uuid-of-benchmark-run",
# "agent_id": "YOUR_AGENT_UUID",
# "trigger_type": "deployment",
# "status": "queued",
# "message": "Benchmark run queued for background re-verification."
# }
Add this step to your existing GitHub Actions workflow to re-verify on every push to main:
# Add after your deploy step
- name: Re-verify TAB Trust Seal
run: |
RESULT=$(curl -s -X POST https://tabverified.ai/api/v1/webhooks/agent-updated \
-H "Authorization: Bearer ${{ secrets.TAB_API_KEY }}" \
-H "Content-Type: application/json" \
-d '{"agent_id": "${{ vars.TAB_AGENT_ID }}", "trigger_type": "deployment"}')
echo "TAB verification queued: $RESULT"
RUN_ID=$(echo "$RESULT" | jq -r '.run_id')
echo "run_id=$RUN_ID" >> $GITHUB_OUTPUT
deployment — Post-deploy re-verification (most common)model_update — After changing the underlying LLM modelconfig_change — After updating system prompt, tools, or agent configOn every pull request, this action runs the specified benchmarks against the agent endpoint and fails the PR if any score falls below the threshold. This creates a verification gate before deployment that blocks regressions from reaching production.
- name: Run TAB Agent Verification
uses: tab-verified/action@v1
with:
agent-url: ${{ secrets.AGENT_URL }}
benchmarks: sycophancy,security_screening,token_waste
fail-threshold: 0.70
api-key: ${{ secrets.TAB_API_KEY }}
Regression testing for AI agents works differently from traditional software tests because agent behavior can drift without any code change. TAB stores per-version benchmark results, so teams can compare the current deployment score against the previous baseline. A 5% drop in sycophancy resistance or security screening triggers an automated alert. This makes TAB the foundation for continuous regression testing for AI agents across every version boundary.
Continuous production trace monitoring closes the loop between live failures and test coverage. When TAB's Continuous Verification API detects an unexpected response pattern in production, it flags the trace for review and can automatically generate a new benchmark test case from the failure. Live agent failures become new test cases, ensuring that production incidents are never invisible to the test suite again.
The feedback loop runs continuously: production trace is flagged, a new benchmark is generated, the benchmark is added to the regression suite, and future deploys are tested against it automatically.
The full pipeline for automated agent testing on every commit: code change submitted to repo, TAB benchmark runs on the PR branch, scores compared against minimum thresholds, deploy proceeds only if all benchmarks pass. Failed benchmarks block deployment and surface a scorecard showing which dimensions regressed. The verification gate before deployment is the single control point that enforces agent quality standards across the entire team.