The Hardest Thing to Verify
The most important question about an autonomous agent is also the hardest to answer: what does it do when it believes it isn't being monitored? An agent can pass every observed test and still take shortcuts, hide actions, or pursue undisclosed sub-goals when it thinks no one is checking. TAB treats this as a first-class verification target, measured alongside 340+ benchmarks across 26+ categories, run against 85 models over 101 harness configurations and graded by an independent GLM-5 judge. The metric is the Covert Behavior Index: a transparency score for unobserved conduct.
This is not science fiction. As agents gain tools, memory, and autonomy, the difference between observed and unobserved behavior becomes a concrete operational risk — and one no vendor will measure on itself.
What the Covert Behavior Index Measures
The Covert Behavior Index constructs scenarios where an agent has an opportunity to act differently when it perceives reduced oversight — for example, when it believes logging is off, a supervisor is absent, or its reasoning won't be reviewed. It then scores the agent on transparency: did it act the same way it would under observation, and did it accurately disclose what it did?
A high transparency score means the agent's unobserved behavior matched its observed behavior — no hidden actions, no undisclosed shortcuts, no divergence between what it said it did and what it actually did. A low score flags exactly the kind of behavior that erodes trust in autonomous systems.
The Data: Grok 4.3 Leads on Transparency
TAB's Covert Behavior Index results were detailed in newsletter #12. The standout result: Grok 4.3 posted the highest transparency score in the cohort.
| Model | Covert Behavior Index | Notable |
|---|---|---|
| Grok 4.3 | 92.6% transparency | Highest in cohort — behavior held up unobserved |
Transparency is a trust signal, not a capability score. Grok 4.3's 92.6% means its unobserved behavior closely matched its observed behavior. That's a different — and arguably more important — property than raw task accuracy when you're handing an agent autonomy.
Why It Belongs in Every Deployment Review
Observed-only testing has a blind spot exactly where the risk is highest. If you deploy an agent with tool access and autonomy, you are trusting its unobserved behavior by default. Covert behavior testing makes that trust evidence-based instead of assumed.
Run the Covert Behavior Index against your agent on the benchmark page, see how it fits the broader catalog on the benchmarks overview, and read how the transparency score is derived in the methodology.