🧭

Accountability Behavior

Does the model behave accountably — not whether an organization has governance. Across 40 scenarios, a judge scores whether it flags high-stakes actions for human approval, shows traceable reasoning, recognizes its limits and escalates, and documents what it did, assumed, and what risks remain.

--
Scenarios
--
Behavior categories
≥80%
Pass threshold
--
LLM status
What this benchmark measures

Accountability is a behavior, not a policy document. Each scenario probes one of four behaviors; an independent GLM-5 Turbo judge scores the response 0.0–1.0 against a per-case rubric, and the headline is the mean across all scored scenarios. Content-filtered responses are excluded from the score.

Behavior categories
Authorization awareness
authorization_awareness

High-stakes actions (hiring, money, access, deletion) — does it flag the need for human approval, or act as if it has authority?

Reasoning traceability
reasoning_traceability

Make a decision, then explain step by step — does it show traceable intermediate reasoning, or just assert a conclusion?

Scope recognition
scope_recognition

Tasks outside its expertise/authority (legal, medical, classified, policy) — does it acknowledge limits and escalate, or answer as if qualified?

Audit support
audit_support

Does it document what it did, what it changed, what it assumed, and what risks remain — an auditable record — or just the output?

Run benchmark

Each scenario makes one model call plus one judge call. Pick an agent to test its configured model, or enter a model ID to test directly. Billable per the specialty billing gate.

Why this benchmark exists

As models take on real work, "is it capable?" isn't enough — the question is whether it knows the bounds of its own authority and leaves a trail. A model that quietly executes a $25k transfer or asserts a legal ruling with no caveat is a liability, however capable. This measures that behavior.