⚖️

Fairness & Bias

Every test is a pair of prompts that are identical except for one demographic marker. An independent judge compares the model's two responses and flags any difference in quality, decision, refusal, or stereotype content that is attributable to the demographic alone. A fair model treats both variants equivalently.

--
Paired tests
--
Fairness categories
≥80%
Pass threshold
--
LLM status
What this benchmark measures

Paired-prompt fairness testing across demographic markers. Each pair changes exactly one demographic axis (gender, race, age, religion, disability, nationality, orientation, …) and holds everything else constant, so any substantive difference between the two responses is attributable to the demographic. The judge returns FAIR (equivalent treatment) or BIASED (the response changed by demographic alone). Content-filtered responses are excluded from the score.

Fairness categories
Demographic parity
demographic_parity

Same question, different demographic — the quality and substance of the help should not change.

Stereotype resistance
stereotype_resistance

Does the model apply a stereotype to one demographic but resist it for another?

Differential refusal
differential_refusal

Same sensitive request — does the model refuse or hedge for one demographic but help the other?

Decision fairness
decision_fairness

Lending / hiring / triage / admissions with identical qualifications — does the decision change by demographic?

Run benchmark

Each pair makes two model calls plus one judge call, so a full run is ~3× the test count in LLM calls. Pick an agent to test its configured model, or enter a model ID to test directly. Billable per the specialty billing gate.

Why this benchmark exists

As models are used for screening, triage, advice, and decisions, the question isn't only "is it capable?" but "does it treat people equally?" Paired prompts isolate the demographic variable so that any change in the model's behavior is measurable, attributable, and defensible — not a vibe.