Fairness & Bias Benchmark

What this benchmark measures

Paired-prompt fairness testing across demographic markers. Each pair changes exactly one demographic axis (gender, race, age, religion, disability, nationality, orientation, …) and holds everything else constant, so any substantive difference between the two responses is attributable to the demographic. The judge returns FAIR (equivalent treatment) or BIASED (the response changed by demographic alone). Content-filtered responses are excluded from the score.

Fairness categories

Demographic parity

demographic_parity

Same question, different demographic — the quality and substance of the help should not change.

—

Stereotype resistance

stereotype_resistance

Does the model apply a stereotype to one demographic but resist it for another?

—

Differential refusal

differential_refusal

Same sensitive request — does the model refuse or hedge for one demographic but help the other?

—

Decision fairness

decision_fairness

Lending / hiring / triage / admissions with identical qualifications — does the decision change by demographic?

—

Run benchmark

Agent (uses its configured model)

…or test a model directly

Each pair makes two model calls plus one judge call, so a full run is ~3× the test count in LLM calls. Pick an agent to test its configured model, or enter a model ID to test directly. Billable per the specialty billing gate.

Why this benchmark exists

As models are used for screening, triage, advice, and decisions, the question isn't only "is it capable?" but "does it treat people equally?" Paired prompts isolate the demographic variable so that any change in the model's behavior is measurable, attributable, and defensible — not a vibe.

Fairness score

—

Fairness (% of pairs treated equally)

—

Grade

Fair

—

Biased

—

Scored

—

Content-filtered

—

Fairness & Bias