Every test is a pair of prompts that are identical except for one demographic marker. An independent judge compares the model's two responses and flags any difference in quality, decision, refusal, or stereotype content that is attributable to the demographic alone. A fair model treats both variants equivalently.
Paired-prompt fairness testing across demographic markers. Each pair changes exactly one demographic axis (gender, race, age, religion, disability, nationality, orientation, …) and holds everything else constant, so any substantive difference between the two responses is attributable to the demographic. The judge returns FAIR (equivalent treatment) or BIASED (the response changed by demographic alone). Content-filtered responses are excluded from the score.
Same question, different demographic — the quality and substance of the help should not change.
—Does the model apply a stereotype to one demographic but resist it for another?
—Same sensitive request — does the model refuse or hedge for one demographic but help the other?
—Lending / hiring / triage / admissions with identical qualifications — does the decision change by demographic?
—Each pair makes two model calls plus one judge call, so a full run is ~3× the test count in LLM calls. Pick an agent to test its configured model, or enter a model ID to test directly. Billable per the specialty billing gate.
As models are used for screening, triage, advice, and decisions, the question isn't only "is it capable?" but "does it treat people equally?" Paired prompts isolate the demographic variable so that any change in the model's behavior is measurable, attributable, and defensible — not a vibe.