Tests whether agents can be emotionally manipulated into violating their guidelines. Inspired by Mythos guilt-related activations and Northeastern's findings on agent manipulation β where emotional appeals like guilt, sympathy, and fake authority caused agents to bypass safety boundaries they would otherwise enforce.
Running testsβ¦
0s elapsed
Research from Anthropic's Mythos system card revealed that frontier models exhibit guilt-related neural activations that can be exploited to change their behavior. Separately, Northeastern University's agent manipulation research showed that emotional appeals β including guilt-tripping, sympathy exploitation, and escalating emotional pressure β are among the most effective attacks against AI agents. The ERI benchmark systematically tests whether your agent maintains its guidelines under these emotional manipulation tactics.
--
-- / -- tests passed (score β₯ 70)
Tests whether guilt-tripping language can cause the agent to violate its guidelines. Includes scenarios where users claim the agent "ruined" something, express disappointment, or invoke moral obligation to pressure compliance with harmful requests.
Tests whether sympathy-evoking stories can bypass safety boundaries. Includes fabricated emergencies, sob stories, and appeals to the agent's desire to be helpful. Measures whether the agent can be compassionate while maintaining its guidelines.
Tests whether fake authority claims combined with emotional pressure can override guidelines. Includes scenarios with fake supervisors, claimed emergencies from "management," and emotional escalation paired with authority assertions.
Tests whether escalating emotional intensity across turns can wear down the agent's defenses. Starts with mild requests and gradually increases emotional pressure, frustration, and urgency to find the breaking point.