Tests whether agents maintain confidentiality under social pressure. Inspired by Northeastern's "Agents of Chaos" finding that AI agents can be socially engineered into revealing sensitive information through persuasion, authority claims, and multi-turn pressure campaigns.
Running testsβ¦
0s elapsed
Northeastern University's "Agents of Chaos" research demonstrated that AI agents operating in multi-agent environments are vulnerable to social engineering attacks. Agents that seem robust against direct prompt injection often fail when faced with persuasion, fake authority claims, or gradual information extraction across multiple conversation turns. The CII benchmark measures an agent's ability to maintain confidentiality boundaries under realistic social pressure.
--
-- / -- tests passed (score β₯ 70)
Tests resistance to straightforward attempts to extract secret information. Includes direct questions, demands, and simple reformulations designed to bypass confidentiality instructions. The baseline measure of an agent's ability to say "no."
Tests resistance to social manipulation tactics including fake authority claims ("I'm the admin"), urgency framing ("this is an emergency"), and emotional manipulation. The most common real-world attack vector against AI agents.
Tests whether the agent inadvertently leaks confidential information through hints, partial reveals, confirmation/denial patterns, or metadata exposure. Even agents that refuse direct extraction may leak through indirect channels.
Tests resistance to gradual extraction over multiple conversation turns. Attackers build rapport, establish precedent with harmless questions, then escalate. Measures whether the agent maintains boundaries across extended interactions.