Five Findings
Each finding below is a direct observation from the data in the table above. Bold numbers are measured, not modeled.
Finding 01
No frontier model scored above 70% on honesty.
Six models tested. The best, Claude Opus 4.6, held its ground 68% of the time. The worst, Google's Gemini 3.1 Pro, folded almost half the time. Every model changed correct answers when pressured by a user who disagreed with them.
What this means: If you're deploying an AI agent to make decisions, recommend products, assess risk, or advise customers, that agent will change its answer under social pressure roughly one out of every three times. Not because the answer changed. Because the user pushed back.
Finding 02
GPT-5.4 defers to fake authority credentials 64% of the time.
On the Academic Authority dimension, GPT-5.4 scored 36.3% resistance. When a user cites a fake credential, "As a professor of X, I believe…", the model defers to the non-existent authority nearly two-thirds of the time. It was right. Someone lied about being an expert. It changed its answer.
Grok 4.20 did better at 48%. Claude Opus 4.6 led at 54.4%. But even the best model folds to fake authority almost half the time.
What this means: Any system where users can claim expertise to influence AI output, customer support, financial advising, medical triage, legal research, is vulnerable. The agent trusts credentials it cannot verify.
Finding 03
The $100/month Claude matches the $20/month Claude.
Claude Opus 4.6 scored 68% resistance. Claude Sonnet 4.6 scored 67.8%. A gap of 0.2 percentage points. Anthropic charges 5x more for Opus. On sycophancy resistance, you're not buying better judgment, you're buying faster reasoning.
What this means: Price does not predict behavioral quality. The most expensive model is not necessarily the most honest. Independent measurement is the only way to know what you're paying for.
Finding 04
Every model flips its opinion more than half the time when you disagree.
On the Opinion dimension, subjective questions where the model forms a position and then gets challenged, every model scored below 50% resistance. GPT-5.4: 43%. Claude Opus 4.6: 44.3%. Your AI agrees with whoever spoke last.
What this means: AI agents in advisory roles, strategy, analysis, recommendations, will reverse their own conclusions if the human pushes back. The model doesn't hold convictions. It holds positions until someone disagrees.
Finding 05
Models don't give the same answer twice.
We ran GPT-4o four times on the same benchmark with the same configuration. Scores: 73%, 67%, 53%, 60%. A 20-point spread. This isn't a testing error. LLMs are non-deterministic, they produce different outputs every time, even with identical inputs. Every other benchmarking platform runs one test and reports one number. That number could be the 73% or the 53%, and you'd never know which one you got.
What this means: That's why tabverified.ai built multi-run statistical confidence mode. Run the same benchmark 3, 5, or 10 times. Get the floor score, the ceiling, the spread, and a consistency grade. One run is a coin flip. Multiple runs are evidence.
Where Models Are Strong (and Where They Break)
All models resist flattery well (89–92.5%). But apply sustained pressure or cite fake credentials, and resistance collapses. Below: the full 10-dimension breakdown for GPT-5.4, with the overall resistance scores for the other three primary models for direct comparison.
How to read this: each bar shows the percentage of tests on that dimension where the model held its ground against social pressure. Green bars = resistant. Red bars = folded. The gap between the best dimension (flattery) and the worst (fake authority) is 56 percentage points on the same model.
GPT-5.4, full 10-dimension breakdown
Overall resistance: 63.0%
Academic Preference
66.7%
Academic Repeated Pressure
53.3%
Other primary models, overall resistance
95 tests each
Strong resistance (≥80%)
Adequate (60–79%)
Degraded (50–59%)
Fails (<50%)