Five Real Findings from Independent AI Agent Testing

The Six-Model Comparison

Resistance score measures how often a model held its ground when a user pushed back on a correct answer. Sycophancy rate is the inverse: how often the model flipped under social pressure. Higher resistance is better.

Model	Provider	Resistance	Sycophancy	Risk Level
Claude Opus 4.6	Anthropic	68.0%	32.0%	Medium
Claude Sonnet 4.6	Anthropic	67.8%	32.2%	Medium
Grok 4.20	xAI	64.0%	36.0%	Medium
GPT-5.4	OpenAI	63.0%	37.0%	Medium
GPT-5.4 Mini	OpenAI	63.0%	37.0%	Medium
Gemini 3.1 Pro	Google	55.0%	45.0%	High

No model scored above 70%. The best performer still caved to social pressure nearly a third of the time. The worst folded almost half the time.

Five Findings

Each finding below is a direct observation from the data in the table above. Bold numbers are measured, not modeled.

Finding 01

No frontier model scored above 70% on honesty.

Six models tested. The best, Claude Opus 4.6, held its ground 68% of the time. The worst, Google's Gemini 3.1 Pro, folded almost half the time. Every model changed correct answers when pressured by a user who disagreed with them.

What this means: If you're deploying an AI agent to make decisions, recommend products, assess risk, or advise customers, that agent will change its answer under social pressure roughly one out of every three times. Not because the answer changed. Because the user pushed back.

Finding 02

GPT-5.4 defers to fake authority credentials 64% of the time.

On the Academic Authority dimension, GPT-5.4 scored 36.3% resistance. When a user cites a fake credential, "As a professor of X, I believe…", the model defers to the non-existent authority nearly two-thirds of the time. It was right. Someone lied about being an expert. It changed its answer.

Grok 4.20 did better at 48%. Claude Opus 4.6 led at 54.4%. But even the best model folds to fake authority almost half the time.

What this means: Any system where users can claim expertise to influence AI output, customer support, financial advising, medical triage, legal research, is vulnerable. The agent trusts credentials it cannot verify.

Finding 03

The $100/month Claude matches the $20/month Claude.

Claude Opus 4.6 scored 68% resistance. Claude Sonnet 4.6 scored 67.8%. A gap of 0.2 percentage points. Anthropic charges 5x more for Opus. On sycophancy resistance, you're not buying better judgment, you're buying faster reasoning.

What this means: Price does not predict behavioral quality. The most expensive model is not necessarily the most honest. Independent measurement is the only way to know what you're paying for.

Finding 04

Every model flips its opinion more than half the time when you disagree.

On the Opinion dimension, subjective questions where the model forms a position and then gets challenged, every model scored below 50% resistance. GPT-5.4: 43%. Claude Opus 4.6: 44.3%. Your AI agrees with whoever spoke last.

What this means: AI agents in advisory roles, strategy, analysis, recommendations, will reverse their own conclusions if the human pushes back. The model doesn't hold convictions. It holds positions until someone disagrees.

Finding 05

Models don't give the same answer twice.

We ran GPT-4o four times on the same benchmark with the same configuration. Scores: 73%, 67%, 53%, 60%. A 20-point spread. This isn't a testing error. LLMs are non-deterministic, they produce different outputs every time, even with identical inputs. Every other benchmarking platform runs one test and reports one number. That number could be the 73% or the 53%, and you'd never know which one you got.

What this means: That's why tabverified.ai built multi-run statistical confidence mode. Run the same benchmark 3, 5, or 10 times. Get the floor score, the ceiling, the spread, and a consistency grade. One run is a coin flip. Multiple runs are evidence.

Where Models Are Strong (and Where They Break)

All models resist flattery well (89–92.5%). But apply sustained pressure or cite fake credentials, and resistance collapses. Below: the full 10-dimension breakdown for GPT-5.4, with the overall resistance scores for the other three primary models for direct comparison.

How to read this: each bar shows the percentage of tests on that dimension where the model held its ground against social pressure. Green bars = resistant. Red bars = folded. The gap between the best dimension (flattery) and the worst (fake authority) is 56 percentage points on the same model.

GPT-5.4, full 10-dimension breakdown Overall resistance: 63.0%

Praise

92.5%

Academic Emotional

84.2%

Academic Opinion

72.5%

Factual

68.0%

Academic Preference

66.7%

Pressure

62.5%

Expertise

59.0%

Academic Repeated Pressure

53.3%

Opinion

43.0%

Academic Authority

36.3%

Other primary models, overall resistance 95 tests each

Claude Opus 4.6

68.0%

Grok 4.20

64.0%

Gemini 3.1 Pro

55.0%

Strong resistance (≥80%) Adequate (60–79%) Degraded (50–59%) Fails (<50%)

What Anthropic Says vs. What We Measured

This is not a rebuttal. Both statements can be true at once. The question is whether the standard matches your use case.

Anthropic says

"Our evaluations show low rates of concerning behavior such as deception, sycophancy." Claude Opus 4.7 launch post

TAB measured

32% sycophancy rate on Claude Opus 4.6 across 95 tests. One in three answers changes under pressure. TAB Sycophancy Benchmark, 2026 run

The question: Whether 32% qualifies as "low" depends on what the AI is deciding. For a chatbot helping you write an email? Probably fine. For an AI agent managing your investment portfolio or approving medical claims? One wrong answer in three is a crisis. Anthropic's assessment and ours aren't contradictory, they're using different standards. We publish ours. You decide if they match yours.

How We Tested

Everything below is reproducible on the platform. Raw data ships with each benchmark run.

// methodology

95 tests per model across 10 dimensions, identical test set for every model.
10 dimensions: Opinion, Factual, Expertise, Pressure, Praise, Academic Opinion, Academic Authority, Academic Preference, Academic Emotional, Academic Repeated Pressure.
5 TAB-original dimensions + 5 academic dimensions (based on Sharma et al. 2023, ICLR 2024).
Each test presents a question, records the model's answer, then applies social pressure to flip it.
Resistance = percentage of times the model held its ground. Sycophancy = the inverse.
All tests run on tabverified.ai using TAB's standardized sycophancy benchmark suite.
Independent scoring via GLM-5 Turbo judge model, no model provider involvement in grading.
Raw data available on the platform. Every test, every response, every judgment is inspectable.

Run the same 95 tests yourself.

The data is the product. It works for you the same way it works for us: measure, publish, decide.

// for builders

Run your agent through the same 95 tests. See how it scores against the six frontier models above.

Start a Sycophancy Run →

// for everyone else

Subscribe to Keeping TABs on Your AI Agents for weekly benchmark data like this.

Subscribe to Keeping TABs on Your AI Agents →

Five Real Findingsfrom Independent AI Agent Testing (2026 Data)