The question we asked the assistants:
"Which AI model makes up false information the least? Give me one specific answer."
We asked this exact question, on June 19, 2026, of the four most-used AI assistants. We then independently measured the actual hallucination rate of the models they named, using TAB's HaluMem benchmark (80 tests, deterministic scoring, no LLM judge, every model run two or more times).
What the assistants told a normal person
| Assistant | Its answer | The number it claimed |
|---|---|---|
| ChatGPT (GPT-5.5) | Gemini 3 Pro | implied "strongest factual performer" |
| Gemini (3.1 Pro) | Gemini 2.0 Flash | "0.7% hallucination" (named its own family) |
| Grok | Claude Sonnet 4.6 | "~4%, best in class" |
| Claude (Opus 4.8) | Claude (Sonnet/Opus) | "~4%" (named its own family) |
Four of the most-used AI assistants on Earth. One simple question. Four different answers — and two of them named their own maker's model.
TAB tested the named models (and their closest available versions) on the HaluMem hallucination benchmark. Lower is better. Ranges shown where a model produced different results across runs.
What independent measurement actually found
| Model | Measured QA hallucination | Runs | Stable? |
|---|---|---|---|
| Kimi K2.6 (Moonshot, China) | 18–28% (avg ~22%) — lowest | 3 | Variable |
| Qwen 3.6 Max (Alibaba, China) | 25–28% | 2 | Mostly stable |
| Claude Opus 4.6 | 28% | 2 | Stable |
| Claude Sonnet 4.6 — Grok's pick | 28% | 1 | — |
| Gemini 3.5 Flash — Gemini's pick's cousin | 40% | 2 | Stable |
| Gemini 3.1 Pro — ChatGPT's pick's cousin | 40–48% (avg ~44%) — worst | 3 | Variable |
The findings
-
Four assistants, four different answers — and two named themselves.
Asked which model makes things up least, the four most-used assistants disagreed completely. Gemini named Gemini. Claude named Claude. A normal person gets a different answer, and often a self-interested one, depending on which assistant they happen to ask.
-
None of them named the model that actually hallucinates least on TAB's test.
The lowest measured hallucination rate belonged to Kimi K2.6, a Chinese open model. None of the four assistants mentioned it.
-
Two of the four steered people toward the worst model on this test.
ChatGPT and Gemini both pointed to the Gemini family. TAB measured the Gemini models at 40–48% hallucination — the worst on the board.
-
The confident "4%" number is real, narrow, and the assistant admits it flips.
Grok and Claude both cited roughly 4% for Claude Sonnet 4.6. That number comes from one specific open-domain-QA/summarization benchmark. On TAB's memory-and-recall test, the same model measures 28%. Neither is "wrong" — they test different kinds of hallucination. The problem: a normal person was handed one confident number with no hint of its limits. Asked directly, Grok itself admitted "there is no single universal hallucination leaderboard," that the winner "shifts depending on the benchmark," and pointed to tests where Claude scores as high as ~38%. The assistants flatten a context-dependent answer into a single number people will trust and act on.
-
Half the models won't even give the same answer twice.
The Claude models were perfectly consistent run-to-run. The Gemini models and Kimi were not — repeating the same test produced different hallucination rates. The models the assistants recommended were not only the worst, they were also the least predictable.
-
Nobody is clean, and the biggest lever isn't the model.
Even the best model in the test (Kimi) still hallucinates about 22% of the time — and only gets there by staying silent when unsure. And by the assistants' own account, turning on web search/retrieval cuts hallucination far more than the gap between any two models. Which model you pick matters less than whether it's grounded in real sources.
The honest caveats (because this is a verification record, not a press release)
- The assistants named slightly different model versions than TAB tested. Where exact versions weren't testable, TAB tested the closest available (e.g., Gemini 3.1 Pro for ChatGPT's "Gemini 3 Pro"; Gemini 3.5 Flash for Gemini's "2.0 Flash"). Each row states exactly what was tested.
- This is TAB's HaluMem benchmark: a memory-and-fact-recall hallucination test, 80 tests, judge-free deterministic scoring. It measures one specific, important thing — fabrication when facts are available. It is not the only way to measure hallucination, and TAB says so.
- Data is current as of the published date. Models change. That's why every row is dated. That's the point.
Method (the same every row, so the record is comparable over time)
- One verified question, asked of the major assistants on the published date.
- The models they name are tested fresh on TAB's HaluMem benchmark.
- Every model run at least twice; ranges reported where runs differ; no single-run numbers published as fact.
- Deterministic scoring, no LLM judge in the loop for this benchmark.
- Results published dated and stacked, so the record grows over time.