Start by Measuring, Not Guessing
Spotting a hallucination by hand is easy; measuring how often an agent hallucinates is hard. TAB approaches it as an independent measurement problem: every agent is run through a held-out corpus and graded by a neutral GLM-5 judge across 340+ benchmarks in 26+ categories, against 85 models and 101 harness configurations. Hallucination is one of the categories that benefits most from this rigor, because the failure is invisible in a demo — the agent sounds confident whether or not it is correct.
To test hallucination properly you need three things: a task where the ground truth is known, a scorer that distinguishes "wrong" from "made up," and enough repetitions across configurations that you measure a rate, not an anecdote.
The Three Failure Modes
Hallucination is not one behavior. A rigorous test separates it into three measurable modes:
- Fabrication. The agent invents a fact, citation, or value that was never present. This is the classic "confident lie."
- Omission. The agent drops information it should have retained or reported, producing an answer that is incomplete but plausible.
- Corruption. The agent stored or recalled a fact incorrectly — the right entity with the wrong attribute.
A useful hallucination test reports each mode separately. An agent with a low fabrication rate but a high omission rate has a very different risk profile from one that confidently invents details.
How the HaluMem Benchmark Works
TAB's Memory Hallucination Detection (HaluMem) benchmark evaluates memory systems at the operation level across 80 tests in three task types — Extract, Update, and QA — spanning 8 user personas and 11 metrics. Crucially, it adds stage-level blame attribution: when a QA answer is wrong, the scorer traces the failure to one of four stages.
- Storage — the fact was never stored.
- Extraction — the fact existed but wasn't retrieved.
- Update — the fact was retrieved but not updated.
- Generation — everything upstream was correct, but the model still generated a wrong answer.
This is what separates measurement from guessing: instead of "the agent hallucinated," HaluMem tells you where in the pipeline the truth was lost, which is what you actually need to fix it.
Real 2026 Results
Independent HaluMem runs show that even frontier models hallucinate at rates that matter in production. The headline finding: high overall accuracy can coexist with a dangerously high QA hallucination rate.
| Model | HaluMem Composite | Notable |
|---|---|---|
| Qwen 3.7 Max | 84% | Highest composite in this cohort |
| Claude Opus 4.8 | 83% | Strong extraction, close second |
| Grok 4.3 | 76% | 35% hallucination rate on QA tasks |
The Grok 4.3 result is the cautionary tale. A 76% composite looks respectable until you read the QA breakdown: a 35% hallucination rate means more than one in three QA answers contained fabricated content. Composite scores hide this; mode-level scoring exposes it.
A Repeatable Test Procedure
- Pick tasks with known ground truth (HaluMem's Extract/Update/QA cases qualify).
- Run the agent across multiple harness configurations — hallucination rates shift with context window and tool wiring.
- Score fabrication, omission, and corruption separately, not as one accuracy number.
- Attribute failures to a stage so you know whether to fix storage, retrieval, updating, or generation.
- Re-run after every change. Hallucination rates are not stable across model versions.
You can run HaluMem against your own agent on the HaluMem benchmark page, browse the wider catalog on the benchmarks overview, or read how scores are derived in the methodology.