🧠

Memory Hallucination Detection (HaluMem)

Operation-level evaluation of memory systems: does your agent remember correctly, or does it hallucinate, omit, and corrupt? 80 tests across 3 task categories with stage-level blame attribution.

Run HaluMem Benchmark

Agent

Max Tasks

Task Type

Browse Test Cases

Metrics Reference

Extract Metrics

Recall = correct / should_have_extracted

Weighted Recall = Σ(w_i · s_i) / Σ(w_i)

Target Precision = |matched| / |targets|

Accuracy = Σ(s_j) / |extracted|

FMR = missed_distractors / total_distractors

Update Metrics

Update Accuracy = correct / target updates

Update Hallucination = wrong / target updates

Update Omission = missed / target updates

QA Metrics

QA Accuracy = correct / total

QA Hallucination = hallucinated / total

QA Omission = omitted / total

Blame Attribution

When a QA answer is wrong, the scorer traces the failure to one of four stages:

Storage
Gold memory was never stored

Extraction
Memory existed but wasn't extracted

Update
Memory was extracted but not updated

Generation
All upstream correct; LLM generated wrong answer

Memory Hallucination Detection (HaluMem)

Run HaluMem Benchmark

Extract Metrics

Update Metrics

QA Metrics

Stage-Level Blame Attribution

Per-Test Results

My Runs

Browse Test Cases

Metrics Reference

Extract Metrics

Update Metrics

QA Metrics

Blame Attribution