Operation-level evaluation of memory systems: does your agent remember correctly,
or does it hallucinate, omit, and corrupt? 80 tests across 3 task categories
with stage-level blame attribution.
80
Total Tests
3
Task Types
8
User Personas
11
Metrics
--
LLM Status
Run HaluMem Benchmark
Starting...0%
0 / 0 taskspending
--
Composite Score
--
Tests Scored
--
Extract Recall
--
QA Accuracy
Extract Metrics
Update Metrics
QA Metrics
Stage-Level Blame Attribution
When QA answers are wrong, which upstream stage is responsible?