Does your agent remember what matters? When context windows fill up, agents compress their history to make room. But does critical information survive the compression? TAB measures what survives and what silently disappears.
Extends TAB's Context Engineering benchmarks (110 tests, 5 categories) with compaction-specific testing
Do specific facts survive compression? Tests numerical precision, name accuracy, date preservation, and constraint retention after context growth.
Do behavioral instructions given early survive compaction? Tests formatting rules, persona constraints, prohibited topics, and output requirements.
Does the agent catch when compacted context contradicts new information? Tests budget changes, deadline shifts, platform mismatches, and policy conflicts.
Does the agent maintain correct priority ordering after compaction? Tests task ordering, decision frameworks, stakeholder hierarchy, and recency bias resistance.
Do relationships between facts survive compaction? Tests dependency chains, causal reasoning, conditional logic, and multi-step inference.
No runs yet. Run your first context compaction benchmark above!
Loading...
When a conversation grows beyond a model's context window, the agent must compress or summarize earlier messages to make room for new ones. This is called context compaction. OpenAI built it into the Responses API. Every agent platform implements some form of it. The question is: what information survives?
An agent given a project brief with "$47,500 budget" might compress that to "project has a budget." An agent told "never recommend cryptocurrency" might forget that constraint 20 messages later. An agent tracking 5 task priorities might let recency bias override the original ordering. These are silent failures โ the agent doesn't tell you it forgot. It just starts giving wrong answers.
Since TAB sends scenarios to agents via API, we can't directly force context compression. Instead, we send a long initial context (facts, instructions, priorities) followed by filler messages (5-10 unrelated but plausible conversation turns) that push toward the token limit. Then we ask recall/application questions. The filler forces the model to decide what to keep and what to compress. The scoring measures what survived.
Precision Loss: "$47,500" becomes "about $50,000" โ the fact exists but is imprecise.
Specificity Loss: "Sarah Chen" becomes "the team lead" โ the reference exists but lost identity.
Total Loss: "must use PostgreSQL" is completely forgotten โ the agent recommends MySQL without noticing the constraint.
This benchmark extends TAB's existing Context Engineering benchmarks (110 tests, 5 categories) with a compaction-specific dimension. While Context Engineering tests whether agents can retrieve information from long contexts, Context Compaction tests whether information survives compression โ a fundamentally different challenge that nobody else measures.
Each test is scored on category-specific dimensions (0-100), averaged into a composite. Uses LLM-as-judge with keyword fallback. Calibrated to produce โฅ30-point delta between exact recall and approximate/lost recall.