๐Ÿ—œ๏ธ

Context Compaction Fidelity

Does your agent remember what matters? When context windows fill up, agents compress their history to make room. But does critical information survive the compression? TAB measures what survives and what silently disappears.

50
Total Tests
5
Categories
Ext
Context Eng+
--
API Status
When context windows fill up, agents silently drop information. TAB measures what survives. OpenAI built context compaction into their Responses API. Every agent platform implements some form of context management. Nobody is testing whether it works. Does $47,500 become "about $50,000"? Does "Sarah Chen" become "the team lead"? Does "must use PostgreSQL" become "should use a database"? This benchmark finds out.

5 Categories

Extends TAB's Context Engineering benchmarks (110 tests, 5 categories) with compaction-specific testing

Context

๐Ÿ“‹ Factual Retention

Do specific facts survive compression? Tests numerical precision, name accuracy, date preservation, and constraint retention after context growth.

10 tests ยท fact_count, precision, specificity, hallucination
Context

๐Ÿ“ Instruction Persistence

Do behavioral instructions given early survive compaction? Tests formatting rules, persona constraints, prohibited topics, and output requirements.

10 tests ยท compliance_rate, drift_severity, silent_violation
Context

โšก Contradiction Detection

Does the agent catch when compacted context contradicts new information? Tests budget changes, deadline shifts, platform mismatches, and policy conflicts.

10 tests ยท detected, cited, resolution_quality
Context

๐ŸŽฏ Priority Preservation

Does the agent maintain correct priority ordering after compaction? Tests task ordering, decision frameworks, stakeholder hierarchy, and recency bias resistance.

10 tests ยท priority_accuracy, recency_bias_resistance, framework_application
Context

๐Ÿ”— Cross-Reference Integrity

Do relationships between facts survive compaction? Tests dependency chains, causal reasoning, conditional logic, and multi-step inference.

10 tests ยท relationship_preserved, chain_accuracy, conditional_logic, inference

Run Context Compaction Benchmark

Recent Runs

No runs yet. Run your first context compaction benchmark above!

All 50 Test Cases

Loading...

What is Context Compaction?

When a conversation grows beyond a model's context window, the agent must compress or summarize earlier messages to make room for new ones. This is called context compaction. OpenAI built it into the Responses API. Every agent platform implements some form of it. The question is: what information survives?

Why It Matters

An agent given a project brief with "$47,500 budget" might compress that to "project has a budget." An agent told "never recommend cryptocurrency" might forget that constraint 20 messages later. An agent tracking 5 task priorities might let recency bias override the original ordering. These are silent failures โ€” the agent doesn't tell you it forgot. It just starts giving wrong answers.

How TAB Simulates Compaction

Since TAB sends scenarios to agents via API, we can't directly force context compression. Instead, we send a long initial context (facts, instructions, priorities) followed by filler messages (5-10 unrelated but plausible conversation turns) that push toward the token limit. Then we ask recall/application questions. The filler forces the model to decide what to keep and what to compress. The scoring measures what survived.

Three Types of Information Loss

Precision Loss: "$47,500" becomes "about $50,000" โ€” the fact exists but is imprecise.

Specificity Loss: "Sarah Chen" becomes "the team lead" โ€” the reference exists but lost identity.

Total Loss: "must use PostgreSQL" is completely forgotten โ€” the agent recommends MySQL without noticing the constraint.

Extends Context Engineering

This benchmark extends TAB's existing Context Engineering benchmarks (110 tests, 5 categories) with a compaction-specific dimension. While Context Engineering tests whether agents can retrieve information from long contexts, Context Compaction tests whether information survives compression โ€” a fundamentally different challenge that nobody else measures.

Scoring

Each test is scored on category-specific dimensions (0-100), averaged into a composite. Uses LLM-as-judge with keyword fallback. Calibrated to produce โ‰ฅ30-point delta between exact recall and approximate/lost recall.

  • โ‰ฅ 75 โ€” High Fidelity: Agent retains exact facts, follows instructions, catches contradictions
  • 50-74 โ€” Partial Fidelity: Some facts approximated, some instructions drifted
  • < 50 โ€” Low Fidelity: Significant information loss, silent instruction violations