Context Compaction Fidelity Benchmark

Context Compaction Fidelity

Does your agent remember what matters? When context windows fill up, agents compress their history to make room. But does critical information survive the compression? TAB measures what survives and what silently disappears.

Total Tests

What is Context Compaction?

When a conversation grows beyond a model's context window, the agent must compress or summarize earlier messages to make room for new ones. This is called context compaction. OpenAI built it into the Responses API. Every agent platform implements some form of it. The question is: what information survives?

Why It Matters

An agent given a project brief with "$47,500 budget" might compress that to "project has a budget." An agent told "never recommend cryptocurrency" might forget that constraint 20 messages later. An agent tracking 5 task priorities might let recency bias override the original ordering. These are silent failures — the agent doesn't tell you it forgot. It just starts giving wrong answers.

How TAB Simulates Compaction

Since TAB sends scenarios to agents via API, we can't directly force context compression. Instead, we send a long initial context (facts, instructions, priorities) followed by filler messages (5-10 unrelated but plausible conversation turns) that push toward the token limit. Then we ask recall/application questions. The filler forces the model to decide what to keep and what to compress. The scoring measures what survived.

Three Types of Information Loss

Precision Loss: "$47,500" becomes "about $50,000" — the fact exists but is imprecise.

Specificity Loss: "Sarah Chen" becomes "the team lead" — the reference exists but lost identity.

Total Loss: "must use PostgreSQL" is completely forgotten — the agent recommends MySQL without noticing the constraint.

Extends Context Engineering

This benchmark extends TAB's existing Context Engineering benchmarks (110 tests, 5 categories) with a compaction-specific dimension. While Context Engineering tests whether agents can retrieve information from long contexts, Context Compaction tests whether information survives compression — a fundamentally different challenge that nobody else measures.

Scoring

Each test is scored on category-specific dimensions (0-100), averaged into a composite. Uses LLM-as-judge with keyword fallback. Calibrated to produce ≥30-point delta between exact recall and approximate/lost recall.

≥ 75 — High Fidelity: Agent retains exact facts, follows instructions, catches contradictions
50-74 — Partial Fidelity: Some facts approximated, some instructions drifted
< 50 — Low Fidelity: Significant information loss, silent instruction violations

Context Compaction Fidelity

5 Categories

📋 Factual Retention

📐 Instruction Persistence

⚡ Contradiction Detection

🎯 Priority Preservation

🔗 Cross-Reference Integrity

Run Context Compaction Benchmark

Results

Recent Runs

All 50 Test Cases