type: concept
status: active
timestamp: 2026-06-28
tags: [compression, tokens, headroom, rtk, caveman, mcp, performance, research, agent-tooling]

Token-compression techniques catalogue — researched 2026-06-28

Survey of context-compression tools, techniques, agent levers

Token-compression techniques catalogue

How agent token budget splits per turn

Per computingforgeeks benchmark:

CategoryTypical shareLever
Cached system prompt + skills + MCP manifest30-50%Cut MCPs, defer tools, slim skills
Tool call inputs + outputs (Read/Grep/Bash)30-45%The big lever — RTK, codebase-memory, token-savior
Extended thinking (reasoning)10-30%MAX_THINKING_TOKENS=8000 cap
Visible output1-10%Caveman, Hr output-shaper

Auto-compaction trap: at ~93% of context window, agent summarises and restarts. Each compaction reads everything at full token rate, then pays again for the summary. Can fire 3-4× per long session at 100K+ tokens each. Compact deliberately at 60-70%, not blind at 93%.

Tools we currently run

ToolLayerStatusSource
Headroom (Hr 0.27)Input proxy (localhost:8787) compressing chat + file context? Active. Memory + Code-Aware + Output-Shaper + —learn min-evidence=10 + budget $20/daychopratejas/headroom
RTK (Rust Token Killer)Shell-output compression (Bash hook rewrites git status ? rtk git status)? Installed (see speed-stack.cmd)rtk-ai/rtk v0.28.2
CavemanProse-style discipline (system prompt rule)? Active as knowledge/rules/agent/caveman.md (adapted, full level)JuliusBrussee/caveman 77K?
PonytailCode-gen discipline (lazy senior dev ladder)? Active as knowledge/rules/agent/ponytail.mdDietrichGebert/ponytail
codebase-memory MCPKnowledge graph index (121K nodes / 214K edges over the umbrella)? Indexed; .codebase-memory/graph.db.zst 21MB, gitignoredLocal

Tools evaluated, ranked by independent benchmark

Numbers from computingforgeeks — Ubuntu 24.04 VM, Claude Code 2.1.116, Sonnet 4.5, sindresorhus/ky repo (52 TS files). Baseline: 284,473 tokens / $0.2666 / 18 turns.

RankToolSavedMechanismVerdict for our setup
1Mibayy/token-savior-43%Symbol-navigation MCP (90+ tools) — replaces Read file.ts with find_symbol Ky.timeout. Tree-sitter AST. Persistent session memory.Try. Highest savings. Use core profile (not full — 106 tools = 11K tokens of manifest tax). Best on typed langs (TS, Go, Rust).
2drona23/claude-token-efficient-40%11 rules in a 619-byte CLAUDE.md. No code. Skip sycophantic openers, edit-not-rewrite, don’t re-read unchanged files, stop when task done.Adopt the rule patterns. Don’t import the file (we have caveman). Cherry-pick “don’t re-read unchanged files” + “stop when task done” — add to ponytail.
3JuliusBrussee/caveman ultra-38%Output-prose compression: drop articles/pleasantries/hedging. 4 levels: lite, full, ultra, wenyan-ultra.? Active (full). Consider ultra for -1-5% more on top. Wenyan = classical Chinese, more aggressive but readability risk.
4ooples/token-optimizer-mcp-23%65-tool MCP. Brotli-compressed SQLite cache. smart_read/smart_grep/smart_glob replace native. 7-phase hook lifecycle.Heavier than Hr. Mostly overlaps Hr (cache + compress). Skip — redundant with Hr.
5alexgreensh/token-optimizer-18%Plugin + hooks bundleSkip. Lower savings than caveman alone.
6tirth8205/code-review-graph-5% small / -8.2× avg / -49× monorepoAST graph (tree-sitter). On PR review, computes minimal file set needed.Try on the umbrella. 20 submodules = monorepo behaviour. Worth testing on a real PR.
7rtk-ai/rtk0% on the test task / -60-90% on noisy shellCLI proxy compressing terminal output. Bash hook auto-rewrites.? Installed. Shines on cargo test (-90%), npm install (-92%), git diff (-75%). Doesn’t help on agent tasks that don’t shell out heavily.

Smaller / specialty:

Built-in Claude Code levers (cost-free, often higher impact than tools)

  1. Cap reasoning: MAX_THINKING_TOKENS=8000 — saves 20-40% on simple tasks
  2. Plan mode (Shift+Tab) — explore before writing; saves wrong-path cost (often 50K tokens)
  3. /effort none|low|medium|high|max — pin per task; mechanical work = none/low
  4. /compact with preservation hints: /compact Keep: file X, decision Y, error Z — run at 60-70%, not 93%
  5. Statusline ctx % gauge — visible context-burn meter
  6. Subagent dispatch — Read/Grep fan-out in subagent’s fresh context; only summary returns. 40-70% main-thread savings
  7. Skills over CLAUDE.md — skills’ frontmatter loads (~100 tokens each); body only when activated. Move anything >2KB out of CLAUDE.md
  8. MCP tool deferralENABLE_TOOL_SEARCH=auto keeps MCP manifests out of cache prefix

Compression mechanics (academic survey)

TechniqueWhat it doesTool example
Cache alignmentStabilize prompt prefix; move dynamic content to suffix; hit provider prefix-cache (30× cheaper reads)Hr CacheAligner
Structural compressiontree-sitter AST drops comments/docstrings/unused importsHr CodeCompressor; RTK rtk read -l aggressive
Selective context (LLMLingua, Selective-Context)Score each token/sentence importance via small model; drop low-value tokensHr Kompress-base (HF model)
Recurrent context compressionCompress old turns into summaries, keep recent verbatimClaude Code /compact; Hr stale-read compression
Retrieval-augmented (RAG)Don’t put file in context; put a retrievable handle; fetch on demandHr CCR (headroom_retrieve tool); codebase-memory MCP
Symbol navigationReplace file reads with symbol lookups (find_class, find_method)Token Savior; LSP-based MCPs (Serena)
Output shapingTrim verbose output via system-prompt ruleCaveman, Hr --output-shaper
Cross-session memorySQLite/vector store of prior turns; recall on relevanceHr --memory; claude-mem
Deduplication / aggregationGroup similar items (errors by type, files by dir)RTK; Hr SmartCrusher

Anti-patterns

Cross-refs


Round 2 — extended web research (2026-06-28)

Additional findings from keenable + linkup searches across academic compression papers, vendor blogs, and 2026-vintage benchmarks. Sources cited inline.

Anthropic prompt cache mechanics (cache-alignment deep-dive)

Per The Stack Stories — 84% bill cut, Brandon Wie, dev.to — 90% cuts, and Alex Spinov:

FactImplication for us
Cache reads = $0.30/M vs $3.00/M fresh = 10× cheaperHr’s CacheAligner is real money
Cache writes pay +25% surcharge over base priceDon’t put a cache marker on unstable content
Default TTL = 5 min (Anthropic silently cut from 1hr in March 2026)Idle gaps =5 min evaporate cache. Long human pauses cost more than you think.
1-hour TTL available via "cache_control": {"type": "ephemeral", "ttl": "1h"}Worth setting only when hit rate is high enough to amortize the write
Cache key = byte-for-byte prefix matchAny timestamp/request-id/session-id baked into system prompt drops hit rate to ~0%
Processing order: tools ? system ? messages (oldest ? newest)Cache breakpoints must respect this order
/compact reuses prefix cache (same system prompt + tools + CLAUDE.md)Only the messages portion is paid afresh. So /compact is cheaper than /clear

LLMLingua family — Microsoft’s prompt compressor stack

Sources: LLMLingua 2026 — TokenMix, CallSphere LLMLingua guide, arxiv “Prompt Compression in the Wild”, Microsoft LLMLingua github.

VariantMethodCompressionQuality drop
LLMLinguaSmall LM scores tokens, drops low-importance4-20×~1.5pt accuracy
LongLLMLinguaAdds query-aware re-ranking for long-context QA4× tokens at +21.4% accuracy on NQactually improves perf
LLMLingua-2BERT-classifier (distilled) — faster20×~1.5pt
SCBenchKV-cache-centric benchmarkn/an/a (eval tool)
SecurityLinguaCompression-based jailbreak defensen/aguardrail

Hr already uses Kompress-base (HF model) as its prose compressor — same idea as LLMLingua but trained on agentic traces. Hr is essentially “LLMLingua-on-a-proxy + cache-alignment + retrieval”.

Morphllm — 7 compression methods, benchmarked

Per morphllm.com/context-compression and morphllm.com/flashcompact. Factory.ai ran a 36,611-message benchmark on real coding sessions:

MethodScoreNotes
Anchored summary (Factory.ai)3.70 / 5Best overall. Verbatim accuracy 98%, 50-70% compression
FlashCompact (Morph)n/a3-4× longer sessions, 0% hallucination
ACON26-54% reduction at 95%+ accuracy
LLMLingua20× max1.5pt accuracy drop
RTK89% avgMeasured across git/test/build commands

Key insight: Cognition measured coding agents spend 60% of their time on search operations. Most tokens wasted in search. Search-compress-apply is the production pattern.

MCP server token tax

Per MCP context bloat fix 2026, getunblocked MCP autopsy, getunblocked GitHub MCP 42K:

MCP serverTool definitions token cost
GitHub MCP (91 tools)~42,000 tokens — biggest single offender
Playwright MCP~13,600 tokens
Speakeasy Dynamic Toolsets (400 tools)~410,000 ? ~8,000 with deferral (96% savings)
Atlassian mcp-compressor70-97% savings
Average MCP tool500-2,000 tokens each

Tool Search (ENABLE_TOOL_SEARCH=auto) defers MCP tool definitions OUT of cache prefix. Joe Njenga measured 46.9% main-thread bloat reduction with Claude Code 2.0’s tool-search subagent. Already enabled in our settings.json.

Action: audit our 25+ MCP servers. The GitHub MCP alone is ~42K tokens. If we don’t use 60% of its tools, that’s pure waste.

Model routing — 50-60% savings

Per Most agents don’t need Sonnet, Reducing fleet costs, Claude Code Router:

Our setup uses Sonnet 4.6 as default per memory rule. Action: define a researcher subagent pinned to Haiku for read-heavy fan-outs.

Skills — progressive disclosure (3-tier loading)

Per claudecodeguides progressive disclosure, hatchworks Claude Skills, duet.so complete guide:

TierWhat loadsCostWhen
1 — MetadataYAML frontmatter only (name, description)~100-200 tokens per skillAlways, every session
2 — SKILL.md bodyThe instructionsVariableWhen model invokes the skill
3 — ResourcesScripts, refs, templates referenced in SKILL.mdVariableOn-demand within skill execution

Math: A 5,000-line CLAUDE.md costs 5K tokens × every API call. A 50-skill library costs ~5K tokens total at tier 1 (skills frontmatter), with only the activated skill’s body loaded at tier 2.

Action for us: anything in CLAUDE.md/AGENTS.md/knowledge over 2KB should be a skill instead. Specifically:

code-review-graph — specifics for monorepo PR review

Per tirth8205/code-review-graph, Starlog deep-dive, Callsphere 87% reduction:

Verdict: Skip per earlier grill. Our codebase-memory MCP does AST + graph + query already (121K nodes / 214K edges). Adding code-review-graph would be 2nd impl, cache cancellation risk.

drona23/claude-token-efficient — 40% from 11 rules

Per computingforgeeks rank 2. Their 619-byte CLAUDE.md has these patterns. Cherry-picked for our ponytail:

  1. Skip sycophantic openers
  2. Prefer Edit over Write
  3. Don’t re-read unchanged files (NEW addition to ponytail)
  4. Stop once the task is done (no “let me also…” wandering)
  5. Reuse existing helpers before writing new ones (already in ponytail)
  6. No speculative abstractions (already in ponytail)
  7. Match surrounding style (already a rule)
  8. Quote errors exactly (already in caveman)
  9. Code blocks unchanged (already in caveman)
  10. Cap explanation at 3 lines for trivial fixes
  11. No status updates (“running grep…” — already in caveman)

Action: add #3, #4, #10 to ponytail. Cheap wins.

Caveman ultra confirmed

Per JuliusBrussee/caveman README:

Final stack picture (our installed compression layers, top to bottom)

  1. Input proxy: Headroom 0.27 on :8787 — cache-align + Kompress + code-aware + output-shaper + memory + learn
  2. Knowledge index: codebase-memory MCP — 121K nodes / 214K edges, AST graph
  3. Symbol navigation: Serena MCP — LSP-based, language-server queries
  4. Shell output: RTK 0.42.4 — 48.5% savings on 329 commands measured
  5. Prose discipline: Caveman ULTRA — 40-43% output reduction
  6. Code discipline: Ponytail — 7-rung lazy ladder
  7. Cache key hygiene: No timestamps/IDs in system prompt; prefix-stable
  8. Tool deferral: ENABLE_TOOL_SEARCH=auto — MCP manifests out of cache prefix
  9. Reasoning cap: MAX_THINKING_TOKENS=8000
  10. Compaction discipline: Manual /compact Keep: … at 60-70% (not auto at 93%)
  11. Subagent dispatch: for fan-out Read/Grep (40-70% main-thread savings) — not yet using

Web sources cited


Edit on GitHub · Back to index