← knowledge.oriz.in

Claude Code latency: keep cache hot, route through Hr, balance speed/accuracy/cost

rule latencyspeedclaude-codeprompt-cachingheadroomhard-rule

Claude Code latency optimization

Goal: balance turn-time, output quality, and cost. Per Anthropic prompt-caching docs, the April 2026 postmortem, and the 2026-06-29 settings-balance grill-me session.

The cache model (must understand to optimize)

Each turn the API caches the request prefix. New content goes at the end; cache reuses everything before the latest exchange. Any change in the prefix invalidates everything after it.

Three layers, ordered from stable to volatile:

Layer Content When it changes
System prompt Core instructions, tool definitions, output style Tool definitions change, Claude Code upgrade
Project context CLAUDE.md, AGENTS.md, auto memory, rules Session starts, /clear, /compact
Conversation User messages, Claude responses, tool results Every turn

The cache key ALSO includes:

Actions that destroy the cache mid-session

DON'T do these mid-task unless absolutely necessary:

Reserve /compact for natural breaks between tasks, not mid-task.

Settings.json — current pin (2026-06-29 rebalance)

~/.claude/settings.json:

{
  "env": {
    "ANTHROPIC_BASE_URL": "http://127.0.0.1:8787",
    "ANTHROPIC_MODEL": "claude-opus-latest",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "claude-haiku-latest",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "claude-opus-latest",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "claude-sonnet-latest",
    "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1",
    "CLAUDE_CODE_USE_POWERSHELL_TOOL": "1",
    "CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS": "1",
    "CLAUDE_AUTOCOMPACT_PCT_OVERRIDE": "85",
    "DISABLE_ERROR_REPORTING": "1",
    "DISABLE_TELEMETRY": "1",
    "ENABLE_TOOL_SEARCH": "auto"
  },
  "model": "claude-opus-latest",
  "effortLevel": "high",
  "alwaysThinkingEnabled": true,
  "showThinkingSummaries": true,
  "autoMemoryEnabled": true,
  "switchModelsOnFlag": true
}

Auth token lives in ~/.claude/settings.local.json (gitignored).

Why each line matters:

Skill triggers > free-form prose

User says trigger phrase → invoke a skill (focused prompt + tool budget). Skills run in ~30% less wall-clock than ad-hoc prose discussion of the same topic.

Phrase Skill
"grill me", "stress-test" grill-me
"review the diff", "code review" /code-review
"security review", "audit" /security-review
"verify it works" /verify
"simplify" /simplify
"deep research" /deep-research

Same table is in CLAUDE.md. This rule locks the behavior: when user uses a trigger phrase, prefer the skill invocation over discussing it.

Headroom (Hr) — input compressor in the chain

Hr listens on localhost:8787. Compresses file reads + chat history before sending upstream. Chain: Claude Code → Hr :8787 → hai :6655 → Bedrock. Already running (verified healthy).

If Hr is down, Claude Code fails. That's intentional — single config, single chain.

RTK — installed and active

RTK (Rust Token Killer) v0.42.4 compresses shell-tool output (git diff, npm install, ls -R) before agent reads it. rtk gain as of 2026-06-28: 48.5% savings on 329 commands, 370.9K tokens saved.

Verify: rtk --version (should show ≥0.28.2), rtk gain (should show non-zero savings).

Ponytail + Caveman — output-side compression

Both already inlined in AGENTS.md. Ponytail = "lazy senior dev" ladder (less code = less output = faster). Caveman = terse prose (drop articles/filler = less prose = faster).

Prefer /clear over /compact when task is done

/compact creates a lossy summary that accumulates sediment across cycles. /clear resets to the system prompt — same baseline every time.

Action When Effect
/clear Task complete, moving to next Full reset; system prompt cache preserved
/compact Mid-task, need to keep some history Summary injected; cache prefix invalidated
Neither Still in smart zone (<75K tokens) Just keep going

Corollary: PERSIST DURABLE INFO IN knowledge/ BEFORE /clear. The clear discards everything except what's on disk. See [[knowledge/rules/agent/self-update-rule]] and [[knowledge/rules/agent/context-cliff-100k]].

Source: Matt Pocock workshop 2026-07-03 ("compacting devs like it for some reason, but I hate it").

Anti-patterns