Why does Claude Code run out of context so fast?

Every message carries invisible baggage: your CLAUDE.md rules, tool definitions, and the full conversation history. A session can use 50-80K tokens before you even type a question. Plus, Claude reserves ~33K tokens as a buffer you can't use — so your real usable space is ~167K, not 200K.

What uses the most tokens in Claude Code?

Three things: (1) MCP tool definitions — each tool adds 500-2000 tokens every turn, (2) repeated file reads — Claude re-reads the entire file each time, not just the diff, and (3) conversation history that keeps growing. The gh CLI uses far fewer tokens than the GitHub MCP server for the same tasks.

How do I make my Claude Code sessions last longer?

Quick wins: use /compact when switching tasks (or /clear if the new task is unrelated), describe problems instead of naming files, disable unused MCP tools, keep CLAUDE.md under 2000 tokens, and use Sonnet or Haiku for simple tasks instead of Opus. For a deeper fix, use QMD or Musubi so Claude finds your notes instantly.

What is a knowledge graph and how does it help with tokens?

A knowledge graph is a map of how your notes connect to each other. Claude checks the map to find relevant files instead of reading everything. On real corpora (~50+ notes) we've measured ~36% token savings vs plain grep, but the wins are concentrated on cross-domain and exploration queries — for simple lookups, grep is often still cheapest. The bigger value: surfaces related notes that keyword search would miss.

~/blog/dev-workflow-token-burn-rate

AI Workflow · part 6

Claude Code Burning Through Tokens? 8 Fixes to Make Sessions Last 10x Longer

2026-04-1317 min read#claude-code #tokens #context-window #beginner 中文版

❯ cat --toc

Plain-Language Version: You Just Started Using Claude Code. It's Amazing. Then It Slows Down.
Preface
Part 1: Where Do All the Tokens Go?
The Stuff You Don't See
The Hidden Buffer
Why It Gets Worse Over Time
Part 2: The Four Biggest Wastes
Waste #1: Reading the Same File Over and Over
Waste #2: Tools You Never Use
Waste #3: A CLAUDE.md That's Too Long
Waste #4: Old Conversation That Doesn't Matter Anymore
Part 3: What You Can Do About It
Fix 0: See Where Your Tokens Actually Go
Fix 1: Use the Right Model for the Job
Fix 2: /compact and /clear at the Right Time
Fix 3: Describe the Problem, Not the File
Fix 4: Trim Your MCP Tools
Fix 5: Keep Your CLAUDE.md Lean
Fix 6: Use Subagents for Exploration
Fix 7: Consider the 1M Context Window
Part 4: Give Claude a Search Engine for Your Notes (Advanced)
Part 5: When Your Notes Learn to Cross-Reference Themselves (Advanced)
Musubi: A Knowledge Graph for Your Notes
The Takeaway

TL;DR

Claude Code's context fills up because every message carries invisible overhead — rules, tool definitions, conversation history. The biggest waste: Claude re-reading the same files over and over. Quick fixes: /compact between tasks, /clear for fresh starts, describe the problem instead of naming files, trim unused MCP tools, keep CLAUDE.md lean, use Sonnet for daily work. Deeper fix: give Claude a search engine (QMD) + knowledge graph (Musubi) so it finds answers in 2-3 files instead of 10.

Plain-Language Version: You Just Started Using Claude Code. It's Amazing. Then It Slows Down.

You install Claude Code. You ask it to fix a bug. It reads your files, understands the problem, writes a fix. Magic.

Then by your 15th question, something changes. Claude starts forgetting things you told it earlier. Responses get less precise. Eventually you see the dreaded compaction message — Claude had to summarize and compress your conversation because the context window was full.

What happened? You didn't do anything wrong. The problem is invisible, and this article makes it visible.

Preface

Imagine a desk. Every time you ask Claude a question, it doesn't just look at your question — it spreads out its entire instruction manual, every tool it might need, and transcripts of everything you've discussed so far. Your actual question sits on whatever space is left.

The desk is 200K tokens wide. It sounds huge until you realize 40% of it is covered before you even sit down.

Part 1: Where Do All the Tokens Go?

The Stuff You Don't See

Every single message you send to Claude carries invisible baggage:

Your rules (CLAUDE.md) — If you've set up a CLAUDE.md file with coding standards, project conventions, and workflow instructions, Claude re-reads the entire thing every turn. A detailed CLAUDE.md can be 3-15K tokens. That's fine — it's what makes Claude useful for your project. But it's not free.

Tool definitions — Every MCP tool (GitHub, Playwright, database connectors, etc.) adds its complete instruction manual to the conversation. One tool = 500-2000 tokens. Got 20 tools installed? That's 10-40K tokens of permanent overhead, every single turn, even if you never use most of them.

Conversation history — Everything Claude said, everything you said, every file it read, every command it ran — it's all still there, growing with every turn.

Here's what a typical session looks like in tokens:

What	How big	When
System instructions	2-8K	Always there
Your CLAUDE.md	3-15K	Always there
Tool definitions	5-50K	Always there
Your conversation so far	Grows every turn	Always there
Files Claude reads	1-10K each	When it reads

Add it up: A fresh session with a solid CLAUDE.md and 15-20 MCP tools starts at 50-80K tokens. That's 25-40% of your 200K context window gone before you ask your first question.

The Hidden Buffer

Here's something most people don't know: Claude Code reserves roughly 33K tokens as an internal buffer (source). This space is used for summarization when compaction happens and for generating responses. You can't disable it. (Note: this number may change as Claude Code updates — run /context to see your actual usable space.)

That means your 200K context window is closer to ~167K of usable space. Auto-compaction kicks in well before 100%. If you're wondering why compaction happens "earlier than expected" — this is why.

Why It Gets Worse Over Time

Turn 1: you have ~120K tokens of room for actual work (after overhead + buffer).

Turn 10: conversation history has grown, Claude has read a bunch of files, made some edits, run some commands. Maybe 70K left.

Turn 20: you're debugging something complex, Claude keeps re-reading files to check its work, tool results are piling up. Down to 20K. Claude starts losing track of what you discussed in turn 3.

Turn 25: compaction. Claude summarizes the conversation to free up space, but in the process, it loses details. That fix you discussed in turn 7? Gone.

This isn't a bug. It's just how context windows work. The key is knowing what fills them up — so you can control it.

Part 2: The Four Biggest Wastes

Waste #1: Reading the Same File Over and Over

This is the single biggest token waste. Community reports suggest a large portion of file-read tokens are redundant — Claude reading files it already read, with barely any changes.

Watch what happens in a typical bug fix:

Claude reads server.ts to understand the code (4,000 tokens)
Claude reads handler.ts to find the bug (3,000 tokens)
You ask "what about the error handling?" → Claude reads server.ts again (4,000 tokens)
Claude makes an edit → reads handler.ts to double-check (3,000 tokens)
Build fails → Claude reads both files again (7,000 tokens)

That's 21,000 tokens spent reading two files. The files barely changed between reads. But Claude doesn't have a "just show me what changed" option — it re-reads the entire file every time.

Waste #2: Tools You Never Use

You installed a cool MCP server with 15 tools. You use 2 of them. The other 13 sit there, doing nothing, except consuming 500-2000 tokens each on every single turn.

It's like carrying a toolbox with 30 tools to fix a leaky faucet. You need a wrench. The other 29 tools just make the toolbox heavier.

Pro tip: The gh CLI uses far fewer tokens than the GitHub MCP server for the same operations (creating PRs, checking issues, viewing diffs). If you're doing GitHub work, try gh first.

Waste #3: A CLAUDE.md That's Too Long

Your CLAUDE.md is injected into every single request. Every turn. Every follow-up. If your CLAUDE.md is 10,000 tokens, you're taxed 10,000 tokens on every interaction before Claude even reads your code.

Most people put too much in CLAUDE.md. Detailed coding style guides, long architecture explanations, exhaustive lists of conventions — all valid information, but it doesn't need to live in the file that gets loaded every turn.

Waste #4: Old Conversation That Doesn't Matter Anymore

By turn 15, the first 8 turns are usually irrelevant — you asked some exploratory questions, tried a wrong approach, changed direction. But those old turns are still in context, taking up space and sometimes confusing Claude with outdated information.

"Wait, didn't you say we should use approach X?" — No, that was turn 3. We abandoned that in turn 6. But Claude still sees both.

Part 3: What You Can Do About It

Fix 0: See Where Your Tokens Actually Go

Before fixing anything, measure first. Claude Code has built-in commands that show you exactly what's eating your context:

/context    — Shows breakdown: system prompt, tools, memory, conversation
/cost       — Shows token usage and dollar cost for this session
/memory     — Shows what persistent files Claude is loading
/mcp        — Shows which MCP servers and tools are active

Run /context right now. Here's what it looks like on a real session:

Estimated usage by category
⛁ System prompt:   6k tokens (0.6%)
⛁ System tools:   11k tokens (1.1%)
⛁ MCP tools:     934 tokens (0.1%)   ← lazy loading keeps this tiny
⛁ Custom agents: 1.5k tokens (0.1%)
⛁ Memory files: 10.5k tokens (1.0%)  ← CLAUDE.md + rules
⛁ Skills:        2.7k tokens (0.3%)
⛁ Messages:    196.6k tokens (19.7%) ← the conversation itself
⛶ Free space:  737.8k (73.8%)
⛝ Autocompact buffer: 33k tokens (3.3%)

This session is on the 1M context window, so 23% used after a long session. On the default 200K window, the same overhead would be over 100% — compaction territory. Notice the 33K autocompact buffer at the bottom — that's real, it's reserved, you can't use it.

You might discover that one MCP server is consuming 18K tokens, or that your CLAUDE.md is bigger than you thought. Fix the biggest offender first — that's worth more than all the other tips combined.

Fix 1: Use the Right Model for the Job

Not every task needs the biggest model. Claude Code lets you switch:

Task	Best Model	Why
Complex refactoring, architecture	Opus	Needs deep reasoning
Writing code, tests, daily work	Sonnet	Fast and capable
Renaming, formatting, lookups	Haiku	Cheap and instant

You can toggle with /model in Claude Code. Using Sonnet for your daily work and switching to Opus only for hard problems can cut your token costs significantly — Opus costs roughly 5x more per token than Sonnet for the same task.

Fix 2: /compact and /clear at the Right Time

Claude Code has two commands for managing context:

/compact — Summarizes your conversation and frees up space. Good when you're switching tasks but want to keep some context.

/clear — Wipes conversation history entirely. More aggressive, but perfect when the new task has nothing to do with the previous one.

A good rule: If your next task doesn't depend on the last 20 messages, use /clear. If it does, use /compact.

Use /compact when:

You just finished a task and are about to start a different one
You've been exploring and are ready to start implementing
Claude starts referencing things from early in the conversation incorrectly

Use /clear when:

You're switching to a completely different project or feature
You're at 60%+ context and about to start something new
You'd rather start fresh than carry stale context

Think of it like clearing your desk between tasks. /compact = organize the papers. /clear = clean slate.

Fix 3: Describe the Problem, Not the File

Instead of:

"Read server.ts"

Try:

"handleAuth seems to not handle the null return case, can you check?" "The login button flashes once then does nothing"

You don't need to remember line numbers or even file names. Give Claude a function name, a feature description, or the symptom you're seeing — it will Grep for the right file and read just those few dozen lines.

The difference? "Read server.ts" = Claude reads all 400 lines (3-4K tokens). Describing the problem = Claude pinpoints the relevant 30 lines (300 tokens). That's 10x less, and you don't need to memorize anything.

Fix 4: Trim Your MCP Tools

Check what MCP tools are loaded in your session. If you have 20 tools but only use 5 regularly, you're burning 15-30K tokens of context every turn on tools that sit idle.

Three approaches:

Lazy loading — Claude Code already defers many tool schemas. Check if your custom MCP servers support it.
CLI over MCP — Tools like gh (GitHub), supabase, vercel as CLI commands cost almost nothing compared to their MCP equivalents.
Session-specific — Only enable MCP servers in sessions where you actually need them.

Fix 5: Keep Your CLAUDE.md Lean

Aim for under 2,000 tokens. Put only the essentials directly in CLAUDE.md:

3-5 most important rules
Key project conventions
File pointers to detailed docs

Everything else — detailed style guides, architecture docs, debugging playbooks — should live in separate files that Claude reads only when relevant, not on every turn.

# CLAUDE.md (lean version)

## Rules
- Use TypeScript strict mode
- Immutable patterns only
- Tests before implementation
- Never re-read a file you already read this session unless it was edited.
  When debugging, search by function name or symptom, not by reading entire files.
  Use subagents for exploratory research.

## Architecture
See docs/ARCHITECTURE.md for details.

## Conventions
See docs/CONVENTIONS.md for full guide.

That one rule about file reads and subagents encodes three of the biggest token savings directly into Claude's behavior — it applies them automatically without you having to remind it every turn.

Claude can always read ARCHITECTURE.md when it needs it. But it doesn't need to carry it on every single turn.

Pro tip: Ask Claude to audit its own CLAUDE.md. Just say:

"Read your CLAUDE.md. Check what's outdated or unnecessary. Suggest a trimmed version under 2000 tokens."

Claude will review the file, flag what's stale, and propose a leaner version. It's a 2-minute exercise that can save thousands of tokens per session going forward.

Fix 6: Use Subagents for Exploration

When you need Claude to research something — dig through files, search the codebase, investigate an error — use subagents. They run in a separate context window, keeping your main conversation clean.

"Use a subagent to investigate how authentication works in this codebase"

The subagent reads 20 files, explores the code, and comes back with a summary. Your main context only receives the summary (a few hundred tokens), not the 20 files (40K tokens).

Important: subagents aren't free. They still burn tokens — just in a separate context window. The benefit isn't "saving money," it's "keeping your main conversation alive longer." Think of it like opening a new browser tab to research something, then closing it and bringing back just the conclusion. The pages you browsed in that tab don't clutter the tab where you're actually writing.

Fix 7: Consider the 1M Context Window

As of 2026, Claude Opus 4.6 and Sonnet 4.6 support a 1M token context window with no pricing premium. If your sessions consistently run out of context at 200K, switching to the 1M window gives you 5x more room.

This doesn't fix the underlying waste — you're still paying for redundant reads and unused tools. But it gives you breathing room while you apply the other fixes.

Part 4: Give Claude a Search Engine for Your Notes (Advanced)

Fixes 0-7 above are enough for most people. Parts 4 and 5 are for power users who work with hundreds of notes across sessions.

Fixes 0-7 reduce waste. This fix changes the game.

The pattern without a search engine:

You ask Claude about a bug you've seen before
Claude doesn't remember (different session)
Claude reads 5-10 files trying to find the answer
30,000 tokens later, it finds the relevant note

The pattern with a search engine:

You ask Claude about a bug you've seen before
Claude searches your notes (200 tokens)
Finds the exact file (2,000 tokens to read)
2,200 tokens total instead of 30,000

QMD is one such tool — a local search engine for markdown files. It indexes your notes and lets Claude find answers in milliseconds instead of reading files one by one. 732 documents searched in 30ms.

QMD has three search modes:

Tool	Speed	Best for
`search`	~30ms	You know the keyword — "vLLM OOM", "Ollama keep_alive"
`vector_search`	~2s	You know what you mean but not the exact words — "model using too much memory"
`deep_search`	~10s	Complex queries, auto-expands into variations and reranks results

It also has get (read a specific document) and multi_get (batch read). Claude calls these tools directly through MCP — you don't have to do anything manually.

But keyword search has a ceiling: it only finds notes that use the same words as your question. vector_search helps somewhat (it searches by meaning), but when the connection between two notes is conceptual — like "Ollama keeping models in memory" and "vLLM crashing on startup" both being about the same 128GB memory pool — you don't need search, you need a relationship map.

Part 5: When Your Notes Learn to Cross-Reference Themselves (Advanced)

This is where knowledge graphs come in. Don't let the fancy name intimidate you — the concept is simple.

A knowledge graph is a map of how your notes relate to each other.

Imagine you have 400 notes about different topics. Some of them are related, but they use different words. A note about "Ollama keeping models in memory" and a note about "vLLM crashing on startup" are deeply related — both are about the same 128GB memory pool. But a keyword search for "vLLM crash" would never find the Ollama note.

A knowledge graph pre-computes these connections. It scans all your notes, finds shared concepts, and draws lines between related documents — even when they use completely different vocabulary.

Measured comparison (52 notes, 10 tasks, Claude Sonnet 4.5):

How Claude finds information	Task type	Result
Grep (keyword match)	Lookup by known topic	Often cheapest — two steps and you're there
Knowledge graph search	Cross-domain connections	Saves 17-25%
Knowledge graph search	"What else is related?"	Saves 35-65%
Knowledge graph search	"Which notes are stale?"	Saves 80%+ (grep literally can't do this)

Overall: 36% savings — with preconditions. That number came from a 52-note corpus. On the bundled 20-note demo, grep actually wins by ~73% because when the corpus is small enough, two grep -l hops already find the file. Below ~30 notes, don't expect savings.

The knowledge graph's real value isn't pure token savings — it's finding cross-domain connections keyword search can't reach and answering questions grep fundamentally can't, like "which notes might be stale?" or "what else have I written on this theme?" That's the real magic: connections you didn't know existed in your own notes.

Musubi: A Knowledge Graph for Your Notes

Full disclosure: Musubi is an open-source tool we built. The following is based on our own experience using it.

Musubi (Japanese for "to tie together") builds this map over your markdown notes. No AI service needed — it runs locally, reads your files, and figures out how they connect.

# Set it up (one time)
uvx --from "git+https://github.com/coolthor/musubi" musubi init

# Build the map
musubi build

# Ask: what's related to this topic?
musubi neighbors "vLLM memory issue"
# → ★ vllm-oom-startup.md (directly about this)
# → + ollama-keep-alive.md (related — same memory pool)
# → + unified-memory-conflict.md (related — same root cause)

The second and third results are notes that keyword search would never find — they don't mention "vLLM" at all. But the knowledge graph knows they're connected through the concept of shared memory.

Musubi itself uses zero LLM tokens — the graph is built with deterministic concept matching, not an AI service. The README is honest about this: they don't claim "saves 40% tokens" without data. The tool includes a built-in benchmark (musubi benchmark) so you can measure your actual savings on your own notes.

When integrated with Claude Code, Musubi runs automatically before Claude searches the web or reads files. If the answer already exists in your notes, Claude can locate the right file in a few calls instead of hunting. In practice the biggest wins are on "find the related stuff" and "check what's stale"; for straight "find that note about X" lookups, plain grep often does the job too. It doesn't save tokens in every scenario — it finds connections grep can never reach.

The Takeaway

Your context window fills up fast because of invisible overhead — not because your questions are too long.

Do these today (free, immediate): 0. Run /context and /cost to see where your tokens actually go

Pick the right model — Sonnet for daily work, Opus for hard problems
/compact when switching tasks, /clear for fresh starts
Describe the problem ("login button does nothing"), not the file name
Disable MCP tools you're not using (try gh CLI over GitHub MCP)
Keep CLAUDE.md lean — put details in separate files
Use subagents for exploratory work

Do this when ready (requires setup): 7. Give Claude a search engine for your notes (QMD) 8. Add a knowledge graph so related notes find each other (Musubi)

The principle: The best way to save tokens isn't making conversations shorter — it's making Claude's search more precise. Don't compress. Target.

Both QMD and Musubi are open source. They work with any markdown files, run locally, and don't need an AI service or cloud account.

FAQ

Why does Claude Code run out of context so fast?: Every message carries invisible baggage: your CLAUDE.md rules, tool definitions, and the full conversation history. A session can use 50-80K tokens before you even type a question. Plus, Claude reserves ~33K tokens as a buffer you can't use — so your real usable space is ~167K, not 200K.
What uses the most tokens in Claude Code?: Three things: (1) MCP tool definitions — each tool adds 500-2000 tokens every turn, (2) repeated file reads — Claude re-reads the entire file each time, not just the diff, and (3) conversation history that keeps growing. The gh CLI uses far fewer tokens than the GitHub MCP server for the same tasks.
How do I make my Claude Code sessions last longer?: Quick wins: use /compact when switching tasks (or /clear if the new task is unrelated), describe problems instead of naming files, disable unused MCP tools, keep CLAUDE.md under 2000 tokens, and use Sonnet or Haiku for simple tasks instead of Opus. For a deeper fix, use QMD or Musubi so Claude finds your notes instantly.
What is a knowledge graph and how does it help with tokens?: A knowledge graph is a map of how your notes connect to each other. Claude checks the map to find relevant files instead of reading everything. On real corpora (~50+ notes) we've measured ~36% token savings vs plain grep, but the wins are concentrated on cross-domain and exploration queries — for simple lookups, grep is often still cheapest. The bigger value: surfaces related notes that keyword search would miss.

Don't miss the next one

Subscribe, and you won't.

One-click unsubscribe anytime.

← back to blog