AI Workflow · part 1
Claude Code Burning Through Tokens? 8 Fixes to Make Sessions Last 10x Longer
❯ cat --toc
- You Just Started Using Claude Code. It's Amazing. Then It Slows Down.
- Preface
- Part 1: Where Do All the Tokens Go?
- The Stuff You Don't See
- The Hidden Buffer
- Why It Gets Worse Over Time
- Part 2: The Four Biggest Wastes
- Waste #1: Reading the Same File Over and Over
- Waste #2: Tools You Never Use
- Waste #3: A CLAUDE.md That's Too Long
- Waste #4: Old Conversation That Doesn't Matter Anymore
- Part 3: What You Can Do About It
- Fix 0: See Where Your Tokens Actually Go
- Fix 1: Use the Right Model for the Job
- Fix 2: /compact and /clear at the Right Time
- Fix 3: Describe the Problem, Not the File
- Fix 4: Trim Your MCP Tools
- Fix 5: Keep Your CLAUDE.md Lean
- Rules
- Architecture
- Conventions
- Fix 6: Use Subagents for Exploration
- Fix 7: Consider the 1M Context Window
- Part 4: Give Claude a Search Engine for Your Notes (Advanced)
- Part 5: When Your Notes Learn to Cross-Reference Themselves (Advanced)
- Musubi: A Knowledge Graph for Your Notes
- The Takeaway
TL;DR
Claude Code's context fills up because every message carries invisible overhead — rules, tool definitions, conversation history. The biggest waste: Claude re-reading the same files over and over. Quick fixes: /compact between tasks, /clear for fresh starts, describe the problem instead of naming files, trim unused MCP tools, keep CLAUDE.md lean, use Sonnet for daily work. Deeper fix: give Claude a search engine (QMD) + knowledge graph (Musubi) so it finds answers in 2-3 files instead of 10.
You Just Started Using Claude Code. It's Amazing. Then It Slows Down.
You install Claude Code. You ask it to fix a bug. It reads your files, understands the problem, writes a fix. Magic.
Then by your 15th question, something changes. Claude starts forgetting things you told it earlier. Responses get less precise. Eventually you see the dreaded compaction message — Claude had to summarize and compress your conversation because the context window was full.
What happened? You didn't do anything wrong. The problem is invisible, and this article makes it visible.
Preface
Imagine a desk. Every time you ask Claude a question, it doesn't just look at your question — it spreads out its entire instruction manual, every tool it might need, and transcripts of everything you've discussed so far. Your actual question sits on whatever space is left.
The desk is 200K tokens wide. It sounds huge until you realize 40% of it is covered before you even sit down.
Part 1: Where Do All the Tokens Go?
The Stuff You Don't See
Every single message you send to Claude carries invisible baggage:
Your rules (CLAUDE.md) — If you've set up a CLAUDE.md file with coding standards, project conventions, and workflow instructions, Claude re-reads the entire thing every turn. A detailed CLAUDE.md can be 3-15K tokens. That's fine — it's what makes Claude useful for your project. But it's not free.
Tool definitions — Every MCP tool (GitHub, Playwright, database connectors, etc.) adds its complete instruction manual to the conversation. One tool = 500-2000 tokens. Got 20 tools installed? That's 10-40K tokens of permanent overhead, every single turn, even if you never use most of them.
Conversation history — Everything Claude said, everything you said, every file it read, every command it ran — it's all still there, growing with every turn.
Here's what a typical session looks like in tokens:
| What | How big | When |
|---|---|---|
| System instructions | 2-8K | Always there |
| Your CLAUDE.md | 3-15K | Always there |
| Tool definitions | 5-50K | Always there |
| Your conversation so far | Grows every turn | Always there |
| Files Claude reads | 1-10K each | When it reads |
Add it up: A fresh session with a solid CLAUDE.md and 15-20 MCP tools starts at 50-80K tokens. That's 25-40% of your 200K context window gone before you ask your first question.
The Hidden Buffer
Here's something most people don't know: Claude Code reserves roughly 33K tokens as an internal buffer (source). This space is used for summarization when compaction happens and for generating responses. You can't disable it. (Note: this number may change as Claude Code updates — run /context to see your actual usable space.)
That means your 200K context window is closer to ~167K of usable space. Auto-compaction kicks in well before 100%. If you're wondering why compaction happens "earlier than expected" — this is why.
Why It Gets Worse Over Time
Turn 1: you have ~120K tokens of room for actual work (after overhead + buffer).
Turn 10: conversation history has grown, Claude has read a bunch of files, made some edits, run some commands. Maybe 70K left.
Turn 20: you're debugging something complex, Claude keeps re-reading files to check its work, tool results are piling up. Down to 20K. Claude starts losing track of what you discussed in turn 3.
Turn 25: compaction. Claude summarizes the conversation to free up space, but in the process, it loses details. That fix you discussed in turn 7? Gone.
This isn't a bug. It's just how context windows work. The key is knowing what fills them up — so you can control it.
Part 2: The Four Biggest Wastes
Waste #1: Reading the Same File Over and Over
This is the single biggest token waste. Community reports suggest a large portion of file-read tokens are redundant — Claude reading files it already read, with barely any changes.
Watch what happens in a typical bug fix:
- Claude reads
server.tsto understand the code (4,000 tokens) - Claude reads
handler.tsto find the bug (3,000 tokens) - You ask "what about the error handling?" → Claude reads
server.tsagain (4,000 tokens) - Claude makes an edit → reads
handler.tsto double-check (3,000 tokens) - Build fails → Claude reads both files again (7,000 tokens)
That's 21,000 tokens spent reading two files. The files barely changed between reads. But Claude doesn't have a "just show me what changed" option — it re-reads the entire file every time.
Waste #2: Tools You Never Use
You installed a cool MCP server with 15 tools. You use 2 of them. The other 13 sit there, doing nothing, except consuming 500-2000 tokens each on every single turn.
It's like carrying a toolbox with 30 tools to fix a leaky faucet. You need a wrench. The other 29 tools just make the toolbox heavier.
Pro tip: The gh CLI uses far fewer tokens than the GitHub MCP server for the same operations (creating PRs, checking issues, viewing diffs). If you're doing GitHub work, try gh first.
Waste #3: A CLAUDE.md That's Too Long
Your CLAUDE.md is injected into every single request. Every turn. Every follow-up. If your CLAUDE.md is 10,000 tokens, you're taxed 10,000 tokens on every interaction before Claude even reads your code.
Most people put too much in CLAUDE.md. Detailed coding style guides, long architecture explanations, exhaustive lists of conventions — all valid information, but it doesn't need to live in the file that gets loaded every turn.
Waste #4: Old Conversation That Doesn't Matter Anymore
By turn 15, the first 8 turns are usually irrelevant — you asked some exploratory questions, tried a wrong approach, changed direction. But those old turns are still in context, taking up space and sometimes confusing Claude with outdated information.
"Wait, didn't you say we should use approach X?" — No, that was turn 3. We abandoned that in turn 6. But Claude still sees both.
Part 3: What You Can Do About It
Fix 0: See Where Your Tokens Actually Go
Before fixing anything, measure first. Claude Code has built-in commands that show you exactly what's eating your context:
/context — Shows breakdown: system prompt, tools, memory, conversation
/cost — Shows token usage and dollar cost for this session
/memory — Shows what persistent files Claude is loading
/mcp — Shows which MCP servers and tools are active
Run /context right now. Here's what it looks like on a real session:
Estimated usage by category
⛁ System prompt: 6k tokens (0.6%)
⛁ System tools: 11k tokens (1.1%)
⛁ MCP tools: 934 tokens (0.1%) ← lazy loading keeps this tiny
⛁ Custom agents: 1.5k tokens (0.1%)
⛁ Memory files: 10.5k tokens (1.0%) ← CLAUDE.md + rules
⛁ Skills: 2.7k tokens (0.3%)
⛁ Messages: 196.6k tokens (19.7%) ← the conversation itself
⛶ Free space: 737.8k (73.8%)
⛝ Autocompact buffer: 33k tokens (3.3%)
This session is on the 1M context window, so 23% used after a long session. On the default 200K window, the same overhead would be over 100% — compaction territory. Notice the 33K autocompact buffer at the bottom — that's real, it's reserved, you can't use it.
You might discover that one MCP server is consuming 18K tokens, or that your CLAUDE.md is bigger than you thought. Fix the biggest offender first — that's worth more than all the other tips combined.
Fix 1: Use the Right Model for the Job
Not every task needs the biggest model. Claude Code lets you switch:
| Task | Best Model | Why |
|---|---|---|
| Complex refactoring, architecture | Opus | Needs deep reasoning |
| Writing code, tests, daily work | Sonnet | Fast and capable |
| Renaming, formatting, lookups | Haiku | Cheap and instant |
You can toggle with /model in Claude Code. Using Sonnet for your daily work and switching to Opus only for hard problems can cut your token costs significantly — Opus costs roughly 5x more per token than Sonnet for the same task.
Fix 2: /compact and /clear at the Right Time
Claude Code has two commands for managing context:
/compact — Summarizes your conversation and frees up space. Good when you're switching tasks but want to keep some context.
/clear — Wipes conversation history entirely. More aggressive, but perfect when the new task has nothing to do with the previous one.
A good rule: If your next task doesn't depend on the last 20 messages, use /clear. If it does, use /compact.
Use /compact when:
- You just finished a task and are about to start a different one
- You've been exploring and are ready to start implementing
- Claude starts referencing things from early in the conversation incorrectly
Use /clear when:
- You're switching to a completely different project or feature
- You're at 60%+ context and about to start something new
- You'd rather start fresh than carry stale context
Think of it like clearing your desk between tasks. /compact = organize the papers. /clear = clean slate.
Fix 3: Describe the Problem, Not the File
Instead of:
"Read server.ts"
Try:
"
handleAuthseems to not handle the null return case, can you check?" "The login button flashes once then does nothing"
You don't need to remember line numbers or even file names. Give Claude a function name, a feature description, or the symptom you're seeing — it will Grep for the right file and read just those few dozen lines.
The difference? "Read server.ts" = Claude reads all 400 lines (3-4K tokens). Describing the problem = Claude pinpoints the relevant 30 lines (300 tokens). That's 10x less, and you don't need to memorize anything.
Fix 4: Trim Your MCP Tools
Check what MCP tools are loaded in your session. If you have 20 tools but only use 5 regularly, you're burning 15-30K tokens of context every turn on tools that sit idle.
Three approaches:
- Lazy loading — Claude Code already defers many tool schemas. Check if your custom MCP servers support it.
- CLI over MCP — Tools like
gh(GitHub),supabase,vercelas CLI commands cost almost nothing compared to their MCP equivalents. - Session-specific — Only enable MCP servers in sessions where you actually need them.
Fix 5: Keep Your CLAUDE.md Lean
Aim for under 2,000 tokens. Put only the essentials directly in CLAUDE.md:
- 3-5 most important rules
- Key project conventions
- File pointers to detailed docs
Everything else — detailed style guides, architecture docs, debugging playbooks — should live in separate files that Claude reads only when relevant, not on every turn.
# CLAUDE.md (lean version)
## Rules
- Use TypeScript strict mode
- Immutable patterns only
- Tests before implementation
- Never re-read a file you already read this session unless it was edited.
When debugging, search by function name or symptom, not by reading entire files.
Use subagents for exploratory research.
## Architecture
See docs/ARCHITECTURE.md for details.
## Conventions
See docs/CONVENTIONS.md for full guide.
That one rule about file reads and subagents encodes three of the biggest token savings directly into Claude's behavior — it applies them automatically without you having to remind it every turn.
Claude can always read ARCHITECTURE.md when it needs it. But it doesn't need to carry it on every single turn.
Pro tip: Ask Claude to audit its own CLAUDE.md. Just say:
"Read your CLAUDE.md. Check what's outdated or unnecessary. Suggest a trimmed version under 2000 tokens."
Claude will review the file, flag what's stale, and propose a leaner version. It's a 2-minute exercise that can save thousands of tokens per session going forward.
Fix 6: Use Subagents for Exploration
When you need Claude to research something — dig through files, search the codebase, investigate an error — use subagents. They run in a separate context window, keeping your main conversation clean.
"Use a subagent to investigate how authentication works in this codebase"
The subagent reads 20 files, explores the code, and comes back with a summary. Your main context only receives the summary (a few hundred tokens), not the 20 files (40K tokens).
Important: subagents aren't free. They still burn tokens — just in a separate context window. The benefit isn't "saving money," it's "keeping your main conversation alive longer." Think of it like opening a new browser tab to research something, then closing it and bringing back just the conclusion. The pages you browsed in that tab don't clutter the tab where you're actually writing.
Fix 7: Consider the 1M Context Window
As of 2026, Claude Opus 4.6 and Sonnet 4.6 support a 1M token context window with no pricing premium. If your sessions consistently run out of context at 200K, switching to the 1M window gives you 5x more room.
This doesn't fix the underlying waste — you're still paying for redundant reads and unused tools. But it gives you breathing room while you apply the other fixes.
Part 4: Give Claude a Search Engine for Your Notes (Advanced)
Fixes 0-7 above are enough for most people. Parts 4 and 5 are for power users who work with hundreds of notes across sessions.
Fixes 0-7 reduce waste. This fix changes the game.
The pattern without a search engine:
- You ask Claude about a bug you've seen before
- Claude doesn't remember (different session)
- Claude reads 5-10 files trying to find the answer
- 30,000 tokens later, it finds the relevant note
The pattern with a search engine:
- You ask Claude about a bug you've seen before
- Claude searches your notes (200 tokens)
- Finds the exact file (2,000 tokens to read)
- 2,200 tokens total instead of 30,000
QMD is one such tool — a local search engine for markdown files. It indexes your notes and lets Claude find answers in milliseconds instead of reading files one by one. 732 documents searched in 30ms.
QMD has three search modes:
| Tool | Speed | Best for |
|---|---|---|
search | ~30ms | You know the keyword — "vLLM OOM", "Ollama keep_alive" |
vector_search | ~2s | You know what you mean but not the exact words — "model using too much memory" |
deep_search | ~10s | Complex queries, auto-expands into variations and reranks results |
It also has get (read a specific document) and multi_get (batch read). Claude calls these tools directly through MCP — you don't have to do anything manually.
But keyword search has a ceiling: it only finds notes that use the same words as your question. vector_search helps somewhat (it searches by meaning), but when the connection between two notes is conceptual — like "Ollama keeping models in memory" and "vLLM crashing on startup" both being about the same 128GB memory pool — you don't need search, you need a relationship map.
Part 5: When Your Notes Learn to Cross-Reference Themselves (Advanced)
This is where knowledge graphs come in. Don't let the fancy name intimidate you — the concept is simple.
A knowledge graph is a map of how your notes relate to each other.
Imagine you have 400 notes about different topics. Some of them are related, but they use different words. A note about "Ollama keeping models in memory" and a note about "vLLM crashing on startup" are deeply related — both are about the same 128GB memory pool. But a keyword search for "vLLM crash" would never find the Ollama note.
A knowledge graph pre-computes these connections. It scans all your notes, finds shared concepts, and draws lines between related documents — even when they use completely different vocabulary.
The result:
| How Claude finds information | Files it reads | Tokens spent |
|---|---|---|
| Reads everything that might be relevant | 10-20 files | ~40,000 |
| Keyword search (QMD) | 3-5 files | ~10,000 |
| Knowledge graph search | 2-3 files | ~4,000 |
Same answer. 10x fewer tokens.
The knowledge graph doesn't just save tokens — it finds notes you wouldn't have searched for. That's the real magic: connections you didn't know existed in your own notes.
Musubi: A Knowledge Graph for Your Notes
Full disclosure: Musubi is an open-source tool we built. The following is based on our own experience using it.
Musubi (Japanese for "to tie together") builds this map over your markdown notes. No AI service needed — it runs locally, reads your files, and figures out how they connect.
# Set it up (one time)
uvx --from "git+https://github.com/coolthor/musubi" musubi init
# Build the map
musubi build
# Ask: what's related to this topic?
musubi neighbors "vLLM memory issue"
# → ★ vllm-oom-startup.md (directly about this)
# → + ollama-keep-alive.md (related — same memory pool)
# → + unified-memory-conflict.md (related — same root cause)
The second and third results are notes that keyword search would never find — they don't mention "vLLM" at all. But the knowledge graph knows they're connected through the concept of shared memory.
Musubi itself uses zero LLM tokens — the graph is built with deterministic concept matching, not an AI service. The README is honest about this: they don't claim "saves 40% tokens" without data. The tool includes a built-in benchmark (musubi benchmark) so you can measure your actual savings on your own notes.
When integrated with Claude Code, Musubi runs automatically before Claude searches the web or reads files. If the answer already exists in your notes, Claude reads that instead of starting from scratch. Fewer files read, fewer tokens burned, better answers.
The Takeaway
Your context window fills up fast because of invisible overhead — not because your questions are too long.
Do these today (free, immediate):
0. Run /context and /cost to see where your tokens actually go
- Pick the right model — Sonnet for daily work, Opus for hard problems
/compactwhen switching tasks,/clearfor fresh starts- Describe the problem ("login button does nothing"), not the file name
- Disable MCP tools you're not using (try
ghCLI over GitHub MCP) - Keep CLAUDE.md lean — put details in separate files
- Use subagents for exploratory work
Do this when ready (requires setup): 7. Give Claude a search engine for your notes (QMD) 8. Add a knowledge graph so related notes find each other (Musubi)
The principle: The best way to save tokens isn't making conversations shorter — it's making Claude's search more precise. Don't compress. Target.
Both QMD and Musubi are open source. They work with any markdown files, run locally, and don't need an AI service or cloud account.
FAQ
- Why does Claude Code run out of context so fast?
- Every message carries invisible baggage: your CLAUDE.md rules, tool definitions, and the full conversation history. A session can use 50-80K tokens before you even type a question. Plus, Claude reserves ~33K tokens as a buffer you can't use — so your real usable space is ~167K, not 200K.
- What uses the most tokens in Claude Code?
- Three things: (1) MCP tool definitions — each tool adds 500-2000 tokens every turn, (2) repeated file reads — Claude re-reads the entire file each time, not just the diff, and (3) conversation history that keeps growing. The gh CLI uses far fewer tokens than the GitHub MCP server for the same tasks.
- How do I make my Claude Code sessions last longer?
- Quick wins: use /compact when switching tasks (or /clear if the new task is unrelated), describe problems instead of naming files, disable unused MCP tools, keep CLAUDE.md under 2000 tokens, and use Sonnet or Haiku for simple tasks instead of Opus. For a deeper fix, use QMD or Musubi so Claude finds your notes instantly.
- What is a knowledge graph and how does it help with tokens?
- A knowledge graph is a map of how your notes connect to each other. Instead of Claude reading 10 files to find an answer, it checks the map and reads only 2-3. Fewer files = fewer tokens. It also surfaces related notes that keyword search would miss.