Every AI coding session has a hidden resource that most developers ignore until it breaks: the context window. It is the total number of tokens the model can hold in memory at once - the prompt, the conversation history, every file read, every tool result, every response. When it fills up, bad things happen. Not dramatic failures. Subtle ones. The kind that waste hours.
Understanding context as a finite resource - and managing it like one - is the single biggest efficiency gain available in AI-assisted development today.
What Context Actually Costs
Every token in the context window costs money. The pricing varies by model, but the pattern is consistent: input tokens cost less than output tokens, and both add up fast during a coding session.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Context limit |
|---|---|---|---|
| Claude Opus 4 | $15 | $75 | 200K |
| Claude Sonnet 4 | $3 | $15 | 200K |
| GPT-4o | $2.50 | $10 | 128K |
| Gemini 2.5 Pro | $1.25 - $2.50 | $10 - $15 | 1M |
| Claude Opus 4 (1M) | $15 | $75 | 1M |
A typical coding session with Claude Opus 4 that fills 150K tokens of context costs roughly $2.25 in input tokens alone - before counting outputs. A heavy development day with 10-15 sessions is $20-35. Scale that across a team and context management becomes a direct line item.
But cost is not the primary concern. Quality degradation is.
What Happens When Context Fills Up
Context window utilization directly impacts output quality. This is not theoretical - it is measurable and consistent across models.
At 0-50% utilization: Normal operation. The model has full access to everything in context and produces its best work. Responses are precise, follow instructions, and maintain consistency with earlier parts of the conversation.
At 50-70% utilization: Subtle degradation begins. The model starts to lose track of constraints mentioned early in the conversation. If project conventions were defined in the first few messages, they may be followed inconsistently. Earlier code snippets may be referenced imprecisely.
At 70-85% utilization: Noticeable quality drops. The model may contradict earlier decisions. Instructions from the system prompt or CLAUDE.md get partially ignored. Code generation becomes more generic - less tailored to the specific project context. This is the danger zone because the model still sounds confident.
At 85-95% utilization: Significant problems. Hallucinations increase. The model may “forget” files it read earlier in the session. Responses become shorter and less detailed as the model struggles to attend to the full context. Variable names and function signatures from earlier in the conversation get confused.
Above 95%: The session is effectively broken. The model cannot process new information meaningfully. Responses may be truncated, incoherent, or completely disconnected from the conversation. Recovery requires starting a new session.
The insidious part: the model does not warn about degradation. It continues producing output with the same confidence at 90% utilization as at 20%. The developer has to monitor context usage proactively.
Strategy 1: @file Imports vs Pasting
The simplest context optimization is how files get into context. There are two approaches, and one is dramatically better.
Pasting file contents into the prompt:
Here is my database model:
class User(Base):
__tablename__ = "users"
id = Column(Integer, primary_key=True)
email = Column(String, unique=True, nullable=False)
... (200 more lines)
This puts the entire file contents directly into the conversation history. It stays in context for the rest of the session, consuming tokens in every subsequent request even when it is no longer relevant.
Using @file references (in tools that support it):
Look at @app/models/user.py and add a `last_login` field
with a migration.
The tool reads the file, processes it, and the model sees the contents - but the implementation can be more efficient about how that content persists in context. In Claude Code specifically, file reads through the tool system are handled more efficiently than raw pastes.
The rule: Never paste file contents into the chat. Always use the tool’s file reading mechanism.
Strategy 2: /compact and /clear
Claude Code provides two commands for managing context:
/compact - Summarizes the conversation history into a condensed form. The full history is replaced with a summary that preserves the key decisions, code written, and current state. This typically reduces context usage by 50-70%.
When to use /compact:
- Context is above 50% and the task is not finished
- Switching to a different sub-task within the same session
- After a large file read that is no longer needed
- Before starting a review of code generated earlier in the session
/clear - Resets the context entirely. Starts a fresh session. CLAUDE.md and rules are re-read, but conversation history is gone.
When to use /clear:
- Context is above 70% and a new task is starting
- The current conversation has gone off track
- Switching between unrelated tasks
A practical rhythm: /compact after every 2-3 completed sub-tasks. /clear when starting a genuinely new piece of work.
Strategy 3: .claude/rules/ for Path-Specific Context
The .claude/rules/ directory allows defining context that is loaded only when relevant files are being worked on. This is more efficient than putting everything in a single CLAUDE.md file.
.claude/
rules/
api.md # Loaded when working in api/ directory
tests.md # Loaded when working in tests/ directory
migrations.md # Loaded when working with Alembic files
frontend.md # Loaded when working in frontend/ directory
Each file contains conventions specific to that part of the codebase:
# .claude/rules/api.md
- All endpoints use dependency injection for database sessions
- Response models are in api/schemas/, not in route files
- Use HTTPException with detail dict, not string
- Rate limiting is handled by middleware, not per-endpoint
Without path-specific rules, all of this goes into CLAUDE.md and loads for every session - including sessions that never touch the API layer. With rules files, context is loaded on demand.
The savings compound. A project with 500 tokens of general context and 2000 tokens of path-specific rules across 4 directories loads 500 + ~500 tokens per session instead of 2500. Over a day of sessions, that is thousands of tokens saved - and more importantly, the model is not distracted by irrelevant constraints.
Strategy 4: Sub-Agents for Large Tasks
When a task requires reading many files or generating substantial code, sub-agents isolate the context cost. The main conversation delegates to a sub-agent that has its own context window, completes the task, and returns only the result.
In Claude Code, this happens naturally with the tool use pattern. But the principle applies to any AI workflow: if a task requires reading 10 files to generate one function, do that in a separate context and bring back just the function.
The practical application:
Use a sub-agent to:
1. Read all files in app/services/
2. Identify the common error handling pattern
3. Write a new service for payment processing that follows
the same pattern
4. Return only the new file contents
The main conversation gets the output without the context cost of reading 10 service files. Those file contents consume tokens in the sub-agent’s context, not the main conversation’s.
Strategy 5: Disable Unused MCP Servers
Every connected MCP server adds to the system prompt. The server’s tool definitions, descriptions, and schemas consume tokens before a single message is sent.
A typical MCP server adds 200-500 tokens to the system prompt. Five servers add 1,000-2,500 tokens. These tokens are present in every single request for the entire session.
// .claude/settings.json - only enable what you need
{
"mcpServers": {
"postgres": { "command": "...", "enabled": true },
"github": { "command": "...", "enabled": false },
"slack": { "command": "...", "enabled": false },
"filesystem": { "command": "...", "enabled": false }
}
}
If today’s work is database-heavy, only the PostgreSQL server needs to be active. If it is GitHub PR reviews, only the GitHub server. The overhead of unused servers is pure waste - tokens consumed for tools that will never be called.
The Context Budget Framework
Treat each session like a budget:
| Context % | Action |
|---|---|
| 0-50% | Normal work. No action needed. |
| 50-60% | Consider /compact if more work remains. |
| 60-70% | /compact recommended. Quality starting to degrade. |
| 70-80% | /compact required or start a new session. |
| 80%+ | Start a new session. Output quality is unreliable. |
Monitor context usage actively. In Claude Code, the context usage is visible in the interface. Build the habit of checking it the same way a driver checks the fuel gauge - not constantly, but regularly.
The Bottom Line
Context window management is not an optimization - it is a requirement. Unmanaged context leads to degraded output quality, wasted tokens, and sessions that produce worse results the longer they run. The strategies here - efficient file loading, aggressive compaction, path-specific rules, sub-agents for large tasks, and disabling unused MCP servers - are not advanced techniques. They are basic hygiene for anyone using AI coding tools seriously.
The developers who get the most from AI are not the ones with the largest context windows. They are the ones who use the smallest effective context for each task. Precision beats capacity. Every time.
Comments