How I use Claude's 200K context window without burning my API budget
Claude's 200K context window feels like a free pass to throw everything at it. Your entire codebase, all your logs, every document — just send it all and let Claude figure it out. I tried that. My API costs tripled and responses got slower. The real skill is curating what you send. Here are the strategies I now use to get better answers with a fraction of the tokens.
Why filling the context hurts
Three things get worse as your context grows:
- Cost — input tokens are billed at the same rate as output tokens. Sending 150K tokens of irrelevant code is expensive.
- Speed — latency scales with context size. A 200K context message can take 20-30 seconds to get a first token.
- Quality — counter-intuitively, Claude can get "lost" in very large contexts. Relevant information surrounded by noise gets less attention than if it was presented cleanly.
Strategy 1: File-based context injection
Instead of sending entire files, extract only what is relevant to the question. For code questions, this usually means the function being changed, its direct dependencies, and the test file.
import ast
import os
from pathlib import Path
def extract_function(file_path: str, function_name: str) -> str:
"""Extract a single function from a Python file."""
source = Path(file_path).read_text()
tree = ast.parse(source)
for node in ast.walk(tree):
if isinstance(node, ast.FunctionDef) and node.name == function_name:
# Get source lines for this function
lines = source.splitlines()
start = node.lineno - 1
end = node.end_lineno
return "
".join(lines[start:end])
return ""
def build_focused_context(
target_file: str,
target_function: str,
include_tests: bool = True,
) -> str:
"""Build a focused context string for a specific function."""
parts = []
# Add the target function
func_code = extract_function(target_file, target_function)
if func_code:
parts.append(f"# File: {target_file}
{func_code}")
# Add relevant test file if it exists
if include_tests:
test_file = target_file.replace("src/", "tests/").replace(".py", "_test.py")
if os.path.exists(test_file):
parts.append(f"# Tests: {test_file}
{Path(test_file).read_text()}")
return "
---
".join(parts)
Strategy 2: Prompt caching for shared context
If you send the same context repeatedly (like your entire schema or architecture docs), use Claude's prompt caching. The first request processes and caches the static content. Subsequent requests hit the cache at ~10% of the token cost.
import anthropic
from pathlib import Path
client = anthropic.Anthropic()
# Load once, reuse across many calls
SCHEMA_CONTENT = Path("schema/database.sql").read_text()
def ask_about_schema(question: str) -> str:
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a database expert. Answer questions about the schema below.",
"cache_control": {"type": "ephemeral"}, # Cache the system prompt
}
],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": f"Schema:
{SCHEMA_CONTENT}",
"cache_control": {"type": "ephemeral"}, # Cache the schema
},
{
"type": "text",
"text": question,
}
],
}
],
)
return response.content[0].text
# First call: processes all tokens
# Subsequent calls: schema tokens served from cache (10% cost)
answer1 = ask_about_schema("Which tables have a created_at column?")
answer2 = ask_about_schema("What indexes exist on the users table?")
Strategy 3: Hierarchical summarization
For large codebases, build a two-level index: a high-level summary of each file, and the full content of individual files. Send the summaries first, ask which files are relevant, then fetch full content only for those.
import anthropic
import json
from pathlib import Path
client = anthropic.Anthropic()
def summarize_file(file_path: str) -> str:
"""Generate a one-paragraph summary of a source file."""
content = Path(file_path).read_text()[:5000] # First 5K chars
response = client.messages.create(
model="claude-haiku-4-5", # Use faster/cheaper model for summaries
max_tokens=200,
messages=[{
"role": "user",
"content": f"Summarize in one paragraph what this file does:
{content}"
}]
)
return response.content[0].text
def find_relevant_files(question: str, file_summaries: dict[str, str]) -> list[str]:
"""Ask Claude which files are most relevant for a question."""
summary_list = "
".join(
f"- {path}: {summary}"
for path, summary in file_summaries.items()
)
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=500,
messages=[{
"role": "user",
"content": f"""Given this question: "{question}"
Which of these files are most relevant? Return a JSON array of file paths.
Files:
{summary_list}"""
}]
)
# Extract JSON array from response
import re
match = re.search(r'[.*?]', response.content[0].text, re.DOTALL)
if match:
return json.loads(match.group())
return []
Strategy 4: Log tail, not full log
When debugging with Claude, you almost never need the entire log file. Send the last 200 lines around the error:
# Extract the relevant portion of logs
# 50 lines before first ERROR, everything after
grep -n "ERROR" app.log | head -1 | awk -F: '{print $1}' | xargs -I{} awk 'NR>={}-50' app.log | head -200 | claude "What's causing this error?"
The mental model that changed everything
I stopped thinking "what should I give Claude?" and started thinking "what is the minimum context needed to answer this specific question?" Usually the answer is: the function being changed (50-100 lines), its direct imports (50-200 lines), the relevant test (50-100 lines), and maybe a schema snippet if it touches the database. That is under 500 lines for most code questions — less than 2% of what a naive "send the whole codebase" approach would use.
Better context curation also produces better answers. When Claude gets a focused, well-structured context, it can reason about the specific problem rather than summarizing a large input.