How I use Claude's 200K context window without burning my API budget
← Back
April 4, 2026Claude7 min read

How I use Claude's 200K context window without burning my API budget

Published April 4, 20267 min read

Claude's 200K context window feels like a free pass to throw everything at it. Your entire codebase, all your logs, every document — just send it all and let Claude figure it out. I tried that. My API costs tripled and responses got slower. The real skill is curating what you send. Here are the strategies I now use to get better answers with a fraction of the tokens.

Why filling the context hurts

Three things get worse as your context grows:

  • Cost — input tokens are billed at the same rate as output tokens. Sending 150K tokens of irrelevant code is expensive.
  • Speed — latency scales with context size. A 200K context message can take 20-30 seconds to get a first token.
  • Quality — counter-intuitively, Claude can get "lost" in very large contexts. Relevant information surrounded by noise gets less attention than if it was presented cleanly.

Strategy 1: File-based context injection

Instead of sending entire files, extract only what is relevant to the question. For code questions, this usually means the function being changed, its direct dependencies, and the test file.

python
import ast
import os
from pathlib import Path


def extract_function(file_path: str, function_name: str) -> str:
    """Extract a single function from a Python file."""
    source = Path(file_path).read_text()
    tree = ast.parse(source)
    
    for node in ast.walk(tree):
        if isinstance(node, ast.FunctionDef) and node.name == function_name:
            # Get source lines for this function
            lines = source.splitlines()
            start = node.lineno - 1
            end = node.end_lineno
            return "
".join(lines[start:end])
    
    return ""


def build_focused_context(
    target_file: str,
    target_function: str,
    include_tests: bool = True,
) -> str:
    """Build a focused context string for a specific function."""
    parts = []
    
    # Add the target function
    func_code = extract_function(target_file, target_function)
    if func_code:
        parts.append(f"# File: {target_file}
{func_code}")
    
    # Add relevant test file if it exists
    if include_tests:
        test_file = target_file.replace("src/", "tests/").replace(".py", "_test.py")
        if os.path.exists(test_file):
            parts.append(f"# Tests: {test_file}
{Path(test_file).read_text()}")
    
    return "

---

".join(parts)

Strategy 2: Prompt caching for shared context

If you send the same context repeatedly (like your entire schema or architecture docs), use Claude's prompt caching. The first request processes and caches the static content. Subsequent requests hit the cache at ~10% of the token cost.

python
import anthropic
from pathlib import Path

client = anthropic.Anthropic()

# Load once, reuse across many calls
SCHEMA_CONTENT = Path("schema/database.sql").read_text()

def ask_about_schema(question: str) -> str:
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": "You are a database expert. Answer questions about the schema below.",
                "cache_control": {"type": "ephemeral"},  # Cache the system prompt
            }
        ],
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": f"Schema:
{SCHEMA_CONTENT}",
                        "cache_control": {"type": "ephemeral"},  # Cache the schema
                    },
                    {
                        "type": "text",
                        "text": question,
                    }
                ],
            }
        ],
    )
    return response.content[0].text

# First call: processes all tokens
# Subsequent calls: schema tokens served from cache (10% cost)
answer1 = ask_about_schema("Which tables have a created_at column?")
answer2 = ask_about_schema("What indexes exist on the users table?")

Strategy 3: Hierarchical summarization

For large codebases, build a two-level index: a high-level summary of each file, and the full content of individual files. Send the summaries first, ask which files are relevant, then fetch full content only for those.

python
import anthropic
import json
from pathlib import Path

client = anthropic.Anthropic()

def summarize_file(file_path: str) -> str:
    """Generate a one-paragraph summary of a source file."""
    content = Path(file_path).read_text()[:5000]  # First 5K chars
    
    response = client.messages.create(
        model="claude-haiku-4-5",  # Use faster/cheaper model for summaries
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"Summarize in one paragraph what this file does:

{content}"
        }]
    )
    return response.content[0].text


def find_relevant_files(question: str, file_summaries: dict[str, str]) -> list[str]:
    """Ask Claude which files are most relevant for a question."""
    summary_list = "
".join(
        f"- {path}: {summary}"
        for path, summary in file_summaries.items()
    )
    
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"""Given this question: "{question}"

Which of these files are most relevant? Return a JSON array of file paths.

Files:
{summary_list}"""
        }]
    )
    
    # Extract JSON array from response
    import re
    match = re.search(r'[.*?]', response.content[0].text, re.DOTALL)
    if match:
        return json.loads(match.group())
    return []

Strategy 4: Log tail, not full log

When debugging with Claude, you almost never need the entire log file. Send the last 200 lines around the error:

bash
# Extract the relevant portion of logs
# 50 lines before first ERROR, everything after
grep -n "ERROR" app.log | head -1 | awk -F: '{print $1}' | xargs -I{}   awk 'NR>={}-50' app.log | head -200 | claude "What's causing this error?"

The mental model that changed everything

I stopped thinking "what should I give Claude?" and started thinking "what is the minimum context needed to answer this specific question?" Usually the answer is: the function being changed (50-100 lines), its direct imports (50-200 lines), the relevant test (50-100 lines), and maybe a schema snippet if it touches the database. That is under 500 lines for most code questions — less than 2% of what a naive "send the whole codebase" approach would use.

Better context curation also produces better answers. When Claude gets a focused, well-structured context, it can reason about the specific problem rather than summarizing a large input.

Share this
← All Posts7 min read