The Claude API feature that cut my AI costs by 80% overnight
← Back
March 31, 2026Claude7 min read

The Claude API feature that cut my AI costs by 80% overnight

Published March 31, 20267 min read

I was building an internal code review tool powered by Claude when our monthly Anthropic bill crossed a threshold that made my engineering manager raise an eyebrow. We were sending the same 50,000-token system prompt — our coding standards doc, architecture guide, and team conventions — with every single API call. I knew this felt wasteful, but I assumed that was just how LLM APIs worked. Then I found two words in the Anthropic docs that changed everything: cache_control.


What Most Engineers Do

When you build with the Claude API, the natural pattern is straightforward: every call gets the full context. System prompt, few-shot examples, relevant documentation, then the actual user question at the end.

This works. The results are good. But if your system prompt is large — and in real production AI tools, it always is — you're paying full price for every single token on every single call, even when 95% of the content hasn't changed between requests.

For a code review tool that runs on every PR, this adds up fast. Our system prompt was around 50k tokens. At Claude Sonnet's pricing, that's roughly $0.15 per review just for input tokens. With 200 PRs per week, that's $30 a week in context overhead — before any actual reasoning happens.

What I Found

Anthropic's prompt caching lets you mark specific blocks of your prompt as cacheable. The first time you send a request, Claude stores those blocks server-side. On subsequent calls, if the cached content matches, Claude retrieves it from cache instead of reprocessing it — at roughly 10% of the original token cost.

The API is refreshingly simple. You add a cache_control key to any content block:

typescript
const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-5',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: yourLargeCodingStandardsDoc, // ~50,000 tokens
      cache_control: { type: 'ephemeral' }, // 👈 mark as cacheable
    },
  ],
  messages: [
    {
      role: 'user',
      content: 'Review this diff for issues:

' + prDiff,
    },
  ],
});

The usage object in the response tells you exactly what happened:

typescript
// First call — cache write
{
  input_tokens: 53,                   // the PR diff
  cache_creation_input_tokens: 50234, // stored to cache
  cache_read_input_tokens: 0
}

// Subsequent calls — cache hit!
{
  input_tokens: 61,                   // new PR diff
  cache_creation_input_tokens: 0,
  cache_read_input_tokens: 50234      // retrieved from cache 🎉
}

Why It's Better

The numbers are striking. Here's what changed for our code review tool after enabling caching:

WITHOUT CACHING
┌─────────────────────────────────────────────┐
│  Every API call                             │
│  ┌──────────────────────────────────────┐   │
│  │  System prompt (50k tokens)  $0.150  │   │
│  │  + PR diff    (1k tokens)    $0.003  │   │
│  └──────────────────────────────────────┘   │
│  Total per review:        ~$0.153           │
│  200 reviews/week:        ~$30.60           │
└─────────────────────────────────────────────┘

WITH PROMPT CACHING
┌─────────────────────────────────────────────┐
│  First call (cache write, +25% cost)        │
│  ┌──────────────────────────────────────┐   │
│  │  System prompt (50k @ 125%)  $0.188  │   │
│  │  + PR diff    (1k tokens)    $0.003  │   │
│  └──────────────────────────────────────┘   │
│                                             │
│  All subsequent calls (cache read, -90%)    │
│  ┌──────────────────────────────────────┐   │
│  │  System prompt from cache    $0.015  │   │
│  │  + PR diff    (1k tokens)    $0.003  │   │
│  └──────────────────────────────────────┘   │
│  199 remaining reviews:   ~$3.57            │
│                                             │
│  Weekly total:   $3.76   (was $30.60)       │
│  Savings:        88% reduction  ✅          │
└─────────────────────────────────────────────┘

There's a latency benefit too. Cache hits skip the tokenization and processing of the cached content, which means faster time-to-first-token on large-context calls. For our CI tool, review latency dropped from around 8 seconds to under 3 seconds on cache hits.

A few important details worth knowing:

  • Cache TTL: Cached content persists for at least 5 minutes and is refreshed on each use. For frequently-called tools, it stays alive indefinitely.
  • Cache write cost: The first call costs 25% more — you're paying to store the cache. Every hit after that is 90% cheaper. Break-even is at 2 calls.
  • Up to 4 cache breakpoints: You can mark multiple blocks in a single prompt — useful for layering a system prompt, few-shot examples, and large reference documents.
  • Identical content required: Even a single character change invalidates the cache for that block. Design your static sections to be truly static.

Real Use Cases

CI/CD code review: Your coding standards doc, architecture conventions, and example patterns stay cached across every PR review. Only the diff changes per call. This is the exact pattern that cut our bill by 88%.

Documentation chatbots: Cache your entire docs corpus in the system prompt. Users ask questions; you only pay for the question tokens on cache hits. If your docs are 100k tokens, you're looking at 90% savings on every query after the first.

Multi-turn conversations with large context: If your chat history references a large document loaded at the start of the conversation, cache it. Subsequent turns pay only for new messages, not the whole document again.

Agentic pipelines: When an agent loops over many items — analyzing 200 log entries with the same analysis instructions, for example — cache the instructions block. Per-item cost drops to roughly 10% of what you'd pay without caching.

Try It

Here's a drop-in helper that wraps your Claude calls with caching for any static content block:

typescript
import Anthropic from '@anthropic-ai/sdk';

const anthropic = new Anthropic();

async function callClaudeWithCaching({
  staticContext,  // your large, stable content
  userMessage,    // what changes each call
  model = 'claude-sonnet-4-5',
  maxTokens = 1024,
}: {
  staticContext: string;
  userMessage: string;
  model?: string;
  maxTokens?: number;
}) {
  const response = await anthropic.messages.create({
    model,
    max_tokens: maxTokens,
    system: [
      {
        type: 'text' as const,
        text: staticContext,
        cache_control: { type: 'ephemeral' as const },
      },
    ],
    messages: [
      {
        role: 'user' as const,
        content: userMessage,
      },
    ],
  });

  const { usage } = response;
  const cacheHit = (usage.cache_read_input_tokens ?? 0) > 0;

  console.log(
    cacheHit
      ? 'Cache hit — saved ' + usage.cache_read_input_tokens + ' tokens'
      : 'Cache miss — stored ' + usage.cache_creation_input_tokens + ' tokens'
  );

  return response;
}

// Usage
const result = await callClaudeWithCaching({
  staticContext: yourCodingStandardsDoc, // stays in cache between calls
  userMessage: 'Review this diff:

' + prDiff,
});

The first time you run this, you'll see "Cache miss — stored X tokens". Every subsequent call with the same staticContext will log "Cache hit" — and you'll be paying 10 cents on the dollar for those tokens.

One thing I wish I'd known earlier: prompt caching is available on Claude Haiku, Sonnet, and Opus. There's no feature flag to enable, no support ticket to file. You add the cache_control key and it works.

If you're building anything that calls Claude repeatedly with the same large context — and most production AI tools do — this is the highest-ROI change you can make to your API usage today. The break-even point is literally two API calls.

Share this
← All Posts7 min read