Prompt caching is the single highest-ROI feature in the Claude API for production workloads. Used well, it cuts the cost of high-traffic endpoints by 70-90% and shaves hundreds of milliseconds off latency. Used poorly. or ignored. it leaves the equivalent of a rounding error on the table at small scale and a six-figure bill on the table at large scale.
This article is the practitioner's guide.
What Caching Actually Does
When you mark a portion of a prompt with cache_control, Anthropic stores the model's internal state after processing that portion. On a subsequent request that begins with the same prefix, the model resumes from that cached state instead of re-processing.
The pricing implications:
- Cache write: ~25% premium over standard input pricing (one-time cost).
- Cache read: ~10% of standard input pricing.
- Cache lifetime: 5 minutes by default, refreshed on each hit.
Break-even math: if you write a cached prefix once and read it twice, you've already saved money. In production, you typically read 50-500 times per write window. The savings are not theoretical.
What to Cache
Three categories of content that always benefit from caching:
1. Stable system prompts. Long instructions, role definitions, output format specifications, examples. anything that doesn't change between requests.
const systemPrompt = [
{
type: "text",
text: `You are a customer support agent for Acme Corp...
[4,000 tokens of policies, examples, edge cases, tone guidance]`,
cache_control: { type: "ephemeral" },
},
];
2. Retrieved context in RAG applications. When the retrieved documents are stable across multiple turns of a conversation (or across multiple users asking similar questions), cache them.
const messages = [
{
role: "user",
content: [
{
type: "text",
text: retrievedDocuments, // 8K tokens of context
cache_control: { type: "ephemeral" },
},
{
type: "text",
text: userQuestion, // varies per request
},
],
},
];
3. Tool definitions in agentic systems. If your agent has 12 tools each with detailed JSON schemas and descriptions, that's typically 2-5K tokens that never change.
const tools = [
/* full tool definitions, ~3K tokens */
];
const response = await client.messages.create({
model: "claude-sonnet-4-6",
tools,
// Anthropic caches the full tool block when cache_control is on the
// last tool, capturing tools + their schemas as a single cached prefix.
max_tokens: 1024,
messages: [/* ... */],
});
What Not to Cache
Caching has overhead. Don't bother for:
- Prompts under ~1,024 tokens. The minimum cache size is roughly that; below it, caching is rejected and the cache_write fee burns.
- Truly per-request content (the user's specific question, this turn's data). Caching it provides no benefit. you'll never get a hit.
- Content that changes faster than the 5-minute window. If your prompt prefix is recomputed every minute, you're paying the 25% write premium each time with no read benefit.
Cache Key Design: The Subtle Art
Anthropic's caching is based on exact prefix match. Even a one-character difference in the cached portion causes a cache miss and a fresh write. Three implications:
1. Order your prompt for maximum reuse. Put the most-stable content first (system prompt, then static examples), variable content last (user query). The longer the stable prefix, the more reuse.
// ✅ Cacheable prefix is large
messages: [
{ role: "user", content: [
{ type: "text", text: stableContext, cache_control: {...} },
{ type: "text", text: userQuery },
]},
]
// ❌ Cacheable prefix is broken by varying content first
messages: [
{ role: "user", content: [
{ type: "text", text: userQuery },
{ type: "text", text: stableContext, cache_control: {...} }, // useless
]},
]
2. Don't put dynamic content (timestamps, IDs, "Today is...") in your system prompt. I've seen teams burn 80% of their potential caching savings on a single line: Current date: 2026-04-17 14:32:11. Put dynamic context in the user message, not the system prompt.
3. Multi-turn conversations: cache up through the last assistant turn. As the conversation grows, mark the last assistant message with cache_control. The full conversation history gets cached, and only the new user turn is uncached input.
const messages = [
{ role: "user", content: "..." },
{ role: "assistant", content: "..." },
{ role: "user", content: "..." },
{
role: "assistant",
content: [
{ type: "text", text: "...", cache_control: { type: "ephemeral" } },
],
},
{ role: "user", content: newUserMessage },
];
Each new turn extends the cached prefix. By turn 10, you're caching ~95% of the input tokens.
Multiple Cache Breakpoints
You can specify up to 4 cache_control markers per request, creating multiple cache prefixes that can hit independently. Useful when you have:
- A stable system prompt (always cached)
- A user-tier-specific instruction set (cached per tier)
- A conversation history (cached per conversation)
{
system: [
{ type: "text", text: globalInstructions, cache_control: {...} },
],
messages: [
{ role: "user", content: [
{ type: "text", text: tierInstructions, cache_control: {...} },
{ type: "text", text: convHistory, cache_control: {...} },
{ type: "text", text: thisTurn },
]},
],
}
The model tries each breakpoint and uses the longest cache hit. Even if the conversation history changes (cache miss on that), the global and tier prefixes can still hit.
Measuring Cache Effectiveness
Every response includes cache metrics in usage:
console.log({
input_tokens: response.usage.input_tokens,
cache_read_tokens: response.usage.cache_read_input_tokens,
cache_write_tokens: response.usage.cache_creation_input_tokens,
output_tokens: response.usage.output_tokens,
});
Build a dashboard that tracks cache hit rate = cache_read_tokens / (cache_read_tokens + input_tokens). A healthy production endpoint with caching configured well should sit between 70-95%. Below 50% means your prefix isn't stable enough or your traffic isn't dense enough to amortise the 5-minute window.
Production Architecture: Cache-First Design
The teams who get the most out of caching design for it from day one. The principles:
- Separate stable from dynamic content explicitly. Treat the cache boundary as part of the prompt's API.
- Concentrate variable content at the end. Userid, query, timestamp. last in the prompt, not sprinkled through.
- Warm the cache before peak traffic. A nightly job that pings each cached prompt template once keeps the cache warm enough that morning users don't pay the cache_write premium.
- Monitor cache hit rate per endpoint. If it drops below threshold, alert. A drop usually means someone changed a "stable" prompt without realising the cost implication.
A Real Example
A RAG-style customer support feature I worked on this quarter:
- System prompt: 6K tokens (instructions + 12 example dialogues)
- Per-conversation context: 4K tokens (account info, recent history)
- User query: ~50 tokens
Without caching: $0.024 per request × 50K requests/day = $1,200/day.
With caching (system + per-conversation context cached, ~85% hit rate after week 1):
- Cache writes: rare, batched into the daily warm-up job
- Cache reads: 10K tokens × $0.0003/M = $0.003 per request × 50K = $150/day
Same feature, $1,050/day saved. ~$30K/month. The caching configuration was 4 lines of code.
The sticker price of frontier LLMs has dropped year over year, but caching is what makes them affordable at scale today. Build it in from request one.