In short
Prompt caching is a feature that stores the processed state of a repeated part of your prompt so later requests can reuse it instead of paying to process it again. When many requests share the same long prefix - a big system prompt, tool definitions, or a document you keep asking about - caching lets the model skip recomputing that prefix, which cuts both cost and latency. On the Claude API, for example, a cache read costs about a tenth of normal input price, a roughly 90 percent discount on the cached portion.
How prompt caching works
You mark a point in your prompt as a cache breakpoint. The provider stores the encoded state of everything up to that point, and the next request that begins with the exact same bytes reads from the cache rather than recomputing. The match must be exact: if even one token in the prefix differs, you get a cache miss and pay the full price. The order matters too, since the prompt is hashed as tools, then system prompt, then messages.
- Put the stable, repeated content first (tools, system prompt, long documents).
- Mark a cache breakpoint after it; the prefix up to there gets cached.
- A later request with the identical prefix reads the cache cheaply.
What it costs and saves
There is a small premium to write the cache and a large saving to read it. On the Claude API a cache write costs about 1.25x normal input for a 5 minute lifetime (or 2x for a 1 hour lifetime), while a cache read costs about 0.1x input, a roughly 90 percent saving. The cache is ephemeral with a short time-to-live that resets on each read, so a busy conversation keeps its cache warm without paying the write cost again.
When to use it
Prompt caching pays off whenever you reuse a large, stable prefix across many calls: a long system prompt, a fixed set of tool definitions, a knowledge document, or a multi-turn chat where the early context stays the same. It does not help one-off prompts that never repeat. Because the saving applies only to the unchanged prefix, structure your prompts so the constant parts come first and the variable parts come last.
