Sign up for our daily and weekly newsletters, featuring the latest updates and exclusive content on industry-leading AI reporting. Learn more
Anthropic introduced prompt caching to their API, which remembers context between API calls and prevents developers from repeatedly showing prompts.
Instant caching is available in the public beta for Claude 3.5 Sonnet and Claude 3 Haiku, but support for the largest Claude model, Opus, is coming soon.
Prompt caching, described in this 2023 paper, allows users to keep frequently used context in the session. Since the model remembers these prompts, users can add additional background information without increasing costs. This is useful when someone sends a lot of context in a prompt and then wants to refer back to it in another conversation with the model. It also allows developers and other users to better fine-tune the model’s responses.
Anthropic says early users have seen “significant speed and cost improvements with fast caching for a variety of use cases, from including the entire knowledge base to 100 shot examples to including each turn of a conversation in a prompt.”
The company says potential use cases include reducing the cost and wait time for long instructions and uploaded documents for conversational agents, making code completion faster, providing multiple instructions to agent search tools, and including full documents in prompts.
Cached prompt price
One advantage of prompt caching is the lower cost per token, with Anthropic saying that using cached prompts is “significantly cheaper” than the cost of the native input token.
For Claude 3.5 Sonnet, creating a cached prompt costs $3.75 per million tokens (MTok), while using a cached prompt costs $0.30 per MTok. Since the base price for inputs to the Claude 3.5 Sonnet model is $3/MTok, paying a little more up front can save you 10x when you use a cached prompt later.
Claude 3 Haiku users pay $0.30/MTok for cache, and $0.03/MTok if they use saved prompts.
Prompt caching is not yet available in Claude 3 Opus, but Anthropic has already published its pricing: writing to the cache costs $18.75/MTok, while accessing a cached prompt costs $1.50/MTok.
But as AI influencer Simon Willison notes on X, Anthropic’s cache only lasts for five minutes and refreshes every time you use it.
Of course, this isn’t the first time Anthropic has tried to compete with other AI platforms through pricing. Before the launch of the Claude 3 family of models, Anthropic slashed its token price significantly.
It is now engaged in a kind of “race down” with competitors including Google and OpenAI to provide cheaper options for third-party developers building on its platform.
Most Requested Features
Other platforms offer prompt-caching versions. Lamina, an LLM inference system, uses KV caching to reduce GPU costs. A quick look at OpenAI’s developer forums or GitHub will reveal questions about how to cache prompts.
Caching prompts is not the same as prompts in a large language model memory. For example, OpenAI’s GPT-4o provides a memory for the model to remember preferences or details. However, it does not store the actual prompts and responses, as does prompt caching.