Anthropic's new prompt caching saves developers a lot of money.

Sign up for our daily and weekly newsletters, featuring the latest updates and exclusive content on industry-leading AI reporting. Learn more

Anthropic introduced prompt caching to their API, which remembers context between API calls and prevents developers from repeatedly showing prompts.

Instant caching is available in the public beta for Claude 3.5 Sonnet and Claude 3 Haiku, but support for the largest Claude model, Opus, is coming soon.

Prompt caching, described in this 2023 paper, allows users to keep frequently used context in the session. Since the model remembers these prompts, users can add additional background information without increasing costs. This is useful when someone sends a lot of context in a prompt and then wants to refer back to it in another conversation with the model. It also allows developers and other users to better fine-tune the model’s responses.

Anthropic says early users have seen “significant speed and cost improvements with fast caching for a variety of use cases, from including the entire knowledge base to 100 shot examples to including each turn of a conversation in a prompt.”

The company says potential use cases include reducing the cost and wait time for long instructions and uploaded documents for conversational agents, making code completion faster, providing multiple instructions to agent search tools, and including full documents in prompts.

Anthropological (@AnthropicAI) announced a groundbreaking feature for its API: prompt caching.
Think of quick caching like this: You’re in a coffee shop. The first time you go, you have to tell the barista everything you want to order. But the next time? You can just say, “as usual.”
That’s really fast… pic.twitter.com/ASB1nkdY4U
— Dan Shipper ? (@danshipper) August 14, 2024

Cached prompt price

One advantage of prompt caching is the lower cost per token, with Anthropic saying that using cached prompts is “significantly cheaper” than the cost of the native input token.

For Claude 3.5 Sonnet, creating a cached prompt costs $3.75 per million tokens (MTok), while using a cached prompt costs $0.30 per MTok. Since the base price for inputs to the Claude 3.5 Sonnet model is $3/MTok, paying a little more up front can save you 10x when you use a cached prompt later.

We’ve launched prompt caching in the Anthropic API.
Reduce API input costs by up to 90% and reduce latency by up to 80%.
Here’s how it works:
— Alex Albert (@alexalbert__) August 14, 2024

As for cost, the initial API call is slightly more expensive (taking into account storing the prompt in the cache), but all subsequent calls are 1/10th the normal price. pic.twitter.com/3cPkz8c0rm
— Alex Albert (@alexalbert__) August 14, 2024

Claude 3 Haiku users pay $0.30/MTok for cache, and $0.03/MTok if they use saved prompts.

Prompt caching is not yet available in Claude 3 Opus, but Anthropic has already published its pricing: writing to the cache costs $18.75/MTok, while accessing a cached prompt costs $1.50/MTok.

But as AI influencer Simon Willison notes on X, Anthropic’s cache only lasts for five minutes and refreshes every time you use it.

Similar to Gemini’s context caching, Anthropic’s pricing model is different.
Gemini charges $4.50 per million tokens per hour to keep the context cache warm.
Anthropic rates for cache writes, “The cache has a 5 minute lifespan and is refreshed whenever cached content is used” https://t.co/rfMQE2J3Rs
— Simon Willison (@simonw) August 14, 2024

Of course, this isn’t the first time Anthropic has tried to compete with other AI platforms through pricing. Before the launch of the Claude 3 family of models, Anthropic slashed its token price significantly.

It is now engaged in a kind of “race down” with competitors including Google and OpenAI to provide cheaper options for third-party developers building on its platform.

Most Requested Features

Other platforms offer prompt-caching versions. Lamina, an LLM inference system, uses KV caching to reduce GPU costs. A quick look at OpenAI’s developer forums or GitHub will reveal questions about how to cache prompts.

Caching prompts is not the same as prompts in a large language model memory. For example, OpenAI’s GPT-4o provides a memory for the model to remember preferences or details. However, it does not store the actual prompts and responses, as does prompt caching.

VB Daily

Stay up to date! Get the latest news delivered to your inbox every day

By subscribing, you agree to VentureBeat’s Terms of Service.

Thanks for subscribing. Check out more VB newsletters here.

An error occurred.

Anthropic’s new prompt caching saves developers a lot of money.

The exciting new world of Redis

Wi-Fi 6E: Transforming connectivity for sports and entertainment venues

Beyond the Data Center: High-Performance Networks for AI

Light it up! Snoop Dogg carries the Olympic torch at the final games in Paris – National

Gausman contributes to Blue Jays’ sweep of Angels

A Drake security guard was shot outside his Toronto home.

The Jays scored four runs in the eighth to beat the Rays 6-3.

Review – Mrs Robinson

Understanding LiHV Batteries: Are They Better than LiPo for FPV Drones?

Biden cites ‘remarkable turn of events in history’ in final UN speech as president

Trump responds to rumors of Haley’s vice presidential nomination by saying he is ‘not considering’ her, but ‘wishes her well’

Our Picks

Brazilian President Lula announces alliance to eradicate poverty and hunger at G20 | kia news

MLB Trade Deadline Tiers: Which teams could be — and should be — aggressive buyers and sellers

Sheffield Shield 2024/25, QLD vs VIC 15th Match Report, 24 – 27 November 2024

Most Popular