Workflow AutomationAIOperationsn8n

Prompt caching in automations: when it pays off

Prompt caching discounts only the static prefix of a prompt that is reused within the cache window, about five minutes by default. Most business automations fire less often than that and build the prompt with variable data first, so they never get a cache hit. Caching pays off on bursty or batched workloads structured static-first; for low-frequency bulk jobs the provider's Batch API is the better cost lever.

Alexey YushkinFounder, GENERAL INFORMATICS3 min read

Prompt caching discounts the part of your prompt that stays identical from one call to the next, but only if a later call reuses that exact prefix before the cache expires, which is about five minutes by default. Most business automations fail both conditions. They fire less often than once every five minutes, so the cache is cold on every run, and they build the prompt with the variable record at the top, so there is no shared prefix to cache in the first place. Caching earns its discount, up to 90 percent off the cached tokens, on bursty or batched workloads that send many similar calls inside the window, structured with the static content first. For the typical low-frequency automation it does nothing, and on Anthropic a misplaced cache breakpoint can make the call cost more than not caching at all.

The advice you will read everywhere is "turn on caching, cut your AI bill by 90 percent." That number is real at the API level. It is also irrelevant to most of the automations operators actually run, for reasons the cost-cutting guides never mention. The discount is conditional, and the conditions are exactly the ones a no-code workflow tends to break.

What prompt caching actually discounts

Caching does not discount your prompt. It discounts the longest run of tokens at the start of your prompt that is byte-for-byte identical to a request the provider has recently seen. That shared opening is the prefix: usually your system instructions, few-shot examples, and output schema. The variable part (this customer's message, this invoice, today's date) is never cached, because it is different every time. The whole mechanism rests on reuse of a stable prefix.

The numbers are current as of June 2026.

OpenAIAnthropic
How you turn it onAutomatic, no parameterExplicit cache_control breakpoint
Discount on cached tokensUp to 90 percent off on current GPT-5.x modelsCache read is 10 percent of the input price, a 90 percent discount
Cost to write the cacheNone25 percent surcharge on the cached tokens for the 5-minute cache, 100 percent for the 1-hour cache, paid once per write
Minimum prefix to cache1,024 tokens1,024 tokens (Sonnet 4.6, Opus 4.8), 4,096 (Haiku 4.5)
Default lifetimeEvicts after 5 to 10 minutes of inactivity, max about 1 hour5 minutes, refreshed free on every read
What must matchLongest identical prefix from the very startEverything up to and including the breakpoint

Two rows decide whether caching does anything for you, and neither is the discount. The lifetime row sets how often you have to call to keep the cache warm. The "what must match" row sets how you have to build the prompt so there is a shared prefix at all. Get either wrong and the 90 percent never arrives.

Why most automations never get a cache hit

Two structural facts about business automations defeat caching, and they are independent. You can fix one and still lose to the other.

The first is cadence against the cache window. Both providers evict the cache after roughly five to ten minutes of inactivity. Anthropic's default lifetime is five minutes, extended only when a read refreshes it. Now look at how often a real workflow fires. A lead-intake automation might run when a form is submitted, which could be every forty minutes. An invoice-extraction flow runs a handful of times a day. A nightly summarization job runs once. In every one of those, the gap between calls is longer than the cache lives, so each run starts from a cold cache and pays full price. Caching only stays warm under sustained traffic: a chatbot mid-conversation, a queue being drained, a loop running back to back. The low-frequency, event-triggered pattern that defines most operator automations is the worst case for it.

The second is prefix ordering, and this one is self-inflicted. The discount applies only to the identical opening of the prompt. If your prompt template interpolates the variable record first ("Lead: Jane Doe, ACME Corp, submitted 14:32. Now follow these instructions...") then the prefix is different on every single call and nothing is shared, even under heavy traffic. Both providers say the same thing in their docs: put static content at the beginning and variable content at the end. OpenAI states it as structure guidance; Anthropic enforces it through where you place the cache breakpoint. No-code AI nodes and hand-written prompt templates routinely do the opposite, dropping the record at the top because that reads naturally to a human. That single ordering choice is a guaranteed cache miss on every run.

This connects to a point we have made before: the model is usually a rounding error in an automation's monthly bill, and the platform's per-step billing dominates. Caching attacks the rounding error, not the dominant cost. Even when it works perfectly, you are shaving the smallest line on the invoice. That is worth doing when it is free, which on OpenAI it is. It is not worth contorting a workflow around.

The Anthropic trap: a breakpoint that costs more

On OpenAI, a cache miss is harmless. Caching is automatic and free, so when the prefix does not match you simply pay the normal input price, the same as if caching did not exist. There is no downside to a miss.

Anthropic is different, and this is the part almost no guide flags. To cache on Anthropic you place a cache_control breakpoint, and writing the cache costs 25 percent more than the base input price for the cached tokens (100 percent more for the one-hour cache). You pay that write premium every time the cache is cold. If your automation runs less often than the cache lives, every run is a cold write with no read to amortize it, so you pay the surcharge and never collect the discount.

Walk the math on a Sonnet-class model at $3 per million input tokens, with a 3,000-token static prefix and 200 tokens of variable lead data.

  • No caching: 3,200 tokens at $3 per million is about $0.0096 per run.
  • Warm cache read: the 3,000 static tokens at 10 percent plus 200 at full price is about $0.0015 per run, roughly 84 percent cheaper. This is the number the guides quote.
  • Cold cache write, which is what a low-frequency flow gets on every run: the 3,000 tokens at the 1.25x write rate plus 200 at full price is about $0.012 per run, about 25 percent more expensive than not caching at all.

So an operator who reads "caching saves 90 percent," switches it on for a flow that fires every forty minutes, and walks away has not saved anything. They have raised the per-run cost by a quarter, quietly, on every run, because the cache is never warm when the next one arrives. Caching on Anthropic is a bet that a read will follow your write inside five minutes. If you cannot guarantee that, do not place the breakpoint.

When caching actually pays off

The decision is one question: will another call reuse this exact static prefix within the cache window? If yes, structure the prompt static-first and cache. If no, do not bother, and on Anthropic actively leave it off.

WorkloadReuses a warm prefix?Caching verdict
Live chatbot mid-conversationYes, turns arrive seconds apartCache the system prompt and history
Draining a queue or looping a list back to backYes, calls are continuousCache the shared instructions
Many parallel calls in one burstYes, but warm the cache with one call firstCache after the first response returns
Lead intake every 30 to 60 minutesNo, cache evicts between runsSkip it (OpenAI), leave off (Anthropic)
Invoice extraction a few times a dayNo, always coldSkip it
Nightly summarization, once per runNoSkip it; batch instead

A practical note for the burst case. The cache entry only becomes available after the first response starts coming back. If you fire fifty parallel calls at once from a cold cache, all fifty write the cache and none of them read it, so you pay fifty writes. Send one call first, let it return, then fan out the rest against the now-warm prefix. That ordering is the difference between paying the write premium once and paying it fifty times.

What to do instead, and how to start

For the most common expensive case, a bulk job where no single item is urgent, caching is the wrong lever. The right one is the asynchronous Batch API, which runs the same model at half the token cost with no cadence requirement and no prefix-ordering gymnastics. You can stack caching on top of a batch if every request shares a long static prefix, but the 50 percent batch discount does the heavy lifting and arrives without any of the conditions caching imposes. Reach for batching first on overnight and backlog work, and reach for caching only when sustained real-time traffic shares a stable prefix.

If you do want the caching discount, the build is small and worth getting right. First, restructure every AI prompt in the flow so all static content (instructions, examples, schema, reference text) sits at the top and the per-call variable goes last. That one change is free, it helps on OpenAI automatically, and it is the precondition for any cache hit on either provider. Second, confirm your static prefix clears the model's minimum, 1,024 tokens for most current models, or nothing caches regardless of frequency. Third, on Anthropic, only place the breakpoint where you can reasonably expect a read within five minutes, and use the one-hour cache only if the read might lag. On a flow that fires every forty minutes, the honest answer is to leave caching off and look at batching or model choice instead.

Pick one AI step you run often and check its real cadence and its prompt order before you touch a caching setting. If it does not fire more than once every few minutes against an identical prefix, caching is not your cost lever and you should stop optimizing it. If you want a second set of eyes on where the money in your workflow automations actually goes, and which steps are worth caching, batching, or leaving alone, tell us what you are running.

Frequently Asked Questions

SOURCES & CITATIONS

  1. Prompt caching Anthropichttps://platform.claude.com/docs/en/build-with-claude/prompt-caching
  2. Prompt caching OpenAIhttps://developers.openai.com/api/docs/guides/prompt-caching
  3. Prompt Caching in the API OpenAIhttps://openai.com/index/api-prompt-caching/

About Alexey Yushkin

Alexey is the founder of GENERAL INFORMATICS LLC. He designs and ships AI and automation systems for businesses and operators across the US.

Connect on LinkedIn

Related reading

Want this kind of system in your business?

We build practical AI and automation systems for operators. Send us your current workflow and we will show you what to automate first.

Request a Workflow Review