OpenAI Prompt Caching for Production AI Apps

Design OpenAI prompt caching around stable prefixes, cached_tokens telemetry, prompt_cache_key routing, and cost math before production traffic.

Friday, July 3, 2026Omid Saffari
OpenAI Prompt Caching for Production AI Apps

OpenAI prompt caching is worth designing for when your app repeats long prefixes: make the prefix boring, measure cached_tokens on every request, and treat cache misses as a cost bug before production traffic scales.

The Production Rule: Stable Prefix First, Dynamic Tail Last

Prompt caching only pays off when repeated requests start the same way. OpenAI's current cookbook states that cache hits require an exact, repeated prefix match and work for prompts containing 1024 tokens or more, with cache hits occurring in increments of 128 tokens.

That rule changes how you structure a production AI app. Put durable content first:

  • System and developer instructions
  • Tool definitions
  • Structured output schemas
  • Static product, policy, or repository context
  • Long examples that rarely change

Put volatile content last:

  • User query
  • Current timestamp
  • Session-specific state
  • Recent retrieval results
  • Temporary feature flags

The mistake is treating caching as a billing feature that happens after architecture. It is a context architecture feature. If your app rewrites the opening instructions on every turn, injects timestamps near the top, changes tool order per request, or compacts earlier turns by editing the front of the conversation, it can turn a cacheable workload into full-price input without a model-quality signal.

For a production agent, the cacheable prefix is usually the operating contract: role, tools, schemas, safety rules, escalation policy, and stable domain context. The dynamic tail is the live work. If the prefix cannot remain stable, caching is not the lever. Use model routing policy, retrieval, or shorter context instead.

What OpenAI Actually Caches

OpenAI prompt caching reuses the model's key/value tensors for a repeated prefix, not a copy of your raw prompt text. The cookbook describes KV tensors as the intermediate representation from the model's attention layers produced during prefill, and says only the key/value tensors may be persisted in local storage.

That distinction matters for security reviews. Caching is still model infrastructure behavior, not an app-side data store. OpenAI's original prompt caching announcement says prompt caches are not shared between organizations. The same announcement says legacy in-memory caches are typically cleared after 5-10 minutes of inactivity and are always removed within one hour of the cache's last use. OpenAI's newer cookbook adds that in-memory prompt caching works automatically on all API requests, while Extended Prompt Caching increases retention to 24hrs.

The cacheable prefix can be more than text. OpenAI says the entire request prefix is cacheable: messages, images, audio, tool definitions, and structured output schemas. For agent systems, tool definitions and schemas are the big one. Reordering tools, changing schema keys, or adding a one-off tool at the top of a request can break the prefix even if the user-facing prompt looks unchanged.

Treat the prompt prefix as a deterministic build artifact:

Text
prefix =
  model contract
  tool definitions in stable order
  structured output schema
  stable policy and product context
  few-shot examples

tail =
  user request
  retrieved documents
  live state
  request metadata

The goal is not to stuff more content into every call. The goal is to make the content you already repeat cheap, observable, and boring.

The Cost Math That Decides Whether It Matters

Prompt caching matters when repeated input is a meaningful share of spend or latency. OpenAI says Prompt Caching can reduce time-to-first-token latency by up to 80% and input token costs by up to 90%, and that it works automatically on API requests with no additional fees.

The discount is model-dependent. OpenAI's cookbook gives these example prices per 1M tokens:

ModelInputCached inputDiscount shape
gpt-4o$2.50$1.2550% lower cached input
gpt-4.1$2.00$0.5075% lower cached input
gpt-5-nano$0.05$0.00590% lower cached input
gpt-5.2$1.75$0.17590% lower cached input
gpt-realtime audio$32.00$0.4098.75% lower cached input

The current OpenAI API pricing docs also show cached input columns across the GPT-5.x family. One current row lists gpt-5.5 at $5.00 input, $0.50 cached input, and $30.00 output per 1M tokens.

Here is the decision math for an app team. Suppose a support agent sends 7000 input tokens per run, and 5000 of those tokens are a stable contract, tools, and product policy. If cache hits serve those 5000 tokens at a 90% cached-input discount, the request still pays full price for dynamic user and retrieval tokens, but the repeated operating contract becomes a small part of the bill. If the same agent mutates its tool schema or prefixes every turn, all 7000 input tokens can behave like fresh input.

That is why the rollout metric is not "prompt caching enabled." It is cached-token share by request class:

Text
cached_share = cached_tokens / prompt_tokens
uncached_prompt_tokens = prompt_tokens - cached_tokens

For an agent with a long stable prefix, a low cached share is a production defect. Either the prefix is changing, the traffic is being routed too broadly, the cache expired, or the request is below the 1024-token threshold.

The Request Shape That Gets Cache Hits

The best cache layout is stable prefix, controlled tools, dynamic tail, and metadata for debugging. OpenAI's cookbook specifically calls out accidental cache busting from timestamps early in the request, and recommends moving that data to metadata where it will not affect the cache.

A production request should be shaped like this:

JSON
{
  "model": "gpt-5.2",
  "prompt_cache_key": "workspace_42_support_agent",
  "metadata": {
    "request_id": "req_2026_07_03_001",
    "tenant_id": "tenant_42",
    "prompt_prefix_version": "support-agent-prefix-v7"
  },
  "input": [
    {
      "role": "system",
      "content": "Stable agent contract, tool policy, escalation rules, and response format."
    },
    {
      "role": "developer",
      "content": "Stable product policy and durable examples."
    },
    {
      "role": "user",
      "content": "Dynamic user request and fresh retrieval context go last."
    }
  ]
}

The important part is not the literal field order in this illustrative request. The important part is ownership. The platform team owns the prefix version. Product teams can edit it, but they do it through review, evals, and rollout. Runtime code appends live data at the end.

  1. Version the prefix

    Give the stable prefix a version such as support-agent-prefix-v7. Store that version in request metadata and deployment logs. When cache hit rate drops, you need to know whether the prefix changed or traffic changed.

  2. Freeze tool order

    Keep tool definitions and structured schemas identical between calls. If you need to restrict available tools on a turn, use the platform's per-request tool choice controls instead of rebuilding the whole tools array.

  3. Move entropy out of the prefix

    Timestamps, request IDs, live customer fields, and retrieved snippets belong in metadata or the tail. A single early timestamp can make every request look new.

  4. Run an eval with cache telemetry

    Replay a representative traffic sample twice. The first pass warms the cache. The second pass should show the expected cached_tokens pattern. If answer quality changes when you stabilize the prefix, the prompt was carrying hidden dynamic behavior and needs refactoring.

This is where prompt caching touches OpenAI tracing telemetry. A trace that shows tool calls but not cached-token behavior is incomplete for cost and latency work.

Use prompt_cache_key Like A Shard Key

prompt_cache_key is useful when many requests share a prefix but need better routing locality. OpenAI says requests are routed to inference engines based on a hash of the first ~256 tokens, and prompt_cache_key is combined with that hash to increase routing stickiness. OpenAI also says one coding customer improved cache hit rate from 60% to 87% after using prompt_cache_key.

The key is not magic. OpenAI says inference engines can handle roughly ~15 requests per minute per prefix plus prompt_cache_key combination. If a single key gets thousands of matching requests, traffic spreads across more machines and each new machine starts with a cache miss. If keys are too narrow, similar traffic never meets the same cache.

Use a key that matches how your prefix repeats:

WorkloadGood starting keyWhy
Coding agent per repositoryrepo_id or repo_id:agent_modeThe stable prefix is repo instructions, tools, and environment policy.
Customer support agenttenant_id:agent_versionProduct policy and tool set repeat inside a tenant.
Internal analyst agentworkspace_id:workflow_typeThe stable context is workflow and data-access policy.
High-volume public chatHashed bucket plus prefix versionKeeps each prefix plus key near the routing sweet spot.

The production mistake is using user_id everywhere because it feels safe. Per-user keys can work when each user has enough repeated traffic. They fail when users send sparse requests and the cache never warms. Conversely, one global key can fail when volume overwhelms a prefix plus key combination.

Start with the smallest key that still creates repeated traffic, then watch hit rate and request rate together.

What To Log Before Production

Prompt caching must be observable per request. OpenAI says all requests display cached_tokens in usage.prompt_tokens_details for the Response or Chat object. Log it beside normal token, latency, and model fields.

Minimum fields:

JSON
{
  "request_id": "req_2026_07_03_001",
  "model": "gpt-5.2",
  "prompt_prefix_version": "support-agent-prefix-v7",
  "prompt_cache_key_hash": "f3a9",
  "prompt_tokens": 7000,
  "cached_tokens": 5120,
  "completion_tokens": 420,
  "ttft_ms": 740,
  "route": "support_agent_answer",
  "eval_pack": "support-cache-regression-v3"
}

Alert on the operational signals, not novelty:

  • Cached-token share drops for a route after deploy.
  • p95 time to first token rises while output length stays flat.
  • Prompt prefix version changes without an eval run.
  • Tool schema hash changes outside a release.
  • Cost per successful run rises while traffic mix is stable.

For reasoning workloads, the API surface also matters. OpenAI's cookbook says internal benchmarks show 40-80% better cache utilization on requests with the Responses API compared with Chat Completions when reasoning tokens are persisted between turns. That does not mean every old endpoint migration pays for itself. It means reasoning-heavy workflows need endpoint choice in the cache analysis, not only prompt text.

Where Prompt Caching Does Not Save You

Prompt caching reduces repeated-prefix cost and latency. It does not decide what knowledge belongs in the prompt, which model should answer, which tools are safe, or whether the output is correct.

Three boundaries matter:

  1. Caching is not RAG. If the answer depends on fresh, permissioned, or user-specific knowledge, retrieval still owns that decision. Caching may make the stable retrieval instructions cheaper, but the retrieved chunks are usually dynamic tail content.
  2. Caching is not model routing. A cheap cached prefix on the wrong model is still the wrong route. Routing decides capability, latency target, and fallback policy.
  3. Caching is not compaction. OpenAI's cookbook is clear that context engineering and prompt caching can pull against each other. Dropping, summarizing, or compacting earlier turns can break the cache because the prefix changes.

Realtime systems add another version of the same problem. OpenAI says the Realtime API currently has a 32k context window, and with 4,096 max output tokens it can include 28,224 tokens in context before truncation. If automatic truncation shifts the front of the conversation every turn, cache behavior can degrade exactly when the session gets long and expensive.

The durable production pattern is simple: cache stable contracts, retrieve volatile knowledge, route by workload, compact deliberately, and evaluate the whole loop.

Does OpenAI prompt caching require code changes?

Basic prompt caching works automatically on API requests, but production teams still change code around it. You need stable prefix construction, metadata instead of early dynamic fields, cached_tokens logging, and usually a deliberate prompt_cache_key.

How many tokens are needed for OpenAI prompt caching?

OpenAI's cookbook says cache hits work for prompts containing 1024 tokens or more, with hits occurring in 128-token increments. A 900-token prompt does not cross the cache threshold.

How do you know whether a request used the prompt cache?

Inspect usage.prompt_tokens_details.cached_tokens on the response. Track cached share by route, model, prefix version, and cache key so deploys cannot silently reset the economics.

Is OpenAI prompt caching a replacement for RAG?

No. Prompt caching optimizes repeated input prefixes. RAG decides which external documents enter the prompt, applies permissions, and keeps answers grounded in current source material.

Last Updated

Jul 3, 2026

CategoryStack
Newsletter

One letter, every week. Working systems — not hot takes.

Build logs, agentic engineering decisions, agent failures, evals, and what survives real users. Sent weekly, never more.

Weekly. No spam. Unsubscribe anytime.