Context Engineering vs Prompt Engineering for Production Agents
Context engineering is the production control plane for agents. Learn when prompts matter, what context layers to ship, and what to log before traffic.

Prompt engineering decides what to say to the model. Context engineering decides what the model is allowed to know, remember, retrieve, and do at each step, so production agents should start with context first and tune prompts second.
The Verdict: Build Context First, Tune Prompts Second
Context engineering is the production control plane; prompt engineering is one layer inside it. A prompt can shape tone, output format, refusal rules, and task framing. It cannot reliably decide which customer record is current, which tool is safe to call, which memory should persist, which retrieved document is stale, or which approval gate must pause the run.
Anthropic defines context as the set of tokens included when sampling from a large-language model, and frames context engineering as curating and maintaining the useful information inside that finite space. Elastic gives the clean production split: prompt engineering focuses on how you communicate with the model, while context engineering focuses on what information the model can access when it responds.
For a real agent, the build order is blunt:
- Define the agent's job and risk boundary.
- Build the context pipeline that supplies facts, tools, state, memory, and approvals.
- Add prompts that tell the model how to use that context.
- Log and evaluate every context layer before expanding traffic.
The Difference That Matters In Production
The production difference is failure diagnosis. If a chatbot gives a vague answer, prompt wording may be the right fix. If an agent calls the wrong API, forgets a prior approval, cites stale policy, repeats a tool call, or exposes data from the wrong tenant, the prompt is usually where the symptom appears, not where the bug lives.
Elastic's failure-mode list is the practical checklist: too little information, too much information, and distracting or conflicting information. Those are context bugs. They show up as hallucination, context overflow, slow runs, wrong retrieval, unsafe tool choice, and brittle long conversations. Anthropic adds the attention-budget reason: as context grows, every token has to relate to every other token through the transformer's attention pattern, creating n squared pairwise token relationships. Bigger windows help, but they do not remove the need to curate.
A support agent makes the difference concrete. The prompt can say:
You are a careful support agent. Answer from policy. Do not promise refunds unless policy allows it.That is necessary, but not enough. The releaseable version also needs:
- Current policy documents selected by product, region, and account tier.
- Account state fetched through a typed tool, not copied into a giant prompt.
- Recent ticket history summarized into durable state.
- A memory rule for user preferences that may persist.
- An approval gate before refunds, cancellations, or account changes.
- Trace output that shows which policy, tool result, and approval state the answer used.
When the agent fails, each layer gives you a different fix. Bad tone is a prompt fix. Wrong policy is retrieval or metadata. Wrong account is authorization and tool scoping. Repeated tool calls are tool descriptions, state updates, or stopping criteria. Unsafe action is approval design. Missing audit trail is tracing.
That is why context engineering should own the release architecture. Prompt engineering still matters, but it should not be used as a container for every rule the system forgot to model.
The Minimum Viable Context Stack
A production agent needs a small, explicit context stack. The stack should be boring enough to inspect in logs and narrow enough that an engineer can explain why every layer exists.
Stable Instructions
Stable instructions define role, scope, output contract, and refusal behavior. Anthropic recommends clear, direct system prompts at the right altitude: specific enough to guide behavior, but not a brittle if-else program hiding inside prose. Keep durable rules here. Keep volatile data out.
Good instruction:
Use only retrieved policy sections and account tool output for customer-specific answers.
Ask for human approval before changing plan, refund, deletion, or billing state.
Return the answer, cited policy IDs, proposed action, and approval requirement.Weak instruction:
Be careful with refunds and remember all policy exceptions.The weak version tells the agent to be careful but does not define what careful means in code, logs, or review.
Runtime State The Model Does Not See
OpenAI Agents SDK separates local context from LLM-visible context. Local context is code-side state passed through RunContextWrapper, useful for dependencies, user IDs, loggers, data fetchers, approval state, and usage tracking. The docs are explicit that this context object is not sent to the LLM.
That boundary is useful. Tenant IDs, auth scopes, database handles, and raw secrets should live in local state. The model can request an action through a tool, but the tool decides what local state permits.
from dataclasses import dataclass
@dataclass
class AgentRunContext:
tenant_id: str
user_id: str
approval_state: str
policy_version: str
LOCAL_CONTEXT_RULE = "Local context belongs to code and tools."
MODEL_CONTEXT_RULE = "Model-visible context receives only safe, task-relevant results."Model-Visible Context
OpenAI's docs list the model-visible surfaces plainly: agent instructions, run input, function tools, retrieval, and web search. Treat those as separate layers, not one giant string.
- Instructions: stable behavior and output contract.
- Run input: the user's task and safe request metadata.
- Tools: on-demand context and actions.
- Retrieval: source-grounded knowledge selected for the current step.
- Web search: live external facts when the product permits it.
The release rule is to log each layer before the model call. If the final answer is wrong, you need to know whether the wrong document was retrieved, the right document was dropped, the tool returned noisy output, or the prompt asked for the wrong shape.
Memory
Memory is governed state, not chat history with a nicer name. LangGraph separates short-term thread-scoped memory from long-term cross-session memory. Short-term memory tracks the active conversation through state and checkpoints; long-term memory stores user-specific or application-level data across sessions.
For a custom agent, this is the safe starting policy:
- Short-term memory can hold the active task, open questions, last tool outputs, and pending approvals.
- Long-term memory can hold stable preferences, validated facts, and reusable workflow notes.
- No memory write happens unless the value has a source, scope, expiry rule, and deletion path.
For deeper memory design, the separate DVNC.dev piece on agent memory as governed state covers what to store, what to forget, and which evals catch unsafe recall.
Tools And Retrieval
Anthropic calls out bloated, overlapping tool sets as a common failure mode. If a human engineer cannot say which tool should be used for a case, the model will not magically make that ambiguity safe.
Ship fewer tools with tighter contracts:
{
"name": "get_refund_policy",
"input": {
"product_id": "string",
"region": "string",
"account_tier": "string"
},
"output": {
"policy_id": "string",
"policy_version": "string",
"eligible": "boolean",
"approval_required": "boolean",
"summary": "string"
}
}This output is more useful than dumping a full policy page into the model. It carries the decision fields, source ID, version, and approval requirement the rest of the run needs.
A Rollout Pattern For One Production Agent
The safest first release is one narrow workflow with visible context assembly. Do not start by giving the agent every document, every tool, and a prompt that says "be accurate." Start with the minimum context that can pass a review.
Pick one workflow
Choose a workflow with a clear success state, bounded inputs, and obvious unsafe actions. A refund triage agent is a better first release than a general customer support agent because policy lookup, account state, approval, and final response can be tested directly.
Define the context contract
Write the layers before writing the prompt: request metadata, retrieved policy, account tool output, short-term task state, long-term memory candidates, approval state, and final answer schema. Each layer needs an owner and a log field.
Separate local state from model-visible state
Keep tenant IDs, auth scopes, raw account records, secrets, and dependency handles in local runtime context. Pass only safe summaries and tool outputs to the model. If the model needs more, it asks through a typed tool.
Add retrieval with source IDs
Retrieve fewer documents and carry their IDs, versions, timestamps, and permission scopes into the answer path. Context engineering is not "more RAG." It is deciding which retrieved facts deserve to enter the run at all.
Put risky actions behind approval
Refunds, cancellations, data deletion, billing changes, outbound messages, and admin actions should produce proposed actions, not direct execution, until the workflow has enough evidence to lower the gate.
Evaluate context before output prose
Score whether the right source was retrieved, the tool choice was correct, the memory write was justified, the approval gate fired, and the final answer cited the expected facts. Only then tune the prompt wording.
This rollout pattern also gives you a clean framework decision. If the agent is mostly an OpenAI-native run with tracing and function tools, OpenAI Agents SDK may be enough. If the system needs durable graph state, branching, manual interrupts, and provider flexibility, the framework decision starts to look more like OpenAI Agents SDK vs LangGraph for production agents. The point is not the logo. The point is whether the runtime makes the context path inspectable.
What To Log And Evaluate Before Traffic
Context engineering is incomplete until a bad run can be replayed. OpenAI Agents SDK includes built-in tracing for LLM generations, tool calls, handoffs, guardrails, and custom events during an agent run. That is the right category of artifact: not just the final answer, but the path that produced it.
Log these fields for every production run:
workflow_name,trace_id,tenant_id, and user-safe request metadata.- Prompt version and instruction hash.
- Retrieved source IDs, versions, scopes, and ranking scores.
- Tool names, arguments after validation, output summaries, and errors.
- Local approval state and whether a human approved, rejected, or edited the action.
- Memory reads, memory write candidates, accepted writes, rejected writes, and expiry rules.
- Token usage and cost by context layer when your runtime exposes it.
- Final answer, citations, action proposal, and safety outcome.
The sensitive-data rule matters. OpenAI's tracing docs note that generation spans and function spans may capture inputs and outputs, and provide trace_include_sensitive_data to control that capture. They also note tracing is unavailable for organizations operating under Zero Data Retention. For regulated workflows, build your trace plan before launch, not after the first incident review.
The eval set should mirror the same layers:
If those evals fail, do not start by rewriting the prompt. Fix the retrieval scope, tool schema, approval state, memory policy, or context budget that produced the bad run.
Where Prompt Engineering Still Wins
Prompt engineering still earns its place after the context path is reliable. It is the right tool for output shape, tone, role boundaries, refusal style, examples, and task-specific heuristics.
Use prompt work for:
- Final answer schema.
- Citation requirements.
- Tone and audience.
- Escalation language.
- Refusal behavior.
- Few-shot examples of the final output.
- Short instructions that tell the model how to use tool output.
Do not use prompt work for:
- Copying full policy manuals into every request.
- Encoding tenant permissions as prose.
- Hiding approval logic inside a system prompt.
- Remembering user facts without a memory policy.
- Explaining a huge overlapping tool set.
- Compensating for retrieval that returns stale or conflicting sources.
The practical line is simple. If the failure can be fixed by clearer instruction, tune the prompt. If the failure needs different facts, state, tools, memory, permissions, or logs, it is context engineering.
What is the difference between prompt and context?
Prompt is the instruction layer: how the task is phrased, what format to return, and which behavior to follow. Context is the information layer: retrieved facts, tools, tool outputs, memory, state, approvals, and conversation history the model can use while following those instructions.
What is replacing prompt engineering?
Context engineering is not replacing prompt engineering. It wraps prompt work inside a broader runtime system, because production agents need current facts, scoped tools, durable state, memory policy, approvals, and trace output in addition to clear instructions.
What is an example of context engineering?
A support agent that retrieves the right policy section, fetches account state through a typed tool, reads short-term task state, checks long-term memory policy, pauses for approval before account changes, and logs every source before answering is using context engineering.
What are the four pillars of context engineering?
For production agents, use a practical four-part model: instructions, retrieved facts, runtime state, and tool outputs. Memory, approvals, and traces govern what persists, what can act, and what can be audited.
Should a team build context engineering or prompt engineering first?
Build the minimum context path first for any agent that retrieves, remembers, uses tools, or touches user state. Tune prompt wording after retrieval, tool choice, approval behavior, memory writes, and trace quality pass evals.
Build a Custom AI Agent
Design and ship a production agent with scoped context, tools, memory, approvals, evals, and traces.
Jun 24, 2026




