OpenAI Agents SDK vs LangGraph for Production Agents
Choose OpenAI Agents SDK for OpenAI-native loops. Choose LangGraph when durable graph state, provider freedom, and custom control matter.

Use OpenAI Agents SDK when your production agent is mostly an OpenAI-native loop with managed traces, approvals, MCP tools, and evals. Use LangGraph when the agent is really a durable state machine that must survive pauses, retries, provider changes, and custom deployment constraints.
The Verdict
OpenAI Agents SDK is the faster production choice when your app is already betting on the OpenAI platform and the agent loop is bounded enough for the SDK to own the useful surfaces: orchestration, tool execution, guardrails, approvals, resumable state, traces, and evals.
LangGraph is the better choice when the agent is not just "a model with tools." It is the right fit when you need explicit graph state, durable checkpoints, a persistent thread_id, provider flexibility, custom retry semantics, and human review that can pause a workflow indefinitely without losing context.
The practical line is simple:
- Choose OpenAI Agents SDK for OpenAI-first assistants, internal copilots, review queues, MCP-backed tools, customer-support triage, and workflows where built-in traces and evals get you to production faster.
- Choose LangGraph for long-running research agents, multi-step operations workflows, custom planning loops, multi-provider routing, and agents where the graph itself is the product boundary.
Do not choose either because the demo is shorter. Choose the one whose runtime boundary matches the thing that will break first in production: state, approvals, cost, deployment, trace quality, or provider control.

The Axis That Separates Them
The real difference is not "simple versus complex." The real difference is who owns durable state.
OpenAI Agents SDK treats an agent run as a structured platform loop. The result surface gives you finalOutput or final_output, replay-ready history, the last specialist agent, the stored response ID for continuation, and, in approval flows, interruptions plus state or to_state() for a resumable snapshot. That is enough for many production apps: pause before a risky tool call, show the pending approval in your UI, approve or reject, and pass the saved state back into the runtime.
LangGraph treats state as the center of the architecture. Its persistence layer saves graph state as checkpoints at every step of execution, organized into threads. A thread_id becomes the pointer for loading state and resuming execution. Checkpoints enable human review, conversational memory, time travel debugging, and fault-tolerant execution.
That distinction changes the production design:
- In OpenAI Agents SDK, the run result is the thing your app stores, audits, and resumes.
- In LangGraph, the graph checkpoint is the thing your app stores, audits, and resumes.
If your team mostly needs "pause before this refund, shell command, private MCP tool, or account edit," OpenAI Agents SDK is probably enough. If your team needs "resume this research graph next week from the last successful super-step, fork it, inspect state, and swap providers," LangGraph is the cleaner foundation.
For the observability stack around either choice, the companion decision is covered in Langfuse vs LangSmith for production observability. For tool boundaries, pair this with the MCP vs function calling production decision rule.
Production Comparison
The right framework is the one that makes your control plane smaller, not the one with the most flexible demo.
OpenAI Agents SDK: Best for OpenAI-Native Control Loops
OpenAI Agents SDK is the better default when your production app wants an agent loop without inventing a graph runtime.
In the OpenAI model, agents are applications that plan, call tools, collaborate across specialists, and keep enough state to complete multi-step work. The important production phrase is "applications." Your server still owns the workflow boundary. The SDK gives you the primitives inside that boundary: running agents, orchestration, guardrails, results and state, integrations and observability, and agent workflow evals.
The clean OpenAI Agents SDK architecture looks like this:
- Your app receives a user or system request.
- The SDK runs a primary agent with narrow instructions and scoped tools.
- Handoffs move control to a specialist only when that branch should own the next response.
- Agents-as-tools let a manager agent call specialists while keeping ownership of the final answer.
- Guardrails block invalid input, output, or tool behavior before it leaves the system.
- Human review pauses risky side effects like cancellations, edits, shell commands, or sensitive MCP actions.
- Traces capture the workflow, model calls, tool calls, handoffs, guardrails, and custom spans.
- Trace grading and eval datasets turn production failures into repeatable tests.
Start with one owner
Build one primary agent first. Add a specialist only when it gives you policy isolation, tool isolation, clearer prompts, or cleaner traces. If a specialist does not improve one of those four things, it is probably extra routing surface.
Gate the side effect
Use guardrails for automatic checks and human review before sensitive actions. A production refund, account edit, shell command, private MCP call, or data export should pause with a clear approval payload before the tool fires.
Store the resumable surface
When a run pauses, store the pending
interruptions, the SDKstateorto_state()snapshot, the requesting user, the policy version, and the approval decision. The approval UI should resume the saved state, not replay the whole conversation from memory.Turn traces into evals
Do not wait for a generic eval system. Start by grading traces for the failure classes you already care about: wrong tool, missing approval, bad handoff, unsafe output, cost spike, or incomplete task.
A support triage agent is a good example. The primary agent classifies the issue, calls a billing specialist as a tool for invoice questions, hands off to an account-security specialist only when that specialist should own the next response, and pauses before account edits. The trace tells you which model call chose the path, which tool arguments were proposed, which guardrail ran, and who approved the side effect.
The failure mode is also clear. If the agent needs to pause for three days, survive an app deploy, resume a multi-branch graph, and keep historical checkpoints that can be forked for debugging, OpenAI Agents SDK becomes only part of the system. You will need more durable workflow state around it, or you should reach for LangGraph earlier.
LangGraph: Best for Durable Agent State Machines
LangGraph is the better default when state is the application, not a side effect of the agent loop.
LangGraph is a low-level orchestration framework and runtime for building, managing, and deploying long-running, stateful agents. It is focused on durable execution, streaming, human-in-the-loop, and persistence. It can be used without LangChain, although LangChain components are common for model and tool integrations.

LangGraph persistence saves graph state as checkpoints at every step of execution. Those checkpoints are organized into threads, and thread_id is the key that tells the runtime which state to load. That design is why LangGraph is strong for long-running workflows: if a node fails, the graph can resume from the last successful boundary instead of recomputing everything. Its pending-write behavior also means successful nodes inside a failed super-step do not need to run again when execution resumes.
Human-in-the-loop works differently in LangGraph. An interrupt() call pauses graph execution, saves state through the persistence layer, waits indefinitely for external input, and resumes via Command. The payload should be JSON-serializable. In production, the interrupt needs a durable checkpointer and a stable thread_id.
The important production caveat: a graph node resumes from the beginning after an interrupt. Any side effect before the interrupt must be idempotent, or it should move after the interrupt. That is not a footnote. It is the difference between a review queue and a duplicate charge, duplicate ticket, duplicate email, or duplicate database write.
Use this LangGraph shape for a long-running research or operations agent:
from typing import TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.types import Command, interrupt
class AgentState(TypedDict):
ticket_id: str
draft_action: dict
approved: bool
def propose_action(state: AgentState) -> AgentState:
# Pure function: no external side effect before approval.
return {
**state,
"draft_action": {
"type": "account_change",
"ticket_id": state["ticket_id"],
},
}
def approval_gate(state: AgentState) -> AgentState:
approved = interrupt({
"kind": "approval_required",
"ticket_id": state["ticket_id"],
"draft_action": state["draft_action"],
})
return {**state, "approved": bool(approved)}
def commit_action(state: AgentState) -> AgentState:
if not state["approved"]:
return state
# Put the side effect here, after approval.
return state
builder = StateGraph(AgentState)
builder.add_node("propose_action", propose_action)
builder.add_node("approval_gate", approval_gate)
builder.add_node("commit_action", commit_action)
builder.add_edge(START, "propose_action")
builder.add_edge("propose_action", "approval_gate")
builder.add_edge("approval_gate", "commit_action")
builder.add_edge("commit_action", END)That pattern is more ceremony than a small SDK loop. It earns the ceremony when the run is valuable enough to pause, inspect, resume, replay, and audit at graph boundaries.
The Cost Model That Actually Matters
The framework is rarely the largest line item. Model calls, tool calls, trace retention, approval retries, deployment uptime, and eval volume usually dominate.

As of June 3, 2026, OpenAI lists GPT-5.5 at $5.00 per 1M input tokens, $0.50 per 1M cached input tokens, and $30.00 per 1M output tokens. GPT-5.4 mini is $0.75 per 1M input tokens, $0.075 per 1M cached input tokens, and $4.50 per 1M output tokens. The pricing page states those flagship rates are standard processing prices for context lengths under 270K.
The tool bill matters too. OpenAI web search is listed at $10.00 per 1K calls, with search content tokens free. Containers are listed at $0.03 for 1 GB and $1.92 for 64 GB per container, with the same amounts applying per 20-minute session per container starting March 31, 2026. Batch API saves 50% on inputs and outputs for asynchronous work over 24 hours.

For LangGraph, the model bill is separate from the LangGraph runtime. If you use LangSmith, the current pricing surface is explicit. Developer is $0 per seat per month with up to 5K base traces per month. Plus is $39 per seat per month with up to 10K base traces per month and one dev-sized agent deployment included. Base traces have 14-day retention and cost $2.50 per 1K traces. Extended traces have 400-day retention and cost $5.00 per 1K traces.
Deployment pricing adds another control-plane cost. LangSmith Plus additional deployment runs are $0.005 per run. Production deployment uptime is $0.0036 per minute, and development deployment uptime is $0.0007 per minute. Fleet includes 50 runs per month on Developer and 500 runs per month on Plus, then additional Fleet runs are $0.05 per run. Engine usage is $1.50 per LCU, and sandboxes list CPU at $0.0576 per vCPU-hour.
The production budget should track:
- model input tokens, cached input tokens, and output tokens
- built-in tool calls such as web search, containers, or retrieval
- graph or run retries caused by failed tools
- approval pauses and resumed runs
- trace retention tier and sampled trace volume
- eval runs per release, per prompt change, and per incident
- deployment uptime for any always-on runtime
For a cost-sensitive team, the cheapest architecture is usually boring: classify early, use a smaller model for routine nodes, reserve frontier models for expensive branches, cache stable context, batch offline evals, sample low-risk traces, keep high-retention traces only for incidents and labeled failures, and log cost per run before the CFO asks for it.
The Decision Rule That Flips the Choice
Pick OpenAI Agents SDK when the agent can be expressed as a platform-native control loop. Pick LangGraph when the workflow is a durable graph with its own state semantics.
OpenAI Agents SDK is the call when:
- your model stack is OpenAI-first
- the agent is mostly request-response with bounded pauses
- tool calls, MCP tools, guardrails, and handoffs are the core runtime
- built-in traces are good enough as the first audit log
- evals can start from trace grading and grow into datasets
- your approval UI can store and resume the SDK result state
LangGraph is the call when:
- state persistence is a first-class requirement
- workflows are long-running, branchy, or resumable across days
- checkpoints, replay, pending writes, and time travel debugging matter
- human approval lives inside graph logic, not only around a tool call
- model-provider portability is a real architecture requirement
- your deployment path needs self-hosting, hybrid hosting, or custom runtime control
The simplest architecture we would ship for an early production agent is often OpenAI Agents SDK plus a small application control table:
Move to LangGraph when that table starts pretending to be a graph runtime. Signs include ad hoc resume pointers, custom retry graphs, manual checkpoint blobs, long-running approval branches, provider routing rules hidden in application code, or incident reviews where engineers cannot reconstruct why the agent took a path.
The uncomfortable truth is that both tools can ship a demo. Production asks a narrower question: which one makes the failure record legible when a user, auditor, or engineer asks what happened?
FAQ
Is OpenAI Agents SDK better than LangGraph?
OpenAI Agents SDK is better for OpenAI-first workflows where integrated traces, approvals, MCP tools, and evals matter more than graph-level runtime independence. LangGraph is better when durable graph state, checkpointing, replay, and provider freedom are requirements.
Is LangGraph production ready?
LangGraph is designed around production-relevant primitives: durable execution, persistence, checkpoints, human-in-the-loop interrupts, streaming, and fault-tolerant resume behavior. Your team still owns the architecture around secrets, auth, deployment, cost limits, evals, data retention, and incident review.
Can OpenAI Agents SDK use non-OpenAI models?
The SDK has model and provider surfaces, but its strongest production path is the OpenAI platform loop. If model portability is a hard requirement rather than a future preference, LangGraph is usually the cleaner starting point.
Do I need LangSmith with LangGraph?
No. LangGraph can run as an open-source orchestration runtime, and it can be used without LangChain. LangSmith is the first-party platform for tracing, evaluation, prompts, and managed deployment across frameworks, so it becomes relevant when your team wants that control plane.
Which should a startup use first?
Use OpenAI Agents SDK first if speed matters and the product can commit to OpenAI-native execution. Use LangGraph first if the startup's moat depends on a custom agent workflow, long-running state, self-hosting, or multi-provider routing.
Scope Your Agent Build
Design the agent runtime, control plane, eval loop, approval queue, and production handoff before a demo becomes a liability.
Jun 3, 2026




