OpenAI Agents SDK vs LangGraph for Production Agents

Choose OpenAI Agents SDK for OpenAI-native loops. Choose LangGraph when durable graph state, provider freedom, and custom control matter.

Wednesday, June 3, 2026

Omid Saffari

OpenAI Agents SDK vs LangGraph for Production Agents

Use OpenAI Agents SDK when your production agent is mostly an OpenAI-native loop with managed traces, approvals, MCP tools, and evals. Use LangGraph when the agent is really a durable state machine that must survive pauses, retries, provider changes, and custom deployment constraints.

The Verdict

OpenAI Agents SDK is the faster production choice when your app is already betting on the OpenAI platform and the agent loop is bounded enough for the SDK to own the useful surfaces: orchestration, tool execution, guardrails, approvals, resumable state, traces, and evals.

LangGraph is the better choice when the agent is not just "a model with tools." It is the right fit when you need explicit graph state, durable checkpoints, a persistent thread_id, provider flexibility, custom retry semantics, and human review that can pause a workflow indefinitely without losing context.

The practical line is simple:

Choose OpenAI Agents SDK for OpenAI-first assistants, internal copilots, review queues, MCP-backed tools, customer-support triage, and workflows where built-in traces and evals get you to production faster.
Choose LangGraph for long-running research agents, multi-step operations workflows, custom planning loops, multi-provider routing, and agents where the graph itself is the product boundary.

Do not choose either because the demo is shorter. Choose the one whose runtime boundary matches the thing that will break first in production: state, approvals, cost, deployment, trace quality, or provider control.

OpenAI Agents SDK documentation — OpenAI Agents SDK is strongest when your application owns the agent loop and OpenAI owns the model, tool, trace, and eval surfaces.

The Axis That Separates Them

The real difference is not "simple versus complex." The real difference is who owns durable state.

OpenAI Agents SDK treats an agent run as a structured platform loop. The result surface gives you finalOutput or final_output, replay-ready history, the last specialist agent, the stored response ID for continuation, and, in approval flows, interruptions plus state or to_state() for a resumable snapshot. That is enough for many production apps: pause before a risky tool call, show the pending approval in your UI, approve or reject, and pass the saved state back into the runtime.

LangGraph treats state as the center of the architecture. Its persistence layer saves graph state as checkpoints at every step of execution, organized into threads. A thread_id becomes the pointer for loading state and resuming execution. Checkpoints enable human review, conversational memory, time travel debugging, and fault-tolerant execution.

That distinction changes the production design:

In OpenAI Agents SDK, the run result is the thing your app stores, audits, and resumes.
In LangGraph, the graph checkpoint is the thing your app stores, audits, and resumes.

If your team mostly needs "pause before this refund, shell command, private MCP tool, or account edit," OpenAI Agents SDK is probably enough. If your team needs "resume this research graph next week from the last successful super-step, fork it, inspect state, and swap providers," LangGraph is the cleaner foundation.

For the observability stack around either choice, the companion decision is covered in Langfuse vs LangSmith for production observability. For tool boundaries, pair this with the MCP vs function calling production decision rule.

Production Comparison

The right framework is the one that makes your control plane smaller, not the one with the most flexible demo.

Axis	OpenAI Agents SDK	LangGraph	Production call	Watch first
Best fit	OpenAI-native agent loops with tools, approvals, traces, and evals	Long-running, stateful graph workflows	Pick the runtime that already models your failure mode	Accidental lock-in to the wrong state model
Model/provider control	Strongest with OpenAI platform models and tools	Works across providers and can be used without LangChain	Use OpenAI SDK when OpenAI is the committed model stack; use LangGraph when portability is a requirement	Hidden provider assumptions in prompts, tools, and evals
Orchestration	Handoffs and agents-as-tools for specialist routing	Explicit graph nodes, edges, state, and checkpointed execution	SDK for bounded routing; LangGraph for custom state machines	Too many specialists before the workflow has evidence
State	Result surfaces include history, last agent, response ID, interruptions, and resumable state	Checkpoints are organized by thread and saved through a checkpointer	SDK state is simpler; LangGraph state is more controllable	Losing context during approval, retry, or deployment restart
Human approval	Guardrails validate automatically; human review pauses sensitive actions	Interrupts pause graph execution and resume via `Command`	SDK for approval around tool calls; LangGraph for approval inside graph logic	Side effects before approval and non-idempotent retries
Tracing and evals	Built-in tracing is enabled by default in the normal server-side SDK path; traces can include model calls, tool calls, handoffs, guardrails, and custom spans	LangSmith is first-party for tracing, evaluation, prompts, and deployment across frameworks	Use traces as the audit log, not just debugging output	Unstructured logs that cannot answer why a run failed
Deployment	Your app owns server deployment unless you move to OpenAI hosted workflow surfaces	Open source runtime, with LangSmith deployment as a managed path	SDK for app-native deployment; LangGraph for custom runtime or LangSmith deployment	State migration and replay behavior during deploys
Cost surface	Model tokens, built-in tools, containers, and any extra tracing/eval usage	Model provider bill plus LangSmith trace, deployment, uptime, Fleet, Engine, and sandbox costs if used	Price the whole control plane, not only tokens	Approval retries and trace retention multiplying cost

OpenAI Agents SDK: Best for OpenAI-Native Control Loops

OpenAI Agents SDK is the better default when your production app wants an agent loop without inventing a graph runtime.

In the OpenAI model, agents are applications that plan, call tools, collaborate across specialists, and keep enough state to complete multi-step work. The important production phrase is "applications." Your server still owns the workflow boundary. The SDK gives you the primitives inside that boundary: running agents, orchestration, guardrails, results and state, integrations and observability, and agent workflow evals.

The clean OpenAI Agents SDK architecture looks like this:

Your app receives a user or system request.
The SDK runs a primary agent with narrow instructions and scoped tools.
Handoffs move control to a specialist only when that branch should own the next response.
Agents-as-tools let a manager agent call specialists while keeping ownership of the final answer.
Guardrails block invalid input, output, or tool behavior before it leaves the system.
Human review pauses risky side effects like cancellations, edits, shell commands, or sensitive MCP actions.
Traces capture the workflow, model calls, tool calls, handoffs, guardrails, and custom spans.
Trace grading and eval datasets turn production failures into repeatable tests.

Start with one owner
Build one primary agent first. Add a specialist only when it gives you policy isolation, tool isolation, clearer prompts, or cleaner traces. If a specialist does not improve one of those four things, it is probably extra routing surface.
Gate the side effect
Use guardrails for automatic checks and human review before sensitive actions. A production refund, account edit, shell command, private MCP call, or data export should pause with a clear approval payload before the tool fires.
Store the resumable surface
When a run pauses, store the pending interruptions, the SDK state or to_state() snapshot, the requesting user, the policy version, and the approval decision. The approval UI should resume the saved state, not replay the whole conversation from memory.
Turn traces into evals
Do not wait for a generic eval system. Start by grading traces for the failure classes you already care about: wrong tool, missing approval, bad handoff, unsafe output, cost spike, or incomplete task.

A support triage agent is a good example. The primary agent classifies the issue, calls a billing specialist as a tool for invoice questions, hands off to an account-security specialist only when that specialist should own the next response, and pauses before account edits. The trace tells you which model call chose the path, which tool arguments were proposed, which guardrail ran, and who approved the side effect.

The failure mode is also clear. If the agent needs to pause for three days, survive an app deploy, resume a multi-branch graph, and keep historical checkpoints that can be forked for debugging, OpenAI Agents SDK becomes only part of the system. You will need more durable workflow state around it, or you should reach for LangGraph earlier.

LangGraph: Best for Durable Agent State Machines

LangGraph is the better default when state is the application, not a side effect of the agent loop.

LangGraph is a low-level orchestration framework and runtime for building, managing, and deploying long-running, stateful agents. It is focused on durable execution, streaming, human-in-the-loop, and persistence. It can be used without LangChain, although LangChain components are common for model and tool integrations.

LangGraph documentation — LangGraph is strongest when the agent is a long-running state machine with checkpoints, interrupts, and explicit graph control.

LangGraph persistence saves graph state as checkpoints at every step of execution. Those checkpoints are organized into threads, and thread_id is the key that tells the runtime which state to load. That design is why LangGraph is strong for long-running workflows: if a node fails, the graph can resume from the last successful boundary instead of recomputing everything. Its pending-write behavior also means successful nodes inside a failed super-step do not need to run again when execution resumes.

Human-in-the-loop works differently in LangGraph. An interrupt() call pauses graph execution, saves state through the persistence layer, waits indefinitely for external input, and resumes via Command. The payload should be JSON-serializable. In production, the interrupt needs a durable checkpointer and a stable thread_id.

The important production caveat: a graph node resumes from the beginning after an interrupt. Any side effect before the interrupt must be idempotent, or it should move after the interrupt. That is not a footnote. It is the difference between a review queue and a duplicate charge, duplicate ticket, duplicate email, or duplicate database write.

Use this LangGraph shape for a long-running research or operations agent:

Python

from typing import TypedDict

from langgraph.graph import StateGraph, START, END
from langgraph.types import Command, interrupt


class AgentState(TypedDict):
    ticket_id: str
    draft_action: dict
    approved: bool


def propose_action(state: AgentState) -> AgentState:
    # Pure function: no external side effect before approval.
    return {
        **state,
        "draft_action": {
            "type": "account_change",
            "ticket_id": state["ticket_id"],
        },
    }


def approval_gate(state: AgentState) -> AgentState:
    approved = interrupt({
        "kind": "approval_required",
        "ticket_id": state["ticket_id"],
        "draft_action": state["draft_action"],
    })
    return {**state, "approved": bool(approved)}


def commit_action(state: AgentState) -> AgentState:
    if not state["approved"]:
        return state
    # Put the side effect here, after approval.
    return state


builder = StateGraph(AgentState)
builder.add_node("propose_action", propose_action)
builder.add_node("approval_gate", approval_gate)
builder.add_node("commit_action", commit_action)
builder.add_edge(START, "propose_action")
builder.add_edge("propose_action", "approval_gate")
builder.add_edge("approval_gate", "commit_action")
builder.add_edge("commit_action", END)

That pattern is more ceremony than a small SDK loop. It earns the ceremony when the run is valuable enough to pause, inspect, resume, replay, and audit at graph boundaries.

The Cost Model That Actually Matters

The framework is rarely the largest line item. Model calls, tool calls, trace retention, approval retries, deployment uptime, and eval volume usually dominate.

OpenAI API pricing page — OpenAI Agents SDK cost depends on the OpenAI model and tool surfaces you attach to the agent loop.

As of June 3, 2026, OpenAI lists GPT-5.5 at $5.00 per 1M input tokens, $0.50 per 1M cached input tokens, and $30.00 per 1M output tokens. GPT-5.4 mini is $0.75 per 1M input tokens, $0.075 per 1M cached input tokens, and $4.50 per 1M output tokens. The pricing page states those flagship rates are standard processing prices for context lengths under 270K.

The tool bill matters too. OpenAI web search is listed at $10.00 per 1K calls, with search content tokens free. Containers are listed at $0.03 for 1 GB and $1.92 for 64 GB per container, with the same amounts applying per 20-minute session per container starting March 31, 2026. Batch API saves 50% on inputs and outputs for asynchronous work over 24 hours.

LangSmith pricing page — LangGraph itself can run open source, but LangSmith pricing matters when you use managed tracing, evals, deployment, Fleet, Engine, or sandboxes.

For LangGraph, the model bill is separate from the LangGraph runtime. If you use LangSmith, the current pricing surface is explicit. Developer is $0 per seat per month with up to 5K base traces per month. Plus is $39 per seat per month with up to 10K base traces per month and one dev-sized agent deployment included. Base traces have 14-day retention and cost $2.50 per 1K traces. Extended traces have 400-day retention and cost $5.00 per 1K traces.

Deployment pricing adds another control-plane cost. LangSmith Plus additional deployment runs are $0.005 per run. Production deployment uptime is $0.0036 per minute, and development deployment uptime is $0.0007 per minute. Fleet includes 50 runs per month on Developer and 500 runs per month on Plus, then additional Fleet runs are $0.05 per run. Engine usage is $1.50 per LCU, and sandboxes list CPU at $0.0576 per vCPU-hour.

The production budget should track:

model input tokens, cached input tokens, and output tokens
built-in tool calls such as web search, containers, or retrieval
graph or run retries caused by failed tools
approval pauses and resumed runs
trace retention tier and sampled trace volume
eval runs per release, per prompt change, and per incident
deployment uptime for any always-on runtime

For a cost-sensitive team, the cheapest architecture is usually boring: classify early, use a smaller model for routine nodes, reserve frontier models for expensive branches, cache stable context, batch offline evals, sample low-risk traces, keep high-retention traces only for incidents and labeled failures, and log cost per run before the CFO asks for it.

The Decision Rule That Flips the Choice

Pick OpenAI Agents SDK when the agent can be expressed as a platform-native control loop. Pick LangGraph when the workflow is a durable graph with its own state semantics.

OpenAI Agents SDK is the call when:

your model stack is OpenAI-first
the agent is mostly request-response with bounded pauses
tool calls, MCP tools, guardrails, and handoffs are the core runtime
built-in traces are good enough as the first audit log
evals can start from trace grading and grow into datasets
your approval UI can store and resume the SDK result state

LangGraph is the call when:

state persistence is a first-class requirement
workflows are long-running, branchy, or resumable across days
checkpoints, replay, pending writes, and time travel debugging matter
human approval lives inside graph logic, not only around a tool call
model-provider portability is a real architecture requirement
your deployment path needs self-hosting, hybrid hosting, or custom runtime control

The simplest architecture we would ship for an early production agent is often OpenAI Agents SDK plus a small application control table:

Table	Purpose
`agent_runs`	run ID, user ID, agent name, prompt version, model, status, started_at, finished_at
`agent_tool_calls`	run ID, tool name, arguments hash, result status, latency, token cost
`agent_approvals`	run ID, interruption ID, approver, policy version, decision, decided_at
`agent_eval_results`	run ID, grader, score, failure class, dataset version

Move to LangGraph when that table starts pretending to be a graph runtime. Signs include ad hoc resume pointers, custom retry graphs, manual checkpoint blobs, long-running approval branches, provider routing rules hidden in application code, or incident reviews where engineers cannot reconstruct why the agent took a path.

The uncomfortable truth is that both tools can ship a demo. Production asks a narrower question: which one makes the failure record legible when a user, auditor, or engineer asks what happened?

FAQ

Is OpenAI Agents SDK better than LangGraph?

OpenAI Agents SDK is better for OpenAI-first workflows where integrated traces, approvals, MCP tools, and evals matter more than graph-level runtime independence. LangGraph is better when durable graph state, checkpointing, replay, and provider freedom are requirements.

Is LangGraph production ready?

LangGraph is designed around production-relevant primitives: durable execution, persistence, checkpoints, human-in-the-loop interrupts, streaming, and fault-tolerant resume behavior. Your team still owns the architecture around secrets, auth, deployment, cost limits, evals, data retention, and incident review.

Can OpenAI Agents SDK use non-OpenAI models?

The SDK has model and provider surfaces, but its strongest production path is the OpenAI platform loop. If model portability is a hard requirement rather than a future preference, LangGraph is usually the cleaner starting point.

Do I need LangSmith with LangGraph?

No. LangGraph can run as an open-source orchestration runtime, and it can be used without LangChain. LangSmith is the first-party platform for tracing, evaluation, prompts, and managed deployment across frameworks, so it becomes relevant when your team wants that control plane.

Which should a startup use first?

Use OpenAI Agents SDK first if speed matters and the product can commit to OpenAI-native execution. Use LangGraph first if the startup's moat depends on a custom agent workflow, long-running state, self-hosting, or multi-provider routing.

Scope Your Agent Build

Design the agent runtime, control plane, eval loop, approval queue, and production handoff before a demo becomes a liability.

Last Updated

Jun 3, 2026

CategoryAgents

OpenAI Agents SDK vs LangGraph for Production Agents

The Verdict

The Axis That Separates Them

Production Comparison

OpenAI Agents SDK: Best for OpenAI-Native Control Loops

Start with one owner

Gate the side effect

Store the resumable surface

Turn traces into evals

LangGraph: Best for Durable Agent State Machines

The Cost Model That Actually Matters

The Decision Rule That Flips the Choice

FAQ

Scope Your Agent Build

More from Agents

Context Engineering vs Prompt Engineering for Production Agents

Agent Memory for Production AI Systems

OpenAI Agents SDK vs Pydantic AI for Production Agents

Google ADK vs LangGraph for Production Agents

OpenAI Agents SDK TypeScript vs Python for Production Agents

LangChain vs LangGraph for Production Agents

One letter, every week. Working systems — not hot takes.