LangChain vs LangGraph for Production Agents

Use LangChain for simple agent harnesses. Use LangGraph when production agents need durable state, retries, interrupts, approvals, and deployment.

Friday, June 5, 2026

Omid Saffari

For production agents, LangChain is the faster agent harness and LangGraph is the safer runtime once the workflow needs durable state. Use LangChain for a linear tool-calling loop; use LangGraph when a run has to pause, resume, retry, branch, or wait for approval without losing context.

Verdict: LangChain for the harness, LangGraph for the runtime

Use LangChain when the agent is still a straightforward model-plus-tools loop. Use LangGraph when the agent is becoming a workflow that needs explicit state, checkpoints, approval gates, recovery, or deployment discipline.

That is the clean production line. LangChain docs describe an agent as a model calling tools in a loop until a task is complete, and position create_agent as a configurable harness around the model, prompt, tools, and middleware. That is useful when the main job is to assemble a capable agent quickly: choose a model, attach tools, set the system prompt, add middleware where behavior needs shaping, and trace it.

LangChain agents documentation — LangChain is strongest when the job is a configurable agent harness around a model, prompt, tools, and middleware.

LangGraph is the lower-level orchestration runtime. LangGraph docs describe it as a framework and runtime for long-running, stateful agents, focused on durable execution, streaming, human-in-the-loop, and persistence. That matters when the agent is no longer a single request path. A support agent that opens a ticket, waits for approval, retries a tool, resumes after a worker restart, and explains its last action to an operator is not just a chain. It is a state machine with operational obligations.

LangGraph overview documentation — LangGraph is the production choice when orchestration, persistence, and controlled execution matter more than a quick harness.

The common mistake is treating LangChain vs LangGraph as a winner-takes-all framework choice. It is a layering choice. LangChain gives you model and tool integrations plus the simple agent harness. LangGraph gives you the runtime shape when the work needs to survive across steps, people, failures, and deploys. The best production architecture often uses LangChain components inside a LangGraph workflow.

The Axis That Separates Them Is Durable State

The separating axis is not popularity, syntax, or the number of integrations. It is whether state is an implementation detail inside your code or a first-class runtime object you can inspect, persist, resume, and audit.

LangGraph has a built-in persistence layer that saves graph state as checkpoints. With a checkpointer, graph state is saved at every step of execution and organized into threads. The docs tie that persistence directly to human-in-the-loop workflows, conversational memory, time travel debugging, and fault-tolerant execution.

Production decision	LangChain	LangGraph	Production impact	Choose this when
Basic agent loop	`create_agent` gives a configurable model, prompt, tools, and middleware harness	Can model the same loop as a graph, but adds structure	LangChain is faster to assemble and easier to read for simple paths	The agent has one main path and limited branching
Durable state	State is usually handled in your app code, memory layer, or custom persistence	Checkpoints save graph state at each step, organized into threads	Operators can resume and inspect the run instead of replaying from scratch	The agent performs multi-step work that cannot be lost
Human approval	Possible with custom app logic	Interrupts pause execution and wait for external input before continuing	Approval becomes part of the runtime flow, not a side-channel flag	A person must approve a tool call, edit state, or reject an action
Failure recovery	You own retries, partial writes, and resume rules	Checkpoints and pending writes support recovery after node failures	Failed work can restart from a safe boundary	Tool failures or worker restarts are expected
Observability	LangSmith can trace LangChain runs	LangSmith can trace graph runs and node-level behavior	Graph nodes create clearer audit points	You need run, node, tool, approval, and checkpoint visibility
Deployment	You deploy the app like normal service code	LangSmith Deployment and Agent Server can deploy graphs, persistence, and queues	The runtime owns more of the production agent lifecycle	You want managed agent runs, threads, queues, and persistence

For a production team, this is the practical test: can you write down the legal states of a run? If the answer is "requested, researching, tool_pending, approval_pending, approved, executing, failed_retriable, escalated, complete," use LangGraph. If the answer is "call the model, maybe call a tool, return a response," use LangChain.

The state test also changes how you debug. With LangChain, a failed tool call often pushes you toward log reconstruction: request payload, model output, tool arguments, error, retry. With LangGraph, the graph itself gives you a place to attach that evidence: node name, thread ID, checkpoint, current state, interrupt payload, human decision, resumed command, and next edge.

That is why a production agent should not wait until late-stage reliability work to adopt state. If approval, retries, and recovery are core to the product, model them at the orchestration layer from the start.

Use LangChain When the Agent Loop Is Still Linear

LangChain is the right default when the agent's control flow is simple enough to fit in one readable harness. A linear agent can still be useful in production if the operational boundary is narrow.

A good LangChain-shaped workload looks like this:

One user request enters the system.
The agent gets a bounded set of tools.
The model decides whether to call one or more tools.
The response returns in the same request or a short background job.
Retry and audit rules can live in ordinary service code.

Example: an internal engineering assistant that checks a deployment status, fetches the current incident summary, and drafts a Slack update. The production risk is real, but the action is bounded. The assistant reads data and drafts text. It does not need to wait overnight, hold a long-running transaction, coordinate multiple actors, or preserve a partial workflow after approval.

In that case, LangChain's create_agent harness is useful because it lets you configure the basics directly: model, tools, and system_prompt. For advanced behavior, you extend the harness with middleware. The system still needs production controls, but those controls can sit around the harness:

Keep the tool surface small
Expose only the tools the agent needs for the current job. A deployment assistant should not also get billing, customer data, or repo-write tools just because they are available in the same app.
Trace every model and tool step
LangSmith tracing captures a complete record of every step during a request, from inputs to final output. Turn tracing on before a team relies on the agent, not after the first incident.
Move risky actions out of the loop
If the agent drafts an action that changes production state, write the draft to a queue and let a human or deterministic service execute it. That keeps LangChain useful without pretending the harness is an approval runtime.
Set the migration trigger
The first time you add approval waits, multi-step recovery, or long-lived task state, migrate the workflow into LangGraph instead of layering custom state flags across handlers.

LangChain also remains valuable in a LangGraph system. The LangGraph docs describe LangChain as the agent framework for abstractions and integrations around models, tools, and agent loops. If your team already has LangChain model wrappers or tool definitions, keep them. The change is where the control flow lives.

Use LangGraph When the Run Has to Pause and Resume

LangGraph is the production choice when a run must survive beyond one request. The moment a human approval, retry boundary, long-running queue, or recovery path becomes part of the product, the runtime needs durable state.

LangGraph's persistence layer saves graph state as checkpoints. When a graph uses a checkpointer, LangGraph requires a thread_id in the configurable config so the checkpointer can load saved state. That thread_id becomes the persistent cursor for the work. Without it, the system cannot safely resume after an interrupt.

Interrupts are the key approval primitive. The interrupt() function pauses graph execution and accepts a JSON-serializable payload that is surfaced to the caller. When the graph resumes with Command, that resumed value becomes the return value of interrupt() inside the node. The docs are explicit that interrupts require a checkpointer, a thread ID, and an interrupt() call at the pause point.

A production approval gate should be modeled like this:

Create an explicit state schema
Put the business state in the graph state: ticket_id, requested_action, risk_level, approval_status, operator_id, and audit_reason. Do not hide these values in logs only.
Checkpoint before the risky action
Place the interrupt before the tool that changes external state. The approval payload should show the exact action, target resource, generated arguments, and expected side effect.
Resume with the human decision
The resumed command should carry approved, rejected, or needs_edit, plus the operator identity from your app. The graph should branch from that state, not inspect a loose external flag.
Trace the whole thread
Use the same thread_id across the run, and emit it into logs, traces, approval queue rows, and billing events. That gives support and engineering one handle for debugging.

Here is the implementation shape. The production version should use a durable checkpointer, but the control pattern is the same:

Python

from typing_extensions import TypedDict
from langgraph.checkpoint.memory import InMemorySaver
from langgraph.graph import END, START, StateGraph
from langgraph.types import Command, interrupt


class State(TypedDict):
    ticket_id: str
    requested_action: str
    approved: bool


def approval_gate(state: State):
    approved = interrupt({
        "ticket_id": state["ticket_id"],
        "requested_action": state["requested_action"],
    })
    return {"approved": bool(approved)}


builder = StateGraph(State)
builder.add_node("approval_gate", approval_gate)
builder.add_edge(START, "approval_gate")
builder.add_edge("approval_gate", END)

graph = builder.compile(checkpointer=InMemorySaver())
config = {"configurable": {"thread_id": "ticket-123"}}

paused = graph.invoke(
    {"ticket_id": "ticket-123", "requested_action": "refund_order", "approved": False},
    config=config,
)

resumed = graph.invoke(Command(resume=True), config=config)

The important part is not the in-memory checkpointer in the example. It is the shape: checkpointed graph, stable thread, interrupt payload, explicit resume. In production, the checkpointer must be durable because the point of the design is to survive worker failures, deploys, and slow human response.

LangGraph also gives you better places to attach observability. Event streaming supports stream modes such as updates, values, messages, custom, checkpoints, tasks, and debug. LangGraph v1.2 introduced event streaming as a typed-projection API. For an agent team, that means the UI can show tokens to the user while ops sees node updates, checkpoint events, and task progress without treating every stream chunk as the same thing.

This is also where the existing DVNC.dev guidance on OpenAI Agents SDK vs LangGraph connects: once the workflow is stateful and cross-step, runtime behavior matters more than API convenience.

Production Cost and Deployment Line

The cost line is not LangChain vs LangGraph open source. The cost line appears when you add LangSmith tracing, evaluation, deployment, managed persistence, and operator workflows around the agent.

LangSmith pricing page — LangSmith pricing is the relevant production cost surface once tracing, evaluation, deployment, and managed agent operations enter the build.

LangSmith's Developer plan is $0 per seat per month, includes up to 5k base traces per month, then pay as you go, and includes 1 seat. The Plus plan is $39 per seat per month, includes up to 10k base traces per month, one dev-sized agent deployment, unlimited seats, up to 3 workspaces, LangSmith Engine, and LangSmith Sandboxes.

For deployment, the pricing page lists $0.005 per deployment run for additional deployments. Uptime cost is $0.0036 per minute per Production deployment and $0.0007 per minute per Development deployment. Fleet runs include 50 per month on Developer and 500 per month on Plus, and LLM usage is billed separately by the model provider.

The production implication is straightforward: do not evaluate LangGraph only by library complexity. Evaluate the whole operating model:

How many traces will the team retain per month?
Which runs need evals before release?
Which actions need annotation queues or human feedback?
How many deployments need to stay warm?
What is the expected uptime cost of a production deployment?
Does the team need self-hosting, SSO, RBAC, or support SLA?

LangSmith Cloud deployment runs on AWS and GCP and requires Plus or above. It can deploy from the LangSmith UI connected to GitHub or through the langgraph deploy CLI. The Cloud deployment docs also state that production deployments can serve up to 500 requests per second. That is a useful ceiling for initial architecture planning, but it should not replace your own load test against real tool latency, model latency, queue behavior, and approval waits.

Agent Server is the managed runtime layer to understand. Its docs describe built-in persistence and a task queue, with assistants, threads, runs, and cron jobs. It persists core resource data, checkpoints, and long-term memory in PostgreSQL by default. Redis handles signaling, cancellation, and streaming pub/sub, and stores only ephemeral data. The recommendation is to export an already compiled graph so the server loads it once at container startup and reuses it for every run; factory functions add overhead on every invocation.

That deployment shape is a strong fit for a serious agent product, but it is too much machinery for a simple read-only assistant. The practical build rule is to start LangChain-only when the service boundary is small, and move to LangGraph before you invent your own half-runtime with cron jobs, status tables, retry handlers, manual approval rows, and replay scripts.

Migration Rule: Compose LangChain Inside LangGraph

The best migration is not a rewrite from LangChain to LangGraph. It is moving orchestration into LangGraph while keeping useful LangChain components for models, tools, and agent loops.

The LangGraph docs state that LangChain components are commonly used to integrate models and tools, but that LangChain is not required to use LangGraph. That gives teams a clean migration path:

Keep the existing model adapters and tool definitions.
Identify the business states that matter.
Move the control flow into graph nodes and edges.
Add a checkpointer and require a stable thread_id.
Replace ad hoc approval flags with interrupts.
Trace by thread, node, tool, approval, and checkpoint.

The migration point is usually visible in the incident log. If support asks "what happened to this run?" and the answer requires stitching together application logs, queue rows, model traces, tool calls, and Slack approvals by timestamp, the architecture is late. That is a LangGraph problem.

For a technical founder or platform lead, the decision should be made by blast radius. A read-only helper can start with LangChain. A production agent that files tickets, edits records, sends messages, touches billing, or coordinates work across systems should start with LangGraph or move there before launch. The extra graph structure is cheaper than explaining an irreversible action that no one can reconstruct.

For monitoring, use the operational fields from the start: thread_id, run_id, node_name, checkpoint_id, tool_name, tool arguments hash, approval status, operator ID, retry count, model provider, latency, token usage, and final outcome. The companion AI agent monitoring playbook covers the dashboard side; the LangChain vs LangGraph decision determines whether those fields exist as first-class runtime facts or after-the-fact logs.

FAQ

What is the difference between LangGraph and LangChain?

LangChain is the agent framework and configurable harness for model, prompt, tools, and middleware. LangGraph is the orchestration runtime for long-running, stateful agents that need durable execution, streaming, human-in-the-loop, and persistence.

Can we use LangGraph without LangChain?

Yes. LangGraph docs state that LangChain components are commonly used for models and tools, but LangChain is not required to use LangGraph.

Do you need LangChain for LangGraph?

No. You need a graph, state schema, checkpointer, thread ID discipline, and runtime observability. LangChain remains useful when you want its model and tool integrations inside the graph.

Should I learn LangChain for LangGraph?

Learn LangChain if your immediate job is wiring model and tool integrations behind a simple agent harness. Learn LangGraph if your job is production orchestration: state, retries, approval, streaming, recovery, and deployment.

Is LangGraph better than LangChain?

LangGraph is better for stateful production agents. LangChain is better for a smaller, faster harness when the workflow is linear and the app can own persistence, retries, and approval outside the agent loop.

Scope Your Agent Build

Design and ship a production agent with the right runtime, state model, approval gates, evals, and observability before users depend on it.

Last Updated

Jun 5, 2026

CategoryAgents

LangChain vs LangGraph for Production Agents

Verdict: LangChain for the harness, LangGraph for the runtime

The Axis That Separates Them Is Durable State

Use LangChain When the Agent Loop Is Still Linear

Keep the tool surface small

Trace every model and tool step

Move risky actions out of the loop

Set the migration trigger

Use LangGraph When the Run Has to Pause and Resume

Create an explicit state schema

Checkpoint before the risky action

Resume with the human decision

Trace the whole thread

Production Cost and Deployment Line

Migration Rule: Compose LangChain Inside LangGraph

FAQ

Scope Your Agent Build

More from Agents

Context Engineering vs Prompt Engineering for Production Agents

Agent Memory for Production AI Systems

OpenAI Agents SDK vs Pydantic AI for Production Agents

Google ADK vs LangGraph for Production Agents

OpenAI Agents SDK TypeScript vs Python for Production Agents

OpenAI Agents SDK vs LangGraph for Production Agents

One letter, every week. Working systems — not hot takes.