Langfuse vs LangSmith for Production Observability

Choose Langfuse for self-hosted, framework-neutral traces. Choose LangSmith for managed LangChain evals, review, alerting, and deployment.

Tuesday, June 2, 2026Omid Saffari
Langfuse vs LangSmith for Production Observability

Choose Langfuse when you need framework-neutral tracing, self-hosting, and cost visibility across a mixed AI stack. Choose LangSmith when your production loop is already built around LangChain or LangGraph and you want tracing, evals, human review, alerting, and agent deployment in one managed platform.

The Verdict

Langfuse is the better default for teams that need observability to belong to their own stack. It is strongest when your agents, RAG flows, model calls, and product workflows span several frameworks, when trace data cannot live in a vendor cloud, or when you want pricing that scales around usage units and your own infrastructure choices. Its current public pricing starts with a free Hobby plan, then Core at $29/month, Pro at $199/month, and Enterprise at $2499/month, with 100k units included on paid cloud plans and additional usage at $8 per 100k units, lower with volume.

LangSmith is the better default when LangChain or LangGraph is already the application framework, and the team wants less platform assembly. It gives you tracing, online and offline evals, prompt tools, annotation queues, monitoring and alerting, and adjacent LangSmith products for deployment, Fleet, Engine, and Sandboxes. Its Developer plan is $0 per seat per month with 5k base traces per month, and Plus is $39 per seat per month with 10k base traces per month.

The production choice is not open source versus proprietary in the abstract. The choice is whether the observability layer should be a self-owned data plane with open instrumentation, or a managed agent engineering system tied tightly to the LangChain ecosystem.

The Axis That Actually Separates Them

The split is control versus lifecycle integration. Langfuse gives you more control over where telemetry lives and how the observability layer plugs into the rest of your platform. LangSmith gives you a more integrated managed loop around traces, evals, review, alerting, and deployment.

Langfuse can be deployed locally, in cloud infrastructure, within a VPC, or on-premises, with internet access optional. Its self-hosted architecture uses Postgres for transactional workloads, ClickHouse for traces, observations, and scores, Redis or Valkey for queues and cache, and S3 or blob storage for events, multimodal inputs, and large exports. That is real control, and it is also real infrastructure.

LangSmith is simpler when the application is already LangChain or LangGraph. Its docs say LangSmith tracing can be enabled for LangChain or LangGraph with a single environment variable, and the quickstart uses LANGSMITH_TRACING=true and LANGSMITH_API_KEY. For other providers, it supports wrappers for OpenAI, Anthropic, and Google Gemini, plus manual tracing with @traceable.

Langfuse vs LangSmith: Production Comparison

AxisLangfuseLangSmithProduction DecisionWatch First
Best fitMixed frameworks, self-hosting, broad LLM telemetryLangChain or LangGraph teams that want a managed lifecyclePick by application architecture, not dashboard preferenceFramework lock-in
HostingCloud or self-hosted, including VPC and on-premisesCloud on self-serve plans; Enterprise includes cloud, hybrid, or self-hosted optionsLangfuse is easier to justify when data location is non-negotiableInfra ownership
Pricing modelCloud plans plus billable unitsSeats, trace usage, retention, and adjacent product usageLangfuse is simpler for unlimited-user teams; LangSmith is clearer for managed LangChain workflowsRetention math
Free tierHobby is free with 50k units per month, 30 days data access, and 2 usersDeveloper is $0 per seat with 5k base traces per month and 1 seatBoth are enough for evaluation, not enough to model production cost aloneProduction traffic shape
Paid entryCore is $29/month with 100k units, 90 days data access, and unlimited usersPlus is $39 per seat per month with 10k base traces per monthLangfuse scales by usage and unlimited users; LangSmith adds seatsTeam size
Trace retentionHobby 30 days, Core 90 days, Pro 3 yearsBase traces retain for 14 days at $2.50 per 1k; extended traces retain for 400 days at $5.00 per 1kLong retention is a cost decision, not just a featureWhich traces deserve history
EvalsOnline and offline evals, datasets, experiments, scores, LLM-as-a-Judge, code evaluators, annotation queuesOnline and offline evals, regression testing, production monitoring, anomaly detection, annotation queuesBoth can support a real eval loop; the surrounding process decides valueGolden dataset quality
Cost trackingUsage and cost on generation and embedding observations, via ingestion or model-based inferenceTrace and usage billing plus cost visibility in the LangSmith productCost attribution still needs normalized metadata from your appTenant and feature tags
DeploymentObservability platform onlyLangSmith Deployment, Fleet, Engine, and Sandboxes available with separate usage pricingLangSmith wins if managed agent deployment belongs in the same platformOperational coupling
Langfuse pricing page
Langfuse pricing centers on plans, included units, retention, and cloud usage overage.
LangSmith pricing page
LangSmith pricing combines seats, traces, retention, deployment, Fleet, Engine, and Sandbox usage.

Langfuse: Best When Observability Has To Belong To Your Stack

Langfuse is the stronger choice when observability is a platform component you expect to operate, extend, and connect to internal systems. It captures prompts, model responses, token usage, latency, tool calls, retrieval steps, timing, inputs, outputs, and metadata. That is the minimum viable trace for production AI: enough context to explain why a request failed, which model path it used, what it cost, and which retrieval or tool step changed the answer.

Langfuse is also the clearer choice when data residency or customer contracts make vendor-hosted traces difficult. The core product is MIT-licensed outside the /ee folders, and Langfuse says all product features are freely available under the MIT license. Enterprise modules such as SCIM, extended audit logging, and data retention policies require a commercial license when self-hosted. That boundary matters: a team can run the core system without treating every production trace as a SaaS procurement decision, while still paying for enterprise controls when those controls are needed.

Self-hosting is not a shortcut. The Langfuse production stack includes Postgres, ClickHouse, Redis or Valkey, and S3 or blob storage. Docker Compose is useful for testing and low-scale deployments, but production-scale self-hosting means Kubernetes Helm, Terraform on AWS, Azure, or GCP, Railway, or a comparable operations setup. If the team does not already run databases, object storage, backups, migrations, and observability for observability itself, the managed cloud plan may be the more honest path.

The cost model is straightforward enough to forecast. Hobby includes 50k units per month, 30 days data access, and 2 users. Core includes 100k units per month, 90 days data access, unlimited users, and additional usage at $8 per 100k units, lower with volume. Pro keeps 100k included units and moves to 3 years data access. The pricing calculator lists graduated tiers: 0-100k units free, 100k-1M units at $8 per 100k, 1-10M at $7 per 100k, 10-50M at $6.5 per 100k, and 50M+ at $6 per 100k.

Langfuse cost tracking is useful, but only if the application sends the right fields. It tracks usage and cost on observations of type generation and embedding. Cost can be ingested through API, SDKs, or integrations, or inferred from the model parameter with predefined models and tokenizers. For reasoning models such as the OpenAI o1 model family, Langfuse says cost inference is not supported when no token counts are ingested. In production, that means you should send provider usage directly whenever the model response includes it.

LangSmith: Best When The Agent Lifecycle Belongs In One Managed System

LangSmith is the stronger choice when the application is already built around LangChain or LangGraph and the team wants a managed loop from trace to eval to review to deployment. LangSmith defines a trace as a single execution of an application that can include many individual steps, such as LLM calls and other tracked events. Its tracing quickstart describes a trace as the complete record of every step in a request, from inputs to final output.

The setup advantage is real for LangChain and LangGraph teams. If the codebase already uses those frameworks, tracing can be turned on with one environment variable and an API key. That matters during rollout because observability fails most often when it is optional per engineer or bolted on after launch. If every run through the framework is traced consistently, the team can debug behavior, collect examples, and build evals without first designing a telemetry standard from scratch.

LangSmith's evaluation model is also production-friendly. Its docs frame evaluation as measuring quality from pre-deployment testing to production monitoring. Offline evaluations cover benchmarking, regression testing, unit testing, and backtesting. Online evaluations cover real-time monitoring, anomaly detection, and production feedback on live traffic. The docs recommend creating 5-10 examples of what good looks like for each critical component, such as retrieval, tool selection, argument formatting, or final answer quality.

That workflow is useful when a team wants evals to be an operating rhythm, not a notebook. A production agent needs failed traces turned into dataset examples, dataset examples turned into regression tests, regression tests tied to release gates, and live quality checks tied to alerting or review. LangSmith gives more of that managed surface in one place.

The tradeoff is cost and platform coupling. Plus is $39 per seat per month, includes 10k base traces per month, and supports unlimited seats and up to 3 workspaces. Base traces retain for 14 days and cost $2.50 per 1k traces. Extended traces retain for 400 days and cost $5.00 per 1k traces, with base-to-extended upgrades at $2.50 per 1k traces. If a team stores every trace as extended history, retention becomes a meaningful bill. If it keeps only failures, reviewed runs, eval examples, and release-critical traces, the economics are easier to control.

LangSmith also adds adjacent product pricing. Plus includes 1 free dev-sized agent deployment, but additional deployment runs cost $0.005 each. Production deployment uptime costs $0.0036 per minute, development deployment uptime costs $0.0007 per minute, additional Fleet runs cost $0.05 per Fleet run, Engine costs $1.50 per LCU, and Sandboxes cost $0.0576 per vCPU-hour, $0.0185 per GiB-hour memory, and $0.000123 per GiB-hour storage. None of those numbers are bad by themselves. They just belong in the architecture decision before LangSmith becomes the default control plane.

The Cost Line

Langfuse is usually easier to reason about when the team has many internal users and wants usage to dominate the bill. LangSmith is usually easier to justify when the team is paying for a managed lifecycle around a LangChain or LangGraph application, not just trace storage.

Here is the production cost question to ask before choosing either one:

Text
monthly observability cost =
  traced requests
  x spans per request
  x retention class
  x review/eval sampling rate
  x team access model
  + deployment/control-plane usage
  + self-host infrastructure and operations

That formula keeps the comparison honest. Langfuse Cloud charges around units, plan, and retention. LangSmith charges around seats, trace allowance, trace retention, and optional managed products. Langfuse self-hosting replaces vendor usage fees with infrastructure and operations. LangSmith Enterprise self-hosting exists, but it is part of custom Enterprise packaging rather than the self-serve path.

The dangerous cost pattern is not high traffic alone. It is retaining low-value traces at high value, sending untagged spans that cannot be aggregated by customer or feature, and running evals without a sampling policy. Trace everything briefly. Retain selectively. Promote only the traces that have learning value: failures, regressions, human-reviewed runs, unusual latency, high-cost requests, and examples that become dataset rows.

What To Log Before Either Tool Is Useful

Both products become weak if the application sends thin telemetry. A trace UI cannot fix missing business context. A production AI system should send the fields that let engineering, product, support, and compliance answer the same incident without arguing about what happened.

At minimum, log:

  • run_id, trace_id, span_id, and parent_span_id
  • user, tenant, workspace, or account identifiers, with privacy-safe masking
  • environment, release, route, feature, and prompt version
  • model provider, model name, temperature, tool policy, and fallback path
  • retrieved document IDs, retrieval scores, reranker scores, and permission filters
  • tool name, arguments, result status, and external system latency
  • token usage, provider-reported cost, inferred cost, and total request latency
  • evaluator names, evaluator versions, score values, and pass/fail thresholds
  • approval state, reviewer, reason code, and handoff target for human decisions
  • error class, retry count, fallback used, and final user-visible outcome

This is where many teams get the tool choice backward. They pick a vendor before they define the run contract. The run contract is the product. Langfuse or LangSmith is the database, UI, eval system, and workflow around it.

Decision Rules

Choose Langfuse if you need self-hosting without an enterprise-only hosting gate, have mixed frameworks, want open instrumentation, need unlimited users on paid cloud plans, or expect your observability layer to connect deeply to internal analytics, billing, and security systems. It is also the better fit when the team can operate the stack or can start on Langfuse Cloud and move selected environments into its own infrastructure later.

Choose LangSmith if the application is LangChain or LangGraph-heavy, the team wants the fastest path to consistent tracing, and the managed eval, review, alerting, and deployment surface will reduce platform work. It is the better fit when the agent lifecycle matters more than framework neutrality, and when per-seat and retention pricing are acceptable for the way the team will sample, store, and review traces.

Pick neither as a substitute for an eval policy. The policy decides what gets scored online, what gets turned into a dataset, what blocks a deployment, what triggers human approval, and what gets paged. Without that policy, both tools become expensive screenshots of confusing behavior.

For teams still choosing a broader observability direction, keep the decision inside the evals and observability lane: traces, evals, cost, approval, and release gates should move together.

What is the difference between LangSmith and Langfuse?

Langfuse is stronger for framework-neutral tracing, self-hosting, open instrumentation, and cost visibility across a mixed AI stack. LangSmith is stronger when the application is already built on LangChain or LangGraph and the team wants managed tracing, evals, review, alerting, and deployment in one platform.

Is LangSmith free or paid?

LangSmith has a Developer plan at $0 per seat per month with up to 5k base traces per month and 1 seat. Its Plus plan is $39 per seat per month with up to 10k base traces per month, then pay-as-you-go usage.

What is Langfuse used for?

Langfuse is used for LLM application tracing, token and cost tracking, prompt management, datasets, experiments, evaluation scores, human annotation, online evals, and offline evals.

Can Langfuse be self-hosted for production?

Yes. Langfuse can be deployed in cloud infrastructure, inside a VPC, or on-premises, but production-scale self-hosting means operating Postgres, ClickHouse, Redis or Valkey, S3 or blob storage, migrations, backups, and throughput.

Last Updated

Jun 2, 2026

More from Evals & Observability

View all Evals & Observability articles
Newsletter

One letter, every week. Working systems — not hot takes.

Build logs, agentic engineering decisions, agent failures, evals, and what survives real users. Sent weekly, never more.

Weekly. No spam. Unsubscribe anytime.