Evals & Observability

What separates demos from deployed — Langfuse and LangSmith, OpenAI tracing, golden datasets, regression testing, and cost telemetry. The instrumentation every production AI system needs.

Articles4

Topics6

All Articles

Trace-to-Eval Builder Build Log

A build log for Trace-to-Eval Builder, a BYOK app that turns agent traces into replayable eval packs.

Agent Runbook Auditor: A BYOK Launch Review Tool for Agent Workflows

A build log for Agent Runbook Auditor, a BYOK OpenAI demo that reviews agent runbooks for launch risk, trace design, eval cases, guardrails, and rollout readiness.

OpenAI Agents SDK Tracing: What It Shows in Production

Use OpenAI Agents SDK tracing as run inspection, not full observability. Configure sensitive data, flushing, trace exports, evals, and approvals.

Langfuse vs LangSmith for Production Observability

Choose Langfuse for self-hosted, framework-neutral traces. Choose LangSmith for managed LangChain evals, review, alerting, and deployment.

Newsletter

One letter, every week. Working systems — not hot takes.

Build logs, agentic engineering decisions, agent failures, evals, and what survives real users. Sent weekly, never more.