AI Agent Observability: What to Log Before Production

A production logging contract for AI agents: trace IDs, tool calls, retrieval, cost, eval scores, approvals, and alerts before launch.

Friday, June 26, 2026Omid Saffari
AI Agent Observability: What to Log Before Production

AI agent observability starts with one rule: every run must be explainable as a trace, not as a pile of prompts and log lines. Before an agent touches production traffic, log the user-visible outcome, the model and prompt version, every tool call, every retrieval, the cost and latency, the eval result, and the human approval state under one trace ID.

The Production Contract Is One Trace Per Agent Run

The minimum production contract is simple: one user request becomes one trace, and every model call, tool call, retrieval, approval, eval, and final response becomes a span or event attached to it. OpenTelemetry defines traces as the big picture of what happens when a request moves through an application, spans as the units of work inside that path, and attributes as key-value metadata on those spans. That model maps cleanly to agents because an agent run is not one API call. It is a sequence of decisions.

The mistake is treating observability as "store the prompt and completion." That helps during demos. It fails the first time a customer asks why the agent updated the wrong record, skipped a source, retried an expensive tool, or produced an answer that passed syntax checks but violated policy. Production debugging needs the decision chain, not only the transcript.

A useful trace has this shape:

JSON
{
  "trace_id": "agent_run_01",
  "workflow": "support_refund_review",
  "tenant_id": "tenant_hash",
  "user_request": {
    "intent": "refund_status",
    "input_hash": "sha256:...",
    "contains_sensitive_data": true
  },
  "spans": [
    {
      "name": "classify_request",
      "kind": "model_call",
      "model": "model_family_and_version",
      "prompt_version": "refund_router_v7",
      "status": "ok"
    },
    {
      "name": "lookup_order",
      "kind": "tool_call",
      "tool": "orders.read",
      "policy_decision": "allowed",
      "approval_state": "not_required",
      "status": "ok"
    },
    {
      "name": "compose_response",
      "kind": "model_call",
      "eval_result": "pass",
      "status": "ok"
    }
  ]
}

OpenAI Agents SDK tracing already thinks this way. Its tracing records LLM generations, tool calls, handoffs, guardrails, and custom events during an agent run, and the SDK describes traces as a single end-to-end workflow composed of spans. If you use that SDK, treat its built-in traces as the starting point, then add the product fields your support, compliance, and engineering teams will need later: tenant, feature flag, prompt version, policy decision, approval state, eval result, and incident link.

Log Decisions, Not Only Prompts

The most useful agent logs record intent, action, and outcome at every boundary where the system could do damage. A prompt log says "the model asked to call refund.create." A decision log says "the model asked to call refund.create, policy limited it to read-only because the order was outside the refund window, a human approval was required, and the final customer response was a refusal with the policy reason." That is the difference between debugging and guessing.

Use a fixed event contract for every agent span:

FieldWhy it existsExample
trace_idCorrelates every step in the runagent_run_01
span_kindSeparates model calls, retrieval, tool calls, approval, eval, and final responsetool_call
modelTies output changes to a model version or routing choiceprovider/model-version
prompt_versionMakes prompt regressions reviewablerefund_router_v7
tool_nameShows the capability the agent tried to useorders.read
tool_intentCaptures why the tool was requestedverify refund window
input_refStores a redacted body or hash, not raw sensitive data by defaultsha256:...
output_summaryKeeps the operational result reviewable without storing full private payloadsorder found, refund expired
policy_decisionRecords allow, deny, redact, or escalateescalate
approval_stateShows whether a human gate was required and completedapproved
statusSeparates success, retry, timeout, policy block, and errorpolicy_blocked
latency_msFinds slow steps by span, not only by final responseavailable if measured
costLinks spend to workflow, model, and tenantavailable if provider reports it
eval_resultKeeps quality judgment attached to the live runpass

Some teams avoid logging too much because prompts and tool payloads can carry private data. The answer is not thin logs. The answer is a redaction policy. Store raw payloads only where you have a retention and access-control reason. For the default trace, store hashes, structured summaries, source IDs, policy decisions, and pointers to secure payload storage.

Langfuse describes application tracing as structured logs of every request that capture the exact prompt, model response, token usage, latency, and any tools or retrieval steps in between. That is useful, but the production contract should add business context. If the trace cannot answer "which customer, which permission, which policy, which approval, which release," it is not enough for a real incident.

  1. Name the workflow

    Use a stable workflow name such as support_refund_review, sales_quote_check, or contract_clause_lookup. Do not use the endpoint path as the workflow name when several agent behaviors share the same route.

  2. Split the run into spans

    Create separate spans for routing, model calls, retrieval, tool calls, memory writes, approval gates, evals, and final response. A single "agent" span hides the failure point.

  3. Attach release context

    Add prompt version, model version, feature flag, tool manifest version, and deployment SHA. When the agent changes behavior after a deploy, these fields shorten the incident review.

  4. Redact by default

    Store hashes or structured summaries for sensitive payloads. Keep raw prompts and outputs only behind retention, access, and audit controls.

Retrieval And Memory Need Separate Spans

Retrieval failures look like model failures when you do not log retrieval separately. A model can only answer from the context it receives. If the search query was rewritten badly, the permissions filter removed the right document, the memory store returned stale state, or the citation mapper dropped the source, the final answer may look hallucinated while the root cause sits upstream.

For a RAG-backed agent, add spans for:

  • query_rewrite: original user intent, rewritten query, and rewrite prompt version.
  • permission_filter: tenant, role, document visibility rule, and block reason.
  • retrieval: source IDs, rank, score bucket, index version, and freshness timestamp when available.
  • rerank: input count, selected source IDs, and reranker version.
  • context_pack: included source IDs, excluded source IDs, token budget status, and citation map.
  • memory_read: memory namespace, record IDs, age bucket, and privacy class.
  • memory_write: write intent, approval state, storage namespace, and retention policy.

The key is to log the source chain without dumping private documents into the trace. Store document IDs, chunks IDs, citation IDs, timestamps, and access decisions. If an engineer needs the raw source, the trace should point to the secured system of record.

This is where the observability layer starts to overlap with retrieval quality. A clean trace can show that the agent chose the wrong tool. It can also show that the retrieval pipeline never sent the right fact. That distinction matters when you are deciding whether to tune prompts, rebuild the index, add a reranker, tighten permissions, or change the agent's fallback behavior.

Evals Belong On The Same Trace

An eval that cannot be traced back to the run is a report, not an operating control. Store eval results as span attributes or child spans on the same trace that produced the answer. That lets an engineer open a failed run and see the model call, prompt version, tool sequence, retrieved sources, approval state, and quality result in one place.

Use different evals for different failure modes:

EvalWhat it catchesWhere it attaches
Task successThe final outcome satisfied the user's intentFinal response span
Tool correctnessThe agent called the right tool with acceptable argumentsTool call span
Evidence supportThe answer is grounded in the retrieved sourcesContext pack or final response span
Policy complianceThe run followed privacy, safety, and business rulesPolicy or approval span
RegressionA new prompt, model, or tool version did not break known casesRelease trace or CI run
Human reviewA reviewer accepted, corrected, or rejected the outputReview span

LangSmith positions observability as visibility from individual traces to production-wide performance metrics, with dashboards, alerts, automation rules, webhooks, online evaluations, annotation queues, and feedback capture. Langfuse adds LLM-native concepts such as token usage, model parameters, prompt and completion pairs, evaluation scores, LLM-as-a-Judge evaluation, prompt management, experiments, datasets, and dashboards. The exact tool matters less than the operating rule: evals have to sit beside the run that produced them.

If you already use OpenAI Agents SDK tracing, start with the production tracing checklist for OpenAI agents and add your own eval spans around product-specific risks. If you are choosing between LLM observability platforms, the Langfuse and LangSmith production comparison is the right next read.

A practical release gate looks like this:

YAML
agent_release_gate:
  required_trace_fields:
    - trace_id
    - workflow
    - model
    - prompt_version
    - tool_manifest_version
    - approval_state
    - eval_result
  blocking_failures:
    - missing_trace_id
    - unlogged_tool_call
    - missing_retrieval_source
    - failed_policy_eval
    - approval_required_but_absent
  review_queue:
    sample: production_risk_based
    required_for:
      - money_movement
      - account_change
      - private_data_export
      - legal_or_medical_claim

That gate can run in CI for offline datasets, in staging against scripted runs, and in production as online monitoring. The release gate should not ask whether the agent "seems good." It should ask whether every risky decision is observable, evaluated, and blocked when it fails.

Alert On Failure Modes, Not Novelty

Alerting should follow the ways agents break in production: tool failure, policy bypass, retrieval miss, cost spike, latency drift, eval failure, approval gap, and silent fallback. A dashboard full of token counts is useful for finance and capacity planning, but it will not wake the right engineer when the agent starts writing to the wrong system.

Create alerts around failure classes:

Failure classAlert signalFirst debugging question
Tool reliabilityTool call errors, timeouts, retries, or malformed argumentsDid the API change, did auth fail, or did the agent choose the wrong tool?
Policy controlDeny events, escalation events, or approval bypass attemptsDid the policy work, or did the agent reach a restricted path?
Retrieval qualityEmpty source sets, stale source sets, or answer-without-citation eventsDid retrieval fail before the model answered?
Output qualityFailed task eval, failed evidence eval, or reviewer rejectionIs this a prompt, model, retrieval, or tool-selection regression?
Cost controlSpend by workflow, model, tenant, or tool routeDid routing choose a costly model or enter a retry loop?
LatencySlow spans by model, retrieval, tool call, or approval queueWhich step controls user-visible delay?
Data protectionPrivacy scanner hit, raw sensitive payload stored, or unexpected exportWhich span crossed the boundary?

Datadog's LLM Observability docs describe traces that can represent an individual LLM inference with tokens, error information, and latency, a predetermined workflow that groups LLM calls with contextual operations such as tool calls or preprocessing, or a dynamic workflow executed by an LLM agent. That is the right mental model for alerting: do not alert only on the final request. Alert on the span that failed.

Phoenix describes a trace as a record of a single run, broken into spans that show how agents, tasks, and tools executed. That trace becomes the raw data for evaluation and iteration. Treat that as the operational loop: trace the run, evaluate the result, fix the release contract, and then promote the agent.

The Practical Stack

The durable stack is vendor-neutral at the trace layer and LLM-aware at the review layer. OpenTelemetry gives you correlation across services and a shared trace vocabulary. Its GenAI semantic conventions repository covers spans, metrics, and events for GenAI clients, MCP, and provider-specific conventions. That makes it a good foundation when agent runs need to connect to your API logs, queues, databases, billing, and incident system.

Above that, pick the smallest LLM-aware tool that fits your workflow:

Stack choiceUse it whenProduction note
OpenAI Agents SDK tracingYou build with the Agents SDK and need immediate traces for model calls, tools, handoffs, guardrails, and custom eventsAdd business fields and eval results yourself; built-in tracing is not the whole release gate
OpenTelemetry plus LangfuseYou want open-source LLM tracing, prompt tracking, evals, datasets, dashboards, and self-hosting optionsKeep redaction and retention explicit before storing prompts and outputs
LangSmithYou are already in LangChain, LangGraph, or want integrated traces, dashboards, online evals, automations, and feedback queuesMake sure non-LangChain services still correlate through trace IDs
Datadog LLM ObservabilityYou want agent traces tied to an existing production observability and incident workflowUse span-level fields so alerts point to the failing decision, not just the endpoint
PhoenixYou want a trace-first open-source workflow for understanding runs and moving into evaluationsKeep the trace schema consistent if you later export elsewhere

The wrong stack is a disconnected mix: one dashboard for prompts, one log system for APIs, one spreadsheet for reviewer notes, and no trace ID tying them together. The right stack lets an engineer open one run and answer:

  • What did the user ask?
  • Which workflow handled it?
  • Which prompt and model version ran?
  • Which tools were called, and why?
  • Which sources or memories were read?
  • Which policy decisions applied?
  • Was human approval required?
  • What did it cost?
  • Which evals passed or failed?
  • Which release introduced the behavior?

If those questions need separate manual searches, the observability layer is not ready for production.

The Launch Checklist

Launch readiness is not "we have traces." It is "we can explain, evaluate, and stop a bad run before it becomes a product incident." Use this checklist before opening the agent to real users.

  1. Define the run schema

    Write the required trace fields, span kinds, redaction rules, retention rules, and owner for each workflow. If a span kind is optional, write the condition that makes it optional.

  2. Instrument the risky boundaries

    Trace model calls, tool calls, retrieval, memory reads, memory writes, approvals, guardrails, and final responses. The riskiest boundary is the one where the agent can change external state.

  3. Attach evals to traces

    Store task success, evidence support, policy compliance, tool correctness, and reviewer feedback on the same trace as the run.

  4. Create blocking gates

    Block release when traces are missing required fields, tool calls are unlogged, retrieval sources are absent, policy evals fail, or required approvals are missing.

  5. Route incidents by span

    Send alerts to the owner of the failing span: model routing, retrieval, tool API, policy engine, approval queue, or product workflow.

The operating principle is that every agent run should leave enough evidence for an engineer to reproduce the decision chain without re-running the model. Re-running a model is not debugging. It changes the system under inspection. The trace is the artifact you can review, compare, evaluate, and hand to an incident owner.

What is AI agent observability?

AI agent observability is the trace, metric, log, eval, and feedback layer that explains what an agent did, why it did it, what it cost, and whether the outcome met the release contract. The useful unit is a full agent run, not a single prompt.

What should be logged for an AI agent?

Log the trace ID, workflow, user or tenant context, model and prompt version, tool calls, retrieval sources, memory reads and writes, approval state, final outcome, cost, latency, eval result, and policy exceptions. Redact sensitive payloads by default and store secure references where full payload review is required.

Is OpenTelemetry enough for AI agents?

OpenTelemetry is the right correlation and transport layer, especially when agent traces need to connect to the rest of your system. Most teams still need an LLM-aware run store or eval platform for prompts, tool calls, retrieval, review queues, and quality scoring.

How is observability different from evals?

Observability explains a live run. Evals judge whether the run met a quality, safety, or business bar. A production setup stores eval outcomes on the same trace so a failed score can be debugged against the exact model call, tool call, retrieval set, and approval path.

Which AI agent observability tool should a team start with?

Start with the tool that fits your execution stack, then enforce your own trace schema. OpenAI Agents SDK tracing is the fastest path for Agents SDK apps, LangSmith fits LangChain and LangGraph teams, Langfuse fits open-source LLM observability and eval workflows, Datadog fits teams already operating production systems there, and Phoenix fits trace-first open-source evaluation work.

Last Updated

Jun 26, 2026

More from Evals & Observability

View all Evals & Observability articles
Newsletter

One letter, every week. Working systems — not hot takes.

Build logs, agentic engineering decisions, agent failures, evals, and what survives real users. Sent weekly, never more.

Weekly. No spam. Unsubscribe anytime.