AI Agent Observability: What to Log Before Production

A production logging contract for AI agents: trace IDs, tool calls, retrieval, cost, eval scores, approvals, and alerts before launch.

Friday, June 26, 2026

Omid Saffari

AI Agent Observability: What to Log Before Production

AI agent observability starts with one rule: every run must be explainable as a trace, not as a pile of prompts and log lines. Before an agent touches production traffic, log the user-visible outcome, the model and prompt version, every tool call, every retrieval, the cost and latency, the eval result, and the human approval state under one trace ID.

The Production Contract Is One Trace Per Agent Run

The minimum production contract is simple: one user request becomes one trace, and every model call, tool call, retrieval, approval, eval, and final response becomes a span or event attached to it. OpenTelemetry defines traces as the big picture of what happens when a request moves through an application, spans as the units of work inside that path, and attributes as key-value metadata on those spans. That model maps cleanly to agents because an agent run is not one API call. It is a sequence of decisions.

The mistake is treating observability as "store the prompt and completion." That helps during demos. It fails the first time a customer asks why the agent updated the wrong record, skipped a source, retried an expensive tool, or produced an answer that passed syntax checks but violated policy. Production debugging needs the decision chain, not only the transcript.

A useful trace has this shape:

JSON

{
  "trace_id": "agent_run_01",
  "workflow": "support_refund_review",
  "tenant_id": "tenant_hash",
  "user_request": {
    "intent": "refund_status",
    "input_hash": "sha256:...",
    "contains_sensitive_data": true
  },
  "spans": [
    {
      "name": "classify_request",
      "kind": "model_call",
      "model": "model_family_and_version",
      "prompt_version": "refund_router_v7",
      "status": "ok"
    },
    {
      "name": "lookup_order",
      "kind": "tool_call",
      "tool": "orders.read",
      "policy_decision": "allowed",
      "approval_state": "not_required",
      "status": "ok"
    },
    {
      "name": "compose_response",
      "kind": "model_call",
      "eval_result": "pass",
      "status": "ok"
    }
  ]
}

OpenAI Agents SDK tracing already thinks this way. Its tracing records LLM generations, tool calls, handoffs, guardrails, and custom events during an agent run, and the SDK describes traces as a single end-to-end workflow composed of spans. If you use that SDK, treat its built-in traces as the starting point, then add the product fields your support, compliance, and engineering teams will need later: tenant, feature flag, prompt version, policy decision, approval state, eval result, and incident link.

Log Decisions, Not Only Prompts

The most useful agent logs record intent, action, and outcome at every boundary where the system could do damage. A prompt log says "the model asked to call refund.create." A decision log says "the model asked to call refund.create, policy limited it to read-only because the order was outside the refund window, a human approval was required, and the final customer response was a refusal with the policy reason." That is the difference between debugging and guessing.

Use a fixed event contract for every agent span:

Field	Why it exists	Example
`trace_id`	Correlates every step in the run	`agent_run_01`
`span_kind`	Separates model calls, retrieval, tool calls, approval, eval, and final response	`tool_call`
`model`	Ties output changes to a model version or routing choice	`provider/model-version`
`prompt_version`	Makes prompt regressions reviewable	`refund_router_v7`
`tool_name`	Shows the capability the agent tried to use	`orders.read`
`tool_intent`	Captures why the tool was requested	`verify refund window`
`input_ref`	Stores a redacted body or hash, not raw sensitive data by default	`sha256:...`
`output_summary`	Keeps the operational result reviewable without storing full private payloads	`order found, refund expired`
`policy_decision`	Records allow, deny, redact, or escalate	`escalate`
`approval_state`	Shows whether a human gate was required and completed	`approved`
`status`	Separates success, retry, timeout, policy block, and error	`policy_blocked`
`latency_ms`	Finds slow steps by span, not only by final response	`available if measured`
`cost`	Links spend to workflow, model, and tenant	`available if provider reports it`
`eval_result`	Keeps quality judgment attached to the live run	`pass`

Some teams avoid logging too much because prompts and tool payloads can carry private data. The answer is not thin logs. The answer is a redaction policy. Store raw payloads only where you have a retention and access-control reason. For the default trace, store hashes, structured summaries, source IDs, policy decisions, and pointers to secure payload storage.

Langfuse describes application tracing as structured logs of every request that capture the exact prompt, model response, token usage, latency, and any tools or retrieval steps in between. That is useful, but the production contract should add business context. If the trace cannot answer "which customer, which permission, which policy, which approval, which release," it is not enough for a real incident.

Name the workflow
Use a stable workflow name such as support_refund_review, sales_quote_check, or contract_clause_lookup. Do not use the endpoint path as the workflow name when several agent behaviors share the same route.
Split the run into spans
Create separate spans for routing, model calls, retrieval, tool calls, memory writes, approval gates, evals, and final response. A single "agent" span hides the failure point.
Attach release context
Add prompt version, model version, feature flag, tool manifest version, and deployment SHA. When the agent changes behavior after a deploy, these fields shorten the incident review.
Redact by default
Store hashes or structured summaries for sensitive payloads. Keep raw prompts and outputs only behind retention, access, and audit controls.

Retrieval And Memory Need Separate Spans

Retrieval failures look like model failures when you do not log retrieval separately. A model can only answer from the context it receives. If the search query was rewritten badly, the permissions filter removed the right document, the memory store returned stale state, or the citation mapper dropped the source, the final answer may look hallucinated while the root cause sits upstream.

For a RAG-backed agent, add spans for:

query_rewrite: original user intent, rewritten query, and rewrite prompt version.
permission_filter: tenant, role, document visibility rule, and block reason.
retrieval: source IDs, rank, score bucket, index version, and freshness timestamp when available.
rerank: input count, selected source IDs, and reranker version.
context_pack: included source IDs, excluded source IDs, token budget status, and citation map.
memory_read: memory namespace, record IDs, age bucket, and privacy class.
memory_write: write intent, approval state, storage namespace, and retention policy.

The key is to log the source chain without dumping private documents into the trace. Store document IDs, chunks IDs, citation IDs, timestamps, and access decisions. If an engineer needs the raw source, the trace should point to the secured system of record.

This is where the observability layer starts to overlap with retrieval quality. A clean trace can show that the agent chose the wrong tool. It can also show that the retrieval pipeline never sent the right fact. That distinction matters when you are deciding whether to tune prompts, rebuild the index, add a reranker, tighten permissions, or change the agent's fallback behavior.

Evals Belong On The Same Trace

An eval that cannot be traced back to the run is a report, not an operating control. Store eval results as span attributes or child spans on the same trace that produced the answer. That lets an engineer open a failed run and see the model call, prompt version, tool sequence, retrieved sources, approval state, and quality result in one place.

Use different evals for different failure modes:

Eval	What it catches	Where it attaches
Task success	The final outcome satisfied the user's intent	Final response span
Tool correctness	The agent called the right tool with acceptable arguments	Tool call span
Evidence support	The answer is grounded in the retrieved sources	Context pack or final response span
Policy compliance	The run followed privacy, safety, and business rules	Policy or approval span
Regression	A new prompt, model, or tool version did not break known cases	Release trace or CI run
Human review	A reviewer accepted, corrected, or rejected the output	Review span

LangSmith positions observability as visibility from individual traces to production-wide performance metrics, with dashboards, alerts, automation rules, webhooks, online evaluations, annotation queues, and feedback capture. Langfuse adds LLM-native concepts such as token usage, model parameters, prompt and completion pairs, evaluation scores, LLM-as-a-Judge evaluation, prompt management, experiments, datasets, and dashboards. The exact tool matters less than the operating rule: evals have to sit beside the run that produced them.

If you already use OpenAI Agents SDK tracing, start with the production tracing checklist for OpenAI agents and add your own eval spans around product-specific risks. If you are choosing between LLM observability platforms, the Langfuse and LangSmith production comparison is the right next read.

A practical release gate looks like this:

YAML

agent_release_gate:
  required_trace_fields:
    - trace_id
    - workflow
    - model
    - prompt_version
    - tool_manifest_version
    - approval_state
    - eval_result
  blocking_failures:
    - missing_trace_id
    - unlogged_tool_call
    - missing_retrieval_source
    - failed_policy_eval
    - approval_required_but_absent
  review_queue:
    sample: production_risk_based
    required_for:
      - money_movement
      - account_change
      - private_data_export
      - legal_or_medical_claim

That gate can run in CI for offline datasets, in staging against scripted runs, and in production as online monitoring. The release gate should not ask whether the agent "seems good." It should ask whether every risky decision is observable, evaluated, and blocked when it fails.

Alert On Failure Modes, Not Novelty

Alerting should follow the ways agents break in production: tool failure, policy bypass, retrieval miss, cost spike, latency drift, eval failure, approval gap, and silent fallback. A dashboard full of token counts is useful for finance and capacity planning, but it will not wake the right engineer when the agent starts writing to the wrong system.

Create alerts around failure classes:

Failure class	Alert signal	First debugging question
Tool reliability	Tool call errors, timeouts, retries, or malformed arguments	Did the API change, did auth fail, or did the agent choose the wrong tool?
Policy control	Deny events, escalation events, or approval bypass attempts	Did the policy work, or did the agent reach a restricted path?
Retrieval quality	Empty source sets, stale source sets, or answer-without-citation events	Did retrieval fail before the model answered?
Output quality	Failed task eval, failed evidence eval, or reviewer rejection	Is this a prompt, model, retrieval, or tool-selection regression?
Cost control	Spend by workflow, model, tenant, or tool route	Did routing choose a costly model or enter a retry loop?
Latency	Slow spans by model, retrieval, tool call, or approval queue	Which step controls user-visible delay?
Data protection	Privacy scanner hit, raw sensitive payload stored, or unexpected export	Which span crossed the boundary?

Datadog's LLM Observability docs describe traces that can represent an individual LLM inference with tokens, error information, and latency, a predetermined workflow that groups LLM calls with contextual operations such as tool calls or preprocessing, or a dynamic workflow executed by an LLM agent. That is the right mental model for alerting: do not alert only on the final request. Alert on the span that failed.

Phoenix describes a trace as a record of a single run, broken into spans that show how agents, tasks, and tools executed. That trace becomes the raw data for evaluation and iteration. Treat that as the operational loop: trace the run, evaluate the result, fix the release contract, and then promote the agent.

The Practical Stack

The durable stack is vendor-neutral at the trace layer and LLM-aware at the review layer. OpenTelemetry gives you correlation across services and a shared trace vocabulary. Its GenAI semantic conventions repository covers spans, metrics, and events for GenAI clients, MCP, and provider-specific conventions. That makes it a good foundation when agent runs need to connect to your API logs, queues, databases, billing, and incident system.

Above that, pick the smallest LLM-aware tool that fits your workflow:

Stack choice	Use it when	Production note
OpenAI Agents SDK tracing	You build with the Agents SDK and need immediate traces for model calls, tools, handoffs, guardrails, and custom events	Add business fields and eval results yourself; built-in tracing is not the whole release gate
OpenTelemetry plus Langfuse	You want open-source LLM tracing, prompt tracking, evals, datasets, dashboards, and self-hosting options	Keep redaction and retention explicit before storing prompts and outputs
LangSmith	You are already in LangChain, LangGraph, or want integrated traces, dashboards, online evals, automations, and feedback queues	Make sure non-LangChain services still correlate through trace IDs
Datadog LLM Observability	You want agent traces tied to an existing production observability and incident workflow	Use span-level fields so alerts point to the failing decision, not just the endpoint
Phoenix	You want a trace-first open-source workflow for understanding runs and moving into evaluations	Keep the trace schema consistent if you later export elsewhere

The wrong stack is a disconnected mix: one dashboard for prompts, one log system for APIs, one spreadsheet for reviewer notes, and no trace ID tying them together. The right stack lets an engineer open one run and answer:

What did the user ask?
Which workflow handled it?
Which prompt and model version ran?
Which tools were called, and why?
Which sources or memories were read?
Which policy decisions applied?
Was human approval required?
What did it cost?
Which evals passed or failed?
Which release introduced the behavior?

If those questions need separate manual searches, the observability layer is not ready for production.

The Launch Checklist

Launch readiness is not "we have traces." It is "we can explain, evaluate, and stop a bad run before it becomes a product incident." Use this checklist before opening the agent to real users.

Define the run schema
Write the required trace fields, span kinds, redaction rules, retention rules, and owner for each workflow. If a span kind is optional, write the condition that makes it optional.
Instrument the risky boundaries
Trace model calls, tool calls, retrieval, memory reads, memory writes, approvals, guardrails, and final responses. The riskiest boundary is the one where the agent can change external state.
Attach evals to traces
Store task success, evidence support, policy compliance, tool correctness, and reviewer feedback on the same trace as the run.
Create blocking gates
Block release when traces are missing required fields, tool calls are unlogged, retrieval sources are absent, policy evals fail, or required approvals are missing.
Route incidents by span
Send alerts to the owner of the failing span: model routing, retrieval, tool API, policy engine, approval queue, or product workflow.

The operating principle is that every agent run should leave enough evidence for an engineer to reproduce the decision chain without re-running the model. Re-running a model is not debugging. It changes the system under inspection. The trace is the artifact you can review, compare, evaluate, and hand to an incident owner.

What is AI agent observability?

AI agent observability is the trace, metric, log, eval, and feedback layer that explains what an agent did, why it did it, what it cost, and whether the outcome met the release contract. The useful unit is a full agent run, not a single prompt.

What should be logged for an AI agent?

Log the trace ID, workflow, user or tenant context, model and prompt version, tool calls, retrieval sources, memory reads and writes, approval state, final outcome, cost, latency, eval result, and policy exceptions. Redact sensitive payloads by default and store secure references where full payload review is required.

Is OpenTelemetry enough for AI agents?

OpenTelemetry is the right correlation and transport layer, especially when agent traces need to connect to the rest of your system. Most teams still need an LLM-aware run store or eval platform for prompts, tool calls, retrieval, review queues, and quality scoring.

How is observability different from evals?

Observability explains a live run. Evals judge whether the run met a quality, safety, or business bar. A production setup stores eval outcomes on the same trace so a failed score can be debugged against the exact model call, tool call, retrieval set, and approval path.

Which AI agent observability tool should a team start with?

Start with the tool that fits your execution stack, then enforce your own trace schema. OpenAI Agents SDK tracing is the fastest path for Agents SDK apps, LangSmith fits LangChain and LangGraph teams, Langfuse fits open-source LLM observability and eval workflows, Datadog fits teams already operating production systems there, and Phoenix fits trace-first open-source evaluation work.

Roll Out the Agentic SDLC

Instrument agent workflows with traces, evals, approval gates, and CI release checks before they carry production traffic.

Last Updated

Jun 26, 2026

CategoryEvals & Observability

AI Agent Observability: What to Log Before Production

The Production Contract Is One Trace Per Agent Run

Log Decisions, Not Only Prompts

Name the workflow

Split the run into spans

Attach release context

Redact by default

Retrieval And Memory Need Separate Spans

Evals Belong On The Same Trace

Alert On Failure Modes, Not Novelty

The Practical Stack

The Launch Checklist

Define the run schema

Instrument the risky boundaries

Attach evals to traces

Create blocking gates

Route incidents by span

Roll Out the Agentic SDLC

More from Evals & Observability

Trace-to-Eval Builder Build Log

Agent Runbook Auditor: A BYOK Launch Review Tool for Agent Workflows

OpenAI Agents SDK Tracing: What It Shows in Production

Langfuse vs LangSmith for Production Observability

One letter, every week. Working systems — not hot takes.