OpenAI Agents SDK Tracing: What It Shows in Production

Use OpenAI Agents SDK tracing as run inspection, not full observability. Configure sensitive data, flushing, trace exports, evals, and approvals.

Monday, June 8, 2026

Omid Saffari

OpenAI Agents SDK Tracing: What It Shows in Production

OpenAI Agents SDK tracing is a strong debug layer for agent runs, not a complete production observability system. Use it to see the agent path, tool calls, guardrails, handoffs, and sensitive payload risk, then add your own logs, evals, retention policy, and approval queues before real traffic.

The verdict: tracing is necessary, not sufficient

Use OpenAI Agents SDK tracing when you need to inspect how an agent actually moved through a run. Do not treat it as the whole production control layer.

The official Python tracing docs and TypeScript tracing docs say the SDK records LLM generations, tool calls, handoffs, guardrails, and custom events. That is exactly the shape you want when an agent gives a wrong answer, calls the wrong function, triggers a guardrail, or hands work to the wrong specialist.

The missing layer is everything outside the agent runtime. A trace does not decide your retention policy. It does not become your customer-visible incident trail. It does not define when a human must approve a destructive action. It does not turn production traces into a clean eval dataset. It does not replace the product logs that join a run back to account, plan, feature flag, deployment, model version, prompt version, approval outcome, and support ticket.

The production rule is simple: keep SDK tracing on for visibility unless your data policy says otherwise, then wrap it in an application-owned control plane. If the runtime choice itself is still open, settle that first. The tradeoff between OpenAI Agents SDK and graph-first orchestration is covered in OpenAI Agents SDK vs LangGraph for production agents. This piece assumes the SDK is already the right runtime and focuses on the tracing boundary.

What the SDK records by default

The SDK gives you useful trace coverage before you write custom instrumentation. That is the point of starting here instead of bolting on generic request logs after launch.

For Python, the docs say Runner.{run, run_sync, run_streamed}() is wrapped in a trace by default. Each agent run becomes an agent_span(). LLM generations become generation_span(). Function tool calls become function_span(). Guardrails become guardrail_span(). Handoffs become handoff_span().

For TypeScript, the docs describe the same model with Trace, AgentSpan, GenerationSpan, FunctionSpan, GuardrailSpan, and HandoffSpan. The official Agents guide also positions the SDK for applications that own orchestration, tool execution, approvals, and state across Python and TypeScript.

That default coverage answers the first debugging question: where did the run go? It shows the path through the agent loop, not only the final response. That matters when the final answer is wrong but the root cause was earlier: a missing tool result, a malformed function output, a guardrail that blocked the wrong branch, or a handoff that transferred ownership too early.

It also tells you what the trace does not prove. A trace can show that a tool was called. It cannot prove the tool result was correct unless your tool has its own validation and domain checks. A trace can show that a guardrail fired. It cannot prove the guardrail is calibrated unless you test it against known bad and known acceptable cases. A trace can show a handoff. It cannot prove the next agent had the right business context unless the handoff payload is designed and evaluated.

The first production pass is to map each trace span to the failure it helps diagnose:

Trace surface	What it helps answer	What still belongs outside the trace
Agent span	Which agent owned this step?	Whether that agent should have owned it
Generation span	What did the model receive and return?	Whether the answer meets your product quality bar
Function span	Which tool ran and with what payload?	Whether the tool result is valid, authorized, and complete
Guardrail span	Which safety or policy check ran?	Whether the guardrail catches your real launch risks
Handoff span	Where did the workflow transfer control?	Whether the transfer preserved enough state

Configure trace identity like production telemetry

Use trace identity fields as join keys, not decorative labels. A trace that cannot be joined back to product context becomes a screenshot for debugging, not production telemetry.

The docs define a trace as a single end-to-end workflow composed of spans. Trace properties include workflow_name, trace_id, optional group_id, disabled, and optional metadata. If you do not provide a trace_id, the SDK can generate one, and the required format is trace_<32_alphanumeric>. Spans carry started_at, ended_at, trace_id, optional parent_id, and span_data.

The default trace name is Agent workflow. That is fine for a demo and too vague for a live system. Production naming should separate the stable workflow from the specific account, user, ticket, or conversation. Put the stable name in workflow_name. Use group_id for the conversation or thread join. Use metadata for dimensions you will actually query later.

Name the workflow, not the incident
Use names such as contract_review_intake, support_triage, or sales_research_handoff. Avoid names that include user text, ticket titles, or incident descriptions. The workflow name should survive many runs.
Join runs with group IDs
Use group_id to connect traces from the same conversation or product thread. This lets you see repeated attempts and follow-up actions without leaking raw user input into the workflow name.
Keep metadata boring and queryable
Add fields such as environment, release, surface, prompt_version, model_alias, and approval_policy. Do not put secrets, raw documents, or long prompts in metadata.

Here is the shape we use for Python workers that need trace identity and a deterministic flush point:

Python

from agents import Agent, Runner, RunConfig, flush_traces, trace


def run_checked_agent(agent: Agent, prompt: str, thread_id: str) -> str:
    try:
        with trace(
            "support_triage",
            group_id=thread_id,
            metadata={
                "environment": "production",
                "surface": "inbox",
                "prompt_version": "support_triage_current",
            },
        ):
            result = Runner.run_sync(
                agent,
                prompt,
                run_config=RunConfig(trace_include_sensitive_data=False),
            )
            return result.final_output
    finally:
        flush_traces()

The exact metadata names are less important than consistency. Decide the fields before launch, then make them part of the agent runtime contract.

Sensitive data is the first launch gate

Do the sensitive-data decision before you enable production tracing, not after the first incident. The Python docs state that generation_span() stores LLM generation inputs and outputs, and function_span() stores function call inputs and outputs. They also state that trace_include_sensitive_data defaults to True.

That default is useful for debugging because it gives you the missing context around model calls and tool calls. It is also risky if prompts, retrieved documents, tool arguments, or tool outputs can contain customer data, internal documents, credentials, private messages, contract text, medical information, financial data, or regulated records.

The SDK gives you controls. Python tracing can be disabled globally with OPENAI_AGENTS_DISABLE_TRACING=1, disabled in code with set_tracing_disabled(True), or disabled for one run with agents.run.RunConfig.tracing_disabled=True. Sensitive-data capture can be controlled with RunConfig.trace_include_sensitive_data or the OPENAI_AGENTS_TRACE_INCLUDE_SENSITIVE_DATA environment variable.

The sharpest constraint is Zero Data Retention. The Python tracing docs say tracing is unavailable for organizations operating under a Zero Data Retention policy using OpenAI APIs. If ZDR is a hard requirement, design the observability path around your own logs, your own redaction layer, and an external trace store that satisfies the policy. Do not discover this during procurement review.

The practical launch gate is a policy matrix:

Payload class	Trace policy	Application log policy
User prompt	Capture only if allowed by data policy	Store request ID and safe category label
Retrieved document	Prefer document ID over raw text	Store source ID, permission scope, retrieval rank, and citation outcome
Tool arguments	Redact secrets and personal data	Store schema version, validation result, and authorization decision
Tool output	Avoid raw payloads when outputs are sensitive	Store checksum, status, and domain validation result
Human approval	Trace the pause and resume point	Store approver role, approval state, and policy reason

This is the difference between a useful trace and a liability. The trace should explain the run without becoming a second copy of everything sensitive in the product.

Flush traces where runtimes end quickly

Flush behavior is a production reliability issue, not a cosmetic dashboard detail. A missing trace after a failed job is exactly when the team needs the trace most.

For Python, OpenAI's docs say the default BatchTraceProcessor exports traces in the background and performs a final flush when the process exits. They also call out long-running workers such as Celery, RQ, Dramatiq, and FastAPI background tasks: traces are usually exported automatically, but they may not appear in the Traces dashboard immediately after each job finishes. If you need an immediate delivery guarantee at the end of a unit of work, call flush_traces() after the trace context exits.

For TypeScript, the runtime matters. The TypeScript docs say that in supported server runtimes, traces are exported on a regular interval. In Cloudflare Workers, the automatic export loop is unavailable even though tracing is still enabled, so you should call getGlobalTraceProvider().forceFlush() as part of the request lifecycle. The same docs say tracing is disabled in browsers by default.

For an edge worker, the pattern is explicit:

TypeScript

import { getGlobalTraceProvider } from "@openai/agents";

async function runAgentRequest(request: Request): Promise<Response> {
  return new Response("ok");
}

export default {
  async fetch(request: Request, env: unknown, ctx: ExecutionContext) {
    try {
      return await runAgentRequest(request);
    } catch (error) {
      console.error(error);
      return new Response("agent error", { status: 500 });
    } finally {
      ctx.waitUntil(getGlobalTraceProvider().forceFlush());
    }
  },
};

Do not wait until traces vanish intermittently. Add the flush point wherever the runtime can end before the exporter has naturally sent the batch: queue workers, short-lived serverless functions, edge runtimes, and background jobs that exit immediately after an agent run.

Export traces, but do not replace observability with a dashboard

Use custom processors when traces need to feed another backend, but decide what the external backend is responsible for. A trace export is not automatically an eval system, incident process, or governance layer.

The Python docs describe the tracing architecture as a global TraceProvider, a BatchTraceProcessor, and a BackendSpanExporter that exports spans and traces to the OpenAI backend in batches. They give two customization paths. add_trace_processor() adds an additional processor that receives traces and spans while still sending to OpenAI's backend. set_trace_processors() replaces the default processors, which means traces will not be sent to OpenAI's backend unless you include a processor that does so.

The TypeScript docs describe the same production footgun with addTraceProcessor() and setTraceProcessors(). Additive export is usually the safer first move. Replacement is for teams that have already decided where traces live, how access is controlled, how retention works, and how incident review pulls evidence.

External observability still needs a data contract. If you send traces into another system, decide which fields are authoritative:

OpenAI trace ID: the run inspection key.
Product request ID: the user-facing support and incident key.
Prompt version: the eval and regression key.
Model alias: the routing and cost-governance key.
Tool schema version: the tool failure and rollback key.
Approval state: the human-control key.

This is where a comparison such as Langfuse vs LangSmith for production observability matters. The choice of external observability tool is less important than the discipline of sending it useful, low-risk, joinable data.

The production split should be clear:

Layer	Owns
SDK trace	Agent path, spans, tool calls, guardrails, handoffs
Application log	Account, request, release, feature flag, permission scope
Eval store	Labeled examples, expected behavior, scorer outputs, regression history
Approval queue	Human decision, policy reason, resume token, audit trail
Cost ledger	Model alias, call class, budget owner, routing policy

The trace is one source. The control layer is the system you build around it.

The production checklist before enabling tracing

Ship tracing with a checklist, because the default is easy and production consequences are not. The checklist is short enough to enforce in code review.

Decide whether tracing is allowed
Check data retention, customer commitments, and Zero Data Retention requirements before enabling tracing in production. If tracing is not allowed, turn it off and build an internal trace path that matches the policy.
Set the sensitive-data default
Choose whether generation and function inputs and outputs can be captured. If not, set trace_include_sensitive_data to false and replace raw payload visibility with safe identifiers and validation outcomes.
Define workflow names and metadata
Name each workflow and standardize metadata before launch. The trace should join back to product logs without exposing raw customer content.
Add flush points
Call flush_traces() in Python units of work that need immediate export. Call forceFlush() in TypeScript runtimes such as Cloudflare Workers where the automatic export loop is unavailable.
Choose additive or replacement export
Start with additive processors when you want OpenAI traces plus another backend. Use replacement processors only when you intend to stop sending traces to OpenAI or have explicitly restored that exporter.
Promote traces into evals deliberately
Use traces to find examples, then curate them. Do not blindly pour production traces into eval datasets without labels, privacy review, deduplication, and failure taxonomy.

The durable rule is to keep traces close to the run and keep decisions close to the product. Tracing tells you what happened inside the agent. Production observability tells you whether that behavior was allowed, useful, costly, safe, and worth repeating.

Is OpenAI Agents SDK tracing enabled by default?

Yes for the Python SDK. The Python tracing docs say tracing is enabled by default and can be disabled globally with OPENAI_AGENTS_DISABLE_TRACING=1, in code with set_tracing_disabled(True), or for a single run with RunConfig.tracing_disabled=True. The TypeScript docs also note that browser tracing is disabled by default.

Does OpenAI Agents SDK tracing work with Zero Data Retention?

No, not through OpenAI's hosted tracing path. The Python docs say tracing is unavailable for organizations operating under a Zero Data Retention policy using OpenAI APIs.

Does tracing capture tool inputs and outputs?

Yes. The Python docs say function_span() stores function call inputs and outputs, and generation_span() stores LLM generation inputs and outputs. Treat that as a launch gate for redaction and sensitive-data policy.

Can OpenAI Agents SDK traces go to another backend?

Yes. The SDK supports custom trace processors. Use additive processors when you want another destination alongside OpenAI's backend, and use replacement processors only when you understand that the default OpenAI export is replaced unless you include it.

Should traces become eval examples automatically?

No. Traces are a strong source of candidate examples, but production evals need labels, expected behavior, privacy review, and a stable failure taxonomy. Promote examples deliberately.

Book an AI Engineering Audit

Get a production review of your agent traces, evals, logs, approval gates, and observability layer before real users depend on them.

Last Updated

Jun 8, 2026

CategoryEvals & Observability

OpenAI Agents SDK Tracing: What It Shows in Production

The verdict: tracing is necessary, not sufficient

What the SDK records by default

Configure trace identity like production telemetry

Name the workflow, not the incident

Join runs with group IDs

Keep metadata boring and queryable

Sensitive data is the first launch gate

Flush traces where runtimes end quickly

Export traces, but do not replace observability with a dashboard

The production checklist before enabling tracing

Decide whether tracing is allowed

Set the sensitive-data default

Define workflow names and metadata

Add flush points

Choose additive or replacement export

Promote traces into evals deliberately

Book an AI Engineering Audit

More from Evals & Observability

Trace-to-Eval Builder Build Log

Agent Runbook Auditor: A BYOK Launch Review Tool for Agent Workflows

Langfuse vs LangSmith for Production Observability

One letter, every week. Working systems — not hot takes.