Trace-to-Eval Builder Build Log

A build log for Trace-to-Eval Builder, a BYOK app that turns agent traces into replayable eval packs.

Monday, June 22, 2026

Trace-to-Eval Builder is the second Omid Saffari Labs open-source release: a BYOK Next.js app that turns raw agent traces into replayable eval cases, scorer ideas, and instrumentation gaps. It is aimed at teams shipping agent workflows where a passing demo is not enough; the useful artifact is the eval that catches the same failure next time.

Why this shipped

The first release, Agent Runbook Auditor, reviews an agent plan before launch. This release covers the other side of the loop: after a run fails, paste the trace and convert the failure into a small eval pack.

That makes the tool useful for a specific production habit. When an agent ignores a tool result, invents a cause, loops on retries, or gives a polished but unsupported final answer, the trace should become a regression case within minutes. Trace-to-Eval Builder turns that trace into failure modes, replay prompts, scorer checks, JSONL seeds, and missing telemetry.

Build shape

Repository: trace-to-eval-builder
Demo: trace-to-eval-builder.vercel.app
Category: evals and observability
Runtime: Next.js App Router on Vercel
Model path: OpenAI Responses API with gpt-5.5
Trust boundary: visitor-supplied OpenAI key only

The project was created from the Omid Saffari Labs golden template. Only the approved project surface changed: the capability slot, project metadata, the page UI, the README, and the OpenAI SDK dependency. The BYOK route, key storage, redaction helper, GitHub Actions workflow, and Playwright leak test stayed frozen.

Capability

The capability prompt asks the model to read a trace as a QA engineer, separate observed facts from inferred risks, and return a practical markdown pack. The output contract is intentionally simple:

Trace readout
Failure modes
Replay eval cases
Scorers
JSONL seed cases
Next instrumentation

The app streams response.output_text.delta events back to the browser. Requests use store: false, low reasoning effort, and the key supplied in the session-only BYOK input. There is no server-side fallback key and no provider key in the repository.

Gate results

Local gates passed before release:

bun run lint
bun run typecheck
bun run build
PORT=3214 bun run test

GitHub Actions then passed the full template gate set on commit 7afd891: install, lint, build, typecheck, gitleaks, Chromium install, and the BYOK no-leak smoke test. The only local wrinkle was a port collision during the first Playwright run: port 3000 was already serving another DVNC app, so the test was rerun on the template's configurable port.

Deploy notes

The first Vercel deploy attempt created no final demo because the new Vercel project had a generic framework preset and expected a public/ output directory. The project setting was updated to nextjs, then the production deploy completed and aliased to the clean demo URL.

The public demo was verified with an HTTP 200 response from Vercel and page content containing the Trace-to-Eval Builder UI. The GitHub repository was kept private until CI passed, then flipped public before deployment and catalog publication.

What to inspect

Use the sample trace in the app to see the intended workflow. It describes an agent that queries refund metrics, finds a retry-policy deploy, then gives a final answer that invents a UI cause. The generated eval pack should preserve the evidence path and turn the unsupported final answer into replayable checks.

For teams building agents, that is the habit this release is designed to reinforce: every meaningful trace failure should become a reusable eval, not just a note in a debugging thread.

Last Updated

Jun 22, 2026

CategoryEvals & Observability

Trace-to-Eval Builder Build Log

Why this shipped

Build shape

Capability

Gate results

Deploy notes

What to inspect

More from Evals & Observability

Agent Runbook Auditor: A BYOK Launch Review Tool for Agent Workflows

OpenAI Agents SDK Tracing: What It Shows in Production

Langfuse vs LangSmith for Production Observability

One letter, every week. Working systems — not hot takes.