Trace-to-Eval Builder Build Log

A build log for Trace-to-Eval Builder, a BYOK app that turns agent traces into replayable eval packs.

Monday, June 22, 2026Omid Saffari

Trace-to-Eval Builder is the second Omid Saffari Labs open-source release: a BYOK Next.js app that turns raw agent traces into replayable eval cases, scorer ideas, and instrumentation gaps. It is aimed at teams shipping agent workflows where a passing demo is not enough; the useful artifact is the eval that catches the same failure next time.

Why this shipped

The first release, Agent Runbook Auditor, reviews an agent plan before launch. This release covers the other side of the loop: after a run fails, paste the trace and convert the failure into a small eval pack.

That makes the tool useful for a specific production habit. When an agent ignores a tool result, invents a cause, loops on retries, or gives a polished but unsupported final answer, the trace should become a regression case within minutes. Trace-to-Eval Builder turns that trace into failure modes, replay prompts, scorer checks, JSONL seeds, and missing telemetry.

Build shape

The project was created from the Omid Saffari Labs golden template. Only the approved project surface changed: the capability slot, project metadata, the page UI, the README, and the OpenAI SDK dependency. The BYOK route, key storage, redaction helper, GitHub Actions workflow, and Playwright leak test stayed frozen.

Capability

The capability prompt asks the model to read a trace as a QA engineer, separate observed facts from inferred risks, and return a practical markdown pack. The output contract is intentionally simple:

  • Trace readout
  • Failure modes
  • Replay eval cases
  • Scorers
  • JSONL seed cases
  • Next instrumentation

The app streams response.output_text.delta events back to the browser. Requests use store: false, low reasoning effort, and the key supplied in the session-only BYOK input. There is no server-side fallback key and no provider key in the repository.

Gate results

Local gates passed before release:

  • bun run lint
  • bun run typecheck
  • bun run build
  • PORT=3214 bun run test

GitHub Actions then passed the full template gate set on commit 7afd891: install, lint, build, typecheck, gitleaks, Chromium install, and the BYOK no-leak smoke test. The only local wrinkle was a port collision during the first Playwright run: port 3000 was already serving another DVNC app, so the test was rerun on the template's configurable port.

Deploy notes

The first Vercel deploy attempt created no final demo because the new Vercel project had a generic framework preset and expected a public/ output directory. The project setting was updated to nextjs, then the production deploy completed and aliased to the clean demo URL.

The public demo was verified with an HTTP 200 response from Vercel and page content containing the Trace-to-Eval Builder UI. The GitHub repository was kept private until CI passed, then flipped public before deployment and catalog publication.

What to inspect

Use the sample trace in the app to see the intended workflow. It describes an agent that queries refund metrics, finds a retry-policy deploy, then gives a final answer that invents a UI cause. The generated eval pack should preserve the evidence path and turn the unsupported final answer into replayable checks.

For teams building agents, that is the habit this release is designed to reinforce: every meaningful trace failure should become a reusable eval, not just a note in a debugging thread.

Last Updated

Jun 22, 2026

More from Evals & Observability

View all Evals & Observability articles
Newsletter

One letter, every week. Working systems — not hot takes.

Build logs, agentic engineering decisions, agent failures, evals, and what survives real users. Sent weekly, never more.

Weekly. No spam. Unsubscribe anytime.