Reference system

AI Ops Dashboard

A control layer for agent runs, costs, evals, approvals, and failures.

The problem

Teams ship AI features into production but lose visibility once traffic starts — they can't see what an agent did, what it cost, or why it failed. Without a control layer, debugging is guesswork and spend is unbounded.

System architecture

A read model over an append-only run log: every agent invocation writes a structured trace that the dashboard aggregates into per-run, per-cost, and per-eval views. Human approvals gate sensitive actions before they execute.

Workflow

  • An agent run starts and emits a structured trace event with run id, inputs, and model.

  • Each tool call, retry, and model hop appends to the run's timeline.

  • Token usage and provider cost are recorded per call and rolled up per run and per day.

  • Actions flagged as sensitive pause for a human approval before they execute.

  • Eval scores attach to the run so quality is visible alongside cost and latency.

  • The dashboard surfaces failures, cost spikes, and pending approvals in one view.

Stack

  • Next.js + React for the dashboard UI

  • A structured trace log persisted to a relational store

  • OpenTelemetry-style span model for agent runs

  • A daily cost cap enforced at the call boundary

  • An approval queue with role-gated actions

What gets logged

  • Run id, agent name, and the inputs that started the run

  • Every model and tool call with provider, model, latency, and outcome

  • Token counts and computed cost per call, rolled up per run and per day

  • Retries, fallbacks, and the error that triggered them

  • Approval decisions: who approved or rejected, and when

  • Eval scores attached to the run

Where evals run

Evals run as a post-run step against recorded traces and as scheduled batch jobs over historical runs, so regressions surface without re-invoking the live system.

Failure modes

  • A provider times out or returns an error mid-run — the run is marked failed with the failing span, not silently dropped.

  • Cost crosses the daily cap — further paid calls are refused at the boundary rather than running unbounded.

  • An approval is never actioned — the action stays pending and visible instead of executing by default.

  • Trace writes lag the live system — the dashboard reads from the log, so the agent is never blocked on logging.

What this demo proves

That an AI system can be observed, costed, and governed in production — runs, spend, evals, and approvals in one control layer rather than scattered logs.

Newsletter

One letter, every week. Working systems — not hot takes.

Build logs, agentic engineering decisions, agent failures, evals, and what survives real users. Sent weekly, never more.

Weekly. No spam. Unsubscribe anytime.