Reference system

AI Ops Dashboard

A control layer for agent runs, costs, evals, approvals, and failures.

The problem

Teams ship AI features into production but lose visibility once traffic starts — they can't see what an agent did, what it cost, or why it failed. Without a control layer, debugging is guesswork and spend is unbounded.

System architecture

A read model over an append-only run log: every agent invocation writes a structured trace that the dashboard aggregates into per-run, per-cost, and per-eval views. Human approvals gate sensitive actions before they execute.

Workflow

An agent run starts and emits a structured trace event with run id, inputs, and model.
Each tool call, retry, and model hop appends to the run's timeline.
Token usage and provider cost are recorded per call and rolled up per run and per day.
Actions flagged as sensitive pause for a human approval before they execute.
Eval scores attach to the run so quality is visible alongside cost and latency.
The dashboard surfaces failures, cost spikes, and pending approvals in one view.

Stack

Next.js + React for the dashboard UI
A structured trace log persisted to a relational store
OpenTelemetry-style span model for agent runs
A daily cost cap enforced at the call boundary
An approval queue with role-gated actions

What gets logged

Run id, agent name, and the inputs that started the run
Every model and tool call with provider, model, latency, and outcome
Token counts and computed cost per call, rolled up per run and per day
Retries, fallbacks, and the error that triggered them
Approval decisions: who approved or rejected, and when
Eval scores attached to the run

Where evals run

Evals run as a post-run step against recorded traces and as scheduled batch jobs over historical runs, so regressions surface without re-invoking the live system.

Failure modes

A provider times out or returns an error mid-run — the run is marked failed with the failing span, not silently dropped.
Cost crosses the daily cap — further paid calls are refused at the boundary rather than running unbounded.
An approval is never actioned — the action stays pending and visible instead of executing by default.
Trace writes lag the live system — the dashboard reads from the log, so the agent is never blocked on logging.

What this demo proves

That an AI system can be observed, costed, and governed in production — runs, spend, evals, and approvals in one control layer rather than scattered logs.

Newsletter

One letter, every week. Working systems — not hot takes.

Build logs, agentic engineering decisions, agent failures, evals, and what survives real users. Sent weekly, never more.