Writing

Agentic engineering, written down

Build logs, agentic engineering decisions, agent failures, evals, and what survives real users — on Claude Code, agents, RAG, MCP, and AI ops.

Coding Agents RAG MCP Evals & Observability Stack

design-md-drift-check Build Log

A build log for design-md-drift-check, a skill that checks DESIGN.md drift against real frontend tokens, components, and UI patterns.

Latest articles

Context Engineering vs Prompt Engineering for Production Agents

Context engineering is the production control plane for agents. Learn when prompts matter, what context layers to ship, and what to log before traffic.

Agent Memory for Production AI Systems

Design agent memory as governed state: what to store, what to forget, how to retrieve it, and which evals catch stale or unsafe recall.

Video Probe MCP Build Log

A build log for video-probe-mcp, a narrow MCP server that lets agents inspect local video and audio files with ffprobe.

Trace-to-Eval Builder Build Log

A build log for Trace-to-Eval Builder, a BYOK app that turns agent traces into replayable eval packs.

All Writing

design-md-drift-check Build Log

A build log for design-md-drift-check, a skill that checks DESIGN.md drift against real frontend tokens, components, and UI patterns.

Context Engineering vs Prompt Engineering for Production Agents

Context engineering is the production control plane for agents. Learn when prompts matter, what context layers to ship, and what to log before traffic.

Agent Memory for Production AI Systems

Design agent memory as governed state: what to store, what to forget, how to retrieve it, and which evals catch stale or unsafe recall.

Video Probe MCP Build Log

A build log for video-probe-mcp, a narrow MCP server that lets agents inspect local video and audio files with ffprobe.

Trace-to-Eval Builder Build Log

A build log for Trace-to-Eval Builder, a BYOK app that turns agent traces into replayable eval packs.

Agent Runbook Auditor: A BYOK Launch Review Tool for Agent Workflows

A build log for Agent Runbook Auditor, a BYOK OpenAI demo that reviews agent runbooks for launch risk, trace design, eval cases, guardrails, and rollout readiness.

AWS MCP Server for Production Agents: The Build-or-Boundary Rule

Use AWS MCP Server for AWS-native agent access, then add custom approval, tenant policy, evals, and run logs where production risk starts.

CLAUDE.md File Best Practices for Production Teams

Write a CLAUDE.md that keeps Claude Code useful in production: scope memory, keep rules concise, move enforcement to hooks, and review drift.

OpenAI Agents SDK vs Pydantic AI for Production Agents

Choose OpenAI Agents SDK for OpenAI-native runs. Choose Pydantic AI when typed Python, provider flexibility, and durable approvals matter.

Codex vs Claude Code vs Gemini CLI for Production Teams

A production rollout comparison of Codex CLI, Claude Code, and Gemini CLI after Google's Antigravity transition, with security and telemetry rules.

Vercel AI Gateway vs OpenRouter for Production Model Routing

Choose Vercel AI Gateway for Vercel-native routing and OpenRouter for broad provider reach. Compare pricing, failover, BYOK, and observability.

MCP Sampling vs Elicitation for Production Servers

Use MCP sampling for client-owned model calls and elicitation for user input. Set the production boundary, approval flow, and logging rules.

Hybrid Search for Production RAG: The BM25, Vector, and Rerank Rule

Use hybrid search when vector-only misses exact terms. Compare BM25, vectors, fusion, reranking, evals, and the production logging gate.

Claude Code Planning Mode for Production Teams

Use Claude Code planning mode as a production review gate: when to plan first, how to approve safely, and what teams should log before rollout.

Google ADK vs LangGraph for Production Agents

Compare Google ADK and LangGraph for production agents: state, human approval, deployment, observability, pricing, and the decision rule.

MCP Resources vs Tools: The Production Server Rule

Use resources for client-controlled context, tools for model-invoked actions, and prompts for reusable user-selected workflows.

MCP Authorization for Production Servers

Build MCP authorization with OAuth, Protected Resource Metadata, token audience checks, consent, approvals, logs, and production release gates.

OpenAI Agents SDK TypeScript vs Python for Production Agents

Choose TypeScript or Python for OpenAI Agents SDK by production ownership: product runtime, worker path, tracing, guardrails, handoffs, MCP, and evals.

Claude Code Security Review for Production Teams

Use Claude Code security review as early signal, not an approval gate. Here is the rollout pattern, CI split, permissions, hooks, and logs.

OpenAI Agents SDK Tracing: What It Shows in Production

Use OpenAI Agents SDK tracing as run inspection, not full observability. Configure sensitive data, flushing, trace exports, evals, and approvals.

Gemini CLI vs Antigravity CLI: The Production Migration Rule

Google is moving Gemini CLI users to Antigravity CLI. Compare migration work, access risk, controls, and when to choose Claude Code or Codex.

Model Routing for Production AI Apps

A production workflow for model routing across managed routers, provider marketplaces, and app-owned policy layers.

AI Agent Observability Tools for Production Teams

Compare LangSmith, Langfuse, Phoenix, and Helicone for production AI agent traces, evals, cost telemetry, and approval control.

Prompt Versioning for Production AI Systems

Version prompts with code, evals, promotion labels, rollback, and run telemetry after OpenAI deprecated reusable prompt objects.

RAG Evaluation Metrics Before Launch

Use a production RAG eval gate for retrieval quality, groundedness, answer correctness, answer relevance, and regression risk.

Human-in-the-Loop AI Agents: Approval Gates for Production

Where to put human approval gates, how to preserve agent state, what reviewers need, and when to move from human-in-the-loop to monitored automation.

LangChain vs LangGraph for Production Agents

Use LangChain for simple agent harnesses. Use LangGraph when production agents need durable state, retries, interrupts, approvals, and deployment.

AI Agent Monitoring: What to Track Before Production Traffic

Trace runs, score outcomes, attribute cost, and route risky actions before production AI agents receive more traffic.

MCP Security Best Practices for Production Servers

Ship MCP servers with per-client consent, audience-bound tokens, strict schemas, approval gates, isolation, and logs that catch tool abuse.

Codex vs Claude Code for Production Teams

Claude Code is the better first rollout for terminal-first teams. Codex wins for delegated cloud tasks, PR review, and compliance-visible agent work.

Claude Code Pricing for Teams: The Production Rollout Cost

Claude Code pricing starts at $20, but teams need a seat mix, usage credits, API fallback controls, analytics, and rollout policy.

Pgvector vs Pinecone for Production RAG

Choose pgvector when retrieval belongs in Postgres. Choose Pinecone when scale, namespaces, or managed ops justify a separate vector system.

Claude Code Hooks for Production Teams

Use Claude Code hooks as deterministic team guardrails for tests, protected files, command logging, permissions, and safe rollout.

OpenAI Agents SDK vs LangGraph for Production Agents

Choose OpenAI Agents SDK for OpenAI-native loops. Choose LangGraph when durable graph state, provider freedom, and custom control matter.

MCP vs Function Calling: The Production Decision Rule

Use function calling for app-local tools. Build MCP when a capability must be shared, discovered, approved, logged, and reused across agents.

Claude Code vs Cursor for Production Teams

Use Cursor for daily IDE work and Claude Code for governed terminal delegation. Compare costs, controls, security, and rollout rules.

Langfuse vs LangSmith for Production Observability

Choose Langfuse for self-hosted, framework-neutral traces. Choose LangSmith for managed LangChain evals, review, alerting, and deployment.

Newsletter

One letter, every week. Working systems — not hot takes.

Build logs, agentic engineering decisions, agent failures, evals, and what survives real users. Sent weekly, never more.