Reference system

Model Routing System

Routes tasks across OpenAI, Claude, Gemini, and Grok with fallback and cost control.

The problem

Pinning every task to one model overpays on simple work and has no recourse when a provider degrades. A routing layer needs to match each task to the right model, fall back on failure, and stay inside a cost budget.

System architecture

A routing layer in front of the providers: each task type maps to a primary model and an ordered fallback chain, calls pass through a single boundary that enforces a cost budget, and failures cascade to the next model rather than erroring out.

Workflow

  • A task arrives tagged with its type and constraints.

  • The router selects the primary model for that task type.

  • The call passes through the cost boundary and is dispatched.

  • On provider error or timeout, the router falls back to the next model in the chain.

  • Token usage and cost are recorded per call against the budget.

  • The result returns with the model and provider that served it.

Stack

  • A provider-agnostic routing layer

  • Adapters for OpenAI, Claude, Gemini, and Grok

  • A task-to-model routing table with fallback chains

  • A cost boundary enforcing a spend budget

  • A call log for usage and cost

What gets logged

  • The task type and the model it routed to

  • Each provider call with latency and outcome

  • Fallbacks taken and the error that triggered them

  • Tokens and cost per call, rolled up against the budget

  • The model and provider that ultimately served the result

Where evals run

Routing decisions are evaluated offline against a labelled task set — checking that each task type lands on an appropriate model — and fallback behaviour is exercised with fault injection.

Failure modes

  • A provider degrades or rate-limits — the router falls back to the next model in the chain.

  • Spend approaches the budget — paid calls are refused at the boundary rather than overrunning.

  • A task is mistyped — a default route handles it and the misclassification is logged for tuning.

  • All providers in a chain fail — the call returns a typed error instead of hanging.

What this demo proves

That a multi-provider stack can be operated as one system — task-aware routing, fallback, and a hard cost boundary instead of a single hardcoded model.

Newsletter

One letter, every week. Working systems — not hot takes.

Build logs, agentic engineering decisions, agent failures, evals, and what survives real users. Sent weekly, never more.

Weekly. No spam. Unsubscribe anytime.