Reference system

RAG Pipeline

Ingestion to hybrid retrieval to reranking to cited answers.

The problem

Naive retrieval returns plausible-looking but wrong context, and answers cite nothing — so users can't tell a grounded answer from a hallucination. A pipeline needs hybrid retrieval, reranking, and enforced citations to be trustworthy.

System architecture

Documents are chunked and embedded into a vector index alongside a keyword index; queries hit both, results are fused and reranked, and the top passages are passed to the model with a citation contract that ties every claim back to a source.

Workflow

Source documents are ingested, normalized, and split into overlapping chunks.
Chunks are embedded and written to a vector index with metadata.
A query runs hybrid retrieval — dense vectors plus keyword search.
Candidate passages are fused and passed through a reranker.
The top passages are assembled into a grounded prompt with source ids.
The model answers with inline citations resolved back to the source passages.

Stack

An embedding model for dense vectors
A vector index for similarity search
A keyword index for lexical recall
A reranking model over fused candidates
A generation model with a citation contract

What gets logged

The query and the retrieval mode used
Candidate passages with their dense and lexical scores
Reranker scores and the final passage order
The passages actually passed to the model
The answer and the source ids it cited
Retrieval latency and token cost per query

Where evals run

Evals run offline against a fixed question set, scoring retrieval recall, citation faithfulness, and answer relevance, plus spot checks on live queries flagged as low-confidence.

Failure modes

Retrieval returns no relevant passage — the system answers that it doesn't know rather than fabricating.
A chunk is too large or too small — chunking is tuned and overlap preserved so context isn't cut mid-thought.
The model cites a passage it didn't use — the citation contract validates that cited ids appear in the retrieved set.
The index drifts from the source after updates — ingestion is idempotent and re-runnable so the index can be rebuilt.

What this demo proves

That retrieval can be made grounded and auditable — hybrid recall, reranking, and enforced citations instead of a single embedding lookup feeding an unverified answer.

Newsletter

One letter, every week. Working systems — not hot takes.

Build logs, agentic engineering decisions, agent failures, evals, and what survives real users. Sent weekly, never more.