RAG Evaluation Metrics Before Launch

Use a production RAG eval gate for retrieval quality, groundedness, answer correctness, answer relevance, and regression risk.

Friday, June 5, 2026

Omid Saffari

The RAG metrics that matter before launch are not a leaderboard. They are a release gate: did retrieval find the right evidence, did the answer stay grounded in that evidence, did it answer the user, and will the same test fail the build when the pipeline regresses?

The Launch Gate Is Smaller Than The Metric List

A production RAG eval should be small enough to run every time the retrieval pipeline changes and strict enough to block a bad release. The mistake is treating every available metric as equally important. Ragas lists RAG metrics including Context Precision, Context Recall, Noise Sensitivity, Response Relevancy, and Faithfulness, plus comparison metrics such as Factual Correctness, Semantic Similarity, ROUGE Score, and Exact Match. That list is useful, but it is not a launch plan.

The launch plan starts from the system shape. A RAG system has a retrieval side and a generation side, and the RAG evaluation survey calls out that this hybrid structure makes evaluation harder than checking a standalone model. The retriever can fail while the generator sounds fluent. The generator can fail even when the right context was retrieved. The corpus can change underneath both.

The minimum useful gate is:

Retrieval relevance: did the retrieved chunks belong in the answer path?
Retrieval coverage: did the retriever find the evidence the answer needed?
Groundedness or faithfulness: did the generated answer stay inside the retrieved context?
Answer relevance: did the response address the user's question?
Answer correctness: when reference answers exist, did the response match the expected facts?
Regression behavior: did a prompt, model, chunking, embedding, or reranker change break known queries?

That set is intentionally boring. It maps each score to an engineering action. If retrieval relevance drops, inspect chunking, metadata filters, query rewriting, hybrid search, or reranking. If groundedness drops, inspect prompts, citation requirements, context formatting, and refusal behavior. If correctness drops while groundedness holds, your retrieved evidence may be incomplete, outdated, or misleading. If regression behavior fails, the release does not ship.

For a founder or platform lead, this is the line between demo confidence and production confidence. A demo asks, "Did the answer look right?" A production gate asks, "Can we reproduce the failure, identify which subsystem caused it, and stop the next bad deploy?"

Measure Retrieval Before You Measure The Answer

Retrieval quality is the first gate because the generator cannot ground an answer in evidence it never received. LlamaIndex separates Response Evaluation from Retrieval Evaluation, and that split is the right mental model. Before grading the final answer, grade whether the system fetched the right source material.

The retrieval eval dataset should store, at minimum:

Field	Purpose	Failure it exposes
`query`	The user question or task	Query rewrite and intent handling failures
`expected_context_ids`	The chunks or documents that should appear	Retriever misses, bad filters, stale indexes
`retrieved_context_ids`	The chunks actually returned	Ranking, permission, and chunking regressions
`retrieved_context_text`	The text sent to the model	Context bloat, irrelevant snippets, citation mismatch
`metadata_filters`	Tenant, role, product, date, or source filters	Permission leakage and filter drift
`retriever_version`	Embedding, index, reranker, and query version	Release-to-release regression analysis

The exact metric depends on the retrieval problem. For "find the one policy paragraph that answers this support question," Mean Reciprocal Rank matters because the first relevant result should be near the top. IBM describes MRR as measuring the position of the first relevant document, with a higher value close to 1 indicating that relevant results appear near the top. For "retrieve all source documents needed to compare contract terms," coverage and recall matter more than the first hit.

When you have graded relevance labels, NDCG and MAP become useful. IBM describes NDCG@k as DCG@k divided by ideal DCG@k, with the metric ranging from 0 to 1. IBM describes MAP as evaluating how correctly retrieved documents are ranked across a result list. Those are stronger than a single hit-rate check when the answer depends on multiple chunks and their order.

When you do not have graded labels yet, start with retrieval relevance and context recall. Ragas includes Context Precision and Context Recall in its RAG metrics. LlamaIndex says retrieval evaluation can use ranking metrics such as MRR, hit rate, and precision when you have questions and ground-truth rankings. The operational point is not which library name wins. The point is to make the retrieved context auditable before it reaches the model.

Create the retrieval fixture
Pick the queries that must not fail: onboarding questions, billing questions, legal policy questions, product edge cases, and known historical failures. For each query, store the expected context IDs, not just a reference answer.
Run the retriever without generation
Execute the retriever and reranker as a standalone target. Capture returned context IDs, text, source URL, document version, tenant or role filter, and retriever version.
Block on the retrieval failure
If the expected evidence is absent or buried behind irrelevant chunks, fix retrieval before tuning prompts. A better answer prompt cannot cite a document the model never sees.

For a SaaS knowledge base, this is where permission and freshness bugs surface. If the query belongs to one tenant and the retrieved context includes another tenant's material, the eval should fail even when the generated answer is polite and fluent. If a billing policy changed but the retrieved chunk still comes from the old version, the eval should fail before a user sees the answer.

Grade The Answer Against Context, Query, And Reference

Answer evaluation needs separate checks because "sounds right" hides different failure modes. LangSmith's RAG evaluation tutorial frames evaluators as correctness, relevance, groundedness, and retrieval relevance. That split is useful because each evaluator compares a different pair of objects.

Evaluator	Compares	Needs a reference answer?	Use it when
Correctness	Response vs reference answer	Yes	The answer must match known expected facts
Relevance	Response vs user input	No	The answer may be grounded but dodges the question
Groundedness	Response vs retrieved documents	No	The answer may invent facts outside the supplied context
Retrieval relevance	Retrieved documents vs user input	No	The documents may be unrelated before generation starts

Correctness is the cleanest gate when you can afford reference answers. LangSmith defines correctness as response versus reference answer and says it requires a ground-truth answer supplied through a dataset. LlamaIndex similarly says Correctness and Semantic Similarity require labels. This is the metric to use for canonical questions where the business already knows the right answer: entitlement limits, integration setup, pricing policy, refund policy, compliance wording, product compatibility, or a documented support decision.

Groundedness is the gate for hallucination control. IBM defines Faithfulness as measuring whether output is based on the given context or whether the model produced hallucinated responses. Ragas includes Faithfulness and Response Relevancy in its RAG metric list. LangSmith's groundedness evaluator checks the response against retrieved docs rather than a reference answer, which makes it useful before a full golden dataset exists.

Relevance is the gate for answer usefulness. A response can be grounded and still fail if it answers a neighboring question. LlamaIndex lists Answer Relevancy as whether the generated answer is relevant to the query. LangSmith's relevance evaluator checks response versus input and does not require a reference answer. Use it on broad support, internal knowledge, or documentation queries where many answers could be factually grounded but only some satisfy the actual request.

Retrieval relevance is the guardrail for context pollution. LangSmith checks retrieved documents versus input. IBM's RAG triad, as presented in its RAG cookbook, is Context Relevance, Groundedness, and Answer Relevance. That triad is a strong default for early RAG systems because it catches the common failure chain: irrelevant context enters the prompt, the model builds an answer from weak evidence, and the user receives a confident but wrong response.

A practical eval row should look like a trace, not a school test:

YAML

query: "<user_question>"
reference_answer: "<expected_answer_when_available>"
expected_context_ids:
  - "<doc_id_or_chunk_id>"
retrieved_context_ids:
  - "<doc_id_or_chunk_id>"
answer: "<model_response>"
scores:
  retrieval_relevance: "<pass_or_fail_or_score>"
  context_recall: "<score_when_reference_contexts_exist>"
  groundedness: "<pass_or_fail_or_score>"
  answer_relevance: "<pass_or_fail_or_score>"
  correctness: "<pass_or_fail_or_score_when_reference_answer_exists>"
metadata:
  corpus_version: "<corpus_version>"
  retriever_version: "<retriever_version>"
  prompt_version: "<prompt_version>"
  model: "<model_name>"
  evaluator_version: "<evaluator_version>"

The placeholder values are deliberate. The targets depend on your risk profile. A public documentation assistant can tolerate a different threshold than a medical, legal, financial, or internal-permissioned workflow. What cannot vary is traceability. Every score should point back to the question, evidence, answer, and version that produced it.

Build The Golden Dataset Before Tuning The Stack

The golden dataset is the production asset. Embedding models, vector stores, rerankers, prompts, and model providers can change. The dataset is what tells you whether the new stack is better for your actual users.

IBM warns that building an evaluation engine should not be underestimated, especially when it includes a golden dataset with reference answers and reference contexts. That warning matches the failure pattern we see in production RAG: teams tune chunk size, swap vector stores, or add a reranker before they have a stable set of questions that represent the workload.

Build the dataset in layers:

Critical known answers: questions where an incorrect answer creates business, trust, security, or compliance risk.
High-volume questions: the recurring queries that drive support load or user friction.
Boundary questions: questions that should be refused, escalated, or answered only with scoped evidence.
Retrieval stress cases: questions that require synonyms, acronyms, old product names, sparse keywords, or multiple documents.
Freshness cases: questions where a new policy, release, or document version should beat older content.
Permission cases: questions where the answer must differ by tenant, role, account, region, or plan.

LlamaIndex describes dataset generation from unstructured corpora and retrieval evaluation over generated question and context pairs. Synthetic questions are useful for breadth, but they cannot be the whole gate. The launch set needs human-owned examples for the questions that carry risk. The synthetic set finds obvious recall holes. The human-owned set protects the product.

Start with source-linked examples
Each dataset row should include the expected answer and the source context IDs that justify it. A reference answer without reference context is not enough for RAG, because it cannot tell you whether retrieval or generation failed.
Tag each row by failure class
Use tags such as permission, freshness, multi_doc, safety, billing, setup, legal, or integration. When a release fails, tags show which product surface became weaker.
Version the corpus and the evaluators
A passing score is only meaningful with the corpus version, prompt version, retriever version, model name, and evaluator version attached. Without those fields, an eval run becomes an anecdote.

This is also where tool choice becomes clearer. Ragas is useful when you want a metric library and can wire the run into your own pipeline. LangSmith is useful when datasets, experiments, traces, and evaluator runs should sit in one managed workflow. LlamaIndex is useful when your RAG stack is already built around its retrieval and response evaluation modules. The tool should fit the control plane you already need, not the other way around.

For retrieval infrastructure choices, keep the eval dataset independent of the store. If you are deciding whether to keep RAG in Postgres or move to a dedicated vector database, the decision belongs to retrieval quality, operational constraints, and product boundaries. The same golden set should be able to test either side. That is the connective thread with Pgvector vs Pinecone for Production RAG: storage only wins if it improves the measured retrieval path without weakening governance.

Wire The Eval Gate Into CI And Production Monitoring

The eval gate has to run before deployment and keep watching after deployment. Offline evals catch regressions before launch. Production monitoring catches corpus drift, user behavior changes, model changes, and evaluator blind spots.

LangSmith shows client.evaluate running a target over a dataset with correctness, groundedness, relevance, and retrieval relevance evaluators. OpenAI's evals guidance says evals test model outputs against style and content criteria you specify, and describes the workflow as describing the task, running with test inputs, analyzing results, and iterating. Those are the right mechanics, but the RAG-specific gate should treat the retriever as part of the target, not just the final model call.

A release gate should record:

Query and user-visible answer.
Retrieved context IDs, document versions, and source URLs.
Metadata filters applied during retrieval.
Prompt version and model name.
Evaluator name and evaluator version.
Individual evaluator results, not just aggregate status.
Failure reason and owner when a row fails.

The CI version can be small and strict. Run the critical set whenever chunking, embedding, retrieval, reranking, prompt, model, or policy logic changes. The production version should sample real queries, replay known incidents, and compare live behavior against the same evaluator definitions. If your RAG system includes citations, production monitoring should also check that citations point to retrieved evidence actually used by the answer.

For observability tooling, the same principle applies. Langfuse, LangSmith, custom traces, or your internal dashboard can all work if they preserve the right fields. The comparison in Langfuse vs LangSmith for Production Observability is the adjacent tool decision. For the RAG launch gate, the non-negotiable requirement is that each answer links back to its query, retrieved documents, prompt version, model, evaluator, and score.

What Breaks First In Production

The first production break is usually not a dramatic model failure. It is a quiet mismatch between the user's question, the retrieved evidence, and the answer boundary.

Common failures:

Stale context wins because the retriever sees an old document as more semantically similar than the current source.
Permission filters are applied in the UI but not in retrieval, so the prompt receives evidence the user should not see.
Chunk boundaries split the answer from the condition that limits it.
Reranking improves generic relevance while hiding the source that contains the deciding clause.
The answer is grounded in retrieved context, but the context itself is not the right evidence.
Evaluator prompts drift and start passing answers that a human reviewer would reject.
Live queries move into topics that were never represented in the golden set.

The fix is to make every failure class observable. Stale context needs document version and source timestamp in the trace. Permission leakage needs tenant, role, and filter metadata. Bad chunking needs retrieved chunk IDs and neighbor chunks. Reranker regressions need pre-rerank and post-rerank rankings. Grounded-but-wrong answers need reference context IDs and correctness checks. Evaluator drift needs evaluator versioning and periodic human review.

The durable launch rule is simple: if the system cannot tell you whether retrieval, grounding, answer relevance, correctness, or monitoring failed, it is not ready for production traffic. A RAG system that answers well in demos but cannot explain its failures will become expensive to support, hard to tune, and risky to trust.

What are RAG evaluation metrics?

RAG evaluation metrics are checks for the retrieval side, the answer side, and the release process. The useful set includes retrieval relevance, context recall or coverage, groundedness, answer relevance, correctness when references exist, and regression checks tied to corpus and pipeline versions.

Which RAG metric should come first?

Retrieval should come first when the answer depends on specific evidence. If the expected context is missing from the prompt, a grounded answer is impossible, so retrieval relevance and context recall should fail before prompt tuning begins.

Do RAG evals require ground-truth answers?

Some do and some do not. LangSmith and LlamaIndex both separate reference-based correctness from reference-less relevance or groundedness checks, so you can start with query, context, and response evaluators while building a stronger golden dataset.

Is Ragas or LangSmith better for RAG evaluation?

Ragas is a strong fit when you want metric primitives such as context precision, context recall, response relevancy, and faithfulness inside your own pipeline. LangSmith is a stronger fit when managed datasets, experiments, traces, and evaluator runs should live together.

What is the difference between faithfulness and correctness?

Faithfulness checks whether the answer is supported by the retrieved context. Correctness checks whether the answer matches the expected answer or known facts. A response can be faithful to incomplete context and still be incorrect.

Scope Your RAG Pipeline

Design the retrieval, evaluation, and monitoring layer your RAG system needs before it reaches production traffic.

Last Updated

Jun 5, 2026

CategoryRAG

RAG Evaluation Metrics Before Launch

The Launch Gate Is Smaller Than The Metric List

Measure Retrieval Before You Measure The Answer

Create the retrieval fixture

Run the retriever without generation

Block on the retrieval failure

Grade The Answer Against Context, Query, And Reference

Build The Golden Dataset Before Tuning The Stack

Start with source-linked examples

Tag each row by failure class

Version the corpus and the evaluators

Wire The Eval Gate Into CI And Production Monitoring

What Breaks First In Production

Scope Your RAG Pipeline

More from RAG

Hybrid Search for Production RAG: The BM25, Vector, and Rerank Rule

Pgvector vs Pinecone for Production RAG

One letter, every week. Working systems — not hot takes.