Hybrid Search for Production RAG: The BM25, Vector, and Rerank Rule

Use hybrid search when vector-only misses exact terms. Compare BM25, vectors, fusion, reranking, evals, and the production logging gate.

Saturday, June 13, 2026Omid Saffari
Hybrid Search for Production RAG: The BM25, Vector, and Rerank Rule

Use hybrid search for RAG when your corpus contains identifiers, product names, error strings, policy clauses, or domain terms that embeddings can smooth over. Vector-only is acceptable only after an eval set proves it can recover exact-match queries; the production default is BM25 plus vectors, fused, then reranked before the context hits the model.

The Verdict: Vector-Only Has To Earn Its Way Into Production

Hybrid search should be the default retrieval pattern for production RAG unless your evals prove vector-only retrieval can recover both semantic questions and exact-match questions. Supabase defines hybrid search as full text search, searching by keyword, combined with semantic search, searching by meaning, so the retrieval stage can find results that are both directly and contextually relevant to the query.

The failure mode is simple: embeddings are good at meaning, but they can smooth over the token that matters. A support engineer asks for ERR_AUTH_TOKEN, a compliance reviewer asks for a clause number, or a product team asks for a customer-specific integration name. A vector index may retrieve conceptually similar chunks. A lexical branch catches the exact string. The RAG system needs both signals before the model writes an answer with citations.

The rule we use is strict:

  • Keep vector-only only when the labeled retrieval set proves it recovers exact-match, acronym, error-code, SKU, policy, and named-entity queries at the target recall.
  • Add BM25 or full text search when the corpus contains exact identifiers, domain vocabulary, or user wording that cannot be safely paraphrased.
  • Add reranking after fusion when the top results are noisy enough that the model receives plausible but wrong context.

That last clause matters. Hybrid retrieval is not a license to send more context to the model. It is a way to build a better candidate set, then shrink it into a cleaner context window.

RAG evaluation still needs its own release gate. Hybrid search improves recall, but it does not prove the final answer is grounded. The production test is whether the right chunk appears, whether the cited chunk supports the answer, and whether the system logs enough evidence to debug a bad answer later.

The Comparison That Matters

The useful comparison is not "vector databases versus search engines." The useful comparison is which retrieval signal owns which query class, and what must be logged before production traffic depends on it.

Retrieval patternBest useRetrieval signalProduction setupWhat breaks firstRelease gate
Vector-onlySemantic questions over clean proseEmbedding similarityOne vector index plus metadata filtersRare terms, exact IDs, version strings, and names disappear into semantic similarityLabeled eval proves exact-match recall is good enough
BM25 or full text onlyExact search over docs, support tickets, policies, and IDsLexical match and term frequencyInverted index, tsvector, BM25, or BM25FSynonyms and paraphrased user questions miss relevant chunksQuery set is mostly exact terms and users search like the corpus is written
Hybrid searchMixed semantic and exact queriesVector branch plus keyword branchTwo retrieval branches, fusion, dedupe, metadata filtersScore fusion drifts, one branch dominates, filters hide candidatesBranch-level recall and fused recall beat either branch alone
Hybrid plus rerankHigh-stakes answers, long docs, noisy chunks, many near-duplicatesFused shortlist plus cross-encoder or rerankerRetrieve candidates, fuse, rerank, return final contextLatency and truncation increase, and reranker inputs become hard to inspectReranker improves final context precision without breaking latency budget

Pinecone's hybrid docs make the scoring problem explicit: dense vectors scored with dotproduct against unit-norm embeddings produce values roughly in [-1, 1], while BM25-style sparse weights and sparse model outputs are unbounded positive values. Without explicit weighting, the sparse component dominates the combined score.

That is why fusion and weighting are not cosmetic settings. They are production behavior. Pinecone documents a convex combination, combined = alpha * dense + (1 - alpha) * sparse, where alpha = 1.0 is pure semantic, alpha = 0.0 is pure lexical, and alpha = 0.5 gives both signals equal weight. Pinecone also recommends evaluating multiple alpha values against a labeled relevance set from the workload.

The practical default is not "set alpha once." The practical default is to log the query class and compare branch behavior. Natural-language support questions may want a dense-leaning setting. Identifier-heavy queries may need the keyword branch to carry more weight. A production RAG system should be able to explain why a chunk won.

Build The First Version In Postgres When The Product Boundary Is Already SQL

Postgres is the right first production boundary when the documents, permissions, tenancy, and audit trail already live in Postgres. You avoid a second operational system while still getting lexical search and vector search in one transactionally familiar place.

Supabase's hybrid-search example is the clean shape. It stores a generated tsvector for full text search and a vector(512) embedding for semantic search:

SQL
create table documents (
  id bigint primary key generated always as identity,
  content text,
  fts tsvector generated always as (to_tsvector('english', content)) stored,
  embedding extensions.vector(512)
);

Then it indexes each retrieval branch separately:

SQL
create index on documents using gin(fts);
create index on documents using hnsw (embedding vector_ip_ops);

That split is the point. PostgreSQL provides to_tsvector for converting a document into tsvector, and websearch_to_tsquery accepts raw user-supplied input without raising syntax errors. PostgreSQL also provides ts_rank and ts_rank_cd; ts_rank_cd uses cover density and considers proximity of matching lexemes. The text branch can rank exact terms with mature database primitives, while the vector branch handles semantic similarity.

Supabase's example function accepts query_text, query_embedding, match_count, full_text_weight, semantic_weight, and rrf_k int = 50. It limits each branch to least(match_count, 30) * 2, joins full-text and semantic results by document ID, and orders by weighted reciprocal rank fusion:

SQL
coalesce(1.0 / (rrf_k + full_text.rank_ix), 0.0) * full_text_weight +
coalesce(1.0 / (rrf_k + semantic.rank_ix), 0.0) * semantic_weight

The example JavaScript call uses match_count: 10. That is a reasonable first shipped shape: retrieve enough candidates from both branches, fuse them, and hand a small context set to the generator.

For teams deciding whether Postgres is enough, the pgvector versus Pinecone decision still applies. pgvector is open-source vector similarity search for Postgres. It supports exact and approximate nearest neighbor search, single-precision, half-precision, binary, and sparse vectors. It supports L2 distance, inner product, cosine distance, L1 distance, Hamming distance, and Jaccard distance. Its installation examples currently use branch v0.8.2.

The index choice is a production tradeoff. pgvector supports HNSW and IVFFlat. HNSW has better query performance than IVFFlat in speed-recall tradeoff, but slower build times and more memory use. IVFFlat has faster build times and less memory use than HNSW, but lower query performance in speed-recall tradeoff. pgvector also documents dimension limits: vector up to 2,000 dimensions, halfvec up to 4,000 dimensions, bit up to 64,000 dimensions, and sparsevec up to 1,000 non-zero elements.

The footgun is filters. Supabase's pgvector docs warn that with IVFFlat or HNSW, naive filtering on another column can return fewer rows than requested because the embedding index may not return enough rows matching the filter. In a RAG system with permissions, tenancy, or source filters, that is not a corner case. It is usually where retrieval bugs start.

  1. Create Two Retrieval Signals

    Store a normalized full-text field and an embedding for the same document or chunk. Keep the permission and source metadata next to both signals so filters apply before generation.

  2. Fuse By Rank, Not Raw Score

    Use RRF or a documented weighted fusion method before trusting raw scores. Raw vector similarity and keyword relevance are not naturally comparable.

  3. Log The Branch Evidence

    For every answer, store the text-rank list, vector-rank list, fused rank, weights, final context IDs, and citation IDs. Debugging RAG without branch evidence turns every miss into guesswork.

Move To A Search Service When Hybrid Becomes Its Own Product Surface

A search service makes sense when retrieval has become its own product surface: many corpora, high query volume, independent scaling, multi-stage search, hosted reranking, or search teams tuning relevance outside the app database.

Pinecone documents three hybrid patterns: a single index for dense and sparse vectors, separate indexes for dense and sparse vectors, and a multi-field document schema. Its single-index pattern is the simpler vector-API path, but Pinecone calls out the weighting problem directly: BM25 scores and pinecone-sparse-english-v0 sparse-weight outputs are not normalized to the dense vector range, and without explicit weighting the sparse component dominates.

That means Pinecone is a good fit when you are ready to make hybrid weighting an explicit retrieval policy. If the workload has both vector and sparse vectors per record, the vector-API pattern works. If the workload is text-centric, Pinecone's document schema lets one schema declare full-text fields, dense vectors, and sparse vectors side by side.

Weaviate's model is similar at the decision level. Its hybrid search combines vector search and keyword BM25F search by fusing the two result sets. The fusion method and relative weights are configurable. Weaviate says alpha of 1 is pure vector search and alpha of 0 is pure keyword search, and Relative Score Fusion is the default fusion method starting in v1.24.

Qdrant is useful when the retrieval plan needs explicit multi-stage shape. Its hybrid and multi-stage queries are available as of v1.10.0; the prefetch parameter enables sub-requests, and when at least one prefetch exists, Qdrant performs the prefetch query or queries and applies the main query over those results. Qdrant supports fusing different queries with rrf and dbsf, shows sparse and dense prefetches with limit: 20, and documents parameterized RRF with k: 60 as of v1.16.0. Weighted RRF is available as of v1.17.0.

The product decision is not which service sounds more "AI-native." The decision is where the retrieval policy should live.

  • Keep it in Postgres when the product already needs SQL permissions, transactions, admin workflows, and simple hybrid retrieval.
  • Move it to a search service when ranking policy, multi-stage retrieval, sparse/dense weighting, reranking, or search operations need independent ownership.
  • Keep the API between app and retrieval explicit either way: query text, query embedding, filters, branch weights, top candidates, final citations, and trace ID.

Reranking Belongs After Fusion, Not Before Retrieval

Reranking is a precision step on a candidate set, not a replacement for retrieval. Put it after BM25 and vector retrieval have produced a fused shortlist. That lets the reranker spend attention on plausible candidates instead of scanning the corpus.

Cohere's Rerank API takes a query and a list of texts and produces an ordered array with each text assigned a relevance score. The model field example is rerank-v3.5. Cohere recommends against sending more than 1,000 documents in a single request, and max_tokens_per_doc defaults to 4096, with long documents automatically truncated to that value.

That default truncation is operationally important. If the chunk is too large, the reranker may score a truncated version of the evidence. If the list is too large, reranking becomes slower and more expensive than it needs to be. If the logs do not preserve reranker inputs, a bad answer can look like a generation problem when the real failure was a truncated or misplaced evidence chunk.

The shipped pipeline should look like this:

Text
query text
  -> normalize and classify query
  -> full-text retrieval with filters
  -> vector retrieval with the same filters
  -> fuse and dedupe candidates
  -> rerank fused shortlist
  -> pass final cited chunks to the model
  -> log answer, citations, scores, and eval labels

Do not rerank before the lexical branch has had a chance to find exact terms. Do not pass the full hybrid result set to the model. Do not treat reranker score as a truth score. It is a relevance score for a query-document pair, and it still needs downstream groundedness checks.

What To Log Before You Trust Hybrid Retrieval

Hybrid search changes what you need to observe. A vector-only RAG failure usually asks one question: did the nearest-neighbor result include the right chunk? A hybrid RAG failure asks several: did the text branch find it, did the vector branch find it, did fusion bury it, did reranking demote it, did a permission filter remove it, or did the model ignore it?

The minimum production log should include:

  • Query text, normalized query, and query class, such as exact ID, natural-language question, policy clause, or mixed.
  • Filter set, including tenant, permission scope, source type, date window, and content status.
  • Full-text candidates with rank and text score.
  • Vector candidates with rank, distance or similarity, embedding model, and index name.
  • Fusion method, weights, and fused rank.
  • Reranker model, candidate count, truncation setting, score, and final rank.
  • Final chunks sent to the model, answer citations, and whether each citation supports the answer.
  • Eval label, failure reason, and rollback or retrieval-policy version.

The first dashboard should not be a chart of average answer quality. It should show the failure split by retrieval stage. If exact-ID queries fail, tune the text branch or filters. If semantic questions fail, tune chunking, embeddings, or vector index recall. If both branches find the chunk and the final answer still fails, the problem is likely reranking, context assembly, instruction following, or answer evaluation.

The strongest hybrid systems treat retrieval policy like application code. A prompt change gets versioned. An embedding model change gets versioned. The same should be true for alpha, RRF constant, branch weights, filter order, reranker model, chunk size, and citation policy.

The Build Decision

The cleanest production path is to ship hybrid retrieval before users teach you that vector-only was too optimistic. Start in the product database when permissions and content lifecycle are already there. Move to a search service when retrieval needs independent scale and ranking ownership. Add reranking only after fusion has created a bounded, inspectable candidate set.

The decision rule is:

  • Use vector-only when evals prove exact-match recall and citation support are already strong.
  • Use BM25 plus vector search when corpus language and user language differ, but exact strings still matter.
  • Use hybrid plus reranking when fused results contain too many plausible distractors for the model.
  • Use a dedicated search service when ranking policy has become a product surface.

Hybrid search is not the finish line for RAG quality. It is the retrieval baseline that gives the rest of the system a fair chance: better recall before generation, clearer failure evidence after generation, and a safer path to production traffic.

What is BM25 in RAG?

BM25 is the keyword retrieval signal that catches exact terms, IDs, names, and phrases. In RAG it complements vector search, which is better at semantic similarity but can miss rare or literal strings.

What is BM25 hybrid search?

BM25 hybrid search runs lexical retrieval and vector retrieval as separate branches, then fuses their ranked results before generation. The production version also dedupes, applies permissions, logs branch evidence, and usually reranks the fused shortlist.

How does hybrid search work in RAG?

The system creates a text query and an embedding query, retrieves candidates from keyword and vector indexes, fuses the ranked lists, reranks the best candidates when needed, then sends the final cited chunks to the model.

What is the difference between BM25 and vector search?

BM25 is exact-term retrieval based on lexical evidence. Vector search is semantic retrieval based on embedding similarity. BM25 is stronger for identifiers and domain terms; vector search is stronger for paraphrases and concept matches.

Last Updated

Jun 13, 2026

CategoryRAG
Newsletter

One letter, every week. Working systems — not hot takes.

Build logs, agentic engineering decisions, agent failures, evals, and what survives real users. Sent weekly, never more.

Weekly. No spam. Unsubscribe anytime.