Hybrid Search: Combining BM25 and Vector Retrieval for Production RAG

Q: What is hybrid search?

Hybrid search is a retrieval pattern that runs two retrievers against the same corpus — a keyword retriever (typically BM25 over an inverted index) and a vector retriever (dense embedding similarity) — and fuses the two ranked result lists into a single list. The combined ranker inherits the strengths of each: BM25 wins on rare tokens, product codes, error codes, and exact-phrase queries, while vector search wins on paraphrased or conceptual queries where the user's words do not appear verbatim in the document. Fusion is usually done via weighted score fusion or reciprocal rank fusion. Because real user traffic contains both query styles, hybrid search is often the practical default for production RAG.

Q: What is the difference between BM25 and vector search?

BM25 is a keyword-based scoring function over an inverted index. It rewards documents that contain the query terms — with adjustments for term frequency, document length, and how rare the term is in the corpus. It is lexical: if the query says 'E-207' and the document says 'E-207', it scores well; if the document says 'error two oh seven', it does not. Vector search is semantic: both query and document are encoded as dense embeddings, and retrieval returns the nearest vectors by cosine similarity. It handles paraphrasing and synonyms — 'device keeps shutting off' and 'unit powers down repeatedly' land near each other even with no word overlap. The failure modes are opposite: BM25 misses synonyms, vectors smear rare tokens.

Q: When should I use weighted fusion vs. reciprocal rank fusion?

Use reciprocal rank fusion (RRF) when your BM25 and vector scores are on different, unnormalized scales — which is the default case. RRF ignores the raw scores and only looks at each document's rank in each list, so it does not require calibration. Use weighted score fusion when you have invested in normalizing both score distributions (min-max, z-score, or a learned calibration) and you want precise, tunable control over how much each signal contributes. RRF is the pragmatic starting point. Weighted fusion is the next step once you have labeled data to tune against.

Q: Do I need a separate vector database for hybrid search?

Not necessarily. Postgres with the pgvector extension can hold both an inverted index (via `ts_vector` or a dedicated BM25 extension) and vector embeddings in the same database, which simplifies operations significantly. Elasticsearch and OpenSearch support both in a single cluster. Dedicated vector databases like Pinecone and Weaviate have added BM25 or sparse-vector support for the same reason. The question to answer is operational — how many systems do you want to run and keep in sync — not algorithmic. Algorithmically, hybrid search works regardless of where the two indexes live.

Q: How do I tune hybrid search weights?

Build a small labeled eval set — 50 to 100 real queries with known-relevant documents — and measure retrieval quality (recall at k, MRR, nDCG) at several weight settings. Start with RRF, which needs no weights. If you move to weighted fusion, sweep the BM25:vector ratio from pure-vector to pure-BM25 in increments and pick the setting that maximizes your target metric on the eval set. Crucially, segment the eval set by query style — product-code queries, natural-language questions, multi-word phrases — and check that the chosen weight does not regress one segment to help another. Then regression-test the weight in CI, the same way you would test any other retrieval parameter.

Q: Does hybrid search replace reranking?

No — they solve different problems and often compose. Hybrid search is a first-pass retrieval step: cheap to run over the whole corpus, designed to return a candidate set of, say, 50-100 documents that should contain the relevant ones. Reranking is a second-pass step: a stronger model (often a cross-encoder) re-scores the candidates and produces a final ordering. Hybrid search improves the candidate set; reranking improves the ordering within it. A production RAG stack for a demanding domain typically runs both — hybrid retrieval for recall, reranker for precision — before handing the top chunks to the generator.

Imtiaz Rayhan

Key takeaways:

The failure modes of BM25 and vector retrieval are nearly opposite. BM25 misses synonyms; vectors smear rare tokens. Hybrid search exists because real user queries come in both styles.
Reciprocal rank fusion (RRF) is score-scale-agnostic and a sane default. Weighted score fusion gives finer control but requires calibrating two different score distributions.
Query style determines which signal dominates. Product codes, error codes, SKUs, and acronyms lean BM25. Paraphrased questions and conceptual queries lean vector.
Tuning weights without an eval set is guessing. Build a 50-100 query labeled set first, then sweep.
Hybrid search is orthogonal to reranking, HyDE, and query rewriting. The strongest production stacks combine all of them.
The scenario in this post is hypothetical. Numbers illustrate shape, not measurements.

Why one retriever is not enough

Before you touch code, get clear on the problem. A retriever's job is to return the documents that contain the information the generator needs. The question is what "contain" means.

For a query like error code E-207 on model RX-400, "contain" means the document has those exact tokens. The user is not asking a conceptual question; they are searching for a string. A vector database that encodes both query and documents as dense embeddings will return something, but "E-207" is a low-frequency, high-signal token that embedding models tend to flatten. In vector space, E-207 sits uncomfortably close to E-208, E-107, and any other short alphanumeric pattern. The retriever returns plausible-looking documents that do not contain the specific error code the user named.

For a query like my device keeps shutting off when I plug in the charger, "contain" means something different. The document that explains this — probably titled "Intermittent shutdown during charging" — does not share a single non-stopword token with the query. A BM25 retriever, which scores documents by the rare tokens they share with the query, returns noise. A vector retriever, which encodes semantic proximity, returns the right document.

Both queries are normal. Both come in through the same endpoint. You cannot ship a single retriever that wins on both. This is the problem hybrid search solves.

The two signals

Hybrid search combines two retrievers over the same corpus. Understand each one on its own before fusing them.

BM25 over an inverted index

BM25 is a keyword scoring function. It assigns each (query, document) pair a score based on which query terms the document contains, weighted by how rare each term is across the corpus (inverse document frequency) and normalized for document length. Longer documents do not automatically win; very rare tokens contribute more than common ones.

What BM25 is good at: exact matches, rare tokens, product codes, error codes, SKUs, function names, proper nouns, acronyms, and phrase queries. When the user types ValueError: invalid literal for int(), BM25 finds the document that contains that literal string. When they type RX-400, BM25 finds the product page.

What BM25 is bad at: synonyms, paraphrases, conceptual queries, and morphological variants (unless you stem). A BM25 index has never heard of synonyms unless you tell it. If the document says "unit" and the query says "device," there is no overlap, so there is no score.

Implementations: Elasticsearch and OpenSearch ship BM25 as the default scorer. Postgres has ts_vector and ts_rank_cd for full-text search (technically a different scoring function, with BM25 extensions available). Most serious search stacks already have a BM25 index sitting somewhere.

Vector similarity over dense embeddings

Vector retrieval encodes both query and document chunks as dense vectors using an embedding model, then retrieves by approximate nearest neighbor — cosine similarity, dot product, or L2 distance. Semantically similar inputs land near each other, so paraphrases retrieve well even without lexical overlap.

What vector retrieval is good at: paraphrases, synonyms, conceptual queries, multilingual queries (with the right model), and queries where the user and the documents use different vocabulary for the same idea.

What vector retrieval is bad at: rare tokens, exact matches, and anything that looks like a structured identifier. Low-frequency tokens get averaged into the surrounding context during encoding — embedding space has no mechanism that rewards "this exact string appears in this exact place." It rewards semantic neighborhood. A product code is semantically close to every other product code.

Implementations: Postgres with pgvector, Pinecone, Weaviate, Qdrant, Milvus, and every other dedicated vector store. The mechanics differ; the signal does not.

Fusion strategies

You now have two ranked lists for the same query. The question is how to combine them into one.

Weighted score fusion

The most intuitive approach: compute a weighted sum of the two scores per document.

code

final_score(d) = w_bm25 * normalize(bm25_score(d)) + w_vector * normalize(vector_score(d))

The trap is in normalize. Raw BM25 scores are unbounded and corpus-dependent. Raw cosine similarities sit in [-1, 1]. Summing them directly is meaningless — BM25 will dominate or disappear depending on corpus size and query characteristics.

Normalization options: min-max scaling per query (maps both to [0, 1] but is sensitive to outliers), z-score normalization (requires running statistics you may not have), or a learned calibration (needs training data). All of them add complexity. And weighted score fusion only behaves well when your normalization is actually calibrated, not just "divided by something."

Where weighted fusion earns its complexity: once you have a labeled eval set and an honest normalization, it gives you fine-grained, per-query-style control. A query router can pick higher BM25 weights for queries that look like product codes and higher vector weights for queries that look like natural-language questions.

Reciprocal rank fusion (RRF)

RRF sidesteps the normalization problem by throwing away the scores entirely and using only the ranks.

code

rrf_score(d) = sum over lists of 1 / (k + rank_in_list(d))

k is a smoothing constant, usually 60. A document at rank 1 in the BM25 list contributes 1 / 61. A document at rank 1 in the vector list contributes another 1 / 61. Documents that appear near the top of both lists win. Documents that appear only once, or only at low ranks, lose.

RRF has three properties that make it the pragmatic default:

It requires no score normalization. The two retrievers can be on wildly different scales, from different model versions, from different vendors — RRF does not care.
It is robust to a single retriever collapsing. If one retriever returns a terrible list for a particular query, documents will still be pulled toward the top by the other list, because the bad list contributes small scores at every rank.
It has exactly one parameter, k, and it is not sensitive to it.

Where RRF is weaker: it cannot express "BM25 is twice as important as vectors on this query style." It treats both lists equally. Most teams start with RRF because it just works, and graduate to weighted fusion only once they have the eval infrastructure to tune it responsibly.

Worked example: a hypothetical e-commerce support bot

Hypothetical scenario, illustrative numbers only. An e-commerce company runs a support bot backed by RAG over its help center: product manuals, FAQ articles, error code references, and troubleshooting guides. The corpus is about 40,000 chunks.

Two real queries land in the same hour.

Query A: error code E-207 on model RX-400

Query B: my device keeps shutting off when I plug in the charger

The pure-vector retriever (running on a recent general-purpose embedding model at 1024 dimensions) returns the following for Query A:

Rank	Chunk	Note
1	"Common error codes overview"	General error list, no E-207 section
2	"RX-500 troubleshooting"	Wrong model
3	"Handling device errors"	Generic guidance
4	"Error code reference — RX series"	Contains E-207, buried
5	"Charging issues guide"	Unrelated

The correct chunk — the error code reference for the RX series, which explicitly describes E-207 — is at rank 4. If the generator sees the top 3, it answers from the wrong context and the user gets a generic "have you tried restarting" response. The failure is silent: the generator produces fluent output, grounding against the retrieved (wrong) context looks fine, but the answer is useless.

Add a BM25 index. For Query A, BM25 returns:

Rank	Chunk	Note
1	"Error code reference — RX series"	Contains "E-207" and "RX-400"
2	"RX-400 product manual"	Contains "RX-400"
3	"Error code reference — general"	Contains "E-207" in passing
4	"Firmware changelog Q2"	Mentions error codes
5	"RX-300 troubleshooting"	Mentions RX

BM25 nails it. The exact chunk is at rank 1.

Now fuse with RRF, k=60:

"Error code reference — RX series": 1/(60+1) + 1/(60+4) ≈ 0.0164 + 0.0156 = 0.0320
"RX-400 product manual": 0 + 1/(60+2) ≈ 0.0161
"Common error codes overview": 1/(60+1) + 0 ≈ 0.0164

The RX-series error reference wins the fused ranking. The generator sees the right chunk at position 1.

For Query B — the conversational one — the situation reverses. BM25 scores poorly because the relevant document ("Intermittent shutdown during charging") shares almost no rare tokens with the query. Vector retrieval places it near the top. RRF fusion carries the vector-preferred chunk into the top of the combined list, because BM25's list for this query is noisy enough that no alternative dominates both lists.

The behavior that matters: hybrid search is not a compromise that makes both queries mediocre. For each query, whichever signal is right is preserved, because fusion does not average — it rewards consensus and forgives single-retriever failures.

Tuning the weights

RRF gets you most of the way. When you outgrow it, the question becomes how to pick weights responsibly.

Build an eval set first. Assemble 50-100 real queries covering the query styles your system actually sees — product codes, natural-language questions, multi-word phrases, support-ticket-style prose. For each, label the documents that should be retrieved. Version it alongside your code. This is the same golden-set discipline the RAGAS evaluation walkthrough describes for end-to-end RAG.

Pick a target metric. Recall at k (did any relevant document make it into the top k?) is the right starting point for a retriever feeding a generator — the generator can tolerate some noise in the candidate set, but it cannot recover from relevant documents being absent. MRR and nDCG layer on precision-of-ordering once recall is acceptable.

Sweep, then segment. Start with RRF. Move to weighted fusion only if you see a clear ceiling. Sweep BM25:vector weights from pure-vector (0:1) through pure-BM25 (1:0) in 10 increments. Report the metric for each setting on the whole eval set — then, critically, segment by query style and check each slice. A single aggregate weight can mask that you helped natural-language queries by 5 points while regressing product-code queries by 20. Fusion that only works on average is not fusion.

Consider query-style routing. Once you can detect query style (regex for product codes and error codes, length/question-mark heuristics for natural language, or a small classifier), you can pick weights per query rather than globally. This is where weighted fusion finally earns its complexity — a router that uses 0.8:0.2 BM25:vector for queries matching [A-Z]+-\d+ and 0.3:0.7 otherwise.

Regression-test in CI. Any retrieval parameter that matters enough to tune matters enough to regress on. Run your eval set in CI, track the metric over time, and alert when it drops. The context engineering maturity model is explicit about this: at Level 5, retrieval parameters are observed, not assumed.

Failure modes

Four anti-patterns to flag.

Fusing without inspecting the individual lists. If BM25 is returning garbage on a broad class of queries — because the index was built with no stopword list, or tokenization is splitting product codes on dashes — fusion drags the combined result down. Inspect each retriever's list on a dozen diverse queries before you tune fusion.

Weighted fusion without score normalization. The most common footgun. Raw BM25 scores and raw cosine similarities are not comparable. A "weighted" fusion where one score is in the low thousands and the other is in [0, 1] is not weighted by the weights — it is weighted by the scale. Either normalize properly or use RRF.

Calling "top-k concatenation" hybrid search. Some systems take the top 10 from BM25 and the top 10 from vector search, deduplicate, and pass 20 chunks to the generator. This is not fusion — there is no ranking signal, and the generator gets 2x the context with no quality gain. It also usually blows out the context window. Actual fusion produces a single ranked list and truncates it.

Treating hybrid search as a substitute for good chunking or a reranker. Hybrid search improves retrieval from the index you built. If chunking splits the relevant passage across two chunks, hybrid search cannot fix that. If the generator needs precision neither signal alone provides, add a reranker after fusion. Necessary, not sufficient.

Our position

Five opinionated stances.

Ship RRF before you tune weights. RRF has one parameter, needs no calibration, and rarely loses to weighted fusion without infrastructure you probably do not have yet. Start here. Graduate later.

The eval set is the real work. Picking RRF vs. weighted fusion is an afternoon. Assembling 50-100 labeled queries that represent your actual traffic is weeks. The fusion algorithm is the easy part of this problem.

Segment your eval set by query style. An aggregate metric will hide regressions on the 15% of queries that matter most to your users. Report per-segment and refuse to ship if any segment regresses more than a threshold.

One system, not three. Operationally, the hybrid search "database" that matters is the one your team can maintain. Postgres with pgvector plus a BM25 extension in the same database beats a three-system stack you cannot keep in sync. Choose for ops reality, not benchmark pedigree.

Hybrid search is a retrieval pattern, not an architecture. It composes with query rewriting, HyDE, reranking, and RAGAS-style evals. Strong 2026 RAG stacks are hybrid-by-default and layer the rest on top — what the agentic prompt stack calls the retrieval substrate.

RAG prompt engineering guide
GraphRAG: when knowledge graphs beat chunk-based retrieval — the alternative for relational, multi-hop queries that neither BM25 nor vector similarity over flat chunks handles well
Retrieval-augmented prompting patterns
RAGAS evaluation walkthrough
Context engineering best practices 2026
SurePrompts Quality Rubric
RCAF prompt structure
Context Engineering Maturity Model
Agentic prompt stack

Hybrid Search: Combining BM25 and Vector Retrieval for Production RAG

Why one retriever is not enough

The two signals

BM25 over an inverted index

Vector similarity over dense embeddings

Fusion strategies

Weighted score fusion

Reciprocal rank fusion (RRF)

Worked example: a hypothetical e-commerce support bot

Tuning the weights

Failure modes

Our position

Ready to write better prompts?

Related Resources

RAG System Design Template

Related Articles

RAG Prompt Engineering: How to Write Prompts That Work With Retrieval-Augmented Generation (2026)

The 4 Reusable RAG Prompt Patterns: A Named-Patterns Reference (2026)

RAGAS Evaluation: A Walkthrough for Quantifying RAG Quality

Hybrid Search: Combining BM25 and Vector Retrieval for Production RAG

Why one retriever is not enough

The two signals

BM25 over an inverted index

Vector similarity over dense embeddings

Fusion strategies

Weighted score fusion

Reciprocal rank fusion (RRF)

Worked example: a hypothetical e-commerce support bot

Tuning the weights

Failure modes

Our position

Related reading

Ready to write better prompts?

Related Resources

RAG System Design Template

Related Articles

RAG Prompt Engineering: How to Write Prompts That Work With Retrieval-Augmented Generation (2026)

The 4 Reusable RAG Prompt Patterns: A Named-Patterns Reference (2026)

RAGAS Evaluation: A Walkthrough for Quantifying RAG Quality