Skip to main content
Back to Blog
hybrid searchRAGBM25vector searchretrievalreciprocal rank fusion

Hybrid Search: Combining BM25 and Vector Retrieval for Production RAG

Hybrid search combines BM25 keyword scoring with vector similarity and fuses the rankings — the practical default for production RAG because real user queries come in both styles. This tutorial walks through the fusion strategies, weight tuning, and failure modes on a hypothetical e-commerce support bot.

SurePrompts Team
April 22, 2026
13 min read

TL;DR

Hybrid search pairs BM25 keyword retrieval with vector semantic retrieval and fuses the rankings — handling both product-code lookups and paraphrased natural-language questions from a single index. This tutorial walks through weighted fusion vs. RRF, weight tuning, and the failure modes worth knowing before you ship.

Tip

TL;DR: A single retriever rarely handles both "error code E-207 on model RX-400" and "my device keeps shutting off when I plug in the charger." BM25 wins the first; vectors win the second. Hybrid search runs both and fuses the rankings — either by weighted score fusion or reciprocal rank fusion. RRF is the pragmatic starting point. Tune weights only after you have an eval set.

Key takeaways:

  • The failure modes of BM25 and vector retrieval are nearly opposite. BM25 misses synonyms; vectors smear rare tokens. Hybrid search exists because real user queries come in both styles.
  • Reciprocal rank fusion (RRF) is score-scale-agnostic and a sane default. Weighted score fusion gives finer control but requires calibrating two different score distributions.
  • Query style determines which signal dominates. Product codes, error codes, SKUs, and acronyms lean BM25. Paraphrased questions and conceptual queries lean vector.
  • Tuning weights without an eval set is guessing. Build a 50-100 query labeled set first, then sweep.
  • Hybrid search is orthogonal to reranking, HyDE, and query rewriting. The strongest production stacks combine all of them.
  • The scenario in this post is hypothetical. Numbers illustrate shape, not measurements.

Why one retriever is not enough

Before you touch code, get clear on the problem. A retriever's job is to return the documents that contain the information the generator needs. The question is what "contain" means.

For a query like error code E-207 on model RX-400, "contain" means the document has those exact tokens. The user is not asking a conceptual question; they are searching for a string. A vector database that encodes both query and documents as dense embeddings will return something, but "E-207" is a low-frequency, high-signal token that embedding models tend to flatten. In vector space, E-207 sits uncomfortably close to E-208, E-107, and any other short alphanumeric pattern. The retriever returns plausible-looking documents that do not contain the specific error code the user named.

For a query like my device keeps shutting off when I plug in the charger, "contain" means something different. The document that explains this — probably titled "Intermittent shutdown during charging" — does not share a single non-stopword token with the query. A BM25 retriever, which scores documents by the rare tokens they share with the query, returns noise. A vector retriever, which encodes semantic proximity, returns the right document.

Both queries are normal. Both come in through the same endpoint. You cannot ship a single retriever that wins on both. This is the problem hybrid search solves.

The two signals

Hybrid search combines two retrievers over the same corpus. Understand each one on its own before fusing them.

BM25 over an inverted index

BM25 is a keyword scoring function. It assigns each (query, document) pair a score based on which query terms the document contains, weighted by how rare each term is across the corpus (inverse document frequency) and normalized for document length. Longer documents do not automatically win; very rare tokens contribute more than common ones.

What BM25 is good at: exact matches, rare tokens, product codes, error codes, SKUs, function names, proper nouns, acronyms, and phrase queries. When the user types ValueError: invalid literal for int(), BM25 finds the document that contains that literal string. When they type RX-400, BM25 finds the product page.

What BM25 is bad at: synonyms, paraphrases, conceptual queries, and morphological variants (unless you stem). A BM25 index has never heard of synonyms unless you tell it. If the document says "unit" and the query says "device," there is no overlap, so there is no score.

Implementations: Elasticsearch and OpenSearch ship BM25 as the default scorer. Postgres has ts_vector and ts_rank_cd for full-text search (technically a different scoring function, with BM25 extensions available). Most serious search stacks already have a BM25 index sitting somewhere.

Vector similarity over dense embeddings

Vector retrieval encodes both query and document chunks as dense vectors using an embedding model, then retrieves by approximate nearest neighbor — cosine similarity, dot product, or L2 distance. Semantically similar inputs land near each other, so paraphrases retrieve well even without lexical overlap.

What vector retrieval is good at: paraphrases, synonyms, conceptual queries, multilingual queries (with the right model), and queries where the user and the documents use different vocabulary for the same idea.

What vector retrieval is bad at: rare tokens, exact matches, and anything that looks like a structured identifier. Low-frequency tokens get averaged into the surrounding context during encoding — embedding space has no mechanism that rewards "this exact string appears in this exact place." It rewards semantic neighborhood. A product code is semantically close to every other product code.

Implementations: Postgres with pgvector, Pinecone, Weaviate, Qdrant, Milvus, and every other dedicated vector store. The mechanics differ; the signal does not.

Fusion strategies

You now have two ranked lists for the same query. The question is how to combine them into one.

Weighted score fusion

The most intuitive approach: compute a weighted sum of the two scores per document.

code
final_score(d) = w_bm25 * normalize(bm25_score(d)) + w_vector * normalize(vector_score(d))

The trap is in normalize. Raw BM25 scores are unbounded and corpus-dependent. Raw cosine similarities sit in [-1, 1]. Summing them directly is meaningless — BM25 will dominate or disappear depending on corpus size and query characteristics.

Normalization options: min-max scaling per query (maps both to [0, 1] but is sensitive to outliers), z-score normalization (requires running statistics you may not have), or a learned calibration (needs training data). All of them add complexity. And weighted score fusion only behaves well when your normalization is actually calibrated, not just "divided by something."

Where weighted fusion earns its complexity: once you have a labeled eval set and an honest normalization, it gives you fine-grained, per-query-style control. A query router can pick higher BM25 weights for queries that look like product codes and higher vector weights for queries that look like natural-language questions.

Reciprocal rank fusion (RRF)

RRF sidesteps the normalization problem by throwing away the scores entirely and using only the ranks.

code
rrf_score(d) = sum over lists of 1 / (k + rank_in_list(d))

k is a smoothing constant, usually 60. A document at rank 1 in the BM25 list contributes 1 / 61. A document at rank 1 in the vector list contributes another 1 / 61. Documents that appear near the top of both lists win. Documents that appear only once, or only at low ranks, lose.

RRF has three properties that make it the pragmatic default:

  • It requires no score normalization. The two retrievers can be on wildly different scales, from different model versions, from different vendors — RRF does not care.
  • It is robust to a single retriever collapsing. If one retriever returns a terrible list for a particular query, documents will still be pulled toward the top by the other list, because the bad list contributes small scores at every rank.
  • It has exactly one parameter, k, and it is not sensitive to it.

Where RRF is weaker: it cannot express "BM25 is twice as important as vectors on this query style." It treats both lists equally. Most teams start with RRF because it just works, and graduate to weighted fusion only once they have the eval infrastructure to tune it responsibly.

Worked example: a hypothetical e-commerce support bot

Hypothetical scenario, illustrative numbers only. An e-commerce company runs a support bot backed by RAG over its help center: product manuals, FAQ articles, error code references, and troubleshooting guides. The corpus is about 40,000 chunks.

Two real queries land in the same hour.

Query A: error code E-207 on model RX-400

Query B: my device keeps shutting off when I plug in the charger

The pure-vector retriever (running on a recent general-purpose embedding model at 1024 dimensions) returns the following for Query A:

RankChunkNote
1"Common error codes overview"General error list, no E-207 section
2"RX-500 troubleshooting"Wrong model
3"Handling device errors"Generic guidance
4"Error code reference — RX series"Contains E-207, buried
5"Charging issues guide"Unrelated

The correct chunk — the error code reference for the RX series, which explicitly describes E-207 — is at rank 4. If the generator sees the top 3, it answers from the wrong context and the user gets a generic "have you tried restarting" response. The failure is silent: the generator produces fluent output, grounding against the retrieved (wrong) context looks fine, but the answer is useless.

Add a BM25 index. For Query A, BM25 returns:

RankChunkNote
1"Error code reference — RX series"Contains "E-207" and "RX-400"
2"RX-400 product manual"Contains "RX-400"
3"Error code reference — general"Contains "E-207" in passing
4"Firmware changelog Q2"Mentions error codes
5"RX-300 troubleshooting"Mentions RX

BM25 nails it. The exact chunk is at rank 1.

Now fuse with RRF, k=60:

  • "Error code reference — RX series": 1/(60+1) + 1/(60+4) ≈ 0.0164 + 0.0156 = 0.0320
  • "RX-400 product manual": 0 + 1/(60+2) ≈ 0.0161
  • "Common error codes overview": 1/(60+1) + 0 ≈ 0.0164

The RX-series error reference wins the fused ranking. The generator sees the right chunk at position 1.

For Query B — the conversational one — the situation reverses. BM25 scores poorly because the relevant document ("Intermittent shutdown during charging") shares almost no rare tokens with the query. Vector retrieval places it near the top. RRF fusion carries the vector-preferred chunk into the top of the combined list, because BM25's list for this query is noisy enough that no alternative dominates both lists.

The behavior that matters: hybrid search is not a compromise that makes both queries mediocre. For each query, whichever signal is right is preserved, because fusion does not average — it rewards consensus and forgives single-retriever failures.

Tuning the weights

RRF gets you most of the way. When you outgrow it, the question becomes how to pick weights responsibly.

Build an eval set first. Assemble 50-100 real queries covering the query styles your system actually sees — product codes, natural-language questions, multi-word phrases, support-ticket-style prose. For each, label the documents that should be retrieved. Version it alongside your code. This is the same golden-set discipline the RAGAS evaluation walkthrough describes for end-to-end RAG.

Pick a target metric. Recall at k (did any relevant document make it into the top k?) is the right starting point for a retriever feeding a generator — the generator can tolerate some noise in the candidate set, but it cannot recover from relevant documents being absent. MRR and nDCG layer on precision-of-ordering once recall is acceptable.

Sweep, then segment. Start with RRF. Move to weighted fusion only if you see a clear ceiling. Sweep BM25:vector weights from pure-vector (0:1) through pure-BM25 (1:0) in 10 increments. Report the metric for each setting on the whole eval set — then, critically, segment by query style and check each slice. A single aggregate weight can mask that you helped natural-language queries by 5 points while regressing product-code queries by 20. Fusion that only works on average is not fusion.

Consider query-style routing. Once you can detect query style (regex for product codes and error codes, length/question-mark heuristics for natural language, or a small classifier), you can pick weights per query rather than globally. This is where weighted fusion finally earns its complexity — a router that uses 0.8:0.2 BM25:vector for queries matching [A-Z]+-\d+ and 0.3:0.7 otherwise.

Regression-test in CI. Any retrieval parameter that matters enough to tune matters enough to regress on. Run your eval set in CI, track the metric over time, and alert when it drops. The context engineering maturity model is explicit about this: at Level 5, retrieval parameters are observed, not assumed.

Failure modes

Four anti-patterns to flag.

Fusing without inspecting the individual lists. If BM25 is returning garbage on a broad class of queries — because the index was built with no stopword list, or tokenization is splitting product codes on dashes — fusion drags the combined result down. Inspect each retriever's list on a dozen diverse queries before you tune fusion.

Weighted fusion without score normalization. The most common footgun. Raw BM25 scores and raw cosine similarities are not comparable. A "weighted" fusion where one score is in the low thousands and the other is in [0, 1] is not weighted by the weights — it is weighted by the scale. Either normalize properly or use RRF.

Calling "top-k concatenation" hybrid search. Some systems take the top 10 from BM25 and the top 10 from vector search, deduplicate, and pass 20 chunks to the generator. This is not fusion — there is no ranking signal, and the generator gets 2x the context with no quality gain. It also usually blows out the context window. Actual fusion produces a single ranked list and truncates it.

Treating hybrid search as a substitute for good chunking or a reranker. Hybrid search improves retrieval from the index you built. If chunking splits the relevant passage across two chunks, hybrid search cannot fix that. If the generator needs precision neither signal alone provides, add a reranker after fusion. Necessary, not sufficient.

Our position

Five opinionated stances.

  • Ship RRF before you tune weights. RRF has one parameter, needs no calibration, and rarely loses to weighted fusion without infrastructure you probably do not have yet. Start here. Graduate later.
  • The eval set is the real work. Picking RRF vs. weighted fusion is an afternoon. Assembling 50-100 labeled queries that represent your actual traffic is weeks. The fusion algorithm is the easy part of this problem.
  • Segment your eval set by query style. An aggregate metric will hide regressions on the 15% of queries that matter most to your users. Report per-segment and refuse to ship if any segment regresses more than a threshold.
  • One system, not three. Operationally, the hybrid search "database" that matters is the one your team can maintain. Postgres with pgvector plus a BM25 extension in the same database beats a three-system stack you cannot keep in sync. Choose for ops reality, not benchmark pedigree.
  • Hybrid search is a retrieval pattern, not an architecture. It composes with query rewriting, HyDE, reranking, and RAGAS-style evals. Strong 2026 RAG stacks are hybrid-by-default and layer the rest on top — what the agentic prompt stack calls the retrieval substrate.

Try it yourself

Build expert-level prompts from plain English with SurePrompts — 350+ templates with real-time preview.

Open Prompt Builder

Ready to write better prompts?

SurePrompts turns plain English into expert-level AI prompts. 350+ templates, real-time preview, works with any model.

Try AI Prompt Generator