Infrastructure

Hybrid Search in Production: BM25 + Dense Retrieval

Balys Kriksciunas 7 min read
#ai#infrastructure#hybrid-search#bm25#rag#reranker#retrieval#vector-database

Hybrid Search in Production: BM25 + Dense Retrieval

Ask ten RAG builders what retrieval strategy they use and nine will say “we embed chunks and do vector search.” That works well enough for demos. In production, dense-only retrieval leaves 20–40% of recall on the floor, especially on queries with proper nouns, version numbers, exact phrases, or specialized vocabulary.

The fix — which every serious RAG system eventually adopts — is hybrid search: combine keyword-based BM25 with dense vectors, then rerank. This post walks through why, how, and the pitfalls.


Why Dense Alone Misses Things

Dense embeddings capture semantic similarity. “How do I fix a broken login?” and “account access issues resolution” embed close together even though they share few words. That’s the magic.

The failure modes:

1. Proper nouns and rare terms. “Error code B2-4471” and “error B2-4471 workaround” semantically resemble each other, but both probably embed closer to generic error-related content than to each other.

2. Version numbers and identifiers. “React 18.2.0” vs “React 18.3.1” — embeddings barely distinguish these, but your docs absolutely must.

3. Short queries. A 2-word query has little semantic content to work with. Embeddings are noisy on short inputs.

4. Out-of-domain vocabulary. An embedding model trained on general text may not separate “torque converter stall speed” from “torque converter slip.” Domain-specific terms benefit from exact matching.

5. Multi-intent queries. “How do I install Python AND set up a virtual env?” — dense search picks one intent; keyword search respects both.

BM25 handles all of these naturally. It’s lexical matching done well: terms, term frequency, inverse document frequency, length normalization.


Why BM25 Alone Misses Things

Flip side:

1. Paraphrasing. “Subscribe to our newsletter” vs “join our mailing list” — no word overlap, BM25 gives zero.

2. Conceptual queries. “Best way to handle user authentication” won’t match “JWT tokens with refresh flow” without semantic understanding.

3. Natural-language queries at all. Modern users phrase queries as questions; BM25 was tuned on keyword-style queries.

Dense handles all of these.

Conclusion: dense and BM25 are complementary. Good retrieval uses both.


The Production Recipe

Three stages:

Stage 1: Parallel retrieval

Run BM25 and dense search against the same corpus, independently. Get top-K from each (typically K=100).

Stage 2: Fusion

Combine the two ranked lists into one. Most common method: Reciprocal Rank Fusion (RRF), which combines ranks (not scores):

RRF_score(d) = sum over retrievers r of (1 / (k + rank_r(d)))

Typical k=60. Documents in both lists get boosted; documents in one list survive; the top of the combined list is usually better than either individual list.

RRF is simple, parameter-free, and remarkably effective. More sophisticated methods (CombSUM, CombMNZ, learned fusion) exist but RRF is the default we recommend.

Stage 3: Reranking

Take the top ~50 from fused results. Run them through a cross-encoder reranker that scores each (query, document) pair with a full attention pass. Keep top 3–10 for the LLM.

Reranking is the single most impactful addition to a RAG stack after fusion. Quality delta is usually larger than any individual retrieval improvement.

Reranker options in 2025:

Rerankers are small enough to run on a single GPU (or even CPU for small volumes). Latency is typically 20–80ms for top-50 scoring.


What The Numbers Look Like

Our internal benchmark on an enterprise technical docs corpus, 250K chunks, 1K held-out queries:

SetupRecall@10NDCG@10
Dense only (bge-large)0.670.52
BM25 only0.580.44
Hybrid (RRF fusion)0.780.61
Hybrid + Cohere rerank top 500.840.72
Hybrid + bge-reranker top 500.830.71

Hybrid beats either alone by 10–20 points. Reranking adds another 8–10. Together, 30%+ improvement over pure dense.

For consumer search workloads, gains are typically smaller but still positive. For specialized corpora (law, medicine, code, technical docs), hybrid + rerank is essentially required.


Implementation Options

Option 1: Two databases

Run an Elasticsearch or OpenSearch cluster for BM25. Run a separate vector DB (Qdrant, Pinecone) for dense. Application queries both in parallel, fuses, reranks.

Works. Operationally heavy. Two query latencies to manage.

Option 2: Unified database

Some vector databases now support BM25 alongside vectors:

Unified is operationally simpler. Use it unless you have a strong reason not to.

Option 3: Use a managed service

Managed is the fastest path. Cost scales with corpus size and queries.


Hybrid in Weaviate (Example)

Weaviate’s hybrid query is as simple as it should be:

response = client.collections.get("Docs").query.hybrid(
    query="how to configure webhooks",
    alpha=0.5,  # 0 = pure BM25, 1 = pure vector, 0.5 = even
    limit=20,
    return_metadata=["score"],
)

alpha tunes the weighting. Start at 0.5; tune based on eval performance.


Hybrid in pgvector + Postgres FTS

Postgres has excellent full-text search via tsvector. Combine with pgvector:

WITH dense AS (
  SELECT id, content, 1 - (embedding <=> $1) AS dense_score
  FROM docs ORDER BY embedding <=> $1 LIMIT 100
),
lex AS (
  SELECT id, content, ts_rank_cd(content_tsv, plainto_tsquery($2)) AS bm25_score
  FROM docs WHERE content_tsv @@ plainto_tsquery($2)
  ORDER BY bm25_score DESC LIMIT 100
),
fused AS (
  SELECT id, SUM(1.0 / (60 + rnk)) AS rrf_score
  FROM (
    SELECT id, ROW_NUMBER() OVER (ORDER BY dense_score DESC) AS rnk FROM dense
    UNION ALL
    SELECT id, ROW_NUMBER() OVER (ORDER BY bm25_score DESC) AS rnk FROM lex
  ) r
  GROUP BY id
)
SELECT d.id, d.content FROM fused f
JOIN docs d ON d.id = f.id
ORDER BY f.rrf_score DESC LIMIT 20;

Not pretty. Works. For moderate workloads, staying in Postgres is worth the ugliness.

See pgvector at Scale.


Tuning Alpha and K

Two knobs that matter:

Alpha (BM25 vs dense weight). Start at 0.5. Measure eval performance. If your domain is keyword-heavy (legal, technical docs), bias toward BM25 (alpha ~0.3). If paraphrase-heavy (customer support, natural questions), bias toward dense (alpha ~0.7).

Top-K at each stage. Retrieve 100 from each, fuse to 50, rerank to 5–10. More candidates at retrieval = better recall but higher rerank cost. Less candidates = faster but may miss.

Tune on your eval set. Our defaults work for most cases; don’t over-tune.


Metadata Filtering

Hybrid search with metadata filters is a common real-world need: “find docs matching this query, only from this tenant, only from last 30 days.”

All unified databases (Weaviate, Milvus, etc.) support metadata filtering in hybrid queries. The filter applies before scoring, so it doesn’t compromise recall.

Be careful: filters that are very selective (<1% of docs) can break ANN search’s recall guarantees. Over-retrieve in those cases. See pgvector’s filtering section for general patterns.


When NOT To Use Hybrid

A few cases where hybrid isn’t worth the complexity:

1. Tiny corpora. Under ~5K docs, a well-tuned dense search plus a reranker is often enough. Keep things simple.

2. Perfectly uniform natural language queries. If your users type clean questions and never proper nouns, pure dense can be competitive.

3. Multilingual corpora where BM25 is hard to tune. Dense embeddings from multilingual models (e5, BGE-M3) often outperform BM25 here. Tokenization for many languages is a separate fight.

4. When latency is hyper-critical. A 20ms rerank pass is a non-starter for sub-50ms P95 workloads. Skip rerank; consider skipping hybrid.

For 90% of production RAG, hybrid + rerank is the right default.


Evaluating Retrieval Quality

Before tuning retrieval, you need evals:

Start with synthetic evals. Build a set of ~100 query/chunk pairs where the “right answer” chunk is known. Measure recall@10 and NDCG@10 across retrieval configs. See Model Evals in Production.


The Short Version


Further Reading

Tuning a RAG retrieval stack? Reach out — we can audit and improve retrieval quality quickly.

← Back to Blog