I Added Hybrid Search. It Made Things Worse.

An experiment with BM25 + vector search fusion on 624k arXiv abstracts — what happened, why, and when hybrid retrieval actually helps.

A common recommendation for production search systems: don’t rely on dense vector search alone. Combine it with BM25 for lexical matching, fused via Reciprocal Rank Fusion. The intuition is that semantic and lexical signals are complementary — one captures meaning, the other captures exact terminology. This should improve recall on queries containing specific technical terms that may not embed distinctively.

I had two concrete retrieval failures that made this seem worth testing.

The failures that motivated it

During conversational evaluation, two topics consistently underperformed.

Retrieval evaluation — queries about “Recall@k vs MRR” repeatedly pulled in papers from adjacent fields (recommendation systems, dialog systems, agent memory) that use these metrics without being about them. Retrieval quality for that topic: 1 out of 3. The hypothesis was that BM25 on exact terms like “Mean Reciprocal Rank” would anchor results to papers in the information retrieval literature specifically.

Mixture of Experts — top retrieval scores for relevant queries hovered at 0.709–0.718 (the lowest of any topic), and the top results skewed toward application papers rather than the foundational MoE architecture work. BM25 on “mixture of experts” and “sparse gating” seemed like it should help pull the architecture papers up.

Implementation

I used Reciprocal Rank Fusion over two ranked lists: k-NN results and BM25 results. RRF was chosen over OpenSearch’s native hybrid query for portability — it requires no search pipeline configuration. The formula for each document’s fused score:

score(d) = Σ  1 / (rank(d, list_i) + 60)

where 60 is the standard k constant from Cormack et al. (2009) and the sum is over the two ranked lists.

No re-indexing required. title and abstract were already mapped as type: "text" in the original index schema, so BM25 was available without any index changes. The implementation added a bm25_search() function and a hybrid_search() wrapper that fetches 3× candidates from each source before fusing and selecting top-k.

Measuring it

Before running the experiment I built a labeled evaluation set: 10 standalone retrieval queries (one per topic), each evaluated against 15 k-NN candidates rated relevant/not-relevant. 144 relevant papers total across the 10 topics.

Metrics computed at k = 3, 5, 10: Precision@k, Recall@k, and MRR (Mean Reciprocal Rank, weighted toward first relevant result). The evaluation script runs against the fixed ground-truth set in about 30 seconds.

The results

k-NN baseline

MetricP@3P@5P@10MRR
Average1.0000.9800.9901.000

Hybrid (k-NN + BM25 via RRF)

MetricP@3P@5P@10MRR
Average0.7670.6800.6100.925

P@5 dropped from 0.98 to 0.68. P@10 dropped from 0.99 to 0.61. MRR dropped from 1.0 to 0.925. The experiment was supposed to fix two specific topics. Instead it regressed nearly everything.

Why it failed

The retrieval evaluation topic went from P@3=1.0 / MRR=1.0 to P@3=0.0 / MRR=0.25. The largest single-topic regression, and the most ironic one.

The query was: "information retrieval evaluation Mean Reciprocal Rank Recall Precision NDCG". BM25 matched hundreds of papers that use these terms as evaluation metrics for something else — recommendation ranking, dialog systems, agent memory retrieval. Every paper about ranking anything mentions “Precision” and “Recall.” RRF promoted those matches into the top-3 positions, displacing the correct papers that k-NN had already surfaced.

BM25 has no notion of topic primacy. A paper where “Mean Reciprocal Rank” appears once in the methods section scores identically to a paper whose entire contribution is about MRR as an evaluation criterion. Dense search was already doing the right thing on these semantically rich queries. Adding lexical signal introduced noise.

The Mixture of Experts failure was different and clarifying: BM25 didn’t help there either, but for a different reason. The problem wasn’t the retrieval method — the corpus is simply thin on foundational MoE architecture papers. Lexical matching cannot fix a corpus coverage gap. That was a wrong diagnosis of the original failure.

The one improvement

Attention mechanisms P@5 improved from 0.800 to 1.000. Two papers that k-NN missed at ranks 4–5 were surfaced by BM25 through exact phrase matching. That’s the scenario hybrid retrieval is designed for. P@10 still regressed (0.900 → 0.700), meaning BM25 pushed irrelevant papers into the broader result set even as it improved the top-5.

When hybrid search would actually help

The results make the principle concrete: hybrid retrieval is most valuable when semantic and lexical retrieval contribute complementary signal. On this corpus, k-NN on ML paper abstracts was already well-aligned with the query distribution — Titan Embeddings was trained on text that overlaps heavily with arXiv. BM25 mostly introduced lexical noise.

Hybrid fusion would be worth trying again for:

  • Queries containing named entities — specific model names (GPT-4o, Llama 3.1), dataset names, author names — things that may not embed discriminatively
  • Corpora with short documents where semantic signal is weak
  • Domains where the embedding model is undertrained on the target vocabulary

A more targeted approach for this system would be named-entity detection on the query, with BM25 applied selectively on queries that contain model identifiers or proper nouns. Full hybrid fusion on all queries isn’t the right tool here.

The broader point

The more useful thing this experiment validated was the evaluation infrastructure itself. Having a labeled ground-truth set and a 30-second evaluation run meant the decision was empirical rather than intuitive. Without it, I might have shipped hybrid search, observed no obvious degradation in casual testing, and left a worse retrieval system in production. The hypothesis seemed reasonable before the data; the data was clear.

The hybrid endpoint (/search-hybrid) remains deployed for future comparison experiments. The chat pipeline reverted to k-NN only.