Hybrid Retrieval in Practice

February 21, 2026 | 12 minutes

In the previous post, I walked through how BM25, vector search, and LLM analysis each solve a different retrieval problem. The takeaway was that these are layers, not rivals, and that the right approach is to orchestrate them intentionally rather than pick one.

This post puts that idea into practice. I'll build a small agentic search pipeline over The Wonderful Wizard of Oz (from Project Gutenberg) that combines BM25, vector search, LLM re-ranking, and bounded reasoning into a single hybrid retrieval flow. Along the way, I'll show how each layer contributes to the final result: exact quote search where BM25 dominates, character co-occurrence where BM25 plus structure shines, semantic theme search where vectors take over, re-ranked hybrid retrieval that gets the best of both, and LLM reasoning that stays grounded in retrieved evidence.

The thesis is that we often don't need a bigger context window. We need a better funnel.

If you want to skip ahead and try it yourself, the live demo lets you run queries against The Wizard of Oz and see how each retrieval layer contributes to the result.

The Hybrid Pipeline

Before diving into the implementation, here's the high-level shape of what we're building:

BM25 gives you precision. Vectors give you recall. Re-ranking gives you relevance. The LLM gives you interpretation. Each stage narrows the funnel so the next stage has less work to do, and the final answer is grounded in evidence rather than hallucinated from a full-document scan.

Loading and Chunking the Source Text

The first step is getting the text into a shape we can index. I downloaded the plain text of The Wonderful Wizard of Oz from Project Gutenberg, then split it by chapter so results feel navigable when they come back. It's best to use a natural structural boundary so that later on, it can be used for grouping and co-occurrence queries.

1function splitByChapters(text: string) {
2  return text.split(/\nCHAPTER\s+[IVXLC]+\b[^\n]*\n/i);
3}

Each chapter gets further divided into chunks. For this demo I used fixed-size splits of roughly 1,200 to 1,500 characters, which keeps chunks small enough for embedding and large enough to carry meaningful context. But character-count splitting isn't optimal for real projects. It cuts through the middle of paragraphs and has no awareness of what's actually happening in the text, so you can end up with a chunk that contains the second half of one scene and the first half of the next.

A better approach is semantic chunking: splitting on natural boundaries in the content so each chunk represents a coherent unit of meaning. Depending on the source material, those boundaries might be paragraph breaks, section headings, topic shifts, or — for narrative text like this one — scene changes. The challenge is that natural boundaries aren't always explicitly marked, so you either write heuristics to detect them or use an LLM in a preprocessing step. Even rough semantic chunking tends to produce better retrieval units than fixed-size splits, because a coherent passage has internal context that an arbitrary 1,200-character window doesn't.

BM25 Index with SQLite FTS5

The BM25 layer uses SQLite's FTS5 extension, same as in the previous post. You create a virtual table, insert your chunks, and query with MATCH:

1CREATE VIRTUAL TABLE chunks_fts USING fts5(
2  docId UNINDEXED,
3  chunkId UNINDEXED,
4  chapter UNINDEXED,
5  text,
6  tokenize='unicode61'
7);

The query uses FTS5's built-in bm25() function for relevance scoring and snippet() to pull highlighted excerpts:

1SELECT chunkId, chapter,
2       bm25(chunks_fts) AS score,
3       snippet(chunks_fts, 3, '[',']','…',16) AS snip
4FROM chunks_fts
5WHERE docId = ? AND chunks_fts MATCH ?
6ORDER BY score ASC
7LIMIT 50;

BM25 excels at queries where you need literal precision: finding the exact phrase "Pay no attention to that man behind the curtain," locating passages where Dorothy AND Scarecrow appear together, or matching on rare character names. It's fast, deterministic, and scores consistently across runs.

The tradeoff is that BM25 can't understand meaning. If the text says "she stepped forward even though her hands were shaking" and you search for "courage," BM25 won't find it. That's where vectors come in.

Vector Search for Semantic Recall

The vector layer converts each chunk of text into a list of numbers (an "embedding") that captures its meaning. Two passages about similar topics will have similar numbers, even if they use completely different words. To find relevant chunks, you convert the search query into the same kind of number list and then compare it against every stored chunk using cosine similarity, which measures the angle between two lists of numbers. A smaller angle means vectors are closer in meaning.

For a relatively small work like a single book, comparing the query against every chunk directly via brute-force works fine. But once you're dealing with tens of thousands of chunks or more (like in a library of documents or a large codebase), brute-force gets slow because you're computing similarity against every single embedding on every query. That's when you'd reach for an approximate nearest neighbor (ANN) index like HNSW (used by pgvector and LanceDB) or IVF, which pre-organizes the embeddings so you only need to compare against a subset. The tradeoff is a small accuracy loss because the index might miss a few borderline-relevant results, but queries go from hundreds of milliseconds to single digits.

For this demo, we'll stick with the brute-force approach:

1function cosine(a: number[], b: number[]) {
2  let dot = 0, na = 0, nb = 0;
3  for (let i = 0; i < a.length; i++) {
4    dot += a[i] * b[i];
5    na += a[i] * a[i];
6    nb += b[i] * b[i];
7  }
8  return dot / (Math.sqrt(na) * Math.sqrt(nb) + 1e-8);
9}

Vector search excels at the queries BM25 can't handle, such as "courage in the face of fear," "leadership under pressure," or "home as emotional anchor." These are conceptual queries where the exact words might never appear in the text, but the ideas do.

Like everything, though, there's a tradeoff. Vector search can return things that are semantically adjacent but not actually relevant. It's good at recall, but its precision is lower. That's why you don't stop here.

Merging and De-duplicating Candidates

Once both retrieval paths return their candidates, we merge them into a single list and de-duplicate by chunk ID. If the same chunk appeared in both result sets, that's a strong signal, but we only need it once:

1function mergeCandidates(bm25Hits, vecHits) {
2  const map = new Map();
3
4  for (const h of bm25Hits)
5    map.set(h.chunkId, { ...h, source: "bm25" });
6
7  for (const h of vecHits)
8    map.set(h.chunkId, { ...(map.get(h.chunkId) ?? h), source: "vector" });
9
10  return [...map.values()];
11}

At this point we might have 60 to 80 unique candidates. That's too many to pass directly to an LLM; the more text you ask it to reason over, the slower and less accurate it gets. But it's the right size for a re-ranking step.

A useful rule of thumb: if your merged candidate set is under 15 or so, you can probably skip re-ranking and pass them straight to the LLM for reasoning. The noise-to-signal ratio is low enough that the LLM can sort through it. Once you're above 20 to 30 candidates, re-ranking starts paying for itself: you're filtering out the marginal results that would otherwise dilute the LLM's attention and eat up context window. Above 50, re-ranking is basically mandatory. Without it, the LLM either ignores half the passages or gets confused trying to weigh too many competing pieces of evidence. The cost of a re-ranking call (a few hundred milliseconds) is cheap compared to the cost of a bad answer from an overwhelmed reasoning step.

Re-ranking with an LLM

Using the LLM as a re-ranker rather than a searcher is fundamentally different from using it to scan a full document. The LLM doesn't see the whole document. It only sees the candidate passages from the merge step, and its job is to rank the top 12 most relevant ones:

1const prompt = `
2You are a re-ranker.
3
4Rank the top 12 passages for this query.
5Return JSON: { ranked: [{chunkId, reason}] }
6
7Query: ${query}
8
9Candidates:
10${candidates.map(c => `(${c.chunkId}) ${c.text}`).join("\n\n")}
11`;

Re-ranking fixes several problems at once. It filters out the vector noise, passages that are semantically adjacent but not actually about the query. It corrects BM25's over-literal ranking, where a passage with three mentions of a keyword outranks a more relevant passage with only one. And it handles weak co-mention hits where a term appears incidentally rather than substantively.

This single step tends to have the biggest impact on result quality in the entire pipeline.

Bounded Reasoning for the Final Answer

Now the LLM reasons over only the top 12 passages. The scope is bounded, citations are enforced, and the evidence is explicit:

1const analysisPrompt = `
2Answer the question using only the passages below.
3Cite chunkId in parentheses.
4
5Question: ${query}
6
7Passages:
8${top12.map(c => `(${c.chunkId}) ${c.text}`).join("\n\n")}
9`;

This is where hallucinations drop. The LLM can't invent passages that aren't in its context. It can't drift into general knowledge about The Wizard of Oz because its only input is the retrieved evidence. And because citations are enforced, you can verify every claim against a specific chunk.

Example Queries

To see how the layers interact in practice, here are four queries that exercise different parts of the pipeline.

Exact Quote

If you search for "Pay no attention to that man behind the curtain", BM25 dominates. This is a literal phrase match, and BM25 finds it immediately with high confidence. Vector search might retrieve passages about deception or hidden authority, which are thematically related but not the quote itself.

Character Co-Occurrence

For Dorothy AND Scarecrow, BM25 works well. BM25 finds chunks where both names appear literally. But raw BM25 results are a flat list of chunks — you might get 15 hits scattered across the book with no sense of which parts of the story actually feature Dorothy and the Scarecrow together. If you group those hits by the chapter they came from (using the chapter column from the FTS5 schema), you can see that chapters 3 through 5 have dense clusters of co-occurrence while chapter 12 has a single passing mention. That distinction between "these characters share extended scenes" and "one character briefly names the other" is hard to get from a flat ranked list.

Semantic Theme

For courage in the face of fear, vectors dominate. The book might never use the word "courage" in the passage you're looking for, but vector search surfaces passages about characters acting bravely despite being afraid. BM25 would miss these entirely.

Hard Mode

For Where does Dorothy show leadership?, hybrid retrieval plus re-ranking wins. BM25 finds passages mentioning Dorothy. Vectors find passages about leadership-like behavior. But neither alone can reliably surface the passages where Dorothy specifically demonstrates leadership. Re-ranking looks at the merged candidates and identifies the ones that actually answer the question.

Agent Router Logic

In a more complete system, the agent can classify the query type and adjust retrieval weights accordingly:

1function classifyQuery(query: string) {
2  if (query.includes('"')) return "quote";
3  if (query.match(/AND|OR/)) return "boolean";
4  if (query.length < 5) return "keyword";
5  return "conceptual";
6}

For quote queries, lean heavily on BM25. For conceptual queries, lean on vectors. For boolean queries, BM25 handles the strict matching. For everything else, run the full hybrid pipeline with balanced weights. This classification doesn't need to be perfect. It just needs to avoid the worst mismatches, like running a pure vector search on an exact quote lookup.

Latency Profile

Here's what the latency funnel looks like in practice:

The key insight is that retrieval is cheap and reasoning is expensive. BM25 and vector search together run in under 100 milliseconds. Re-ranking and final analysis take 600 to 1,600 milliseconds. That's fine, because the LLM is working over 12 passages instead of the entire book.

If you skip the retrieval funnel and pass the full document to the LLM, reasoning time scales with document length on every single query. With the funnel, reasoning time stays constant regardless of how large the source document is.

Why This Matters for Agents

Without hybrid retrieval, we're stuck choosing between bad options. LLM-only search has to scan everything, which is slow and expensive. Vector-only search returns semantic soup where half the results are thematically adjacent but not actually relevant. BM25-only search misses conceptual meaning entirely.

Hybrid retrieval gives you precision, recall, relevance, and stability. Agents become more reliable when their evidence is structured. When an agent can trust that the passages it's reasoning over are actually relevant, it makes fewer reasoning errors. It's less likely to hallucinate, less likely to contradict the source, and less likely to miss important context. It also makes it easier to verify the accuracy of the LLM by comparing its answer to specific chunks from the document.

This is not redundancy. It's division of labor. Each layer does one job well, and the layers compose into a system that's more reliable than any single approach.

Try it Yourself

You can try the pipeline yourself in the live demo.

Next Up

This post focused on building a hybrid retrieval pipeline where all the content is indexed upfront. But in a real application, documents change. New content gets added, old content gets updated, and you don't want to re-index everything from scratch every time.

In the next post, I'll cover incremental indexing: how to keep your BM25 and vector indexes in sync as documents evolve, without re-building the entire index on every change. That's where hybrid retrieval starts to feel like a production system rather than a demo.