Retrieval Is Not One Thing

February 17, 2026 | 11 minutes

I built a small app that compares three ways of searching a document: BM25 (lexical search), vector search (semantic embeddings), and Gemini 3 Flash used as a full-context search engine.

The point isn't that one of these "wins." The point is that they solve different problems. In an agentic workflow, they complement each other. When you treat them as interchangeable, you get weird results. When you orchestrate them intentionally, you get a more reliable system.

BM25: Literal, Deterministic, Fast

BM25 answers the question: where in this document do these words appear?

It doesn't understand meaning and it doesn't paraphrase. It matches tokens and ranks them.

BM25 works well for exact quotes, character names, repeated phrases, consistency audits, and "find the line where..." type searches.

For example, if you search for "we cannot afford this", BM25 finds the literal phrase quickly (within milliseconds) and ranks matches consistently.

Vector Search: Meaning Without Words

Vector search answers a different question: what passages are semantically similar to this idea?

It doesn't need literal word overlap. If you search for courage in the face of fear, the document might never use the word "courage," but it could contain something like "She stepped forward even though her hands were shaking."

Vector search improves recall because it finds ideas, not just words. The tradeoff is that it can retrieve things that are "kind of" similar but not actually relevant.

Gemini as Search: It Works, but It's Doing Two Jobs at Once

You can also use Gemini 3 Flash (or a similar LLM) as a search engine by passing it the full document and asking it to find relevant excerpts:

1const prompt = `
2Find excerpts relevant to "${query}".
3Return verbatim quotes.
4Document:
5${doc}
6`;

This works, but it re-reads the entire document on every query, which is slower and more expensive. It can also omit literal matches or return "verbatim" quotes that aren't actually verbatim. Gemini is a reasoning engine, and treating it like a search engine increases the surface area for failure.

Concrete Comparisons

Here are a few examples that show how these approaches differ in practice.

Exact Quotes

If you search for "we cannot afford this", BM25 is the best option. It's built for literal matching and ranking. Vector search can retrieve it, but it might rank paraphrases higher. Gemini might quote correctly, but it can also paraphrase or miss one instance.

Characters Appearing Together in a Passage

If you search for Dorothy AND Scarecrow, BM25 is the strongest choice for finding literal co-occurrence. It excels when the names appear literally and you can use AND semantics. Vector search isn't designed for "both terms present" constraints. Gemini can reason about scenes where both characters are present, but it has to scan the whole document unless you retrieve first.

A practical upgrade is to group results by scene or structural node, so "co-occurrence" becomes "both appear in the same scene node" instead of just the same chunk.

Semantic Similarities and Themes

If you search for courage in the face of fear, vector search is the best fit. BM25 fails unless those exact words show up. Vector search will retrieve passages about implied bravery and related language. Gemini can also do this, but it'll be slower if it's scanning the full document.

Repetition Audits

If you want to know how many times a phrase like ruby slippers appears, BM25 is the right tool. It gives you exact matches with consistent ranking.

Conceptual Search with Evidence

For a question like Where does Dorothy start to doubt herself?, the best approach is a hybrid. Vector search is good at surfacing candidate passages, and Gemini is good at analyzing those passages. Asking Gemini to both find and analyze in one go is where things get unreliable.

The Real Pattern: Orchestration

These systems are layers, not rivals. A solid agentic workflow looks like:

  1. Retrieve candidates with BM25 (precision)
  2. Expand or rerank with vector search (recall)
  3. Analyze with Gemini using only the retrieved evidence (reasoning)

How the Layers Fit Together

High-Level Pipeline

Separating Search from Reasoning

Example Workflow: "Dorothy + Scarecrow together" + "courage in the face of fear"

Example Code

The companion app implements all three search modes. Here are the important parts, restructured so you can adapt them to your own project.

BM25 Retrieval (Lexical)

SQLite's FTS5 extension gives you BM25 scoring for free. You create a virtual table, insert your chunks, and query with MATCH:

1import Database from 'better-sqlite3';
2
3const db = new Database('demo.db');
4
5// Create the FTS5 virtual table.
6// docId is UNINDEXED — stored for filtering, but not searchable.
7db.exec(`
8  CREATE VIRTUAL TABLE IF NOT EXISTS chunks_fts USING fts5(
9    docId UNINDEXED,
10    text,
11    tokenize = 'unicode61'
12  );
13`);
14
15// Insert chunks during indexing
16function insertChunk(docId: string, text: string) {
17  db.prepare('INSERT INTO chunks_fts (docId, text) VALUES (?, ?)').run(docId, text);
18}

The query is where the interesting bits are. FTS5's built-in bm25() function handles relevance scoring, but you can layer on an exact-phrase bonus so literal matches rank higher than partial token hits:

1const STOP_WORDS = new Set([
2  'a', 'an', 'the', 'is', 'are', 'was', 'were', 'in', 'on', 'at',
3  'to', 'for', 'of', 'with', 'and', 'but', 'or', 'not', 'it',
4  // ... ~80 common English words
5]);
6
7function sanitizeFts5Query(query: string): string {
8  return query
9    .replace(/[^\w\s]/g, '')
10    .split(/\s+/)
11    .filter(Boolean)
12    .map((t) => t.toLowerCase())
13    .filter((t) => !STOP_WORDS.has(t))
14    .map((term) => `"${term}"`)
15    .join(' ');
16}
17
18function searchBM25(docId: string, query: string) {
19  const ftsQuery = sanitizeFts5Query(query);
20  if (!ftsQuery) return [];
21
22  const phrase = query.replace(/[^\w\s]/g, '').trim().toLowerCase();
23
24  // bm25() returns negative values (more negative = more relevant).
25  // Subtract 5.0 when the original phrase appears as a substring
26  // so exact phrase matches always rank higher.
27  return db.prepare(`
28    SELECT text,
29      bm25(chunks_fts) - (CASE WHEN instr(lower(text), ?) > 0
30                                THEN 5.0 ELSE 0.0 END) AS score
31    FROM chunks_fts
32    WHERE docId = ? AND chunks_fts MATCH ?
33    ORDER BY score ASC
34    LIMIT 10
35  `).all(phrase, docId, ftsQuery);
36}

Stop-word filtering matters here. Without it, a query like "where is the house" would match on is and the across the entire corpus, drowning out the meaningful terms.

Vector Search (Semantic Recall)

For vector search, you need two things: an embedding model and a vector store. The companion app uses Gemini's embedding model and LanceDB for the vector store.

Generating embeddings with the Gemini SDK:

1import { GoogleGenerativeAI } from '@google/generative-ai';
2
3const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
4const model = genAI.getGenerativeModel({ model: 'gemini-embedding-001' });
5
6// Single query embedding
7async function embedQuery(query: string): Promise<number[]> {
8  const result = await model.embedContent(query);
9  return result.embedding.values;
10}
11
12// Batch embedding for indexing (batches of 20 to stay under rate limits)
13async function embedTexts(texts: string[]): Promise<number[][]> {
14  const BATCH_SIZE = 20;
15  const allEmbeddings: number[][] = [];
16
17  for (let i = 0; i < texts.length; i += BATCH_SIZE) {
18    const batch = texts.slice(i, i + BATCH_SIZE);
19    const result = await model.batchEmbedContents({
20      requests: batch.map((text) => ({
21        content: { role: 'user', parts: [{ text }] },
22      })),
23    });
24    allEmbeddings.push(...result.embeddings.map((e) => e.values));
25  }
26
27  return allEmbeddings;
28}

Storing and searching with LanceDB:

1import * as lancedb from '@lancedb/lancedb';
2
3const lanceDb = await lancedb.connect('data/lancedb');
4
5// Store chunks + vectors during indexing
6async function insertVectors(docId: string, chunks: string[], vectors: number[][]) {
7  const data = chunks.map((text, i) => ({ text, vector: vectors[i] }));
8  await lanceDb.createTable(`chunks_${docId}`, data, { mode: 'overwrite' });
9}
10
11// Search by cosine similarity
12async function searchVectors(docId: string, queryVector: number[], limit = 10) {
13  const table = await lanceDb.openTable(`chunks_${docId}`);
14  const results = await table
15    .vectorSearch(queryVector)
16    .distanceType('cosine')
17    .limit(limit)
18    .toArray();
19
20  // LanceDB returns _distance (cosine distance).
21  // Subtract from 1 to get similarity (1.0 = identical).
22  return results.map((r: { text: string; _distance?: number }) => ({
23    text: r.text,
24    score: 1 - (r._distance ?? 0),
25  }));
26}

Gemini Analysis (Reason Over Evidence)

The key idea from this whole post: Gemini should reason over retrieved evidence, not scan the full document. You pass in the top chunks from BM25 and vector search, and ask it to analyze only those:

1import { GoogleGenerativeAI } from '@google/generative-ai';
2
3const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
4const model = genAI.getGenerativeModel({ model: 'gemini-3-flash-preview' });
5
6async function analyzeEvidence(query: string, evidence: { text: string }[]) {
7  const numbered = evidence.map((e, i) => `[${i + 1}] ${e.text}`).join('\n\n');
8
9  const prompt = `
10You are analyzing excerpts from a document.
11
12Query: "${query}"
13
14Evidence:
15${numbered}
16
17Based only on the evidence above, answer the query.
18Cite passages by number (e.g. [1], [3]).
19If the evidence does not support an answer, say so.
20`;
21
22  const resp = await model.generateContent(prompt);
23  return resp.response.text();
24}

This is fundamentally different from passing the full document in the prompt. The evidence is bounded, the prompt token count is predictable, and Gemini can focus entirely on reasoning instead of searching.

Putting It Together

At indexing time, you chunk the document and populate both stores in one pass:

1function chunkText(text: string, size = 1500): string[] {
2  const chunks: string[] = [];
3  for (let i = 0; i < text.length; i += size) {
4    chunks.push(text.slice(i, i + size));
5  }
6  return chunks;
7}
8
9async function indexDocument(docId: string, text: string) {
10  const chunks = chunkText(text);
11
12  // Populate the FTS5 index for BM25
13  for (const chunk of chunks) insertChunk(docId, chunk);
14
15  // Generate embeddings and store in LanceDB
16  const vectors = await embedTexts(chunks);
17  await insertVectors(docId, chunks, vectors);
18}

At query time, BM25 and vector search retrieve independently, then the results merge before Gemini analyzes the top evidence:

1async function search(docId: string, query: string) {
2  // 1. BM25: fast lexical retrieval
3  const bm25Results = searchBM25(docId, query);
4
5  // 2. Vector search: semantic recall
6  const queryVector = await embedQuery(query);
7  const vectorResults = await searchVectors(docId, queryVector);
8
9  // 3. Merge, dedupe by text content, and pass top evidence to Gemini
10  const seen = new Set<string>();
11  const evidence = [...bm25Results, ...vectorResults]
12    .filter((r) => !seen.has(r.text) && seen.add(r.text))
13    .slice(0, 12);
14  const analysis = await analyzeEvidence(query, evidence);
15
16  return { bm25Results, vectorResults, analysis };
17}

BM25 handles precision, vector search adds recall, and Gemini reasons over a bounded set of evidence. Each layer does one job.

Latency: What to Expect

The takeaway here is simple:

  • BM25 latency scales with query complexity and index size, not document length per request
  • Gemini full-context "search" scales with document length every time

If you're building this into an app, measure these separately:

  1. BM25 retrieve time
  2. Vector retrieve/rerank time
  3. LLM analysis time
  4. End-to-end time

You'll also want to track the number of chunks retrieved, the number of chunks passed to Gemini, prompt tokens and output tokens (if available), and cache hits if you're caching embeddings or LLM results.

Here's a simple timing wrapper that can help:

1async function timed<T>(label: string, fn: () => Promise<T>) {
2  const t0 = performance.now();
3  const value = await fn();
4  const t1 = performance.now();
5  return { label, ms: Math.round(t1 - t0), value };
6}

In my experience, the general latency profile looks like this, though it varies by hosting, model, and document size:

  • BM25 is usually the fastest
  • Vector search is fast, but often slower than BM25
  • Gemini full-context search is the slowest, and grows quickly with document size
  • Gemini analysis over top chunks is slower than retrieval, but stable if evidence size is bounded

The key optimization is to keep Gemini doing analysis, not scanning.

Precision vs. Recall

The mental model that helped me think about this clearly:

  • BM25 gives you precision
  • Vector search gives you recall
  • The LLM gives you reasoning

Precision without recall misses things. Recall without precision floods you with noise. Reasoning without retrieval hallucinates. Hybrid systems balance all three.

Where This Fits in Agent Design

If you're building agentic systems over large documents, don't ask the model to search, filter, reason, summarize, and verify in one pass. Split the responsibilities: retrieval retrieves, the LLM reasons, and verification validates. The system becomes more testable and much harder to trick into hallucinating.

Next Up

This post sets up a series on building better retrieval for agentic systems. In upcoming posts, I'll be covering hybrid retrieval (BM25 + vectors) in practice, reranking strategies, structure-aware retrieval using chapters or scenes as a tree, contradiction detection and consistency checking, and verification layers for agent outputs.