The Untaught Lessons of RAG Retrieval: Cosine Is Not the Foundation

The Untaught Lessons of RAG Retrieval: Cosine Is Not the Foundation


companion to Enterprise Document Intelligence, the series whose philosophy is laid out in Amplify the Expert. It zooms in on brick 3 (retrieval) of the four-brick architecture and surfaces the lessons most tutorials skip.

The mainstream story has retrieval as embed the question, return top-k by cosine, optionally rerank. We disagree with almost every part of it. Retrieval is filtering on structured tables, not searching free text. Embeddings are the optional fallback, not the foundation. Anchor and context are two granularities, not one. Each of these is a position we can defend, with consequences you can measure.

where this article sits in the series: brick 7 (retrieval) highlighted – Image by author

📓 Runnable companion notebooks are on GitHub: doc-intel/notebooks-vol1.

The public companion-code repo at doc-intel/notebooks-vol1 – Image by author

The naive baseline this article pushes back on

The architectural contrast: a single cosine signal over chunks vs three signals in parallel on structured tables – Image by author

The naive pipeline chunks the document, embeds every chunk, embeds the question, ranks by cosine. That single signal is opaque, and it throws away the document’s structure. We keep the document as line_df + toc_df and run three retrieval signals in parallel (keyword on lines, TOC reasoning, embedding cosine), then let an LLM arbiter rank once at the end with all three sets of hits in view.

Keywords always run, the TOC always reasons, embeddings fire only when the vocabulary mismatches – Image by author

Below are the six untaught lessons of this brick.

Lesson 1 – Retrieval is filtering, not searching

Once parsing is done, retrieval is a SQL-like filtering problem over line_df and toc_df, the reverse of the chunk-embed-cosine-top-k framing. The shift is simple to state: the question has columns, the document has columns, and retrieval is the join.

Why it matters. Search and filter are not synonyms , the two operations have different mechanics. Search scores every candidate on a continuous similarity (cosine , BM25), forces a top-k cutoff, and always returns something, even when the answer is not in the document. Filter applies a boolean condition (line.contains("X") , toc.title in [...]), retains every row that matches and no more, and can return zero rows when the document does not carry the answer. The audit consequence is the largest part of the gap: a filter’s condition is one line of inspectable code that runs the same way in six months; a search’s ranking depends on which dimensions of the embedding mattered, and you cannot replay that judgment without re-running the model.

Concrete contrast. The user asks “What positional encoding does the paper use?”. Naive RAG embeds the question, scores 300+ chunks, returns the top-5. Series RAG filters line_df where the line contains "positional encoding" (4 hits), filters toc_df where the section title contains "positional" (1 section, 3.5 Positional Encoding), and the arbiter sees both, anchor: the line; scope: the section. No cosine needed.

Article 7A: Retrieval is filtering, not search lays out the mental model.

Lesson 2 – Anchor and context, kept apart

You anchor on the single line that mentions “premium” (precise) but pass the whole surrounding section to generation (sufficient context); conflating them breaks precision and coverage in one move. Top-k forces you to pick: tiny chunks lose context, huge chunks lose precision. We get both, by keeping them apart.

Concrete contrast. For a definition question, the anchor is the one line ( "the deductible is the amount the insured pays before coverage begins" ), the scope is the paragraph around it ( three sentences of context the LLM needs to phrase the answer ). Naive top-k either returns the line (no context) or the paragraph (anchor unclear). Series retrieval returns anchor + scope as a typed pair.

Article 7A: Retrieval is filtering, not search draws the line between anchor and context.

Lesson 3 – Embeddings come last, not first

Keywords always run (cheap, deterministic); the document’s own TOC is a first-class retrieval method; embeddings are the optional final signal, only when vocabulary mismatch is expected. The 2024-era reflex starts with embeddings; we leave them for the cases where the cheaper signals failed.

Concrete contrast. A factual lookup on insurance policy: “effective date?”. Naive RAG embeds, returns 5 chunks. Series runs keyword on "effective" and "date" → 1 line found → done. Embeddings never ran. Cost: one regex pass over line_df; a few milliseconds. The 2-cent cosine search did not happen.

Article 7B: Finding the right anchors builds the three-signal pipeline.

Lesson 4 – Keywords prove absence; embeddings cannot

A zero on keyword search means the answer is genuinely not there; a zero on embedding similarity could be absence or just different words, so embeddings are a refinement, not a decision gate. This asymmetry is the case for keywords as the primary signal in enterprise RAG.

Concrete contrast. The user asks “does this contract cover earthquake damage?” on a flood-only policy. Keyword search for "earthquake" returns zero matches in line_df . The pipeline can ship answer_found = False confidently. Embedding cosine returns 5 chunks (the closest topically related lines about natural disasters ) and the LLM, seeing them, may infer a wrong yes. Keywords saved the day.

Article 7B: Finding the right anchors explains the keyword-first discipline.

Lesson 5 – Co-occurrence beats BM25 on narrow corpora

BM25 ranks by term frequency, but the enterprise answer shape is one mention of a topic next to a specific value, so co-occurrence boosts and high-value regex anchors beat statistical IDF on narrow corpora. The IDF assumptions break on a 20-document corpus where every term is “rare” by Wikipedia standards.

Concrete contrast. The question is “what is the deductible amount?”. BM25 ranks by frequency of "deductible"; the line that appears 12 times in a glossary section ranks first. Co-occurrence search ranks lines that contain both "deductible" and a number; the actual policy line ( "the deductible is $1000" ) ranks first because it co-occurs with $1000 , and the LLM can extract the value cleanly.

Article 7B: Finding the right anchors measures co-occurrence against BM25.

Lesson 6 – One LLM pass over the TOC

Handing the 20-100 row toc_df to a small model and asking which sections answer the question costs one cached call and catches the paraphrases (“exit early” ≈ “Termination”) keyword matching misses.

TOC reasoning is one of the most under-used retrieval signals in production RAG.

Concrete contrast. The user asks “when can I leave the policy early?”. Substring matching on "leave" returns zero TOC entries. An LLM call on the full TOC ( 28 rows, fits in a single small prompt ) returns section “Termination and Cancellation”, the correct paraphrase. One cached LLM call, deterministic afterwards, and the right anchor.

Article 7B reasons over the TOC, and Article 7C: An LLM as arbiter adds the arbiter.

The six lessons share one move: refuse the chunk-embed-cosine reflex, and treat retrieval as filtering on structured tables instead. Keywords always run because they prove absence; the TOC is a first-class signal because the document already declared its structure; embeddings are the optional refinement, not the foundation. The deep-dives (7A, 7B, 7C, 7bis) ship runnable code on real documents; this piece is the catalogue that points at them.

Across sectors and professions

The same three-signal retrieval pattern ( keyword on line_df + reasoning on toc_df + embedding fallback ) holds in every domain. The vocabulary and the TOC depth differ; the signal hierarchy does not. Five sectors below, one retrieval pattern, one audit trace per call.

Embeddings fire only on the medical row where vocabulary diverges from the document – Image by author

Embeddings fire only on the medical row, where the user’s vocabulary ( “tachycardia” ) diverges from the document’s ( “rapid heart rate” ). The other four rows resolve entirely on keyword + TOC. Keywords prove absence (Lesson 4), the TOC catches paraphrases (Lesson 6), and the anchor / scope split keeps precision and context apart (Lesson 2) in every row. The cost gradient is real: the four keyword-resolved rows run in milliseconds with zero LLM tokens; the medical row pays for one embedding pass and one arbiter call.

Sources and further reading

The mainstream literature on retrieval is shaped by web-scale search and shorter consumer corpora. The series stance assumes a small enterprise corpus where the structure is known and the vocabulary is the asset.



Source link