Retrieval Is Filtering, Not Search: A Mental Model for Enterprise RAG

brick of Enterprise Document Intelligence, a series that builds an enterprise RAG system from four bricks: parsing, question parsing, retrieval, and generation. Retrieval is the third brick, and this is the first of its three parts, the mental model: retrieval is filtering, not search; filter line_df and toc_df, pick anchors small, expand context large.

*where this article sits in the series: Article 7 (retrieval), the mental-model part, inside Part II (the four bricks) – Image by author*

Watch how a human searches a document.

Someone at work wants to know how many vacation days they get this year. They open the HR policy PDF. They press Ctrl+F. They type “vacation”. Fifteen hits scroll past, some in headings, some in the TOC. They jump to the right paragraph, read the rule, and have their answer in 60 seconds.

That’s not a novice doing it wrong. That’s a professional doing it the most efficient way they know: keywords they know are in the document, the TOC the author already wrote, reading a whole section when they suspect it’s the right one. Where in this process is “embedding similarity”? Nowhere.

Sometimes Ctrl+F finds nothing: The doc calls it “PTO” not “vacation”. Or the text sits inside a scanned page that Ctrl+F can’t see. The expert tries a synonym, then a third. Still zero hits.

Then the expert opens the table of contents. They scan the section titles, click the most likely one (“Leave and Time Off”), and read the body. That fallback (keyword first, TOC navigation when the keyword fails) is what professional document work has run on for thirty years.

This article (the first of three) builds the mental model behind that workflow: retrieval is a filtering problem on two structured tables (line_df and toc_df), not a search problem. It also introduces the anchor / context distinction (where the match lands versus what gets passed to generation) which the other two articles build on. The pipeline mechanics (“Anchor detection for RAG: parallel detectors, then one LLM call at the end”) and the arbiter that ranks the results (“Letting an LLM pick the right RAG page: the arbiter pattern at the end of retrieval”) come next.

The “amplify the expert” stance: codify the expert’s workflow, then do it better than they can manually. Three concrete lifts:

The expert types one keyword at a time. The system can detect co-occurrence of multiple keywords on the same page or section in a single pass.
The expert sees nothing when words are locked in scanned images. The parsing brick runs OCR at ingestion, so image-bound text becomes searchable like any other line.
The expert scans the TOC manually. The system joins TOC and content programmatically: pick the right section from the map, then scope the keyword search inside that section’s body.

*Each expert pain point maps to one programmatic lift – Image by author*

Once the parsing brick has produced clean DataFrames, retrieval becomes a filtering problem on structured tables: filter line_df (the text) and toc_df (the map). This article builds up that mental model. The next two articles build the mechanics on top.

Throughout this article we work on a single document, Attention Is All You Need (Vaswani et al. 2017, 15 pages; arXiv non-exclusive distribution license, declared on the arXiv abstract page). It carries a clean native TOC in the PDF outline (22 entries, 3 levels deep), and the content is familiar territory for any engineer touching RAG: encoder, decoder, attention, queries, keys, values. That keeps the focus on the retrieval methods rather than on parsing a domain-specific corpus. This article also assumes the document carries its own TOC; recovering one from raw text is left to follow-up work.

*Every method in this article starts from `line_df` and `toc_df` – Image by author*

1. Retrieval as filtering on structured tables

The standard framing of retrieval is find the passages most similar to the query. That framing is misleading because it imports the wrong mental model.

Once parsing has produced clean DataFrames, retrieval is no longer a search problem in the classical sense. It is a filtering problem on structured tables. Every method we discuss is a different way of filtering rows of line_df (the document’s text) and toc_df (the document’s map). The mental model is closer to a SQL query than to a Google search.

This shift unlocks methods that don’t appear when you treat retrieval as free-text search:

The parsing brick produces structured DataFrames: line_df carries the textual content with line numbers, page coordinates, and section IDs; toc_df carries the table of contents as a navigable hierarchy; page_df and image_df round out the data model. The document is no longer free text by the time retrieval runs.

The question side arrives equally pre-processed. The question parsing brick turns the user’s string into a RetrievalQuery brief (a derived view of ParsedQuestion) that carries the keywords, scope filters, structural hints, and the answer-context width retrieval should respect. Retrieval reads one typed object on each side: the document-as-tables on the left, the RetrievalQuery on the right.

None of this is exotic. It is what teams do when they sit down with a real document and a real question. But it is invisible in the standard tutorial framing because that framing assumes “document = unstructured string, retrieval = vector search.”

1.1 Documents become tables

Two DataFrames from the parsing brick carry most of the retrieval load. Their sizes determine which filtering methods are even possible on each.

line_df, the dense table: One row per line of the document, tens of thousands of rows for a long contract. Columns include the text, the page number, the line number on the page, the bounding box, and the section_id that links each line to its section in toc_df. Dense, large, fine-grained: every line is a candidate, so filtering needs to be cheap per row. This is where the answer lives: the actual text the LLM will read and cite comes from here.

toc_df, the sparse table: One row per section in the table of contents, typically 20 to 100 rows for most enterprise documents, sometimes as few as 10. Columns include the section title, the level (1, 2, 3…), the page range, the parent section, and a stable section_id. Sparse, small, coarse-grained: each row covers a large block of the document, so filtering can be expensive per row. This is the map of where the answer might be: it tells you which section to look in, not what the section says.

The size difference between these two tables reshapes the entire grid. A method that is feasible on toc_df (passing the whole table to an LLM, embedding every entry, running multi-hop reasoning) may be entirely infeasible on line_df. Conversely, a method that is natural on line_df (regex over thousands of lines, fast keyword scoring) is wasteful on toc_df because there is not enough data to discriminate.

A good retrieval pipeline uses both. It uses toc_df to narrow down to the right section, then line_df to find the precise lines within that section. The two tables collaborate via the section_id join.

1.2 Why no single method is enough

Four real questions on an insurance contract show why no single filter and no single granularity is enough. Each question wants a different scope at two levels: the anchor (where the matching signal lands in the document) and the context (what gets passed to the LLM around that anchor). Section 2 develops both terms; for now just notice how they vary across the four questions.

A pipeline that uses cosine similarity with top-k=5 (return the 5 closest chunks by score) for all four cases will be wrong on at least three. The right answer depends on which structure you filter, what column you filter on, what anchor you detect on, and what context you pass downstream. The rest of the article develops that grid.

The first question on the list, “What is the policy number?”, is exactly what the Needle-in-a-Haystack benchmark tests (Kamradt, 2023, github.com/gkamradt/LLMTest_NeedleInAHaystack). Drop a verbatim sentence into a long context, ask about it, watch the model find it. Frontier models score near-perfectly. The benchmark is real, the result is real.

The trap is generalizing from it. “Skip retrieval, dump the corpus in context” works on category 1 and fails on categories 2, 3, 4. Listing every obligation in a contract isn’t a single needle. Comparing premiums across three policies isn’t a single needle. Summarizing a warranty section without missing anything isn’t a single needle.

The benchmark validates one question class, not the other three. It is a research result for an isolated retrieval task, not a license to skip retrieval in production.

To make the contrast concrete, here is the naive RAG baseline that most tutorials show, applied to the running paper. We pick a representative question, “How is attention computed?”. The answer is the formula box in section 3.2. We then run cosine top-k against the page embeddings.

*Five pages ranked by similarity, no section context, no “not found” path – Image by author*

The rest of this article is methods that beat this baseline by exploiting the structure parsing already extracted.

2. Anchor and context: the two granularities

Filtering rows is half the picture. The other half is just as foundational: the anchor (the row where you detect the signal) and the context (the chunk you pass downstream to generation) are not the same unit. The anchor terminology is local to this article; adjacent literature uses hit in IR, trigger in information extraction, or evidence span in QA, all for the same idea. We keep anchor because it pairs naturally with context.

Retrieval runs in two phases. Phase 1 finds where the answer lives: keyword detection and embeddings run on line_df and toc_df, an LLM ranks the candidates once at the end, and the output is a small set of anchors (section, page, or lines). Phase 2 sizes the context around each anchor: paragraph, section, or a window of N lines, driven by the question’s intent and scope width (already parsed on the question side). Article 7B (anchor detection) develops phase 1’s detectors and Article 7C (the LLM arbiter) develops the arbiter that ranks them; section 2.4 below develops phase 2.

*Phase 1 finds anchors; phase 2 sizes the context around each anchor – Image by author*

A compliance officer searching “liability” with Ctrl+F lands on a single matching line. They never read just that line. They read the surrounding paragraph, often the whole section. The anchor is one line; the context is hundreds.

Concretely: you may anchor on a single line of line_df that mentions “premium”, but you pass the whole surrounding section to generation so the LLM sees the value in context. You may anchor on a toc_df title (“Section 5: Specific Exclusions”), but the context is the entire section’s body lines from line_df. The two granularities are independent design decisions:

Collapsing the two scopes, detecting at chunk level and passing the same chunk downstream, is the most common mistake in RAG pipelines. It loses precision (chunks are too coarse for fine-grained anchoring) and richness (chunks are too narrow for grounding the answer). The rest of the article keeps the two scopes separate at every step.

“€125,000 annually.”

Without context, the LLM does not know what €125,000 refers to. With one paragraph of expansion:

“The annual premium for the policy is set as follows. The base premium amount is €125,000 annually. This may be adjusted in accordance with section 3.4.”

Three expansion strategies cover most cases.

Paragraph expansion: Take the paragraph that contains the matched line. Works for most QA tasks.

Section expansion: For listing or synthesis questions, expand to the full section. Use toc_df to find boundaries.

Window expansion: For documents without clear paragraph or section boundaries (transcripts, long-form prose), expand to N lines before and after.

The choice depends on the question (intent, expected answer shape) and the document (does it have paragraph structure? sections?). The dispatcher picks; the orchestration applies the chosen strategy uniformly. With one of these, the LLM can answer correctly and cite precisely.

def expand_to_section(line_num, page_num, line_df, toc_df):
    anchor_section_id = line_df.loc[
        (line_df["line_num"] == line_num) & (line_df["page_num"] == page_num),
        "section_id",
    ].iloc[0]
    sec = toc_df[toc_df["section_id"] == anchor_section_id].iloc[0]
    in_section = (line_df["page_num"] >= sec["start_page"]) & (line_df["page_num"] <= sec["end_page"])
    return "\n".join(line_df[in_section]["text"])

def expand_window(line_num, page_num, line_df, n=5):
    page_lines = line_df[line_df["page_num"] == page_num].reset_index(drop=True)
    i = page_lines.index[page_lines["line_num"] == line_num][0]
    return "\n".join(page_lines.iloc[max(0, i - n) : i + n + 1]["text"])

Run both on the first body line of page 4 (where the Attention(Q,K,V) formula sits) and the difference between the two scopes shows up:

*Same anchor, two expansion strategies: 7-line window vs full section – Image by author*

2.5 When there’s no TOC: where does the section end?

Section expansion is straightforward when toc_df gives the page range. The start of the answer is the anchor, the end is the next TOC entry’s start page, no thinking required. When the document has no native TOC and no synthesized one, that easy bound disappears. The end of the section has to come from the content itself, and that is one of the hard problems in document AI.

A short tour of what research has tried.

Each of these is a paper or three, with tuning knobs, failure modes, and benchmark folklore. Stacked together, they make a small research project on their own.

Our position in this series: the LLM you already have in generation does this job. The generation brick reads the context retrieval handed it and produces the answer. The same call can also report whether the context went off-topic, whether the answer continues past the window, whether more context is needed. A follow-up integrated pipeline wires this back into a feedback loop: generation flags “the topic continues past the last line I was shown”, retrieval extends the window, generation re-runs. The boundary is found by the same model that uses the boundary. No second machinery, no segmenter to train, no threshold to calibrate.

The wider point: R&D engineering and enterprise engineering are different jobs. Section-end detection has a real research bibliography. Most enterprise teams do not need to solve it; they need a system that works on their actual documents with their actual tools. The discipline is to ask, before reaching for TextTiling or a custom segmenter, whether one extra LLM call would have done the same job. Today, the answer is usually yes. Big-model inference is not what it was two years ago: pricing is down by an order of magnitude, latency is acceptable, and an enterprise pipeline can afford one extra call per question on a problem like this without breaking the budget. The trade is clear: spend on inference, not on a custom segmentation stack you then own forever.

That is not a blanket “use the LLM for everything” claim. Cost still matters, latency still matters, and a small deterministic rule beats an LLM call when the rule covers the case cleanly. But the default has shifted. R&D explores the technique space and pushes the techniques forward; enterprise engineering picks the simplest tool that meets the requirement and ships. In retrieval, the temptation to confuse the two is high because every paper looks like an opportunity. The right call is usually the one that does not require new R&D.

3. Conclusion

Retrieval is not search; it is filtering on two structured tables (line_df and toc_df) that parsing already produced. Every retrieval method is a different way of picking rows out of these two tables.

Anchor and context are different granularities. Anchor small (line, title): that is what you score. Context large (paragraph, section): that is what you pass to generation. The two scopes are independent design decisions; collapsing them is the single most common pipeline mistake.

Two phases. Phase 1 finds where the answer lives. Phase 2 sizes the context around each anchor, driven by the question’s intent and scope width (already parsed on the question side).

With the mental model in place, the natural next question is how the pipeline produces the anchors: which detectors run on line_df and toc_df, in what order, and when the LLM enters. Article 7B (anchor detection) walks that three-stage pipeline (keyword + embeddings in parallel, aggregate to a structural unit, one LLM arbiter at the end), with runnable code on the Transformer paper as the running example. Article 7C (the LLM arbiter) closes the loop on the arbiter call, the decision tree that picks methods per question, the “not found” path, and the unified JSON contract retrieval hands to generation.

This article is part of the Enterprise Document Intelligence series. The minimal RAG pipeline shows retrieval in use end to end on a real PDF.

4. Sources and further reading

Retrieval reframed as filtering on two structured tables, with anchor and context as separate granularities. The references below cover the mental model.

Same direction as the article:

Sarthi et al., RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval, ICLR 2024 (arXiv:2401.18059). The closest published equivalent to the TOC-as-retriever framing; RAPTOR clusters its own tree, this article uses the TOC the document already declares.
Anthropic, Contextual Retrieval (Sept 2024). Hybrid-search consensus aligned with the anchor-and-context separation this article introduces.

Earlier in the series:

Document Intelligence: series intro. What the series builds, brick by brick, and in what order.
Baseline Enterprise RAG, from PDF to highlighted answer. The four-brick pipeline end to end: PDF in, highlighted answer out.
Embeddings Aren’t Magic: The Predictable Failure Modes of RAG Retrieval. Where embedding similarity wins (synonyms, typos, paraphrase), where it predictably breaks (unknown terms, negation, term-vs-answer relevance), and how to use it anyway.
Rerankers Aren’t Magic Either: When the Cross-Encoder Layer Is Worth the Cost. What a cross-encoder adds over bi-encoder embeddings, measured, and when it is worth the latency.
RAG is not machine learning, and the ML toolkit solves the wrong problem. Why chunk-size sweeps and finetuning optimize the wrong thing; route by question type instead.
From regex to vision models: which RAG technique fits which problem. Two axes, document complexity and question control, that pick the technique for each case.
10 common RAG mistakes we keep seeing in production. Ten production mistakes, organized brick by brick, with the fix for each.
Beyond extract_text: the two layers of a PDF that drive RAG quality. The first half of the parsing brick: the document’s nature, signals, and summary.
Stop returning flat text from a PDF: the relational shape RAG needs. The second half of the parsing brick: the relational tables every downstream brick reads.

Source link

Retrieval Is Filtering, Not Search: A Mental Model for Enterprise RAG

1. Retrieval as filtering on structured tables

1.1 Documents become tables

1.2 Why no single method is enough

2. Anchor and context: the two granularities

2.1 Anchor scope: where to look for the signal

2.2 Context scope: what to extract around the match

2.3 Question types and scope choices

2.4 From match to context: three expansion strategies

2.5 When there’s no TOC: where does the section end?

3. Conclusion

4. Sources and further reading

Like this:

Related

1. Retrieval as filtering on structured tables

2. Anchor and context: the two granularities

4. Sources and further reading

Share this:

Like this:

Related

Related News