brick of Enterprise Document Intelligence, a series that builds an enterprise RAG system from four bricks: parsing, question parsing, retrieval, and generation. Parsing comes first, and this is the second of its two parts. The previous part turned a PDF into line_df, one row per line of text on the page. This one covers the rest of the model: the full set of tables a parser should emit, what each one holds, and how they link together, so the table on page 14 keeps its columns and the renewal fee stays attached to its label. The other three bricks, and the highlighted answer at the end, all read these tables, never the raw PDF.
RAG tutorials start the same way: text = extract_text(pdf). That single line is where the PDF problems begin.
You build a RAG pipeline. It works on a few clean documents. Then a customer sends you a real contract: 30 pages, with a Schedule of Charges table on page 14. The user asks “what’s the renewal fee?” and the model returns the wrong number.
The team says: “the model can’t read tables.”
The model reads tables fine. The problem is upstream. Your parser walked the table cell by cell and joined them into one long string. The column structure is gone. The link between a label and its amount is gone. Your model is asked to guess which number is the renewal fee. Sometimes it guesses right. Often it doesn’t.
The parser did not fail. It gave you what you asked for. You asked for the wrong thing.
A good PDF parser does not extract text. It models the document as a relational set of tables. One PDF in, one table per kind of thing out (seven or eight today, and more as new needs show up).
toc_df: the sections, like the author wrote them.page_dfandline_df: the body. Every page. Every line.image_df: every figure on every page.span_df: bold, italic, color, font size. Every span of every line.object_registry: every figure caption, every table caption, every annex.cross_ref_df: every “see Figure 2”, every “see Table 4”, every “see Annex B”.parsing_summary: tells you if the PDF is born-digital, scanned, or mixed. Tells you if the OCR is good or bad.
Retrieval reads these tables. Generation reads these tables. Highlighting reads these tables. You open the PDF once. After that, you only work with tables.
This article covers each table in detail, then runs parse_pdf side by side on two very different PDFs to show that the same columns cover both. The previous article (“Beyond extract_text: the two layers of a PDF that drive RAG quality”) covers the upstream side: the declared signals the parser reads first and the page-level classification it runs before any line gets a number.
1. One table per entity
Everything we’ve extracted gets returned as a dictionary of tables plus a parsing summary, one table per entity of the document model.
The naming convention makes granularity readable from the name itself. The diagram at the top of this article shows how each table is produced. Four come straight from the parse: line_df (the text lines), parsing_summary (the doc-level synthesis), toc_df (the native outline, via doc.get_toc), and image_df (via page.get_image_info). The other four are derived from line_df: page_df aggregates it by page, while span_df, object_registry, and cross_ref_df are extracted from its lines. How the tables then join one another is a separate question, taken up in section 2.
1.1. toc_df: table of contents
TOCs are everywhere in enterprise documents. Contracts, reports, policies, employee manuals, regulatory filings: almost all of them ship with a declared section structure, and that structure is the cheapest semantic signal you can hand a retriever.
The catch: it isn’t always native. Sometimes it’s only typographic (bold headings, numbered sections, indented subheadings) and has to be reconstructed from line_df + span_df.
We focus here on the native case (the common one for born-digital LaTeX, Word, and InDesign exports); reconstructing a TOC from typography when bookmarks are absent is its own topic, sketched by an adaptive parser and treated in full in a dedicated follow-up.
parent_idx and breadcrumb; empty when no native bookmarks – Image by authorHow to build it: build_toc_df(doc) calls doc.get_toc(simple=False) (one entry per bookmark, with the destination dict attached) and walks the result to compute parent_idx, breadcrumb, end_page, and start_y. Run on the Attention paper, you get the 22 entries already shown in section 1.2 above: three levels of headings, native bookmarks, no reconstruction needed.
The implicit end_page convention: TOCs mark where sections begin, almost never where they end. build_toc_df materializes the end as a column anyway: for each row, end_page is the start_page of the next entry at the same level or shallower (the next peer or ancestor), with total_pages as the fallback for the last section. Look at Conclusion on the Attention paper: start_page=10, end_page=15. The document only has 15 pages, so the last section absorbs everything to the document’s end. The convention keeps a one-page overlap by design (a section’s end_page is its successor’s start_page, not successor.start_page - 1), which makes the generation brick’s next-page peek (a strong completeness signal that catches truncated lists at section boundaries) a single lookup rather than a runtime scan.
The start_y column, for info: Each bookmark in a PDF outline carries a destination Point(x, y) on its target page, not just a page number. build_toc_df exposes the y as start_y (raw value as returned by fitz). It pins each section header to a precise position within start_page, which is what enables line-level resolution: the same (target_page, target_y) → line join used for native links in section 1.6. Same coordinate-orientation caveat: 720 on the Attention paper (LaTeX, bottom-up) and 72 on NIST CSF (Acrobat, top-down) both point at the top of the page, just from opposite origins. We store the raw value; callers normalize when they need to land on a specific line.
start_page and end_page are page-level anchors. Line-level anchors (start_line, end_line) are the natural refinement: they let downstream stages pinpoint a section to the exact line in line_df, and they enable TOC offset detection when the document has front matter inserted after the TOC was generated (the entire TOC drifts by 1 or 2 pages, a real-world failure mode). The full treatment lives in a dedicated bonus article on TOC anchoring and validation; for now, toc_df stops at page-level granularity (with start_y as the bonus column for callers ready to resolve to a line).
The role: toc_df is the cheapest semantic signal in the entire pipeline. Each entry names a section: knowing that lines 100–150 belong to “3.5 Positional Encoding” tells the retriever and the LLM what those lines are about, before any embedding is computed. Embeddings give you topical proximity; the TOC gives you the document’s own structural meaning of each region, declared by the author, not inferred. The breadcrumb extends this with hierarchical context: a chunk gets stamped with “Methods > 3.5 Positional Encoding”, giving the language model section-level grounding without inflating the chunk text. end_page is what lets the generation brick peek one page past a retrieved section and detect truncated answers without a vision pass. When the document has a native TOC, all of this is free.
Watch out: TOC entries can point to pages that don’t exist (a corrupt or truncated export). Validate 0 <= page_num < n_pages before recording a row, or a section anchor lands nowhere and the page-range join from section 2 silently returns empty.
1.2. line_df: line granularity
The source of truth for text content. Every line of the PDF, with its position and dominant typographic style.
column_position – Image by authorHow to build it: fitz_pdf_to_line_df(pdf_path) walks every text block of every page and emits one row per line. assign_column_positions(line_df) then annotates each row with single / left / right / multi. Run on data/paper/1706.03762v7.pdf, the Attention Is All You Need paper (Vaswani et al. 2017; arXiv non-exclusive distribution license, declared on the arXiv abstract page). Here is page 4 of the paper (the two-column Figure 2 region):
The role: line_df is the unified per-element manifest of the document. Text lines first, but the same row structure also carries image placeholders and table placeholders: each visible content element on a page is one row, with its own bbox, column_position, and a content_type flag (text, image, table). Text-specific fields (font, render_mode) are NaN for non-text rows; the rich image and table metadata lives in image_df and the table extractor’s output, joined back via (page_num, line_num). The result is that a single sorted query against line_df.page_num returns every element on a page in reading order, regardless of its kind. Downstream stages don’t have to join three tables to know what’s on this page.
Watch out: on multi-GB or thousand-page PDFs, holding every line (and image) in memory at once is a problem. A lightweight mode that skips line_df and image_df for endpoints needing only parsing_summary (classification, the doc-level summary) keeps those cheap; gate the full parse at ingestion time for the rest.
The screenshot below is from Enterprise Document Intelligence, the desktop app I’m building. The Text panel on the right is line_df made visible: the page’s native text, line by line, parsed once and read straight from the table, next to the original page it came from.
1.3. page_df: page granularity
Per-page synthesis. Classification, flags, aggregated metrics.
page_type, additive flags, char counts, n_columns – Image by authorHow to build it: build_page_df(line_df) groups line_df by page_num. detect_columns_per_page(line_df) computes n_columns and the result is merged in.
What else fits here: build_page_df is the right home for any per-page signal you can aggregate from line_df on the same pass. Beyond the core triplet, simple aggregations land here for free: n_lines (page density), native_chars versus ocr_chars (a fast scanned-or-native verdict, no classifier needed), n_fonts and font-size spread (a rough structure indicator that separates heading-heavy pages from plain prose), image_coverage_ratio (a join with image_df). The columns that need a downstream pass wait: page_type is produced by classify_page (covered in the previous article) and parsing_method / context_structured are produced by an adaptive cascade that escalates to a heavier parser when fitz is not enough.
Run on the Attention paper:
The role: page_df is where extraction is anchored. Every parser, every OCR run, every classifier operates page by page; page_df is the table that records what each page is and how it should be handled. The page is also a good semantic unit on its own: roughly one or two ideas per page in academic papers, one clause per page in contracts, one sub-topic per page in technical reports. Small enough to be focused, large enough to carry context. That’s why retrieval typically defaults to page-level chunks in a minimal RAG pipeline and why most downstream coordination keys off page_num. When you query “what is page 5 about”, page_df is the row that answers; when you query “all scanned pages with bad OCR”, page_df is what you filter.
Watch out: store page_width and page_height per row, never once per document. Letter and A4 mix in technical publishing, and a landscape page is often inserted for a wide table; a single document-level page size makes every bbox-derived metric (column detection, full-page-image coverage) drift on the odd-sized pages.
1.4. image_df: image granularity
One row per embedded image.
How to build it: The parser walks every page and calls page.get_image_info(), which returns each embedded image with its displayed bounding box and intrinsic dimensions. The Attention paper has three:
Describing the image content: So far image_df only locates each image: a bounding box, a size, a content hash. It says nothing about what the image shows, and a bounding box is not retrievable. A chart or a diagram holds no extractable text, so OCR and layout-based parsers leave that part empty: to them the region is invisible. To make the figure searchable we run a vision LLM over each extracted image and store a short description alongside the row, for example “a line chart of commodity prices since 2022” or “the Transformer architecture, an encoder of N stacked layers”. That description is text, so retrieval can match it. A companion piece on vision-LLM enrichment walks this step in full.
1.5. object_registry: cross-reference TARGETS
A cross-reference has two sides. The target is where a named object lives in the document: the line “Figure 2: The Transformer model architecture” on page 3, the line “Table 1: BLEU scores” on page 8. The source is a body-text mention pointing at the target: “as shown in Figure 2”, “see Table 1”. object_registry captures the target side, one row per caption. The next subsection (section 1.6) captures the source side. Resolving sources to target pages, so a retrieved chunk that mentions “see Table 1” also pulls the page where Table 1 lives, is a follow-up cross-reference pass that consumes both tables.
(object_type, object_id) is the join key – Image by authorHow to build it: Detection uses regex patterns ANCHORED at the start of a line (a real caption starts there, a body-text mention does not); build_object_registry walks line_df, matches each line against the patterns, and keeps the first hit for every (object_type, object_id) pair. On the Attention paper:
OBJECT_PATTERNS = [
(re.compile(r"^\s*(?:Figure|Fig\.?)\s+(\d+)\b", re.IGNORECASE), "figure"),
(re.compile(r"^\s*Table\s+(\d+)\b", re.IGNORECASE), "table"),
(re.compile(r"^\s*(?:Annex|Appendix)\s+([A-Z0-9]+)\b", re.IGNORECASE), "annex"),
]
def build_object_registry(line_df: pd.DataFrame) -> pd.DataFrame:
"""Returns one row per (object_type, object_id), first match wins."""Run on the Attention paper, the builder lands one row per named object, with the caption line as the anchor:
1.6. cross_ref_df: cross-reference SOURCES
The symmetric half of object_registry. Each row is one body-text mention of a named object: “as shown in Figure 2” on page 4, “refer to Table 1” on page 7, “see Annex B for details” on page 12. Every such mention is a source that, when resolved, jumps to a page recorded in object_registry.
Same pattern as the TOC, two methods can produce these rows: native PDF links (the deterministic source, when the document carries them) and text-pattern matching on line_df (the general fallback, what build_cross_ref_df ships). Method 1 is exact but partial. Method 2 is approximate but complete.
Method 1, native PDF links: A PDF can carry its own clickable cross-references. fitz.Page.get_links() returns one entry per link rectangle, with the target encoded as a (target_page, to.x, to.y) triple for an internal jump or a URI for an external one:
import fitz
doc = fitz.open("data/nist/NIST.CSWP.29.pdf")
for page in doc:
for ln in page.get_links():
tgt_page = ln.get("page")
tgt_pt = ln.get("to") # Point(x, y) on the target page
print(page.number + 1, ln.get("kind"), tgt_page, tgt_pt, ln.get("uri"))The interesting bit is to.y. Knowing only the target page tells you where on the document the link lands but not what it points at; the y coordinate pins the line within that page. We split the destination into two scalar columns, tgt_page and tgt_y, and resolve the target line by finding the row in line_df whose y0 is closest to tgt_y on tgt_page.
Two practical caveats here:
- PDF generators differ on y orientation. LaTeX returns bottom-up, Acrobat returns top-down. The normalizer tries both and keeps the closer match.
tgt_ymay sit between two lines. We round to the nearest one.
The payoff: once we know the landing line, we can join (target_page, landing_text) against toc_df and recover the section index directly. No regex, no text matching against breadcrumbs. The native link tells us exactly which toc_idx we landed in.
toc_df – Image by authorThe same pipeline on the Attention paper turns up a different shape of link: citations that resolve to bibliography entries rather than TOC section starts.
landing_text – Image by authorCoverage is the catch. The two demo PDFs show the same pattern:
- Attention paper: 95 internal links, all citations jumping to bibliography entries, plus 18 external URIs (github, arxiv). Zero native links for body-text mentions like “as shown in Figure 2”.
- NIST Cybersecurity Framework 2.0 (CSWP-29; US Government work, public domain in the US, see NIST copyright statement): 47 internal links, all TOC entries and the list of figures pointing at section starts, plus 56 external URIs. Same story: no body-text figure or table mentions are linked.
Enterprise documents are usually worse, with no native links at all (scans, screenshots, exports from tools that drop link metadata). So native links are excellent signal when present (deterministic, resolvable to a toc_idx when the target is a section header) but never cover the full set of cross-references an article carries.
Method 2, text-pattern matching: Detection uses the same vocabulary as OBJECT_PATTERNS, but UNANCHORED so the regex matches anywhere inside a line; caption lines are excluded so the line that DEFINES Figure 2 isn’t also counted as a mention of it.
object_registry – Image by authorOn the Attention paper:
REFERENCE_PATTERNS = [
(re.compile(r"\b(?:Figure|Fig\.?)\s+(\d+)\b", re.IGNORECASE), "figure"),
(re.compile(r"\bTable\s+(\d+)\b", re.IGNORECASE), "table"),
(re.compile(r"\b(?:Annex|Appendix)\s+([A-Z0-9]+)\b", re.IGNORECASE), "annex"),
]
def build_cross_ref_df(line_df: pd.DataFrame) -> pd.DataFrame:
"""One row per body-text mention, with ~30 chars of context."""Run on the Attention paper, every body-text mention of a figure or table lands as a row, joinable back to object_registry:
Run on the demo PDFs, the Attention paper has 13 body-text mentions covering 6 unique objects (Figure 1, Figure 2, Table 1–4): some figures are referenced multiple times, which is exactly what the source-side table is meant to capture.
NIST CSF 2.0 has 13 mentions (7 figure references, 5 annex references, 1 table reference) covering 10 unique objects (5 figures, 4 annexes, 1 table). The mismatch with NIST’s object_registry (6 figures + 3 annexes + 2 tables) is informative:
- one annex is mentioned in the body without an anchored caption in the document (the regex catches a reference whose target lives outside the parsed text)
- one registered figure and one registered table are never referenced
Both are real-world signals worth surfacing to a downstream cross-reference resolver.
1.7. span_df: sub-line granularity (optional)
The line is sometimes too coarse. A line can mix bold and non-bold text (a defined term in a contract). A line in a research paper can include an inline equation in italic alongside prose. A line in an amendment can have the original text in black and the modification in red.
class Span(BaseModel):
# Identity & ordering
pdf_hash: str
page_num: int
line_num: int
span_id: int
# What it says, where it sits
text: str
bbox: tuple[float, float, float, float]
# Typography signals
font_name: str
font_size: float
is_bold: bool
is_italic: bool
color_rgb: tuple[int, int, int]A span_df is more granular than line_df. On the Attention paper the ratio is 3,480 spans for 1,048 lines, about 3.3× heavier. The cost only pays off for stages that inspect typography:
- Heading detection: A line in a larger font, possibly bold, is probably a heading. A TOC reconstruction pass uses this when native bookmarks are absent.
- Listing detection: A bold span starting a paragraph is often the marker of an enumeration item.
- Defined terms in contracts: Bold or italicized terms in legal documents are often defined elsewhere; capturing them at parse time enables glossary linking later.
How to build it: Default behaviour: parse_pdf(...) returns span_df empty. The downstream stages that need it call a dedicated builder on the same line:
paper = parse_pdf(paper_pdf)
paper["span_df"] = build_span_df(paper_pdf) # 3,480 rows on the Attention paperKeeping the spans behind an explicit call avoids paying their cost on every parse for stages that only need line_df. Run on the Attention paper:
is_bold keys the TOC reconstructor – Image by author1.8. parsing_summary: technical synthesis
A single JSON-serializable dictionary per document. It answers at a glance: “is this PDF scanned?”, “does it need OCR?”, “what extraction strategy should the next stage use?” And one more, the semantic one downstream bricks read: “what kind of document is this and what is it about?”
The dict is organised in five zones. The first four are deterministic, built by the parser without an LLM call. The fifth, semantic, carries the document type plus a short LLM-written summary that the question parser injects into its system prompt.
{
"pdf_hash": "abc123...",
"n_pages": 87,
"pdf_version": "1.7",
"source_software": "word_export",
"creator_raw": "Microsoft Word 2019",
"producer_raw": "Microsoft Word for Microsoft 365",
"content_type": "scanned_with_ocr",
"is_scanned": true,
"has_text_layer": true,
"ocr_quality": "good",
"page_type_counts": {"scanned_ocr_good": 80, "native": 5, "empty": 2},
"scanned_page_ratio": 0.92,
"has_toc": true,
"n_toc_entries": 24,
"n_named_objects": 11,
"is_encrypted": false,
"has_form_fields": false,
"recommended_strategy": "use_existing_ocr",
"needs_reocr": false,
"pages_needing_ocr": [],
"doc_type": "annual_report",
"typical_fields": ["fiscal_year", "revenue", "net_income", "auditor"],
"summary": "87-page annual report for fiscal year 2023. Covers revenue, net income, and auditor's notes across operating segments. Standard sections: Letter to Shareholders, MD&A, Financial Statements, Notes."
}The distinction between source_software (from metadata) and content_type (inferred from content) matters. The two can diverge: a PDF whose Producer is “Microsoft Word” but whose content is 100% scanned means somebody pasted images into a Word doc and exported. That’s useful information; don’t overwrite one with the other.
The semantic zone follows the same rule on a different axis. doc_type is a coarse family (resume, contract, academic_paper, invoice, memo, annual_report, …) derived from filename + first-page text. Deterministic, no LLM. typical_fields is the per-doc_type table of field names a question about this kind of document is most likely to target; a resume gets [name, email, phone, experience, …], a contract gets [policyholder, premium, deductible, …]. summary is the only LLM-derived value in the dict: three to four factual sentences naming the document type, the main subject, and the fields it carries. One LLM call at parsing time, cached forever, injected into the question parser’s system prompt so “what is the name?” on a CV no longer returns not found. The companion article on what to read before any line gets a number (“Beyond extract_text”) walks the full design of that summary.
2. The relational model: how the tables link
Producing the tables is one thing; linking them is another. Once the tables exist, the keys they share turn eight separate DataFrames into one queryable model, and almost every link resolves back to line_df, the per-line source of truth.
A few links carry most of the weight:
toc_df→line_df. A TOC entry knows itsstart_page(andstart_y), so from any section you jump straight to the lines that belong to it. “Summarize section 3.5” becomes a page-range filter online_df, no search required.image_df↔︎line_df. An image occupies a position on the page, so it has a line slot inline_df. That line’stextis empty at first, since an image carries no extractable text. Optionally, a vision pass reads the image and writes a short description back into thattextcell, so retrieval can match “the architecture diagram” later. The link is what makes that enrichment incremental: fill it when you need it, leave it empty when you don’t.cross_ref_df→ its target. A body-text mention resolves to wherever the target lives. “see Figure 2” resolves toobject_registryon(ref_type, ref_id); “see section 2.3” resolves to atoc_dfentry. The table fills in as references are matched, so resolution runs lazily, mention by mention.page_df,span_df,object_registryanchor toline_dfonpage_numor(page_num, line_num), the same join every downstream brick relies on.
Concretely, common questions collapse into one or two filters:
- “Summarize section 3.5.” Look up its
start_pageandend_pageintoc_df, thenline_df[line_df.page_num.between(start, end)]. No embedding, no keyword search, just the section’s lines. - “What are the totals?” on the invoice from section 3.2 →
line_df[line_df.column_position == "right"]. The column the parser detected is now a query. - “What does Figure 2 show?”
object_registryresolves the caption to its page and line;line_dfreturns the caption text; and if a vision pass has filled the image’s slot, you get the description too. - “Where is Table 1 referenced?”
cross_ref_df[(cross_ref_df.ref_type == "table") & (cross_ref_df.ref_id == 1)]lists every mention with its(page_num, line_num), joined back totoc_dfto name the section each one sits in.
Each is a filter or a join on tables already in memory, never a re-parse.
This is what the joins buy you downstream. Retrieval pulls a section from toc_df, expands it to its lines in line_df, and widens to the figures it mentions through object_registry; generation reads those lines; highlighting renders citations back onto the page by (page_num, line_num). The whole pipeline becomes a chain of cheap joins on one parse, instead of re-reading the PDF at every step. How these joins become concrete SQL primary keys, foreign keys, and indexes is the storage layer’s job, beyond this article’s scope.
3. parse_pdf on two real PDFs, side by side
parse_pdf is the single entry point that calls every helper above and returns the full set of linked tables in one go. Run it on two very different PDFs and the output structure is identical: same keys, comparable shapes.
3.1. parse_pdf side-by-side on two real PDFs
Running both calls and laying the two returned dicts side by side shows that the keys hold up, with per-cell tallies that reflect each document’s shape:
A LaTeX research paper and the NIST Cybersecurity Framework 2.0 (CSWP-29, US government work, public domain). Two very different documents: one has 15 pages of math notation in a NeurIPS-style two-column layout, the other 32 pages of policy text mixing single and two-column sections. Same parse_pdf call, same keys, every column comparable. The Attention paper drops a useful surprise on the way: this arXiv version carries 22 native TOC entries, contrary to the common assumption that arXiv strips bookmarks.
The PDF is opened once with fitz, every helper consumes the same document state, and the file is closed before returning. No reopening, no redownload from S3, no inconsistency between two helpers seeing different page versions. From here, retrieval, generation, and annotation never touch the PDF again. They query the dict.
3.2. column_position in action (an invoice)
Invoices are the canonical case for column_position: line items run down the left column (descriptions), prices and totals stack down the right column. We pick a one-page fictional invoice (data/invoices/invoice_01.pdf, openly-licensed, generated for the series) so the layout is honest two-column billing instead of a research paper’s figure caption.
Look at the source page first. Each line is boxed by the column the parser gave it: blue for the left (descriptions), green for the right (amounts and totals). assign_column_positions picks that split cleanly:
The header line sits in the left column at x0 = 54. Below the items table, the totals stack on the right: “TOTAL DUE:” at x0 ≈ 391, the amount $2,027.56 at x0 ≈ 497. The line item at y0 = 397.13 shows the split clearly: the description “Staff training” sits at x0 = 54 (left), the quantity 0.5 and unit price $197.58 sit at x0 ≈ 343 and x0 ≈ 395 (right). Downstream, asking for “the totals” becomes a one-line query against line_df: line_df[line_df["column_position"] == "right"].
No vision pass, no bbox arithmetic. Just a column filter on a structured table.
3.3. Two PDFs, same parser, same shape
Two very different documents, the same parser, directly comparable structured outputs:
What this would have looked like with a naive get_text() parser: a string per document, no way to tell which lines were OCR’d and which were native, no idea where each figure caption sits, no separation between left and right halves of a two-column page. The retrieval and generation stages would have built on sand.
4. Save once, reload forever
Parsing is the most expensive brick in the pipeline. Question parsing, retrieval, and generation each cost one LLM call; parsing reads bytes and resolves layout. With PyMuPDF it stays cheap (sub-second on a small paper). With heavier engines (Azure Layout, Tesseract, vision-LLM fallback), the same PDF can take 30 seconds to several minutes per run. Three iterations on a downstream prompt is three OCR runs. No reason for that.
The fix is path-driven. Each PDF writes its parsed tables to a mirror folder under the output directory, matching the source path exactly. From the PDF path alone, every downstream step (retrieval, generation, annotation) knows where the cache lives.
data/ has a twin folder in output/ carrying its parsed tables – Image by authorThe relational tables go to .xlsx (one file per table, opens with a double-click), parsing_summary to JSON. Excel is enough at this stage: pandas round-trips cleanly, and each table stays inspectable in any spreadsheet tool. A production storage layer swaps in SQLite (foreign keys, joins across documents, append-on-update), but the downstream bricks consume DataFrames either way.
save_parsed writes the folder; load_parsed returns the same dict, or None if the cache is missing. The calling pattern is one line:
parsed = load_parsed(pdf_path)
if parsed is None:
parsed = parse_pdf(pdf_path)
save_parsed(pdf_path, parsed)The downstream bricks follow suit. Question parsing writes its ParsedQuestion to questions/, retrieval saves retrieved_pages.xlsx, generation saves answer.json. Every step is fully recoverable from disk, every step can be replayed without touching the LLM again. When you tweak a generation prompt, you’re not paying for parsing or retrieval to re-run.
5. Conclusion
A good RAG parser does not extract text. It turns an unstructured PDF into a relational model of the document: a set of linked tables, joined by shared identifiers (page_num, line_num, (ref_type, ref_id)), each carrying one entity. Retrieval, generation, and annotation never re-read the PDF afterwards; they query DataFrames. Saving the parse once and reloading it forever turns a 30-second-per-question latency into a per-corpus one-shot cost.
A relational set of tables, one PDF in, no flat string out. Every downstream tool the team wires onto the parser (keyword search, embedding similarity, section retrieval, citation rendering, audit log, change tracking) reads from these tables rather than from the original bytes. The PDF is opened once, at ingest. After that, everything is SQL or pandas. That property is what makes the parsing brick worth the engineering investment: the cost is paid once per document, and every iteration on the rest of the pipeline runs against a stable, queryable artefact.
This article is part of the Enterprise Document Intelligence series. The minimal RAG pipeline shows the relational tables in use end-to-end on a real PDF.
6. Sources and further reading
Earlier in the series:
The parser this article describes follows the same architecture as Docling (Auer et al., Docling Technical Report, IBM Research 2024): layout detection, TableFormer, reading-order. Borderless table extraction uses the model from Smock et al. (PubTables-1M / Table Transformer, CVPR 2022). The page-class taxonomy is built on the same baseline as Pfitzmann et al. (DocLayNet, KDD 2022). The article adds a render-mode detection pass (native / scanned / mixed) with OCR-quality scoring on top. The parser produces a relational set of tables (line_df, page_df, image_df, toc_df, object_registry, cross_ref_df, span_df, plus a parsing_summary dict); retrieval, generation, and annotation downstream do not read the PDF again, they query DataFrames.
Same direction as the article:
- Auer et al., Docling Technical Report, IBM Research 2024 (arXiv:2408.09869). Reference architecture for the pipeline this article describes: layout detection, TableFormer, reading-order, unified document representation.
- Smock, Pesala, Abraham, PubTables-1M / Table Transformer (TATR), CVPR 2022 (arXiv:2110.00061). Vision-based table detection and structure recognition; the model behind most modern table parsers.
- Pfitzmann et al., DocLayNet, KDD 2022 (arXiv:2206.01062). Empirical baseline for the page-class taxonomy and layout detection benchmarks.
- Lo et al., PaperMage, EMNLP 2023 demos. Maps to the indexing-vs-reading split (parsing for retrieval is not parsing for answer generation).
Different angle, different context:
- Faysse et al., ColPali: Efficient Document Retrieval with Vision Language Models, 2024 (arXiv:2407.01449). Vision-language retrieval on the page image. The context is retrieval where the page image is the artefact, no parsing-into-tables step. This article uses bounding-box-anchored DataFrames as the foundation instead.
- Wang et al., DocLLM: A Layout-Aware Generative Language Model for Multimodal Document Understanding, JPMorgan 2024 (arXiv:2401.00908). Layout-aware LLM that reads the PDF directly without an explicit relational parsing brick. Same family of approach as ColPali; different from this article’s queryable relational artefact.
- Kim et al., OCR-free Document Understanding Transformer (Donut), ECCV 2022 (arXiv:2111.15664). End-to-end OCR-free document understanding; useful contrast with the OCR-quality-scoring pass this article adds on top of the render-mode detection.
