Parse Scanned PDFs for RAG with EasyOCR: Free OCR Gives You Words, Not a Document

Parse Scanned PDFs for RAG with EasyOCR: Free OCR Gives You Words, Not a Document


in Enterprise Document Intelligence, the series that builds an enterprise RAG system from four bricks. Article 5 (document parsing) built the parser with PyMuPDF (fitz), which returns empty on a scanned page with no text layer. This companion swaps the engine for EasyOCR, a free OCR package that recovers that text. It is the one case in this family where the new engine gives you less, not more: it recovers the text and nothing around it, and that gap is the lesson.

where this companion sits: it extends Article 5 (document parsing), inside Part II (the four bricks), with a different parsing engine – Image by author

Scanned PDFs are not solved by “just throw OCR at it”. The OCR step recovers text; that’s necessary but not sufficient for an enterprise RAG pipeline. What the pipeline also needs is everything around the text: where the page boundaries are, which lines are section headings, what is a figure, what is a table row vs a free paragraph. “Traditional OCR” (the term of art for text-detection + text-recognition engines like EasyOCR, Tesseract, PaddleOCR) gives you the text. It gives you nothing else. The rest is the layout problem, and the layout problem is the harder half.

This article runs that distinction concretely. The traditional-OCR engine is EasyOCR: the simplest, fastest, free, JaidedAI’s text-detection + recognition library (Apache 2.0, declared in the project’s LICENSE file). The layout-aware engine is Docling (Article 5ter; MIT license, declared in the project’s LICENSE file). Both can OCR a scanned page. They differ on what they do with the result. The whole article is a setup for the head-to-head on a real public-domain 1974 scan in section 5.

EasyOCR is the OCR floor: line_df only, no layout. The rest of the family adds structure – Image by author

1. What “traditional OCR” does (and doesn’t)

Traditional OCR reads pixels and returns text rectangles. Everything else, sections, tables, figures, reading order, is a separate layout problem the engine refuses to look at. The two models behind it are text detection (find rectangular regions of the image that contain text) and text recognition (read each region’s pixels and return characters with a confidence score). The output is a flat list of (bbox, text, confidence) per detected region.

That is everything EasyOCR (or Tesseract, or PaddleOCR) does. The engine reads pixels and returns text rectangles. A two-column page comes back as a flat list of left-and-right text boxes intermixed by y-coordinate; the engine does not know there are two columns. A table comes back as a grid of disconnected cells the engine cannot tell apart from regular paragraphs. A figure caption is just another text box. The page header, page footer, marginalia all show up as boxes too.

Anything that needs “this text is a section heading” or “these four boxes are one table row” needs a second model on top, a layout model. The layout model reads the OCR output plus the page image and classifies each region (heading, paragraph, table cell, figure, caption, footer…) and groups them into a reading order. That is what Article 5bis (Azure DI), Article 5ter (Docling), and Article 5quater (vision LLM) all add over the OCR step. Without one, you have “OCR output”, not “a parsed document”.

2. EasyOCR: the canonical traditional OCR

EasyOCR is the cleanest demonstration of “traditional OCR” as a class. The library is small (~150 MB of model weights cached on first call), free, CPU-only by default, local. The whole library API is two calls: build a Reader for the languages you need, then hand readtext an image. Each detection comes back as a triple: the polygon around the text, the recognised string, and the recogniser’s own confidence.

import easyocr
import fitz
import numpy as np

reader = easyocr.Reader(["en"], gpu=False)        # first call downloads ~150 MB

# render page 1 of a scanned PDF to a numpy array EasyOCR can read
page = fitz.open("data/contracts/scanned_amendment.pdf")[0]
pix = page.get_pixmap(matrix=fitz.Matrix(2.0, 2.0))   # 2x zoom = ~144 DPI
img = np.frombuffer(pix.samples, dtype=np.uint8).reshape(
    pix.height, pix.width, pix.n,
)

# the recogniser: one image in, one triple per detected text region out
detections = reader.readtext(img)
for quad, text, conf in detections:
    # quad = [[x0,y0], [x1,y0], [x1,y1], [x0,y1]]  in pixel coords
    print(round(conf, 2), text)

parse_pdf_easyocr wraps that loop. It walks every page of the PDF, renders each to a numpy array, calls readtext, converts the pixel-space polygons back to PDF coordinates, and packs the detections into the same dict-of-tables contract as the other parsers, same line_df, same parsing_summary, same downstream consumers, except that only those two keys carry data. Every other slot (page_df, image_df, toc_df, span_df, object_registry, cross_ref_df) comes back as an empty DataFrame. That isn’t a missing-feature bug; it’s exactly what “traditional OCR” means.

parsed = parse_pdf_easyocr(
    "data/contracts/scanned_amendment.pdf",
    languages=("en",),         # add "fr", "de", ... for multilingual scans
    render_scale=2.0,          # 2.0 = ~144 DPI ; raise for small fonts
    gpu=False,                 # CPU-only by default ; set True if CUDA available
    confidence_threshold=0.0,  # filter low-confidence detections if needed
)

parsed["line_df"]              # text + bbox + confidence per detection
parsed["parsing_summary"]      # method, page count, line count, render scale
# Every other key (page_df, image_df, toc_df, span_df, object_registry,
# cross_ref_df) is an empty DataFrame ; EasyOCR has nothing to put there.

The signature kwargs are the only knobs:

  • languages: tuple of ISO-639-1 codes (en, fr, de, zh, …). A multilingual corpus loads one Reader per language set; the @lru_cache in get_easyocr_reader keeps a handful of these in memory across calls.
  • render_scale: how many pixels per PDF unit when rasterising each page. 1.0 is native (~72 DPI, often too small). 2.0 is the sweet spot for body text. Raise to 3.0 for tiny fonts; lower if you’re memory-bound.
  • gpu: CPU is the default so the module works on any machine. CUDA gives a 3-5x speedup on text-heavy pages.
  • confidence_threshold: drop low-confidence detections. 0.0 keeps everything (the column is preserved so downstream code can filter), 0.3 cuts most noise on degraded scans.

3. What line_df looks like

Sample rows from the NIST FIPS 199 cover (US Government work, public domain in the US, see NIST copyright statement), one per detected text region: the page coordinate, the OCR’d text, and the recogniser’s own confidence score. That is the whole output.

Same column shape as fitz’s line_df, plus a confidence column EasyOCR adds for free – Image by author

The shape is deliberately small:

  • text + bbox: the recogniser’s payload, one row per detected text region.
  • confidence: float between 0 and 1, EasyOCR’s self-score. Useful both as a filter (drop below 0.3 on noisy scans) and as a feedback signal (Article 8’s generation can flag low-confidence passages to the user).
  • character_count: kept for symmetry with the other parsers; on EasyOCR it’s just len(text).
  • No column / reading-order column. A two-column page comes back as a flat list, left-and-right boxes intermixed by y-coordinate.

Every other key in the returned dict (page_df, image_df, toc_df, span_df, object_registry, cross_ref_df) is an empty DataFrame. A consumer that calls parsed["image_df"] does not crash; it iterates an empty frame.

4. What traditional OCR misses, the layout gap, item by item

Five structural artefacts that the RAG pipeline needs and that traditional OCR cannot produce, regardless of how big the recognition model is. Each one breaks a downstream operation the rest of the series relies on.

Take the third one, reading order, because it is the one that quietly corrupts an answer. EasyOCR returns text boxes sorted by their y-coordinate. On a two-column page the two columns sit at the same heights, so the boxes come back interleaved: first line of the left column, first line of the right column, second line of the left, and so on. The prose reads as a zigzag, and generation quotes the zigzag.

with no layout model, boxes come back sorted by y, so a two-column page interleaves into a zigzag – Image by author

The single sentence: the OCR step recovers text, the layout step recovers what makes the text usable. Article 5ter (Docling) and Article 5bis (Azure DI) add the layout step on top of the same OCR. Article 5quater (vision LLM) folds the two into one call. EasyOCR stops at the OCR step.

5. EasyOCR vs Docling on a real scanned PDF

On the same 1974 scan, Docling extracts more characters (5,423 vs 4,952), the page boundaries, eleven TOC entries, and four figure regions. EasyOCR extracts text rectangles and stops. The two engines agree at the character level, both OCR with the same recogniser-class accuracy, but Docling’s layout pass turns the OCR output into a document.

The interesting comparison is not against fitz (fitz returns zero on a scan) but against the next engine up: Docling, the local layout-aware parser from Article 5ter. The comparison is cleaner than it looks: Docling’s default OCR backend is EasyOCR itself. Same recognizer reading the same pixels; the difference is everything Docling builds around it.

The test case is a real public-domain scan: pages 1–5 of karg74.pdf, the 1974 USAF MULTICS Security Evaluation (Karger & Schell, ESD-TR-74-193 Vol. II). NIST hosts it in their Early Computer Security Papers archive; the work is in the public domain as the output of US Air Force officers. The PDF has Adobe’s “Paper Capture” OCR layer baked in, but we ignore it, both engines re-OCR from page images, which is the realistic scenario when the embedded OCR (when present) is unreliable.

The real comparison. Both re-OCR the page images; Docling adds layout – Image by author

The two columns tell different stories.

EasyOCR (left). Faster (59.7 s vs 134.4 s, no layout model to load and run), ships the recogniser’s confidence as a column (mean 0.81 on this scan), produces more row-level detections (346 boxes) because every text region in the page becomes one row. Zero structure: no page_df, no toc_df, no image_df. The output is text in bbox form, nothing else.

Docling (right). Slower (2.3× more compute), joins detections into 105 lines/paragraphs rather than 346 boxes, no confidence column. The structural gain is real: 5 page_df rows, 11 toc_df entries (Docling’s layout model classifies headings as sections), 4 image_df rows (figures detected inside the page as separate objects). On a PDF with tables, the gap widens further, Docling’s TableFormer recognises rows × columns × headers, which EasyOCR cannot do at all. Article 5ter develops the table case in full.

Both engines OCR with similar character-level error rates on this 1974 scan (Karger → “Karger” by EasyOCR, “Karger” by Docling on the cleanest page; degraded regions yield similar noise on both, “Laboralory”, “und” instead of “and”). The OCR engine inside Docling (EasyOCR or OnnxTR depending on install) is not magically more accurate than calling EasyOCR directly. What Docling adds is how it organises the OCR output, not how it OCRs.

For enterprise RAG, the right call is Docling almost always. The 2.3× compute is paid once at ingestion (the parse cache from Article 5 (document parsing) reuses results forever); the structural gain (TOC, figures, table cells, reading order) is paid back on every downstream query. The one thing Docling does not ship is EasyOCR’s row-level confidence signal, which is rarely worth giving up sections + figures + tables.

6. When traditional OCR still earns its keep

EasyOCR is the emergency package of the family: less visibility into the document, simpler dependencies, faster to deploy when the constraint is operational rather than pedagogical. Four narrow cases keep the door open.

Outside these cases: default to Docling on scans, Azure DI on regulated-cloud-OK shops, vision LLM when the document has handwriting / signatures / a non-textual semantic layer. The adaptive-parsing dispatcher (Article 10) routes automatically.

7. Conclusion

OCR recovers characters. Layout recovers what makes the characters useful, sections, figures, table cells, reading order. The default engine for scans is the one that does both. The full Article-5 family lines up by the same axis:

EasyOCR sits at the OCR floor (line_df only); every other engine in the family adds a layout step on top – Image by author

EasyOCR is the OCR floor, what you get when you stop at recognising characters and never ask “and where are they on the page?” The question matters. “Where on the page” makes the difference between a list of text boxes and a parsed document. The dispatcher of Article 10 (adaptive parsing) picks the right engine per page; this article exists so the dispatcher knows what it gives up when it picks the cheap one.

8. Sources and further reading

EasyOCR is the most reachable traditional OCR engine in 2026; PaddleOCR (Baidu) and Tesseract (Google, decades-old) sit beside it in the same family. The layout step on top is what separates “OCR” from “document parsing”; Docling (Article 5ter) and Azure DI (Article 5bis) both add it, on local hardware and in the cloud respectively. The right cross-reading is the layout literature (Smock et al. 2022 for table structure, Auer et al. 2024 for the full layout cascade) and the alternative OCR engines for non-Latin scripts.

Same direction as the article:

  • JaidedAI, EasyOCR. The library this article documents, including the 80+ language model packs.
  • PaddleOCR (Baidu). Same-class traditional OCR engine; better Chinese coverage, similar layout blindness.
  • Tesseract OCR. The decades-old reference, still widely deployed; same architectural shape as EasyOCR (detection + recognition, no layout).

Different angle, different context:

  • Auer et al., Docling Technical Report, IBM Research 2024 (arXiv:2408.09869). The layout-aware cascade that turns OCR output into a parsed document; the comparison point in section 5 of this article.
  • Smock, Pesala, Abraham, PubTables-1M / Table Transformer (TATR), CVPR 2022 (arXiv:2110.00061). The research lineage behind cell-level table extraction, the single biggest capability EasyOCR lacks.

Earlier in the series:



Source link