PDF Parsing for Complaint Forms — Docling vs PyMuPDF vs PaddleOCR

Background

Why complaint forms break every parser

Most PDF parsing benchmarks test clean academic papers or financial reports. Complaint forms are harder — they stack six difficulty layers at once.

Difficulty Stack — Customer Complaint Forms

Content Mix_{Print + Handwriting}

→

Layout Variety_{KVP · Tables · Fields}

→

Scan Quality_{Fax · Skew · Noise}

→

Mixed Docs_{Form + Attachments in one PDF}

→

Field Semantics_{Cross-field validation}

→

Comply_{CFPB · GDPR · PII}

The benchmark trap: OmniDocBench (CVPR 2025) tested 9 document types and found over 55 percentage points of accuracy variance across document categories — for the same parser.[2] Domain composition matters more than which tool you pick. Measure on your actual forms.

Field-Level Accuracy Expectations (Before Tool Selection)

99%+

Printed text at 300 DPI

90–96%

Checkbox detection

80–90%

Numeric fields (printed)

70–88%

Handwritten free text

No benchmark covers real complaint forms. FUNSD (the closest proxy) contains 199 administrative forms — mostly machine-typed, not handwritten.[1] Build a proprietary annotated corpus from 200+ real forms before making a production tool selection.

Tool Overview

8 parsers, 6 categories

The local tools below are free and can run on-prem. The Claude API row is included only as a cloud comparison baseline, not as an on-prem option.

Tool	License	Scanned	Tables	Handwriting	Speed (CPU)	Best For
Docling ★ Strong local option	MIT	✓	97.9%*	Limited	~1.3 p/s	Complex layouts, production on-prem
PyMuPDF	AGPL-3.0†	✗	Basic	✗	~32 p/s	Digital PDFs, rendering + coordinates
pdfplumber	MIT	✗	Rule-based	✗	~150 p/s	Table extraction from digital PDFs
PaddleOCR	Apache-2.0	✓	✓ (ML)	CER ~24%	~2–4 p/s	Scanned PDFs, multilingual (80+ langs)
Tesseract	Apache-2.0	✓	✗	Poor	~1 p/s	Baseline OCR, widest language support
TrOCR	MIT	✓	✗	CER ~3%‡	<1 p/s	Handwritten field extraction only
pypdf	BSD-3	✗	✗	✗	~415 p/s	AcroForm digital fields (fastest)
Claude API Not on-prem	Proprietary	✓	✓	~90%	0.3 p/s	Research comparison only — data leaves infra

Speed figures are indicative CPU timings and vary strongly by document type; the detailed measured tables below are the source of truth for this benchmark. * Procycons 2025 benchmark on sustainability reports.[10] Not measured on complaint forms specifically. † AGPL-3.0 is free for internal/on-prem use; commercial redistribution requires a paid Artifex license. ‡ CER on IAM clean handwriting dataset[7] — real-world complaint form handwriting will be higher.

Docling[4]
MIT   IBM Research · 37k ★
ML layout parser for mixed document types. Runs fully local and air-gap compatible. Good first choice when tables or scanned pages are involved.

          Scanned
          Tables
          Multi-format
          PII ready
        

PyMuPDF

AGPL-3.0 Artifex · 8.7k ★

Very fast text-layer PDF extractor. Best for digital text PDFs and simple AcroForm workflows. Not useful for scanned inputs without an OCR stage.

Digital PDF AcroForm Fast + tables

PaddleOCR

Apache-2.0 Baidu · 70k ★

The broadest open-source OCR suite. PP-OCRv4[6] + PP-StructureV2 for layout and tables. 80+ language support — useful for multilingual complaint forms.

Scanned Tables (ML) 80+ languages GPU ready

Under the Hood

How parsers actually detect words

The tools called "PDF parsers" use three completely different mechanisms. Which one a tool uses determines what it can and cannot read — that's what separates the benchmark rows.

Category 1 — Text extraction (pypdf · PyMuPDF · pdfplumber)

No image processing. No machine learning. Pure parsing of PDF's internal byte structure.

Step-by-step walkthrough — one PDF operator at a time

Step 1 of 7. Open the PDF binary. Everything is stored as numbered objects — like rows in a database. Object 12 is the Page. Object 14 is the Font. Object 15 is the Contents stream.

Use Prev / Next or click any step

Blue highlight = the operator currently executing

Operator token

Position operator

Text draw operator

1 / 7

PDF object tree — what the parser traverses

xref tableObject 12 → Page 1.

Page object 12Links to Font obj 14 and Contents obj 15.

Font object 14Defines /Helvetica + ToUnicode table.

Contents stream 15Holds all the drawing operators.

Key insight

StorageObjects, not pixels

Scan PDFObject is an image — no text

A scanned PDF stores the page as a raster image object. The contents stream has almost no text operators — extractors return empty string.

Contents stream — BT is executing (blue)

BTBegin text block — opens drawing context.

TfSet font (next step).

TdMove cursor (later).

TjDraw string (later).

ETEnd text block (closes context).

What BT means

ActionOpens a text context

Characters drawn?None yet

Closes withET operator

Every visible text string in a PDF sits between a BT and ET pair. The parser looks for these pairs to know where text regions are.

Contents stream — Tf is executing (blue)

BTBegin text block.

Tf/Helvetica 12 — set font + size.

TdMove cursor (next step).

TjDraw string (later).

What Tf does

Font name/Helvetica

Size12 points

AffectsAll Tj draws until next Tf

The font name refers to an object in the PDF resource dictionary. That object contains the ToUnicode table — critical for correct character decoding.

Contents stream — Td is executing (blue)

BTBegin text block.

Tf/Helvetica 12 — font set.

Td72 680 — move pen to column 72, row 680.

TjDraw string (next step).

What Td does

X position72 pt from left

Y position680 pt from bottom

Ink drawn?None — cursor move only

PDF coordinates start at bottom-left (0,0). Y=680 places the text near the top of an A4 page. This is how parsers know the reading order — by sorting Td positions.

Contents stream — Tj is executing (blue)

BTBegin text block.

Tf/Helvetica 12.

Td72 680 — cursor positioned.

Tj(Complaint Date:) — draw this string now.

Td180 0 — advance right for value.

Tj(01/15/2024) — draw the value.

What Tj does

DrawsThe visible string

Bytes extracted"Complaint Date:"

Then"01/15/2024"

Tj is the operator the parser actually reads text from. Blue = both Tj calls. The second Td advances the cursor right so the value appears inline.

ToUnicode table: raw glyph IDs → readable characters

<0043>

→

C

<006F>

→

o

<006D>

→

m

<F001>

→

# (custom)

<0031>

→

1

<0035>

→

5

Why this step can fail

With CMapClean Unicode text

Missing tableGarbled bytes or blanks

Some PDFs use private glyph IDs (like F001 above) that map to non-standard characters. Without the CMap, the parser has no way to recover the real character.

Sorted by position → final extracted text

Y=680, X=72Complaint Date:

Y=680, X=25201/15/2024

AcroForm /VRead via field / widget dictionaries

Characters sorted by Y (top first), then X (left-to-right) reconstruct reading order. AcroForm /V values bypass this entirely — they're read from the form field structure instead, which is why pypdf reaches 1.00 Field F1 on the synthetic AcroForm slice used here.

Final result

Speed415 pages / second

Field F11.00 on this slice

Works on scans?No — needs text layer

Inside a PDF content stream

BT
  /Helvetica 12 Tf        % set font + size
  72 680 Td               % move cursor to (72, 680)
  (Complaint Date:) Tj    % draw string
  180 0 Td                % advance right
  (01/15/2024) Tj         % draw value
ET

Every character's position is encoded as PDF operators. Tj draws a string. Td moves the cursor. Tf selects a font.

Glyph → Unicode mapping

% ToUnicode CMap (inside PDF font object)
/CIDInit /ProcSet findresource begin
beginbfchar
  <0041> <0041>   % glyph 0x41 → "A"
  <0042> <0042>   % glyph 0x42 → "B"
  <F001> <0023>   % custom glyph → "#"
endbfchar

When a font embeds a ToUnicode CMap, parsers can map glyph codes to characters. When it's missing (common in scanned or form-flattened PDFs), parsers produce mojibake — or nothing.

PDF bytes_{xref table}

→

Object tree_{font + content}

→

Content stream_{BT/ET operators}

→

Glyph codes_{ToUnicode CMap}

→

Unicode text_{reading-order sort}

Why this fails on complaint forms: scanned PDFs usually have no usable text layer — the page content may just draw a raster image object instead of text operators. Text extractors therefore return empty string. Also, AcroForm field values live in /V dictionary entries, not in the visible text stream, so most extractors need a separate code path for forms.

Lecture — tracing a real PDF file in memory

🎓

Professor

Here's what pypdf sees when it opens a digital complaint form. The binary file starts with %PDF-1.4. Inside are numbered objects — think of them as rows in a database. Object 12 is the Page. Object 14 is the Font. Object 15 is the Contents stream — a list of drawing operators like BT (begin text block), Tf (select font), Td (move cursor to position), Tj (draw a text string at the cursor).

🤔

Student

Wait — so text in a PDF is stored as drawing instructions, like a painter following a recipe? Not as actual words in a file?

🎓

Professor

Exactly. And here's the trap: those drawn strings use the font's own character encoding — not plain Unicode. The ToUnicode table inside the font definition converts raw glyph IDs to readable characters. If that table is missing (common in older PDFs), the parser returns garbled characters — even though the PDF looks perfect in Adobe Reader.

🤔

Student

That explains why pypdf shows CER = 0.49 even on this clean synthetic digital form slice. But its Field F1 is 1.0 for field extraction. How?

🎓

Professor

AcroForm field values live in a completely separate location: the /V entry of each widget annotation object — not inside the content stream at all. pypdf has a dedicated path that reads these /V entries directly, bypassing the drawing commands entirely. The high CER is a measurement artefact — the ground truth interleaves labels with values in a specific order that doesn't match the content stream's drawing sequence.

⚡

Expert insight

This is the most common trap in PDF benchmarks: do not rely on CER or WER alone for AcroForms. CER measures whether two text strings match character-for-character in the same order — but AcroForms are structured data, not prose. The key metric is Field F1: did you extract the right value for each named field? On the synthetic slice in this benchmark, pypdf scores 1.0 on Field F1, which is the outcome that matters for a complaint-processing pipeline.

Category 2 — Traditional OCR (Tesseract · PaddleOCR PP-OCRv4[6])

Converts a raster image to text through a multi-stage pipeline — each stage feeds the next.

Animated walkthrough

Step 1. Start with pixels from a scan. The page looks readable to us, but it is only an image.

Simple OCR flow

1 / 5

Scanned page strip

At this stage the parser only sees grey pixels, not text objects.

Short summary

InputRaster image

ProblemNo direct characters yet

After cleaning

Background noise drops. Letters become darker and easier to separate.

What changed

BinarizationPush text away from background.

DeskewRotate the lines back into place.

Detected word boxes

Complaint

Date

01/15/2024

Now the engine knows where each word probably lives on the page.

Tesseract vs PaddleOCR

TesseractUses projection-profile heuristics.

PaddleOCRUses a learned detection model.

Raw character read — what OCR sees

C

o

r

n

p

l

a

i

n

t

Green = confident match. Amber = low-confidence: OCR read rn but the actual glyph is m.

Model output

Raw decodeCornplaint

Low-confidence arearn confused for m

Rescored output

Correction cornplaint → complaint

Final textComplaint Date 01/15/2024

Short summary

StrengthWorks on scans and photos.

WeaknessQuality falls when pixels are poor.

Raster image_{scanned page}

→

Binarize_{Otsu / Sauvola}

→

Deskew_{Hough transform}

→

Detect words_{find ink blobs}

→

Word split_{projection profiles}

→

LSTM read_{char → token}

→

LM rescore_{beam search}

How Tesseract segments words

Binarize — Otsu global threshold converts grey pixels to pure black/white. Sauvola[8] is better for mixed handwrite/print (adapts per tile).
Find baselines — horizontal projection profiles detect row gaps. Each ink-dense row is a text line.
Find words — vertical projection profiles within each line detect white gaps. Each connected blob is a word candidate.
Classify characters — LSTM (Tesseract 4+) reads each word image left-to-right and predicts character sequences with a CTC loss decoder.
Language model rescoring — a word-n-gram model corrects low-confidence character predictions (e.g. rn → m).

PaddleOCR additions

PaddleOCR replaces Tesseract's heuristic segmentation with a detection model (DB++ — a differentiable binarization network) that directly predicts word bounding boxes as a segmentation map. This handles rotated, curved, or irregularly spaced text that trips up projection profiles.

DB++ detection → word bbox polygons

→

CRNN recognition → character sequence

→

PP-StructureV2 → table HTML

PP-StructureV2 runs a separate layout model that classifies regions as text / title / table / figure, then applies a table-specific parser to reconstruct row/col structure as HTML.

Why 300 DPI matters: a character that is 3mm tall on paper = 35 px at 300 DPI vs 24 px at 200 DPI. Below roughly 200-300 DPI, printed OCR often loses stroke detail and segmentation becomes less stable. The exact drop depends on the engine, document quality, and preprocessing.

Lecture — pixel density, binarization, and why preprocessing can hurt

🎓

Professor

Tesseract works on line or word-height image crops, reading left to right and predicting character probabilities across each region. When resolution is too low, adjacent strokes blur together and the recognizer sees ambiguous blobs where distinct characters should be.

🤔

Student

Can't we just scan at 600 DPI to give the model more pixels and better accuracy?

🎓

Professor

Higher DPI helps up to a point — but the bottleneck often shifts to binarization. The first real step in any OCR pipeline is converting the image to pure black and white. Otsu's algorithm finds a single global brightness threshold for the entire page. Now, if your complaint form has a coffee stain in one corner and clean white paper elsewhere, that global threshold is simultaneously too high in one area and too low in another.

🤔

Student

Is that what Sauvola binarization fixes? Our preprocessing code uses it...

🎓

Professor

Exactly — Sauvola divides the page into overlapping tiles and computes a local threshold for each one, adapting to local contrast. For fax-quality or coffee-stained scans it is dramatically better. But — and this is what our benchmark showed on FUNSD — on an already-clean 300 DPI scan, Sauvola's local window over-sharpens thin pen strokes. CER went from 0.484 → 0.560. Preprocessing hurt.

⚡

Expert insight

The practical rule: gate preprocessing on measured image quality, not a global flag. In the code, check_dpi() returns the scan resolution. If DPI < 200, or skew angle > 1°, or contrast ratio < 50, applying the full pipeline (deskew → denoise → CLAHE → Sauvola) is worth trying. On already-adequate inputs, the local rerun here showed the full pipeline was slower and less accurate, so it is usually not worth applying by default.

Category 3 — ML layout parsing (Docling · TableFormer)

Treats a document page as a 2D photograph. A vision model recognises regions — text block, table, figure, heading — and TableFormer reconstructs table structure from patterns learned across large document corpora.

Animated walkthrough — Docling 5-stage pipeline

Step 1. The page image is sliced into a grid of 16×16 pixel patches — the same way modern AI image models process photographs. Each patch becomes one token (a 768-dimensional vector).

Tap a step or let it play

1 / 5

Page divided into 16×16 px patches — colour shows region type

■ heading ■ text ■ table ■ blank

What this produces

InputFull page image

Each patch768-dim embedding vector

Sequence lengthHundreds to low-thousands of tokens

This is the same patchification idea used by modern multimodal vision models and image classifiers. The document is treated as a visual object — no text layer required, works on scans.

Attention map — table header row attends to data cells below

Blue intensity = attention weight. Row 2 (header) strongly attends to the rows below — the model learns column membership, not just proximity.

Why attention beats rules

RulesFail on borderless tables

RulesFail on merged cells

AttentionHandles any distance, any layout

A spanning header and its column cells can be 300+ pixels apart. Self-attention handles this trivially; rule-based line detection cannot.

Detection head output — labelled bounding boxes

Customer Complaint Form — Q1 2024 HEADING

The customer reported that the product arrived damaged on 15 January… TEXT

Claim ID · Date · Amount · Status · Assigned To TABLE →TableFormer

Output classes (DiT vocabulary)

HeadingTitle, section header

Text blockBody paragraph, list item

TableCropped → TableFormer

Figure / CaptionImage or chart region

Tables are cropped at this point and handed off to the specialised TableFormer model for structure reconstruction.

TableFormer predicts HTML tokens one by one

<thead> <tr> <td> Claim ID </td> <td> Date </td> <td colspan="2"> Amount & Status </td> </tr> </thead> <tbody> <tr> <td> C-0042 …

Blue = HTML structure tokens predicted by the model. Green = text content (filled from OCR / text layer after structure is finalised). Notice colspan="2" — a merged cell, impossible to detect with line-finding rules.

Why sequence prediction wins

Rule-basedFind lines → infer grid

TableFormerPredict structure → fill text

TrainingLarge annotated table corpora

The model has seen a wide range of merged-cell, spanning-header, and nested-table configurations in large scientific-paper table datasets.

Docling final output — ready for downstream NLP

## Customer Complaint Form — Q1 2024

The customer reported that the product arrived
damaged on 15 January...

| Claim ID | Date       | Amount & Status |
|----------|------------|------------------|
| C-0042   | 2024-01-15 | $142 · Pending   |
| C-0043   | 2024-01-16 | $89  · Resolved  |

What Docling exports

FormatsMarkdown, JSON, HTML

TablesCorrect row/column grid + spans

Reading orderCorrect multi-column sequence

No regex cleanup. No column-merging hacks. No reading-order guessing. The output is directly usable by LLMs, search indexes, and downstream form extraction pipelines.

Lecture — why ML layout is a completely different category

🎓

Professor

Docling's core insight is to treat a document page as a 2D photograph and apply object detection — the same technology your phone uses to identify faces. Instead of 'face' or 'car', the output classes are 'text block', 'table', 'heading', 'figure', 'caption', 'footnote'.

🤔

Student

Can't you detect tables more simply, by just looking for horizontal and vertical lines on the page?

🎓

Professor

That works on clean, bordered tables. Real complaint forms often have borderless tables with rows separated only by spacing. Merged cells. Headers spanning three columns. Scanned forms where the ink lines are smudged or missing. Rule-based detection fails on many of these cases. TableFormer predicts the structure as a sequence of HTML tokens, using patterns learned from large annotated table corpora instead of only visible ruling lines.

🤔

Student

If Docling is so much better, why not always use it for everything?

🎓

Professor

Cost and speed. Docling runs a vision model on every page, so CPU throughput is typically in the seconds-per-page range rather than the hundreds-of-pages-per-second range you can see from byte-level extraction on clean digital PDFs. The right tool depends on the document type: layout understanding buys robustness, but you pay for it in compute.

⚡

Expert insight

Docling combines a DeiT-B-class layout backbone with a TableFormer-style structure model trained on large document and table corpora such as DocBank, PubLayNet, and PubTabNet. That breadth is why it generalises better than line-finding heuristics, but the end-to-end pipeline is still materially heavier than byte-level extraction on clean digital PDFs.

When to choose Docling: documents with complex tables, multi-column layouts, scanned pages, or mixed content where reading order matters. For the simple digital AcroForm slice in this benchmark, Category 1 text extraction was orders of magnitude faster and produced the same field-extraction outcome.

Category 4 — Handwriting recognition (TrOCR[5])

Microsoft TrOCR is a Vision-Encoder–Language-Decoder transformer fine-tuned for handwritten text recognition. Unlike traditional OCR, it skips explicit character segmentation entirely.

Animated walkthrough

Step 1. Crop the handwritten field tightly. TrOCR works best on a single field, not a full busy page.

Field-level TrOCR flow

1 / 4

Handwritten field crop

J

o

n

e

s

Keep the crop small so background boxes, tables, and nearby labels do not distract the model.

Short summary

InputOne field image

Best useName, date, note, signature line

How this is usually taught

The common ViT mental model is: one image -> fixed-size patches -> a token sequence. The decoder only turns those image tokens into text in the next step.

1. Start with one cropped field image

TrOCR is meant for a tight word or line crop, not a full complaint-form page.

↓

2. Resize to 384×384, cut into 16×16 patches → 24×24 grid

The crop is resized to 384×384 px. The grid cuts it into 16×16 px squares — uniform, letter-blind. 384 ÷ 16 = 24, so the grid is 24×24 = 576 patches. Green squares = patches that contain pen strokes.

Schematic only. The model does not split the image into letter columns. It cuts the whole resized image into equal squares regardless of where letters start or end.

↓

3. Turn those patches into a token sequence

patch 1 patch 2 patch 3 patch 4 … patch 576

This is still vision data, not text. The next step is where the decoder reads those image tokens and starts generating letters.

What the encoder does

Unit16×16 pixel patches

Default resize384×384 crop

Patch count24×24 = 576 patches

OutputA sequence of visual tokens

Still missingNo final text yet

This follows the standard Vision Transformer teaching pattern: image -> patches -> embeddings -> token sequence. TrOCR then adds the text decoder on top.

Decoder attending → emitting "J"

To emit each letter, the decoder queries cross-attention over the 576 patch embeddings. Brighter blue = higher attention weight for that patch at this decoding step.

Schematic only. Real attention maps are distributed across all heads and all layers — not a single clean spotlight.

One output token at a time

J

o

n

e

s

The highlighted token is being emitted now. Each step re-runs cross-attention over the full patch sequence before picking the next character.

Attention scopeAll 576 patches, every step

Not usedNo handcrafted character boxes

Connected strokesHandled as one word image

Field-level output

Model textJones

Use this onShort handwritten fields

AvoidFull-page mixed layouts

Practical takeaway

StrengthCleaner cursive transcription

RiskMessy real forms still need review

Handwritten field crop_{from form region}

→

ViT encoder_{16×16 px patches → patch embeddings}

→

Cross-attention_{decoder attends to patches}

→

Autoregressive decode_{token by token, beam=4}

→

Text tokens_{→ final string}

What TrOCR gets right: because it reads a field as a whole image, connected script and ligatures that break character classifiers just work. The public large handwritten model reports CER ~2.9% on the clean IAM benchmark.[7]

What TrOCR gets wrong: real complaint forms have noisy, degraded, variable-size handwriting. Performance drops on messy real-world inputs — the IAM benchmark uses much cleaner line images than complaint forms typically are. Use it per-field, not whole-page, and calibrate on your own data.

Lecture — why TrOCR does not rely on explicit character boundaries

🎓

Professor

Traditional OCR has a fundamental chicken-and-egg problem: to read handwriting you need to segment it into individual characters, but to know where the boundaries are, you already need to know what the characters look like. Tesseract and PaddleOCR work around this using projection profiles — looking for white vertical gaps between characters. That approach fails completely on cursive script where letters are physically connected with no white gap.

🤔

Student

So how does TrOCR solve this? Where does the letter boundary detection happen?

🎓

Professor

It does not happen as a separate preprocessing stage. TrOCR takes the entire field image as input and asks a language decoder: "given all these visual patches, what is the most likely Unicode character sequence?" At each decoding step, cross-attention focuses on whichever patches contain the ink relevant to the current character. For the letter 'J', the decoder attends to the top-left patches where the vertical descending stroke lives. For 'o', it attends to the curved-loop patches. Segmentation is implicit in the attention weights rather than computed by a standalone character splitter.

🤔

Student

So if someone writes 'Jones' in connected cursive with no white gaps at all, TrOCR just reads the whole word and the model figures out the boundaries from context?

🎓

Professor

That's why TrOCR performs so well on clean IAM handwriting. Traditional character-segmentation OCR breaks badly on connected script, while TrOCR handles it more naturally. The limitation is training data: the public model was evaluated on cleaner, curated line images. Real complaint forms have degraded ink, mixed print and handwriting, and much more writer variation, which is why we run TrOCR per-field, with a coordinate-based crop, not whole-page.

⚡

Expert insight

TrOCR pairs a BEiT vision transformer (encoder) with a RoBERTa-family language model (decoder). The variant used here, microsoft/trocr-large-handwritten, is larger than the base family, but the architecture is the same. Cross-attention is what resolves ambiguous strokes — "rn" that looks like "m" gets pushed toward whichever makes a real word. A purely visual classifier just sees pixels.

In short: Categories 1–4 aren't interchangeable. A digital complaint form submitted as AcroForm → Category 1 (text extraction) wins. A faxed, crooked, coffee-stained form → Category 2 + preprocessing. A form with multi-column tables and headers → Category 3. A form with handwritten notes in the margins → Category 4. Production pipelines route to the right category per page, not per document.

Ablation Study

Performance across document types

Real measured results on this benchmark's test corpus, plus published benchmark references for parsers not yet run. ✓ = measured on this machine. * = from published research.

The measured tables below come from local reruns on this machine. Corpus: 20 synthetic AcroForms plus a 10-document FUNSD slice from the available scanned set. Speed is CPU-only on Apple Silicon. Rows marked * still come from published benchmarks, not this machine.

Measured Results — Digital AcroForms (N=20 synthetic) ✓

Synthetic complaint forms with real AcroForm field widgets. Ground truth: programmatically exact. CER measures full text extraction; Field F1 measures AcroForm field value extraction.

Parser	CER ↓	WER ↓	FER ↑	Field F1 ↑	Speed (p/s) ↑	AcroForm fields
pypdf	0.490	0.524	1.000	1.000	415	✓
pymupdf	0.470	0.437	1.000	1.000	32	✓
pdfplumber	0.490	0.524	0.000	0.000	151	✗
tesseract	0.534	0.611	0.000	0.000	0.68	✗

CER ~0.49 is a measurement artifact: ground truth interleaves label+value; text parsers read label text and widget values in separate passes (non-interleaved order). Field F1 is the correct signal for AcroForms. Tesseract renders to image and runs OCR even on digital PDFs — correct for scanned forms, wasteful here.

Measured Results — Scanned Forms FUNSD (N=10) ✓

Real scanned business forms from FUNSD (EPFL 2019).[1] Human-annotated ground truth. Text parsers return empty string (no text layer in raster-image PDFs). Tesseract runs OCR on the rendered image.

Parser	CER ↓	WER ↓	Speed (p/s) ↑	Reads scanned
tesseract (raw)	0.484	0.749	1.05	✓
tesseract +preproc	0.560	0.866	0.25	✓
pypdf	1.000	1.000	1165	✗
pymupdf	1.000	1.000	938	✗
pdfplumber	1.000	1.000	1117	✗

Tesseract CER=0.484 on this noisy 10-document FUNSD slice is plausible — these are degraded real-world business forms, not clean print. Text parsers "speed" is misleadingly high because they immediately return empty string (no text layer to process). Docling, PaddleOCR, and TrOCR results are still pending local reruns because they require larger model downloads.

Published benchmark numbers below (marked *) are from external research, not this machine. Sources: Applied AI 2025[9] (800+ docs, 17 parsers), Procycons 2025[10] (Docling table accuracy on sustainability reports), Koncile 2025[11], NVIDIA 2025[12]. Your results on your corpus will differ.

Table Cell Accuracy — Published Benchmarks

Table extraction accuracy (various doc types)

CPU throughput — pages per second

Speed rows marked ✓ come from the latest local reruns, but they are best-fit task measurements rather than one universal race: pypdf / PyMuPDF / pdfplumber are from digital AcroForms, while Tesseract is from scanned forms.

Handwriting Recognition — On Documents with Handwritten Notes

Source: Braincuber 2025 independent benchmark.[13] This chart mixes two kinds of evidence: cloud OCR numbers from document benchmarks with handwritten annotations, and a clean-reference TrOCR score from the IAM handwriting dataset.[7] Treat it as directional, not apples-to-apples.

Preprocessing Pipeline Impact on Degraded Scans

Raw Fax Scan_{~200 DPI, bitonal}

→

Baseline Accuracy_~70%

→

Deskew + Denoise_+12%

→

CLAHE + Sauvola_+10%

→

~92% Accuracy_{+22% median gain}

Source: Compiled from industry benchmarks (2023–2025).[9][12] The +22% is a median figure across degraded document types — your mileage will vary by scan quality and OCR engine.

Measured finding (2026-04-10): Preprocessing hurts already-good scans.
Running the full pipeline (deskew → denoise → CLAHE → Sauvola binarization) on FUNSD 300 DPI scans increased CER from 0.484 → 0.560 (+0.076 absolute, +15.7% relative). Sauvola binarization (window=25) over-thresholds fine strokes on adequate-quality 300 DPI images, destroying information that Tesseract could read from the raw scan.

Rule of thumb: Gate preprocessing on measured document quality — not a global boolean flag. Use check_dpi() + measure skew angle before deciding to preprocess. Reserve the full pipeline for fax-quality inputs (≤200 DPI, skew >1°, contrast ratio <50).

Preprocessing Effect — FUNSD N=10, Tesseract eng, 300 DPI ✓

Configuration	CER ↓	WER ↓	Speed (p/s) ↑	Verdict
tesseract (raw)	0.484	0.749	1.05	Use this
tesseract +preproc	0.560	0.866	0.25	Worse (+0.076 CER, 4.3× slower)

Measured on this machine (Apple Silicon, CPU). Preprocessing pipeline: deskew (Hough) → denoise (fast non-local means) → CLAHE → Sauvola binarization. For degraded/fax inputs the pipeline is beneficial — this finding is specific to already-adequate 300 DPI business form scans.

Evaluation Rubric

How to score your parsers

A 10-criterion weighted rubric designed for complaint form requirements. Weights reflect the specific priorities of this use case — handwriting is weighted highest (0.22) because it is the primary failure mode.

Criterion	Weight	Score 5	Score 3	Score 1
A. Text Extraction	0.12	CER ≤ 2%	CER 5–10%	CER >20%
B. Handwriting	0.22	CER ≤ 5% (your forms)	CER 15–30%	No capability
C. Table Structure	0.08	Cell acc ≥ 95%	70–85%	No extraction
D. Form Field / KVP	0.18	Field F1 ≥ 0.95	F1 0.70–0.85	F1 < 0.50
E. Figures / Diagrams	0.05	No hallucination near figures	Minor text bleed	Hallucination observed
F. Degraded Scans	0.10	CER <10% at 200 DPI fax	CER 20–35%	Fails below 300 DPI
G. Speed	0.05	≥5 p/s CPU	0.2–1 p/s	<0.05 p/s
H. Compliance	0.10	Zero egress, MIT/Apache	Self-hosted option w/ cloud activation	Cloud-only, no DPA by default
I. License	0.05	MIT / Apache-2.0 / BSD	GPL-3.0 (code open, weights NC)	Proprietary API
J. Maintenance	0.05	Active release <6 months	Release within 18 months	Effectively abandoned (>3 years)

Mandatory gate: Criterion H (Compliance) must score ≥ 3 before any tool processes real customer complaint PII. A tool that scores 1 on Criterion E (hallucinated text from figures/signatures) is disqualified if your forms contain signature fields.

Estimated scores (from published research — measure on your corpus)

Tool	A	B	C	D	E	F	G	H	I	J	Weighted
Docling + TrOCR Hybrid	5	5	5	4	4	4	3	5	5	5	4.53
Docling	5	3	5	4	4	3	3	5	5	5	4.05
PyMuPDF Digital only	5	1	3	3	5	1	5	4	4	4	3.20
PaddleOCR	4	2	3	3	3	4	3	5	5	5	3.47
Tesseract	3	1	1	2	3	3	3	5	5	5	2.73
Claude API*	4	4	4	4	4	4	1	1	1	5	2.98*

* Claude API scores Compliance (H) = 1 because data leaves your infrastructure. Not recommended for real complaint PII without a signed DPA. These are pre-measurement estimates — do not use as final scores without running the ablation on your corpus.

Decision Framework

Which parser for which form?

Four questions narrow you to the right tool. A hybrid pipeline — different parsers for different form subtypes — almost always beats running a single parser on everything.

flowchart TD A([📄 Incoming complaint form PDF]) --> B{🔒 Air-gap or data residency required?} B -->|YES| C["Rule out all cloud APIs\nUse local tools only:\nDocling · PaddleOCR · TrOCR · pypdf"] B -->|NO| D C --> D{📄 Does the PDF have a text layer?} D -->|YES — Digital PDF| E{Is it an AcroForm\nwith fillable fields?} E -->|YES| F["✅ pypdf.get_fields\nBest current measured fit\non synthetic AcroForms"] E -->|NO| G{Complex tables\nor multi-column layout?} G -->|YES| H["✅ Docling\nStrong layout + table handling"] G -->|NO| I["✅ PyMuPDF\nFast plain text + geometry"] D -->|NO — Scanned PDF| J{🔍 Scan quality?} J -->|Less than 200 DPI| K["❌ Reject\nor request rescan"] J -->|Low DPI or visibly degraded| L["If you preprocess:\ndeskew → denoise → CLAHE → Sauvola"] J -->|Already readable 300 DPI| M["Usually skip heavy binarization\nor keep cleanup light"] L --> N{✍️ Handwritten fields present?} M --> N N -->|YES| O{High monthly volume\nor strict local control?} O -->|YES| P["✅ Docling + TrOCR\nSelf-hosted pipeline"] O -->|NO| Q{Cloud OK?} Q -->|YES| R["☁️ AWS Textract Forms+Tables\n$0.065 per page"] Q -->|NO| P N -->|NO — Printed text only| S{High monthly volume\nor lowest unit cost?} S -->|YES| U["✅ Self-hosted OCR/layout\nDocling or Tesseract path"] S -->|NO| T["☁️ Google Document AI\n$0.030 per page"] style F fill:#1a3a1a,stroke:#22c55e,color:#22c55e style H fill:#1a2a3a,stroke:#3b82f6,color:#3b82f6 style I fill:#1a2a3a,stroke:#3b82f6,color:#3b82f6 style P fill:#1a2a3a,stroke:#3b82f6,color:#3b82f6 style U fill:#1a2a3a,stroke:#3b82f6,color:#3b82f6 style K fill:#3a1a1a,stroke:#ef4444,color:#ef4444 style R fill:#2a1a3a,stroke:#a855f7,color:#a855f7 style T fill:#2a1a3a,stroke:#a855f7,color:#a855f7

Production Architecture

Recommended hybrid pipeline

No single parser handles all complaint form types optimally. Route by document type and confidence — different paths use different tools.

Hybrid Pipeline — Components and Flow

Intake_{S3 + SQS}

→

Classifier_LayoutLM

→

Digital Path_PyMuPDF

/

Scanned Path_{Preproc → Docling}

/

Handwritten Fields_TrOCR

→

Validation_{Pydantic v2}

→

HITL Routing_{<0.70 conf}

→

PII Redact_Presidio

→

Audit Log_Immutable

Illustrative Routing Benefit — Single parser vs hybrid pipeline

Routing each document type to the best-fit tool instead of running one parser on everything is often the highest-leverage improvement. The chart below is an illustrative scenario that mixes local measurements (✓), published benchmarks, and operational assumptions — your numbers will vary.

What routing buys you, concretely:

Digital AcroForms → pypdf: 1.00 Field F1 on the synthetic slice at 415 p/s. OCR on that path mostly wastes compute.
Scanned 300 DPI → a routed OCR/layout path can materially outperform a single default parser, especially when tables or mixed layout matter.
Handwritten fields → per-field HTR is usually a better fit than whole-page OCR, but the exact gain depends heavily on your handwriting corpus.
Low-confidence → HITL queue: catches the 5–10% of edge cases that degrade any automated pipeline.

The ceiling is not 100%:

Even a strong hybrid pipeline hits a wall on sub-200 DPI fax scans, heavy coffee-stained or torn forms, and extreme cursive handwriting. In practice, many teams still reserve the last 5–10% of uncertain cases for Human-in-the-Loop review rather than promise full automation.

✓ Must-do

Preprocessing before OCR, gated on actual scan quality
Empirical confidence calibration (not vendor defaults)
Field-level metrics, not form-level
Pydantic v2 cross-field validation
Immutable audit log (strong compliance practice; no specific CFPB rule cited)
Presidio PII redaction before storage
DPA in place before any cloud API

✗ Anti-patterns

One parser for all document types
LLMs as primary OCR for numeric fields
Trusting vendor confidence as calibrated probability
Unstructured for real-time intake (51–140s/form)
LlamaParse without enterprise DPA on PII data
Benchmarking on vendor sample documents
Evaluating form-level accuracy (hides field failures)

⚠ Watch out for

Fax-quality scans (<200 DPI) — reject or upsample
Few-shot prompt contamination → hallucination
VLMs hallucinating chart/signature content
PyMuPDF AGPL if redistributing commercially
Marker/Surya GPL blocking proprietary products
Active learning without OOD validation
Static thresholds without recalibration over time

Cost at scale — cloud vs self-hosted

Monthly Cost Estimate — Structured Extraction (1 form ≈ 3 pages)

These numbers assume Docling on a shared server with reasonable utilisation. At 10K forms/month it already undercuts both cloud options shown. The actual crossover depends on your hardware cost and server load — calculate it against your own setup. Vendor pricing checked April 2026.

Practical starting point for most teams: If you need one parser for both scanned and layout-heavy documents, Docling is the safe default. For handwritten fields, fine-tune TrOCR on real complaint samples. For digital AcroForms, pypdf.get_fields() was fastest and most accurate on the synthetic corpus here. Keep HITL for low-confidence cases — no single parser handles every subtype well.

Before you start: Annotate 200 real forms as ground truth JSON, run sandbox/evaluate.py, and measure CER/WER/FER on your actual documents. Confidence thresholds need to come from real data, not published benchmarks. The numbers here are a starting point.

Parsing Customer
Complaint Forms

Why complaint forms break every parser

8 parsers, 6 categories

How parsers actually detect words

Performance across document types

How to score your parsers

Estimated scores (from published research — measure on your corpus)

Which parser for which form?

Recommended hybrid pipeline

✓ Must-do

✗ Anti-patterns

⚠ Watch out for

Cost at scale — cloud vs self-hosted

References

Parsing CustomerComplaint Forms

Why complaint forms break every parser

8 parsers, 6 categories

How parsers actually detect words

Performance across document types

How to score your parsers

Estimated scores (from published research — measure on your corpus)

Which parser for which form?

Recommended hybrid pipeline

✓ Must-do

✗ Anti-patterns

⚠ Watch out for

Cost at scale — cloud vs self-hosted

References

Parsing Customer
Complaint Forms