Interactive Deep Dive

Parsing Customer
Complaint Forms

Head-to-head benchmark of open-source PDF parsers on customer complaint forms — ablation tests, rubric scoring, and a decision framework.

Haoming Koo · April 2026 · Open Source On-Prem Complaint Forms
↓ Scroll to explore
Background

Why complaint forms break every parser

Most PDF parsing benchmarks test clean academic papers or financial reports. Complaint forms are harder — they stack six difficulty layers at once.

Difficulty Stack — Customer Complaint Forms
Content MixPrint + Handwriting
Layout VarietyKVP · Tables · Fields
Scan QualityFax · Skew · Noise
Mixed DocsForm + Attachments in one PDF
Field SemanticsCross-field validation
ComplyCFPB · GDPR · PII
The benchmark trap: OmniDocBench (CVPR 2025) tested 9 document types and found over 55 percentage points of accuracy variance across document categories — for the same parser.[2] Domain composition matters more than which tool you pick. Measure on your actual forms.
Field-Level Accuracy Expectations (Before Tool Selection)
99%+
Printed text at 300 DPI
90–96%
Checkbox detection
80–90%
Numeric fields (printed)
70–88%
Handwritten free text
No benchmark covers real complaint forms. FUNSD (the closest proxy) contains 199 administrative forms — mostly machine-typed, not handwritten.[1] Build a proprietary annotated corpus from 200+ real forms before making a production tool selection.
Tool Overview

8 parsers, 6 categories

The local tools below are free and can run on-prem. The Claude API row is included only as a cloud comparison baseline, not as an on-prem option.

ToolLicenseScannedTablesHandwritingSpeed (CPU)Best For
Docling ★ Strong local option MIT 97.9%* Limited ~1.3 p/s Complex layouts, production on-prem
PyMuPDF AGPL-3.0† Basic ~32 p/s Digital PDFs, rendering + coordinates
pdfplumber MIT Rule-based ~150 p/s Table extraction from digital PDFs
PaddleOCR Apache-2.0 ✓ (ML) CER ~24% ~2–4 p/s Scanned PDFs, multilingual (80+ langs)
Tesseract Apache-2.0 Poor ~1 p/s Baseline OCR, widest language support
TrOCR MIT CER ~3%‡ <1 p/s Handwritten field extraction only
pypdf BSD-3 ~415 p/s AcroForm digital fields (fastest)
Claude API Not on-prem Proprietary ~90% 0.3 p/s Research comparison only — data leaves infra

Speed figures are indicative CPU timings and vary strongly by document type; the detailed measured tables below are the source of truth for this benchmark. * Procycons 2025 benchmark on sustainability reports.[10] Not measured on complaint forms specifically. † AGPL-3.0 is free for internal/on-prem use; commercial redistribution requires a paid Artifex license. ‡ CER on IAM clean handwriting dataset[7] — real-world complaint form handwriting will be higher.

Docling[4]
MIT   IBM Research · 37k ★
ML layout parser for mixed document types. Runs fully local and air-gap compatible. Good first choice when tables or scanned pages are involved.
Scanned Tables Multi-format PII ready
PyMuPDF
AGPL-3.0   Artifex · 8.7k ★
Very fast text-layer PDF extractor. Best for digital text PDFs and simple AcroForm workflows. Not useful for scanned inputs without an OCR stage.
Digital PDF AcroForm Fast + tables
PaddleOCR
Apache-2.0   Baidu · 70k ★
The broadest open-source OCR suite. PP-OCRv4[6] + PP-StructureV2 for layout and tables. 80+ language support — useful for multilingual complaint forms.
Scanned Tables (ML) 80+ languages GPU ready
Under the Hood

How parsers actually detect words

The tools called "PDF parsers" use three completely different mechanisms. Which one a tool uses determines what it can and cannot read — that's what separates the benchmark rows.

Category 1 — Text extraction (pypdf · PyMuPDF · pdfplumber)

No image processing. No machine learning. Pure parsing of PDF's internal byte structure.

Step-by-step walkthrough — one PDF operator at a time

Step 1 of 7. Open the PDF binary. Everything is stored as numbered objects — like rows in a database. Object 12 is the Page. Object 14 is the Font. Object 15 is the Contents stream.

Use Prev / Next or click any step
Blue highlight = the operator currently executing
Operator token
Position operator
Text draw operator
1 / 7
PDF object tree — what the parser traverses
xref tableObject 12 → Page 1.
Page object 12Links to Font obj 14 and Contents obj 15.
Font object 14Defines /Helvetica + ToUnicode table.
Contents stream 15Holds all the drawing operators.
Key insight
StorageObjects, not pixels
Scan PDFObject is an image — no text
A scanned PDF stores the page as a raster image object. The contents stream has almost no text operators — extractors return empty string.
Contents stream — BT is executing (blue)
BTBegin text block — opens drawing context.
TfSet font (next step).
TdMove cursor (later).
TjDraw string (later).
ETEnd text block (closes context).
What BT means
ActionOpens a text context
Characters drawn?None yet
Closes withET operator
Every visible text string in a PDF sits between a BT and ET pair. The parser looks for these pairs to know where text regions are.
Contents stream — Tf is executing (blue)
BTBegin text block.
Tf/Helvetica 12 — set font + size.
TdMove cursor (next step).
TjDraw string (later).
What Tf does
Font name/Helvetica
Size12 points
AffectsAll Tj draws until next Tf
The font name refers to an object in the PDF resource dictionary. That object contains the ToUnicode table — critical for correct character decoding.
Contents stream — Td is executing (blue)
BTBegin text block.
Tf/Helvetica 12 — font set.
Td72 680 — move pen to column 72, row 680.
TjDraw string (next step).
What Td does
X position72 pt from left
Y position680 pt from bottom
Ink drawn?None — cursor move only
PDF coordinates start at bottom-left (0,0). Y=680 places the text near the top of an A4 page. This is how parsers know the reading order — by sorting Td positions.
Contents stream — Tj is executing (blue)
BTBegin text block.
Tf/Helvetica 12.
Td72 680 — cursor positioned.
Tj(Complaint Date:) — draw this string now.
Td180 0 — advance right for value.
Tj(01/15/2024) — draw the value.
What Tj does
DrawsThe visible string
Bytes extracted"Complaint Date:"
Then"01/15/2024"
Tj is the operator the parser actually reads text from. Blue = both Tj calls. The second Td advances the cursor right so the value appears inline.
ToUnicode table: raw glyph IDs → readable characters
<0043>
C
<006F>
o
<006D>
m
<F001>
# (custom)
<0031>
1
<0035>
5
Why this step can fail
With CMapClean Unicode text
Missing tableGarbled bytes or blanks
Some PDFs use private glyph IDs (like F001 above) that map to non-standard characters. Without the CMap, the parser has no way to recover the real character.
Sorted by position → final extracted text
Y=680, X=72Complaint Date:
Y=680, X=25201/15/2024
AcroForm /VRead via field / widget dictionaries
Characters sorted by Y (top first), then X (left-to-right) reconstruct reading order. AcroForm /V values bypass this entirely — they're read from the form field structure instead, which is why pypdf reaches 1.00 Field F1 on the synthetic AcroForm slice used here.
Final result
Speed415 pages / second
Field F11.00 on this slice
Works on scans?No — needs text layer

Inside a PDF content stream

BT /Helvetica 12 Tf % set font + size 72 680 Td % move cursor to (72, 680) (Complaint Date:) Tj % draw string 180 0 Td % advance right (01/15/2024) Tj % draw value ET

Every character's position is encoded as PDF operators. Tj draws a string. Td moves the cursor. Tf selects a font.

Glyph → Unicode mapping

% ToUnicode CMap (inside PDF font object) /CIDInit /ProcSet findresource begin beginbfchar <0041> <0041> % glyph 0x41 → "A" <0042> <0042> % glyph 0x42 → "B" <F001> <0023> % custom glyph → "#" endbfchar

When a font embeds a ToUnicode CMap, parsers can map glyph codes to characters. When it's missing (common in scanned or form-flattened PDFs), parsers produce mojibake — or nothing.

PDF bytesxref table
Object treefont + content
Content streamBT/ET operators
Glyph codesToUnicode CMap
Unicode textreading-order sort
Why this fails on complaint forms: scanned PDFs usually have no usable text layer — the page content may just draw a raster image object instead of text operators. Text extractors therefore return empty string. Also, AcroForm field values live in /V dictionary entries, not in the visible text stream, so most extractors need a separate code path for forms.
Lecture — tracing a real PDF file in memory
🎓
Professor
Here's what pypdf sees when it opens a digital complaint form. The binary file starts with %PDF-1.4. Inside are numbered objects — think of them as rows in a database. Object 12 is the Page. Object 14 is the Font. Object 15 is the Contents stream — a list of drawing operators like BT (begin text block), Tf (select font), Td (move cursor to position), Tj (draw a text string at the cursor).
🤔
Student
Wait — so text in a PDF is stored as drawing instructions, like a painter following a recipe? Not as actual words in a file?
🎓
Professor
Exactly. And here's the trap: those drawn strings use the font's own character encoding — not plain Unicode. The ToUnicode table inside the font definition converts raw glyph IDs to readable characters. If that table is missing (common in older PDFs), the parser returns garbled characters — even though the PDF looks perfect in Adobe Reader.
🤔
Student
That explains why pypdf shows CER = 0.49 even on this clean synthetic digital form slice. But its Field F1 is 1.0 for field extraction. How?
🎓
Professor
AcroForm field values live in a completely separate location: the /V entry of each widget annotation object — not inside the content stream at all. pypdf has a dedicated path that reads these /V entries directly, bypassing the drawing commands entirely. The high CER is a measurement artefact — the ground truth interleaves labels with values in a specific order that doesn't match the content stream's drawing sequence.
Expert insight
This is the most common trap in PDF benchmarks: do not rely on CER or WER alone for AcroForms. CER measures whether two text strings match character-for-character in the same order — but AcroForms are structured data, not prose. The key metric is Field F1: did you extract the right value for each named field? On the synthetic slice in this benchmark, pypdf scores 1.0 on Field F1, which is the outcome that matters for a complaint-processing pipeline.
Category 2 — Traditional OCR (Tesseract · PaddleOCR PP-OCRv4[6])

Converts a raster image to text through a multi-stage pipeline — each stage feeds the next.

Animated walkthrough

Step 1. Start with pixels from a scan. The page looks readable to us, but it is only an image.

Simple OCR flow
1 / 5
Scanned page strip
At this stage the parser only sees grey pixels, not text objects.
Short summary
InputRaster image
ProblemNo direct characters yet
After cleaning
Background noise drops. Letters become darker and easier to separate.
What changed
BinarizationPush text away from background.
DeskewRotate the lines back into place.
Detected word boxes
Complaint
Date
01/15/2024
Now the engine knows where each word probably lives on the page.
Tesseract vs PaddleOCR
TesseractUses projection-profile heuristics.
PaddleOCRUses a learned detection model.
Raw character read — what OCR sees
C
o
r
n
p
l
a
i
n
t
Green = confident match. Amber = low-confidence: OCR read rn but the actual glyph is m.
Model output
Raw decodeCornplaint
Low-confidence arearn confused for m
Rescored output
Correction cornplaint complaint
Final textComplaint Date 01/15/2024
Short summary
StrengthWorks on scans and photos.
WeaknessQuality falls when pixels are poor.
Raster imagescanned page
BinarizeOtsu / Sauvola
DeskewHough transform
Detect wordsfind ink blobs
Word splitprojection profiles
LSTM readchar → token
LM rescorebeam search

How Tesseract segments words

  1. Binarize — Otsu global threshold converts grey pixels to pure black/white. Sauvola[8] is better for mixed handwrite/print (adapts per tile).
  2. Find baselines — horizontal projection profiles detect row gaps. Each ink-dense row is a text line.
  3. Find words — vertical projection profiles within each line detect white gaps. Each connected blob is a word candidate.
  4. Classify characters — LSTM (Tesseract 4+) reads each word image left-to-right and predicts character sequences with a CTC loss decoder.
  5. Language model rescoring — a word-n-gram model corrects low-confidence character predictions (e.g. rnm).

PaddleOCR additions

PaddleOCR replaces Tesseract's heuristic segmentation with a detection model (DB++ — a differentiable binarization network) that directly predicts word bounding boxes as a segmentation map. This handles rotated, curved, or irregularly spaced text that trips up projection profiles.

DB++ detection → word bbox polygons
CRNN recognition → character sequence
PP-StructureV2 → table HTML

PP-StructureV2 runs a separate layout model that classifies regions as text / title / table / figure, then applies a table-specific parser to reconstruct row/col structure as HTML.

Why 300 DPI matters: a character that is 3mm tall on paper = 35 px at 300 DPI vs 24 px at 200 DPI. Below roughly 200-300 DPI, printed OCR often loses stroke detail and segmentation becomes less stable. The exact drop depends on the engine, document quality, and preprocessing.
Lecture — pixel density, binarization, and why preprocessing can hurt
🎓
Professor
Tesseract works on line or word-height image crops, reading left to right and predicting character probabilities across each region. When resolution is too low, adjacent strokes blur together and the recognizer sees ambiguous blobs where distinct characters should be.
🤔
Student
Can't we just scan at 600 DPI to give the model more pixels and better accuracy?
🎓
Professor
Higher DPI helps up to a point — but the bottleneck often shifts to binarization. The first real step in any OCR pipeline is converting the image to pure black and white. Otsu's algorithm finds a single global brightness threshold for the entire page. Now, if your complaint form has a coffee stain in one corner and clean white paper elsewhere, that global threshold is simultaneously too high in one area and too low in another.
🤔
Student
Is that what Sauvola binarization fixes? Our preprocessing code uses it...
🎓
Professor
Exactly — Sauvola divides the page into overlapping tiles and computes a local threshold for each one, adapting to local contrast. For fax-quality or coffee-stained scans it is dramatically better. But — and this is what our benchmark showed on FUNSD — on an already-clean 300 DPI scan, Sauvola's local window over-sharpens thin pen strokes. CER went from 0.484 → 0.560. Preprocessing hurt.
Expert insight
The practical rule: gate preprocessing on measured image quality, not a global flag. In the code, check_dpi() returns the scan resolution. If DPI < 200, or skew angle > 1°, or contrast ratio < 50, applying the full pipeline (deskew → denoise → CLAHE → Sauvola) is worth trying. On already-adequate inputs, the local rerun here showed the full pipeline was slower and less accurate, so it is usually not worth applying by default.
Category 3 — ML layout parsing (Docling · TableFormer)

Treats a document page as a 2D photograph. A vision model recognises regions — text block, table, figure, heading — and TableFormer reconstructs table structure from patterns learned across large document corpora.

Animated walkthrough — Docling 5-stage pipeline

Step 1. The page image is sliced into a grid of 16×16 pixel patches — the same way modern AI image models process photographs. Each patch becomes one token (a 768-dimensional vector).

Tap a step or let it play
1 / 5
Page divided into 16×16 px patches — colour shows region type
■ heading ■ text ■ table ■ blank
What this produces
InputFull page image
Each patch768-dim embedding vector
Sequence lengthHundreds to low-thousands of tokens
This is the same patchification idea used by modern multimodal vision models and image classifiers. The document is treated as a visual object — no text layer required, works on scans.
Attention map — table header row attends to data cells below
Blue intensity = attention weight. Row 2 (header) strongly attends to the rows below — the model learns column membership, not just proximity.
Why attention beats rules
RulesFail on borderless tables
RulesFail on merged cells
AttentionHandles any distance, any layout
A spanning header and its column cells can be 300+ pixels apart. Self-attention handles this trivially; rule-based line detection cannot.
Detection head output — labelled bounding boxes
Customer Complaint Form — Q1 2024 HEADING
The customer reported that the product arrived damaged on 15 January… TEXT
Claim ID · Date · Amount · Status · Assigned To TABLE →TableFormer
Output classes (DiT vocabulary)
HeadingTitle, section header
Text blockBody paragraph, list item
TableCropped → TableFormer
Figure / CaptionImage or chart region
Tables are cropped at this point and handed off to the specialised TableFormer model for structure reconstruction.
TableFormer predicts HTML tokens one by one
<thead> <tr> <td> Claim ID </td> <td> Date </td> <td colspan="2"> Amount & Status </td> </tr> </thead> <tbody> <tr> <td> C-0042
Blue = HTML structure tokens predicted by the model. Green = text content (filled from OCR / text layer after structure is finalised). Notice colspan="2" — a merged cell, impossible to detect with line-finding rules.
Why sequence prediction wins
Rule-basedFind lines → infer grid
TableFormerPredict structure → fill text
TrainingLarge annotated table corpora
The model has seen a wide range of merged-cell, spanning-header, and nested-table configurations in large scientific-paper table datasets.
Docling final output — ready for downstream NLP
## Customer Complaint Form — Q1 2024

The customer reported that the product arrived
damaged on 15 January...

| Claim ID | Date       | Amount & Status |
|----------|------------|------------------|
| C-0042   | 2024-01-15 | $142 · Pending   |
| C-0043   | 2024-01-16 | $89  · Resolved  |
What Docling exports
FormatsMarkdown, JSON, HTML
TablesCorrect row/column grid + spans
Reading orderCorrect multi-column sequence
No regex cleanup. No column-merging hacks. No reading-order guessing. The output is directly usable by LLMs, search indexes, and downstream form extraction pipelines.
Lecture — why ML layout is a completely different category
🎓
Professor
Docling's core insight is to treat a document page as a 2D photograph and apply object detection — the same technology your phone uses to identify faces. Instead of 'face' or 'car', the output classes are 'text block', 'table', 'heading', 'figure', 'caption', 'footnote'.
🤔
Student
Can't you detect tables more simply, by just looking for horizontal and vertical lines on the page?
🎓
Professor
That works on clean, bordered tables. Real complaint forms often have borderless tables with rows separated only by spacing. Merged cells. Headers spanning three columns. Scanned forms where the ink lines are smudged or missing. Rule-based detection fails on many of these cases. TableFormer predicts the structure as a sequence of HTML tokens, using patterns learned from large annotated table corpora instead of only visible ruling lines.
🤔
Student
If Docling is so much better, why not always use it for everything?
🎓
Professor
Cost and speed. Docling runs a vision model on every page, so CPU throughput is typically in the seconds-per-page range rather than the hundreds-of-pages-per-second range you can see from byte-level extraction on clean digital PDFs. The right tool depends on the document type: layout understanding buys robustness, but you pay for it in compute.
Expert insight
Docling combines a DeiT-B-class layout backbone with a TableFormer-style structure model trained on large document and table corpora such as DocBank, PubLayNet, and PubTabNet. That breadth is why it generalises better than line-finding heuristics, but the end-to-end pipeline is still materially heavier than byte-level extraction on clean digital PDFs.
When to choose Docling: documents with complex tables, multi-column layouts, scanned pages, or mixed content where reading order matters. For the simple digital AcroForm slice in this benchmark, Category 1 text extraction was orders of magnitude faster and produced the same field-extraction outcome.
Category 4 — Handwriting recognition (TrOCR[5])

Microsoft TrOCR is a Vision-Encoder–Language-Decoder transformer fine-tuned for handwritten text recognition. Unlike traditional OCR, it skips explicit character segmentation entirely.

Animated walkthrough

Step 1. Crop the handwritten field tightly. TrOCR works best on a single field, not a full busy page.

Field-level TrOCR flow
1 / 4
Handwritten field crop
J
o
n
e
s
Keep the crop small so background boxes, tables, and nearby labels do not distract the model.
Short summary
InputOne field image
Best useName, date, note, signature line
How this is usually taught

The common ViT mental model is: one image -> fixed-size patches -> a token sequence. The decoder only turns those image tokens into text in the next step.

1. Start with one cropped field image
J o n e s

TrOCR is meant for a tight word or line crop, not a full complaint-form page.

2. Resize to 384×384, cut into 16×16 patches → 24×24 grid
Jones = 16×16 px 24 × 24 = 576 patches

The crop is resized to 384×384 px. The grid cuts it into 16×16 px squares — uniform, letter-blind. 384 ÷ 16 = 24, so the grid is 24×24 = 576 patches. Green squares = patches that contain pen strokes.

Schematic only. The model does not split the image into letter columns. It cuts the whole resized image into equal squares regardless of where letters start or end.

3. Turn those patches into a token sequence
patch 1 patch 2 patch 3 patch 4 patch 576

This is still vision data, not text. The next step is where the decoder reads those image tokens and starts generating letters.

What the encoder does
Unit16×16 pixel patches
Default resize384×384 crop
Patch count24×24 = 576 patches
OutputA sequence of visual tokens
Still missingNo final text yet
This follows the standard Vision Transformer teaching pattern: image -> patches -> embeddings -> token sequence. TrOCR then adds the text decoder on top.
Decoder attending → emitting "J"
Jones emitting "J" attention weight Cross-attention — decoder step 1 of 5

To emit each letter, the decoder queries cross-attention over the 576 patch embeddings. Brighter blue = higher attention weight for that patch at this decoding step.

Schematic only. Real attention maps are distributed across all heads and all layers — not a single clean spotlight.

One output token at a time
J
o
n
e
s
The highlighted token is being emitted now. Each step re-runs cross-attention over the full patch sequence before picking the next character.
Attention scopeAll 576 patches, every step
Not usedNo handcrafted character boxes
Connected strokesHandled as one word image
Field-level output
Model textJones
Use this onShort handwritten fields
AvoidFull-page mixed layouts
Practical takeaway
StrengthCleaner cursive transcription
RiskMessy real forms still need review
Handwritten field cropfrom form region
ViT encoder16×16 px patches → patch embeddings
Cross-attentiondecoder attends to patches
Autoregressive decodetoken by token, beam=4
Text tokens→ final string
What TrOCR gets right: because it reads a field as a whole image, connected script and ligatures that break character classifiers just work. The public large handwritten model reports CER ~2.9% on the clean IAM benchmark.[7]
What TrOCR gets wrong: real complaint forms have noisy, degraded, variable-size handwriting. Performance drops on messy real-world inputs — the IAM benchmark uses much cleaner line images than complaint forms typically are. Use it per-field, not whole-page, and calibrate on your own data.
Lecture — why TrOCR does not rely on explicit character boundaries
🎓
Professor
Traditional OCR has a fundamental chicken-and-egg problem: to read handwriting you need to segment it into individual characters, but to know where the boundaries are, you already need to know what the characters look like. Tesseract and PaddleOCR work around this using projection profiles — looking for white vertical gaps between characters. That approach fails completely on cursive script where letters are physically connected with no white gap.
🤔
Student
So how does TrOCR solve this? Where does the letter boundary detection happen?
🎓
Professor
It does not happen as a separate preprocessing stage. TrOCR takes the entire field image as input and asks a language decoder: "given all these visual patches, what is the most likely Unicode character sequence?" At each decoding step, cross-attention focuses on whichever patches contain the ink relevant to the current character. For the letter 'J', the decoder attends to the top-left patches where the vertical descending stroke lives. For 'o', it attends to the curved-loop patches. Segmentation is implicit in the attention weights rather than computed by a standalone character splitter.
🤔
Student
So if someone writes 'Jones' in connected cursive with no white gaps at all, TrOCR just reads the whole word and the model figures out the boundaries from context?
🎓
Professor
That's why TrOCR performs so well on clean IAM handwriting. Traditional character-segmentation OCR breaks badly on connected script, while TrOCR handles it more naturally. The limitation is training data: the public model was evaluated on cleaner, curated line images. Real complaint forms have degraded ink, mixed print and handwriting, and much more writer variation, which is why we run TrOCR per-field, with a coordinate-based crop, not whole-page.
Expert insight
TrOCR pairs a BEiT vision transformer (encoder) with a RoBERTa-family language model (decoder). The variant used here, microsoft/trocr-large-handwritten, is larger than the base family, but the architecture is the same. Cross-attention is what resolves ambiguous strokes — "rn" that looks like "m" gets pushed toward whichever makes a real word. A purely visual classifier just sees pixels.
In short: Categories 1–4 aren't interchangeable. A digital complaint form submitted as AcroForm → Category 1 (text extraction) wins. A faxed, crooked, coffee-stained form → Category 2 + preprocessing. A form with multi-column tables and headers → Category 3. A form with handwritten notes in the margins → Category 4. Production pipelines route to the right category per page, not per document.
Ablation Study

Performance across document types

Real measured results on this benchmark's test corpus, plus published benchmark references for parsers not yet run. ✓ = measured on this machine. * = from published research.

The measured tables below come from local reruns on this machine. Corpus: 20 synthetic AcroForms plus a 10-document FUNSD slice from the available scanned set. Speed is CPU-only on Apple Silicon. Rows marked * still come from published benchmarks, not this machine.
Measured Results — Digital AcroForms (N=20 synthetic) ✓

Synthetic complaint forms with real AcroForm field widgets. Ground truth: programmatically exact. CER measures full text extraction; Field F1 measures AcroForm field value extraction.

ParserCER ↓WER ↓FER ↑Field F1 ↑Speed (p/s) ↑AcroForm fields
pypdf0.4900.5241.0001.000415
pymupdf0.4700.4371.0001.00032
pdfplumber0.4900.5240.0000.000151
tesseract0.5340.6110.0000.0000.68

CER ~0.49 is a measurement artifact: ground truth interleaves label+value; text parsers read label text and widget values in separate passes (non-interleaved order). Field F1 is the correct signal for AcroForms. Tesseract renders to image and runs OCR even on digital PDFs — correct for scanned forms, wasteful here.

Measured Results — Scanned Forms FUNSD (N=10) ✓

Real scanned business forms from FUNSD (EPFL 2019).[1] Human-annotated ground truth. Text parsers return empty string (no text layer in raster-image PDFs). Tesseract runs OCR on the rendered image.

ParserCER ↓WER ↓FER ↑Field F1 ↑Speed (p/s) ↑Reads scanned
tesseract (raw)0.4840.7490.0000.0001.05
tesseract +preproc0.5600.8660.0000.0000.25
pypdf1.0001.0000.0000.0001165
pymupdf1.0001.0000.0000.000938
pdfplumber1.0001.0000.0000.0001117

Tesseract CER=0.484 on this noisy 10-document FUNSD slice is plausible — these are degraded real-world business forms, not clean print. Text parsers "speed" is misleadingly high because they immediately return empty string (no text layer to process). Docling, PaddleOCR, and TrOCR results are still pending local reruns because they require larger model downloads.

Published benchmark numbers below (marked *) are from external research, not this machine. Sources: Applied AI 2025[9] (800+ docs, 17 parsers), Procycons 2025[10] (Docling table accuracy on sustainability reports), Koncile 2025[11], NVIDIA 2025[12]. Your results on your corpus will differ.
Table Cell Accuracy — Published Benchmarks
Table extraction accuracy (various doc types)
CPU throughput — pages per second

Speed rows marked ✓ come from the latest local reruns, but they are best-fit task measurements rather than one universal race: pypdf / PyMuPDF / pdfplumber are from digital AcroForms, while Tesseract is from scanned forms.

Handwriting Recognition — On Documents with Handwritten Notes

Source: Braincuber 2025 independent benchmark.[13] This chart mixes two kinds of evidence: cloud OCR numbers from document benchmarks with handwritten annotations, and a clean-reference TrOCR score from the IAM handwriting dataset.[7] Treat it as directional, not apples-to-apples.

Preprocessing Pipeline Impact on Degraded Scans
Raw Fax Scan~200 DPI, bitonal
Baseline Accuracy~70%
Deskew + Denoise+12%
CLAHE + Sauvola+10%
~92% Accuracy+22% median gain

Source: Compiled from industry benchmarks (2023–2025).[9][12] The +22% is a median figure across degraded document types — your mileage will vary by scan quality and OCR engine.

Measured finding (2026-04-10): Preprocessing hurts already-good scans.
Running the full pipeline (deskew → denoise → CLAHE → Sauvola binarization) on FUNSD 300 DPI scans increased CER from 0.484 → 0.560 (+0.076 absolute, +15.7% relative). Sauvola binarization (window=25) over-thresholds fine strokes on adequate-quality 300 DPI images, destroying information that Tesseract could read from the raw scan.

Rule of thumb: Gate preprocessing on measured document quality — not a global boolean flag. Use check_dpi() + measure skew angle before deciding to preprocess. Reserve the full pipeline for fax-quality inputs (≤200 DPI, skew >1°, contrast ratio <50).
Preprocessing Effect — FUNSD N=10, Tesseract eng, 300 DPI ✓
ConfigurationCER ↓WER ↓Speed (p/s) ↑Verdict
tesseract (raw) 0.484 0.749 1.05 Use this
tesseract +preproc 0.560 0.866 0.25 Worse (+0.076 CER, 4.3× slower)

Measured on this machine (Apple Silicon, CPU). Preprocessing pipeline: deskew (Hough) → denoise (fast non-local means) → CLAHE → Sauvola binarization. For degraded/fax inputs the pipeline is beneficial — this finding is specific to already-adequate 300 DPI business form scans.

Evaluation Rubric

How to score your parsers

A 10-criterion weighted rubric designed for complaint form requirements. Weights reflect the specific priorities of this use case — handwriting is weighted highest (0.22) because it is the primary failure mode.

CriterionWeightScore 5Score 3Score 1
A. Text Extraction0.12CER ≤ 2%CER 5–10%CER >20%
B. Handwriting0.22CER ≤ 5% (your forms)CER 15–30%No capability
C. Table Structure0.08Cell acc ≥ 95%70–85%No extraction
D. Form Field / KVP0.18Field F1 ≥ 0.95F1 0.70–0.85F1 < 0.50
E. Figures / Diagrams0.05No hallucination near figuresMinor text bleedHallucination observed
F. Degraded Scans0.10CER <10% at 200 DPI faxCER 20–35%Fails below 300 DPI
G. Speed0.05≥5 p/s CPU0.2–1 p/s<0.05 p/s
H. Compliance0.10Zero egress, MIT/ApacheSelf-hosted option w/ cloud activationCloud-only, no DPA by default
I. License0.05MIT / Apache-2.0 / BSDGPL-3.0 (code open, weights NC)Proprietary API
J. Maintenance0.05Active release <6 monthsRelease within 18 monthsEffectively abandoned (>3 years)
Mandatory gate: Criterion H (Compliance) must score ≥ 3 before any tool processes real customer complaint PII. A tool that scores 1 on Criterion E (hallucinated text from figures/signatures) is disqualified if your forms contain signature fields.

Estimated scores (from published research — measure on your corpus)

ToolABCDEFGHIJWeighted
Docling + TrOCR Hybrid 55544 43555 4.53
Docling 53544 33555 4.05
PyMuPDF Digital only 51335 15444 3.20
PaddleOCR 42333 43555 3.47
Tesseract 31123 33555 2.73
Claude API* 44444 41115 2.98*

* Claude API scores Compliance (H) = 1 because data leaves your infrastructure. Not recommended for real complaint PII without a signed DPA. These are pre-measurement estimates — do not use as final scores without running the ablation on your corpus.

Decision Framework

Which parser for which form?

Four questions narrow you to the right tool. A hybrid pipeline — different parsers for different form subtypes — almost always beats running a single parser on everything.

flowchart TD A([📄 Incoming complaint form PDF]) --> B{🔒 Air-gap or data residency required?} B -->|YES| C["Rule out all cloud APIs\nUse local tools only:\nDocling · PaddleOCR · TrOCR · pypdf"] B -->|NO| D C --> D{📄 Does the PDF have a text layer?} D -->|YES — Digital PDF| E{Is it an AcroForm\nwith fillable fields?} E -->|YES| F["✅ pypdf.get_fields\nBest current measured fit\non synthetic AcroForms"] E -->|NO| G{Complex tables\nor multi-column layout?} G -->|YES| H["✅ Docling\nStrong layout + table handling"] G -->|NO| I["✅ PyMuPDF\nFast plain text + geometry"] D -->|NO — Scanned PDF| J{🔍 Scan quality?} J -->|Less than 200 DPI| K["❌ Reject\nor request rescan"] J -->|Low DPI or visibly degraded| L["If you preprocess:\ndeskew → denoise → CLAHE → Sauvola"] J -->|Already readable 300 DPI| M["Usually skip heavy binarization\nor keep cleanup light"] L --> N{✍️ Handwritten fields present?} M --> N N -->|YES| O{High monthly volume\nor strict local control?} O -->|YES| P["✅ Docling + TrOCR\nSelf-hosted pipeline"] O -->|NO| Q{Cloud OK?} Q -->|YES| R["☁️ AWS Textract Forms+Tables\n$0.065 per page"] Q -->|NO| P N -->|NO — Printed text only| S{High monthly volume\nor lowest unit cost?} S -->|YES| U["✅ Self-hosted OCR/layout\nDocling or Tesseract path"] S -->|NO| T["☁️ Google Document AI\n$0.030 per page"] style F fill:#1a3a1a,stroke:#22c55e,color:#22c55e style H fill:#1a2a3a,stroke:#3b82f6,color:#3b82f6 style I fill:#1a2a3a,stroke:#3b82f6,color:#3b82f6 style P fill:#1a2a3a,stroke:#3b82f6,color:#3b82f6 style U fill:#1a2a3a,stroke:#3b82f6,color:#3b82f6 style K fill:#3a1a1a,stroke:#ef4444,color:#ef4444 style R fill:#2a1a3a,stroke:#a855f7,color:#a855f7 style T fill:#2a1a3a,stroke:#a855f7,color:#a855f7
Production Architecture

Recommended hybrid pipeline

No single parser handles all complaint form types optimally. Route by document type and confidence — different paths use different tools.

Hybrid Pipeline — Components and Flow
IntakeS3 + SQS
ClassifierLayoutLM
Digital PathPyMuPDF
/
Scanned PathPreproc → Docling
/
Handwritten FieldsTrOCR
ValidationPydantic v2
HITL Routing<0.70 conf
PII RedactPresidio
Audit LogImmutable
Illustrative Routing Benefit — Single parser vs hybrid pipeline

Routing each document type to the best-fit tool instead of running one parser on everything is often the highest-leverage improvement. The chart below is an illustrative scenario that mixes local measurements (✓), published benchmarks, and operational assumptions — your numbers will vary.

What routing buys you, concretely:
  • Digital AcroForms → pypdf: 1.00 Field F1 on the synthetic slice at 415 p/s. OCR on that path mostly wastes compute.
  • Scanned 300 DPI → a routed OCR/layout path can materially outperform a single default parser, especially when tables or mixed layout matter.
  • Handwritten fields → per-field HTR is usually a better fit than whole-page OCR, but the exact gain depends heavily on your handwriting corpus.
  • Low-confidence → HITL queue: catches the 5–10% of edge cases that degrade any automated pipeline.
The ceiling is not 100%:

Even a strong hybrid pipeline hits a wall on sub-200 DPI fax scans, heavy coffee-stained or torn forms, and extreme cursive handwriting. In practice, many teams still reserve the last 5–10% of uncertain cases for Human-in-the-Loop review rather than promise full automation.

✓ Must-do

  • Preprocessing before OCR, gated on actual scan quality
  • Empirical confidence calibration (not vendor defaults)
  • Field-level metrics, not form-level
  • Pydantic v2 cross-field validation
  • Immutable audit log (strong compliance practice; no specific CFPB rule cited)
  • Presidio PII redaction before storage
  • DPA in place before any cloud API

✗ Anti-patterns

  • One parser for all document types
  • LLMs as primary OCR for numeric fields
  • Trusting vendor confidence as calibrated probability
  • Unstructured for real-time intake (51–140s/form)
  • LlamaParse without enterprise DPA on PII data
  • Benchmarking on vendor sample documents
  • Evaluating form-level accuracy (hides field failures)

⚠ Watch out for

  • Fax-quality scans (<200 DPI) — reject or upsample
  • Few-shot prompt contamination → hallucination
  • VLMs hallucinating chart/signature content
  • PyMuPDF AGPL if redistributing commercially
  • Marker/Surya GPL blocking proprietary products
  • Active learning without OOD validation
  • Static thresholds without recalibration over time

Cost at scale — cloud vs self-hosted

Monthly Cost Estimate — Structured Extraction (1 form ≈ 3 pages)

These numbers assume Docling on a shared server with reasonable utilisation. At 10K forms/month it already undercuts both cloud options shown. The actual crossover depends on your hardware cost and server load — calculate it against your own setup. Vendor pricing checked April 2026.

Practical starting point for most teams: If you need one parser for both scanned and layout-heavy documents, Docling is the safe default. For handwritten fields, fine-tune TrOCR on real complaint samples. For digital AcroForms, pypdf.get_fields() was fastest and most accurate on the synthetic corpus here. Keep HITL for low-confidence cases — no single parser handles every subtype well.
Before you start: Annotate 200 real forms as ground truth JSON, run sandbox/evaluate.py, and measure CER/WER/FER on your actual documents. Confidence thresholds need to come from real data, not published benchmarks. The numbers here are a starting point.

References

  1. [1] Jaume, G., Ekenel, H. K., & Thiran, J.-P. (2019). FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents. ICDAR-OST 2019. arXiv:1905.13538 · Dataset page Dataset · Paper
  2. [2] Ouyang, L., et al. (2024/2025). OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations. CVPR 2025. arXiv:2412.07626 · GitHub Conference Paper
  3. [3] Pfitzmann, B., Auer, C., Dolfi, M., Nassar, A. S., & Staar, P. (2022). DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation. KDD 2022. IBM Research. arXiv:2206.01062 Dataset · Conference Paper
  4. [4] Auer, C., et al. (2024). Docling Technical Report. IBM Research Zurich. arXiv:2408.09869 · GitHub Technical Report
  5. [5] Li, M., Lv, T., Chen, J., Cui, L., Lu, Y., Florencio, D., Zhang, C., Li, Z., & Wei, F. (2021). TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models. Microsoft Research. arXiv:2109.10282 Paper
  6. [6] Du, Y., et al. (2020). PP-OCR: A Practical Ultra Lightweight OCR System. Baidu Inc. arXiv:2009.09941 (PP-OCRv3: arXiv:2206.03001) Paper
  7. [7] Marti, U.-V., & Bunke, H. (2002). The IAM-database: an English sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition, 5(1), 39–46. Dataset page Dataset
  8. [8] Sauvola, J., & Pietikäinen, M. (2000). Adaptive document image binarization. Pattern Recognition, 33(2), 225–236. doi:10.1016/S0031-3203(99)00055-2 Journal Paper
  9. [9] Applied AI. (2025, December). The State of PDF Parsing: What 800+ Documents and 7 Frontier LLMs Taught Us About Parser Selection. applied-ai.com/briefings/pdf-parsing-benchmark/ · GitHub Industry Benchmark
  10. [10] Procycons. (2025). PDF Data Extraction Benchmark 2025: Comparing Docling, Unstructured, and LlamaParse for Document Processing Pipelines. procycons.com/en/blogs/pdf-data-extraction-benchmark/ Industry Report
  11. [11] Koncile. (2025). Document Parsing in 2025: Tools, Accuracy, and Comparisons. koncile.ai/en/resources Industry Report
  12. [12] NVIDIA Developer Blog. (2025, August). Approaches to PDF Data Extraction for Information Retrieval. developer.nvidia.com/blog/approaches-to-pdf-data-extraction Technical Blog
  13. [13] Braincuber Technologies. (2025). AWS Textract vs Google Document AI: OCR Comparison. braincuber.com/blog/aws-textract-vs-google-document-ai-ocr-comparison Industry Comparison