Head-to-head benchmark of open-source PDF parsers on customer complaint forms — ablation tests, rubric scoring, and a decision framework.
Most PDF parsing benchmarks test clean academic papers or financial reports. Complaint forms are harder — they stack six difficulty layers at once.
The local tools below are free and can run on-prem. The Claude API row is included only as a cloud comparison baseline, not as an on-prem option.
| Tool | License | Scanned | Tables | Handwriting | Speed (CPU) | Best For |
|---|---|---|---|---|---|---|
| Docling ★ Strong local option | MIT | ✓ | 97.9%* | Limited | ~1.3 p/s | Complex layouts, production on-prem |
| PyMuPDF | AGPL-3.0† | ✗ | Basic | ✗ | ~32 p/s | Digital PDFs, rendering + coordinates |
| pdfplumber | MIT | ✗ | Rule-based | ✗ | ~150 p/s | Table extraction from digital PDFs |
| PaddleOCR | Apache-2.0 | ✓ | ✓ (ML) | CER ~24% | ~2–4 p/s | Scanned PDFs, multilingual (80+ langs) |
| Tesseract | Apache-2.0 | ✓ | ✗ | Poor | ~1 p/s | Baseline OCR, widest language support |
| TrOCR | MIT | ✓ | ✗ | CER ~3%‡ | <1 p/s | Handwritten field extraction only |
| pypdf | BSD-3 | ✗ | ✗ | ✗ | ~415 p/s | AcroForm digital fields (fastest) |
| Claude API Not on-prem | Proprietary | ✓ | ✓ | ~90% | 0.3 p/s | Research comparison only — data leaves infra |
Speed figures are indicative CPU timings and vary strongly by document type; the detailed measured tables below are the source of truth for this benchmark. * Procycons 2025 benchmark on sustainability reports.[10] Not measured on complaint forms specifically. † AGPL-3.0 is free for internal/on-prem use; commercial redistribution requires a paid Artifex license. ‡ CER on IAM clean handwriting dataset[7] — real-world complaint form handwriting will be higher.
The tools called "PDF parsers" use three completely different mechanisms. Which one a tool uses determines what it can and cannot read — that's what separates the benchmark rows.
No image processing. No machine learning. Pure parsing of PDF's internal byte structure.
Step 1 of 7. Open the PDF binary. Everything is stored as numbered objects — like rows in a database. Object 12 is the Page. Object 14 is the Font. Object 15 is the Contents stream.
/V values bypass this entirely — they're read from the form field structure instead, which is why pypdf reaches 1.00 Field F1 on the synthetic AcroForm slice used here.Inside a PDF content stream
Every character's position is encoded as PDF operators. Tj draws a string. Td moves the cursor. Tf selects a font.
Glyph → Unicode mapping
When a font embeds a ToUnicode CMap, parsers can map glyph codes to characters. When it's missing (common in scanned or form-flattened PDFs), parsers produce mojibake — or nothing.
/V dictionary entries, not in the visible text stream, so most extractors need a separate code path for forms.
%PDF-1.4. Inside are numbered objects — think of them as rows in a database. Object 12 is the Page. Object 14 is the Font. Object 15 is the Contents stream — a list of drawing operators like BT (begin text block), Tf (select font), Td (move cursor to position), Tj (draw a text string at the cursor).ToUnicode table inside the font definition converts raw glyph IDs to readable characters. If that table is missing (common in older PDFs), the parser returns garbled characters — even though the PDF looks perfect in Adobe Reader./V entry of each widget annotation object — not inside the content stream at all. pypdf has a dedicated path that reads these /V entries directly, bypassing the drawing commands entirely. The high CER is a measurement artefact — the ground truth interleaves labels with values in a specific order that doesn't match the content stream's drawing sequence.Converts a raster image to text through a multi-stage pipeline — each stage feeds the next.
Step 1. Start with pixels from a scan. The page looks readable to us, but it is only an image.
rn confused for mHow Tesseract segments words
rn → m).PaddleOCR additions
PaddleOCR replaces Tesseract's heuristic segmentation with a detection model (DB++ — a differentiable binarization network) that directly predicts word bounding boxes as a segmentation map. This handles rotated, curved, or irregularly spaced text that trips up projection profiles.
PP-StructureV2 runs a separate layout model that classifies regions as text / title / table / figure, then applies a table-specific parser to reconstruct row/col structure as HTML.
check_dpi() returns the scan resolution. If DPI < 200, or skew angle > 1°, or contrast ratio < 50, applying the full pipeline (deskew → denoise → CLAHE → Sauvola) is worth trying. On already-adequate inputs, the local rerun here showed the full pipeline was slower and less accurate, so it is usually not worth applying by default.Treats a document page as a 2D photograph. A vision model recognises regions — text block, table, figure, heading — and TableFormer reconstructs table structure from patterns learned across large document corpora.
Step 1. The page image is sliced into a grid of 16×16 pixel patches — the same way modern AI image models process photographs. Each patch becomes one token (a 768-dimensional vector).
colspan="2" — a merged cell, impossible to detect with line-finding rules.## Customer Complaint Form — Q1 2024 The customer reported that the product arrived damaged on 15 January... | Claim ID | Date | Amount & Status | |----------|------------|------------------| | C-0042 | 2024-01-15 | $142 · Pending | | C-0043 | 2024-01-16 | $89 · Resolved |
Microsoft TrOCR is a Vision-Encoder–Language-Decoder transformer fine-tuned for handwritten text recognition. Unlike traditional OCR, it skips explicit character segmentation entirely.
Step 1. Crop the handwritten field tightly. TrOCR works best on a single field, not a full busy page.
The common ViT mental model is: one image -> fixed-size patches -> a token sequence. The decoder only turns those image tokens into text in the next step.
TrOCR is meant for a tight word or line crop, not a full complaint-form page.
The crop is resized to 384×384 px. The grid cuts it into 16×16 px squares — uniform, letter-blind. 384 ÷ 16 = 24, so the grid is 24×24 = 576 patches. Green squares = patches that contain pen strokes.
Schematic only. The model does not split the image into letter columns. It cuts the whole resized image into equal squares regardless of where letters start or end.
This is still vision data, not text. The next step is where the decoder reads those image tokens and starts generating letters.
To emit each letter, the decoder queries cross-attention over the 576 patch embeddings. Brighter blue = higher attention weight for that patch at this decoding step.
Schematic only. Real attention maps are distributed across all heads and all layers — not a single clean spotlight.
microsoft/trocr-large-handwritten, is larger than the base family, but the architecture is the same. Cross-attention is what resolves ambiguous strokes — "rn" that looks like "m" gets pushed toward whichever makes a real word. A purely visual classifier just sees pixels.Real measured results on this benchmark's test corpus, plus published benchmark references for parsers not yet run. ✓ = measured on this machine. * = from published research.
Synthetic complaint forms with real AcroForm field widgets. Ground truth: programmatically exact. CER measures full text extraction; Field F1 measures AcroForm field value extraction.
| Parser | CER ↓ | WER ↓ | FER ↑ | Field F1 ↑ | Speed (p/s) ↑ | AcroForm fields |
|---|---|---|---|---|---|---|
| pypdf | 0.490 | 0.524 | 1.000 | 1.000 | 415 | ✓ |
| pymupdf | 0.470 | 0.437 | 1.000 | 1.000 | 32 | ✓ |
| pdfplumber | 0.490 | 0.524 | 0.000 | 0.000 | 151 | ✗ |
| tesseract | 0.534 | 0.611 | 0.000 | 0.000 | 0.68 | ✗ |
CER ~0.49 is a measurement artifact: ground truth interleaves label+value; text parsers read label text and widget values in separate passes (non-interleaved order). Field F1 is the correct signal for AcroForms. Tesseract renders to image and runs OCR even on digital PDFs — correct for scanned forms, wasteful here.
Real scanned business forms from FUNSD (EPFL 2019).[1] Human-annotated ground truth. Text parsers return empty string (no text layer in raster-image PDFs). Tesseract runs OCR on the rendered image.
| Parser | CER ↓ | WER ↓ | FER ↑ | Field F1 ↑ | Speed (p/s) ↑ | Reads scanned |
|---|---|---|---|---|---|---|
| tesseract (raw) | 0.484 | 0.749 | 0.000 | 0.000 | 1.05 | ✓ |
| tesseract +preproc | 0.560 | 0.866 | 0.000 | 0.000 | 0.25 | ✓ |
| pypdf | 1.000 | 1.000 | 0.000 | 0.000 | 1165 | ✗ |
| pymupdf | 1.000 | 1.000 | 0.000 | 0.000 | 938 | ✗ |
| pdfplumber | 1.000 | 1.000 | 0.000 | 0.000 | 1117 | ✗ |
Tesseract CER=0.484 on this noisy 10-document FUNSD slice is plausible — these are degraded real-world business forms, not clean print. Text parsers "speed" is misleadingly high because they immediately return empty string (no text layer to process). Docling, PaddleOCR, and TrOCR results are still pending local reruns because they require larger model downloads.
Speed rows marked ✓ come from the latest local reruns, but they are best-fit task measurements rather than one universal race: pypdf / PyMuPDF / pdfplumber are from digital AcroForms, while Tesseract is from scanned forms.
Source: Braincuber 2025 independent benchmark.[13] This chart mixes two kinds of evidence: cloud OCR numbers from document benchmarks with handwritten annotations, and a clean-reference TrOCR score from the IAM handwriting dataset.[7] Treat it as directional, not apples-to-apples.
Source: Compiled from industry benchmarks (2023–2025).[9][12] The +22% is a median figure across degraded document types — your mileage will vary by scan quality and OCR engine.
check_dpi() + measure skew angle before deciding to preprocess.
Reserve the full pipeline for fax-quality inputs (≤200 DPI, skew >1°, contrast ratio <50).
| Configuration | CER ↓ | WER ↓ | Speed (p/s) ↑ | Verdict |
|---|---|---|---|---|
| tesseract (raw) | 0.484 | 0.749 | 1.05 | Use this |
| tesseract +preproc | 0.560 | 0.866 | 0.25 | Worse (+0.076 CER, 4.3× slower) |
Measured on this machine (Apple Silicon, CPU). Preprocessing pipeline: deskew (Hough) → denoise (fast non-local means) → CLAHE → Sauvola binarization. For degraded/fax inputs the pipeline is beneficial — this finding is specific to already-adequate 300 DPI business form scans.
A 10-criterion weighted rubric designed for complaint form requirements. Weights reflect the specific priorities of this use case — handwriting is weighted highest (0.22) because it is the primary failure mode.
| Criterion | Weight | Score 5 | Score 3 | Score 1 |
|---|---|---|---|---|
| A. Text Extraction | 0.12 | CER ≤ 2% | CER 5–10% | CER >20% |
| B. Handwriting | 0.22 | CER ≤ 5% (your forms) | CER 15–30% | No capability |
| C. Table Structure | 0.08 | Cell acc ≥ 95% | 70–85% | No extraction |
| D. Form Field / KVP | 0.18 | Field F1 ≥ 0.95 | F1 0.70–0.85 | F1 < 0.50 |
| E. Figures / Diagrams | 0.05 | No hallucination near figures | Minor text bleed | Hallucination observed |
| F. Degraded Scans | 0.10 | CER <10% at 200 DPI fax | CER 20–35% | Fails below 300 DPI |
| G. Speed | 0.05 | ≥5 p/s CPU | 0.2–1 p/s | <0.05 p/s |
| H. Compliance | 0.10 | Zero egress, MIT/Apache | Self-hosted option w/ cloud activation | Cloud-only, no DPA by default |
| I. License | 0.05 | MIT / Apache-2.0 / BSD | GPL-3.0 (code open, weights NC) | Proprietary API |
| J. Maintenance | 0.05 | Active release <6 months | Release within 18 months | Effectively abandoned (>3 years) |
| Tool | A | B | C | D | E | F | G | H | I | J | Weighted |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Docling + TrOCR Hybrid | 5 | 5 | 5 | 4 | 4 | 4 | 3 | 5 | 5 | 5 | 4.53 |
| Docling | 5 | 3 | 5 | 4 | 4 | 3 | 3 | 5 | 5 | 5 | 4.05 |
| PyMuPDF Digital only | 5 | 1 | 3 | 3 | 5 | 1 | 5 | 4 | 4 | 4 | 3.20 |
| PaddleOCR | 4 | 2 | 3 | 3 | 3 | 4 | 3 | 5 | 5 | 5 | 3.47 |
| Tesseract | 3 | 1 | 1 | 2 | 3 | 3 | 3 | 5 | 5 | 5 | 2.73 |
| Claude API* | 4 | 4 | 4 | 4 | 4 | 4 | 1 | 1 | 1 | 5 | 2.98* |
* Claude API scores Compliance (H) = 1 because data leaves your infrastructure. Not recommended for real complaint PII without a signed DPA. These are pre-measurement estimates — do not use as final scores without running the ablation on your corpus.
Four questions narrow you to the right tool. A hybrid pipeline — different parsers for different form subtypes — almost always beats running a single parser on everything.
No single parser handles all complaint form types optimally. Route by document type and confidence — different paths use different tools.
Routing each document type to the best-fit tool instead of running one parser on everything is often the highest-leverage improvement. The chart below is an illustrative scenario that mixes local measurements (✓), published benchmarks, and operational assumptions — your numbers will vary.
Even a strong hybrid pipeline hits a wall on sub-200 DPI fax scans, heavy coffee-stained or torn forms, and extreme cursive handwriting. In practice, many teams still reserve the last 5–10% of uncertain cases for Human-in-the-Loop review rather than promise full automation.
These numbers assume Docling on a shared server with reasonable utilisation. At 10K forms/month it already undercuts both cloud options shown. The actual crossover depends on your hardware cost and server load — calculate it against your own setup. Vendor pricing checked April 2026.
sandbox/evaluate.py, and measure CER/WER/FER on your actual documents. Confidence thresholds need to come from real data, not published benchmarks. The numbers here are a starting point.