Blog

Ratesheet ingestion: the hard parts

Excel parses cleanly. PDF mostly does. Image-only PDFs do not. OCR is the failure mode that decides whether your ingestion pipeline survives Monday morning.

RTBy RateStack TeamPublished2026-03-26Reviewed2026-03-2610 min read

Excel is the easy case. Open with POI, walk sheets, extract rows, done. PDF is harder but tractable: PDFBox extracts text spans with positions; you reconstruct the table by clustering on the y-coordinate. Image-only PDFs and scans are where it gets interesting.

Why OCR is the long tail

A surprising number of investor ratesheets arrive as image-only PDFs. Sometimes intentionally (the investor's pricing system exports an image to prevent scraping). Sometimes accidentally (someone scanned a paper ratesheet a decade ago and the workflow stuck). Either way, you can't extract text positions because there isn't any text — you have a rasterized image.

Tess4J on Tesseract gets us to maybe 95% character accuracy on a clean image. That sounds good until you realize a ratesheet is a 50×30 grid of three-digit numbers, and 95% character accuracy means a 3-digit cell is right ~85% of the time. With 1500 cells, you have a couple hundred wrong cells per ratesheet.

The two-step we landed on

Step 1: detect the table structure on the image (lines, cell boundaries) before OCR. Step 2: OCR each cell independently with a numeric-only dictionary when we know the column is numeric (rate, price). This dramatically improves accuracy because the OCR pass has a tighter prior.

Step 3 is the QC dashboard. The pipeline never auto-activates a sheet where confidence falls below threshold. The QC operator reviews the uncertain cells side-by-side with the source image and corrects them. Confidence-thresholded auto-activation is a roadmap item; we will not ship it before the false-positive rate is provably below what manual entry produces.

What this teaches

The lesson generalizes beyond OCR: when an automation pipeline can be 90%+ correct, the QC layer is more important than the additional automation. The right answer is high-recall automation with high- precision human review, not 99% automation with no review.

We've seen vendors claim "100% automated ratesheet ingestion." We don't make that claim, because the failure modes we just described are not solvable by a smarter regex. They're solvable by a human looking at the data five minutes per day.

← Previous

What the correlationId buys you

Engineering an explainable pricing engine

All posts