Skip to content
RateStack
Blog

Ratesheet ingestion: the hard parts

Excel parses cleanly. PDF mostly does. Image-only PDFs do not. OCR is the failure mode that decides whether your ingestion pipeline survives Monday morning.

RTBy RateStack TeamPublishedReviewed10 min read

Excel is the easy case. Open with POI, walk sheets, extract rows, done. PDF is harder but tractable: PDFBox extracts text spans with positions; you reconstruct the table by clustering on the y-coordinate. Image-only PDFs and scans are where it gets interesting.

Why OCR is the long tail

A surprising number of investor ratesheets arrive as image-only PDFs. Sometimes intentionally (the investor's pricing system exports an image to prevent scraping). Sometimes accidentally (someone scanned a paper ratesheet a decade ago and the workflow stuck). Either way, you can't extract text positions because there isn't any text — you have a rasterized image.

Tess4J on Tesseract gets us to maybe 95% character accuracy on a clean image. That sounds good until you realize a ratesheet is a 50×30 grid of three-digit numbers, and 95% character accuracy means a 3-digit cell is right ~85% of the time. With 1500 cells, you have a couple hundred wrong cells per ratesheet.

The two-step we landed on

Step 1: detect the table structure on the image (lines, cell boundaries) before OCR. Step 2: OCR each cell independently with a numeric-only dictionary when we know the column is numeric (rate, price). This dramatically improves accuracy because the OCR pass has a tighter prior.

Step 3 is the QC dashboard. The pipeline never auto-activates a sheet where confidence falls below threshold. The QC operator reviews the uncertain cells side-by-side with the source image and corrects them. Confidence-thresholded auto-activation is a roadmap item; we will not ship it before the false-positive rate is provably below what manual entry produces.

What this teaches

The lesson generalizes beyond OCR: when an automation pipeline can be 90%+ correct, the QC layer is more important than the additional automation. The right answer is high-recall automation with high- precision human review, not 99% automation with no review.

We've seen vendors claim "100% automated ratesheet ingestion." We don't make that claim, because the failure modes we just described are not solvable by a smarter regex. They're solvable by a human looking at the data five minutes per day.

Ratesheet ingestion: the hard parts | RateStack