Skip to content
RateStack
Pillar · Ratesheet automation

Ratesheet automation, end to end

The seven-stage pipeline that turns inbound investor documents into versioned, queryable ratesheets — and the operational disciplines that keep it boring.

RTBy RateStack TeamPublishedReviewed16 min read

Why automate

The 7am ratesheet ritual — log into vendor portals, download PDFs, copy cells into the LOS — is the single most expensive recurring task in small and mid-size lender ops. Manual entry has a non-trivial error rate, no audit trail outside the inbox, and no resilience: if the person who runs it is sick, prices don't update.

Automation isn't about replacing humans. It's about moving them up the value chain — from data entry to QC and exception handling. The goal of a ratesheet automation pipeline is to take 90% of the morning's work off humans while making the remaining 10% (the unusual cases) more visible and auditable.

Sources: email, portal, scrape, upload

Investors deliver ratesheets through four channels. A serious pipeline handles all four with the same downstream stages:

  • Email inbox. An IMAP poller monitors a dedicated ingestion mailbox; new mail with attachments lands as raw documents for the pipeline. Sender-allowlists prevent random spam from triggering ingestion.
  • Vendor portal. Some investors require login to their portal to retrieve the daily sheet. A headless browser (Playwright, in production environments) handles login and download. This is best-effort: portals occasionally break automation through UI redesigns or anti-bot defenses, and the pipeline must surface failures explicitly.
  • Web scrape / API. Some investors publish ratesheets to a public URL or a JSON feed. Scrape with an SSRF-safe HTTP client and host allowlist; never let a portal config talk to internal IP space.
  • Direct upload. Operators upload edge-case sheets (corrections, mid-day reissues, paper sheets) through a UI. Same pipeline from there.

Conversion: any format → canonical JSON

Sources deliver in three formats: native Excel (XLSX), PDF (text-based or image-based), and HTML/JSON (scrape). The conversion stage normalizes every input into the same shape: { sheets: [{ name, rows: [[cell, cell, ...]] }] }.

  • Excel: Apache POI walks sheets, extracts rows. Reliable.
  • Text PDF: PDFBox extracts text spans with positions; the extractor reconstructs the table from y-coordinate clustering. Works well; some columnar layouts need tuning.
  • Image PDF / scan: Tess4J (Tesseract) runs OCR with numeric-mode dictionaries when columns are known to be numeric. This is the long tail; expect 95-98% character accuracy and plan for QC.
  • HTML / JSON: jsoup parses HTML tables; JSON ingest is direct.

Critically: the conversion stage is content-addressed. Every raw document is hashed (SHA-256), stored in object storage by hash, and deduplicated. The same investor sending the same sheet twice in a morning is a no-op after the first insert.

Extraction: tables and classification

Conversion produces undifferentiated rows. Extraction segments those rows into tables and classifies each table by purpose: a rate grid, an LLPA matrix, a program-eligibility block, a notes section, a margin schedule. Each classification path feeds different downstream normalization logic.

Classification is heuristic: row-density patterns, header-row recognition (presence of expected fields like "rate," "price," "LTV," "FICO"), and column-type inference. Confidence below threshold raises a QC item before the ratesheet can activate.

Header mapping: the learning template pattern

The most consequential design decision in the pipeline: how do you normalize an investor's column headers (which vary by investor and change occasionally) into the engine's canonical fields?

A naive approach hardcodes per-investor mappings. This is brittle: a column reorder or a header rename breaks the parser, and adding a new investor requires code.

A serious approach uses a three-tier resolver:

  1. Stored template: if a template already exists for this investor and matches the current header set with high similarity (≥0.85), use it. This is the fast path; most days no AI call needed.
  2. AI fallback: when no template matches, send the header strings (header text only — no borrower data, no rate values) to an LLM with a structured prompt asking for a mapping. Acceptance threshold around 0.7. The resolved mapping is persisted as a new template for the investor; next time, it's the fast path.
  3. Regex heuristic: as a final fallback, rule-based header normalization. Captures the 80/20 of common patterns. Always writes to the template store on success.

The downstream property is that the system learns. New investors cost an AI call once; subsequent sheets from that investor match the stored template. Vendor changes (column reorder, header rename) are absorbed by re-running the resolver and writing a new template. The pipeline keeps running.

Versioning: DRAFT, ACTIVE, SUPERSEDED

A ratesheet is not a configuration file you overwrite. It is a versioned data object with three states:

  • DRAFT: ingested and parsed; not yet visible to the pricing engine. The QC dashboard renders a diff against the currently ACTIVE version.
  • ACTIVE: the current production view. Pricing requests read this version. The version id is captured in every quote's trace.
  • SUPERSEDED: a previous ACTIVE version. Still queryable, still replayable for historical pricing, just not the current view.

Activation is an explicit human action by default — an operator approves the DRAFT through the QC dashboard. The transition is atomic: previous ACTIVE flips to SUPERSEDED, new ACTIVE goes live, and an event fires for cache invalidation.

Rollback is a single API call. Rolling back to a prior version flips states again. The pricing engine reflects the rollback within the event-driven cache invalidation window — typically under a second.

Quality control: what to surface

QC is what makes automation viable in production. The dashboard shows, per DRAFT ratesheet:

  • Diff vs. previous ACTIVE: rows added, rows removed, cells changed. Numeric changes flagged with delta amounts.
  • Confidence scores per column: where the mapping resolver was uncertain.
  • OCR confidence per cell (image-source documents): cells below the threshold rendered side-by-side with the source image patch.
  • Sample pricing comparison: a fixed sample loan priced under both DRAFT and currently-ACTIVE versions. Material price deltas highlight the impact of the change.
  • Approve / reject controls with a reason field. Rejection writes a journal entry; the source document is preserved for post-mortem.

Failure modes

Real pipelines fail in predictable ways:

  • Vendor changes its layout. Mapping confidence drops; QC catches it; new template lands; pipeline resumes. Don't auto- activate below threshold — block until human resolves.
  • OCR misreads numeric cells. Use numeric-only dictionaries on known-numeric columns. Cells below confidence go to QC. The sample-pricing diff is the safety net — a typo that moves the price by 25 bps will be visible.
  • Portal automation breaks. Portals redesign UIs. Plan for it. Surface failures explicitly with the last successful fetch timestamp; have a manual-fallback runbook.
  • Email forwarding loops. An ingestion mailbox that forwards into itself (via a misconfigured rule) creates an infinite loop. Sender allowlist + content-addressed dedup catch this; don't skip either.
  • Time-zone confusion on activation timestamps. A ratesheet labeled "effective 5/3/2026 EST" uploaded in PST should activate when EST hits 5/3, not when PST does. Store timestamps in UTC; render with explicit zones.
  • Investor sends the wrong file. Manual content review catches this. Don't over-automate; the sample-pricing diff is the guardrail.

What to measure

A healthy ingestion pipeline reports on:

  • Documents received per source per day.
  • Time from receipt to ACTIVE (per source).
  • Pipeline failure rate by stage (convert, extract, ingest).
  • Mapping confidence distribution.
  • QC items raised vs. resolved per day.
  • Time to QC resolution.
  • Auto-rejection rate for thresholded confidence.
  • Cache invalidation latency post-activation.

A team that watches these metrics knows when to add a vendor template, when to retrain the OCR for a problematic source, and when to escalate to the investor about a delivery problem.

Cluster

More on this topic.

Deep dives that pick up where the pillar leaves off.

Frequently asked

Questions readers send us.

What ratesheet formats do you actually accept?

XLSX (Excel), text-based PDFs, image-based PDFs (via OCR), HTML scrapes, and direct JSON. Manual upload supports any of those plus CSV.

How long does first-time setup take?

For an investor we already have a template for, instant. For a new investor with a typical layout, the first sheet costs an AI call; subsequent sheets match the template directly. Operationally, most teams reach steady-state ingestion across their full investor pack within two weeks.

Can the pipeline auto-activate?

It can, but we recommend against it for new deployments. The QC step is what catches subtle mapping errors and unusual investor changes. Once you trust the pipeline on a per-investor basis, you can configure auto-activation for that investor specifically — most teams keep human approval as the default.

What about vendor portal anti-bot defenses?

Portal automation is a contractual relationship — we use credentials you provide and respect the portal's terms. We don't bypass CAPTCHAs or anti-automation defenses. When a portal breaks automation, we surface the failure explicitly so you can fall back to manual download.

Do you store the raw documents?

Yes, content-addressed by SHA-256, encrypted at rest, with bucket versioning and configurable lifecycle expiration. Default retention is 90 days post-supersede; configurable on Business and Enterprise.

Ready to see it on your data?

See ratesheet automation, end to end in production.

Spin up a sandbox or talk to us about a guided demo. Everything in this guide is wired into the platform — not aspirational.

Ratesheet automation — the pillar guide | RateStack