Skip to content
RateStack
Capability · Ratesheet ingestion

Ratesheet automation that learns.

Email IMAP, portal automation, web/API scraping, file upload — feed any source into one normalization pipeline. Excel, PDF, and OCR conversion, table detection and classification, and a three-tier mapping resolver (stored template → AI → regex) that learns templates as it goes.

Overview

What it is, in one paragraph

Every source is normalized into the same {sheets:[{name, rows:[...]}]} JSON shape: Excel via Apache POI, PDF via PDFBox, image-only PDFs via Tess4J OCR. Tables are detected, segmented, and classified. Headers are normalized through a three-tier resolver — stored template (≥0.85 similarity), AI fallback (≥0.7), regex heuristic last — and the resolver writes new templates back to MySQL as it learns. Final ratesheets land as DRAFT, transition to ACTIVE on QC approval, and SUPERSEDE the previous version with a JPA-backed audit trail.

  • Format-agnostic

    Excel (POI), PDF text (PDFBox), image PDFs and scans (Tess4J OCR), HTML scrape (jsoup), API JSON. Same downstream pipeline.

  • Learning header mapping

    First-seen vendor headers run AI; the resolved mapping is persisted as a template. Next sheet from that vendor matches the template directly — no AI cost.

  • Versioned activation

    DRAFT → ACTIVE → SUPERSEDED. Roll back to a prior version with a single API call; the engine immediately uses the rolled-back grid.

  • QC dashboard

    Side-by-side diff vs. prior version, sample-row preview, approve/reject. Approval is an explicit human action, never automatic.

  • Email-in

    IMAP poller pulls inbound ratesheet emails, attaches them to a content-addressed MinIO bucket (SHA-256 dedup), and emits documents.raw.received.

  • Content-addressed store

    Every blob is keyed by SHA-256 hash with SSE-S3 encryption, bucket versioning, lifecycle expiration, and abort-incomplete-multipart cleanup.

How it works

The pipeline, end to end.

Numbered steps from input to output. Each step maps to a specific subsystem you can inspect via OpenTelemetry.

  1. 1

    Source delivers a document

    An email lands in the inbox, a portal scrape completes, or a user uploads a file. The raw blob is hashed (SHA-256), deduped, and stored in MinIO; documents.raw.received fires on NATS.

  2. 2

    Convert to canonical JSON

    conversion-service picks up the event, dispatches by content type to POI, PDFBox, or Tess4J OCR, and emits documents.converted with the {sheets:[{rows:[]}]} JSON shape.

  3. 3

    Detect and classify tables

    extraction-service runs heuristic table detection, segments by header row patterns, and classifies each table (rate grid vs. text-headed). Output: extraction.metadata.extracted.

  4. 4

    Resolve headers

    ingestion-service runs the three-tier resolver. Stored templates match first; AI falls back next; regex heuristic last. Confidence below threshold raises an event for human QC.

  5. 5

    Land as DRAFT

    Normalized rows land in MySQL with status DRAFT. ratesheet-service emits ratesheet.draft.ready; the QC dashboard renders a diff vs. the prior ACTIVE version.

  6. 6

    Activate

    On approval, the version transitions to ACTIVE and the previous version becomes SUPERSEDED. ratesheets.activated fires, every pricing-service node invalidates its cache, and webhooks dispatch to subscribers.

Hands on

ratesheets.activated webhook payload

Live JSON sample — copy, paste, ship.

# Webhook payload — ratesheets.activated
{
  "eventId": "01J2K8RM6X3T7Y9PNVA1F1Q3W2",
  "correlationId": "01J2K8RJYP4Z6N5GA2T9DH8C0H",
  "occurredAt": "2026-05-03T09:14:23.118Z",
  "type": "ratesheets.activated",
  "data": {
    "ratesheetId": "rs_C1c8a4ed",
    "investorCode": "INV_42",
    "version": 217,
    "previousVersion": 216,
    "activatedBy": "ingestion-service",
    "rowCount": 4128,
    "checksum": "sha256:f57a8d2c..."
  }
}

Why this matters

The pain it removes.

No more morning copy-paste

The IMAP poller and the portal automation handle the dailies. Your team reviews a diff, not a spreadsheet.

Auditable provenance

Every ratesheet row traces back to the source document — by SHA-256, by upload timestamp, by who approved which version.

Vendor changes don't break you

A vendor reorders two columns? The mapping resolver adapts on the next sheet, persists the new template, and your pipeline keeps running.

Frequently asked

Direct answers, no marketing spin.

What if the AI mapping is wrong?

The QC dashboard shows confidence scores per column. Below threshold, the activation is blocked until a human resolves it. Resolutions are written back as templates so the next sheet matches without AI.

How do you handle vendor portal scraping legally?

Portal automation is scoped to credentials you provide and only reaches hosts on your allowlist. agent-service never bypasses CAPTCHAs or bot detection — if a portal blocks automation, we surface the error and you handle it manually.

Can you ingest from our LOS?

Yes. The ingestion API accepts pre-normalized rows directly; you skip the conversion / extraction stages and feed straight into ingestion.

How are duplicate documents handled?

MinIO uses content-addressed keys (SHA-256). A second copy of the same document hashes to the same key, dedupes server-side, and skips the rest of the pipeline.

Can a rolled-back ratesheet be re-activated?

Yes. Versions are immutable; rolling back ACTIVE → SUPERSEDED leaves the prior version available. Re-activating it is a single API call, fully audited.

Ready to see it on your data?

Wire ratesheet automation that learns. up to your real workflow.

We'll spin you a sandbox, load your actual ratesheets, and walk you through this capability against your top scenarios.

Ratesheet ingestion — automated, learning, versioned | RateStack