Capability · Ratesheet ingestion

Ratesheet automation that learns.

Email IMAP, portal automation, web/API scraping, file upload — feed any source into one normalization pipeline. Excel, PDF, and OCR conversion, table detection and classification, and a three-tier mapping resolver (stored template → AI → regex) that learns templates as it goes.

Request a demo Read the API docs

Overview

What it is, in one paragraph

Every source is normalized into the same {sheets:[{name, rows:[...]}]} JSON shape: Excel via Apache POI, PDF via PDFBox, image-only PDFs via Tess4J OCR. Tables are detected, segmented, and classified. Headers are normalized through a three-tier resolver — stored template (≥0.85 similarity), AI fallback (≥0.7), regex heuristic last — and the resolver writes new templates back to MySQL as it learns. Final ratesheets land as DRAFT, transition to ACTIVE on QC approval, and SUPERSEDE the previous version with a JPA-backed audit trail.

Format-agnostic
Excel (POI), PDF text (PDFBox), image PDFs and scans (Tess4J OCR), HTML scrape (jsoup), API JSON. Same downstream pipeline.
Learning header mapping
First-seen vendor headers run AI; the resolved mapping is persisted as a template. Next sheet from that vendor matches the template directly — no AI cost.
Versioned activation
DRAFT → ACTIVE → SUPERSEDED. Roll back to a prior version with a single API call; the engine immediately uses the rolled-back grid.
QC dashboard
Side-by-side diff vs. prior version, sample-row preview, approve/reject. Approval is an explicit human action, never automatic.
Email-in
IMAP poller pulls inbound ratesheet emails, attaches them to a content-addressed MinIO bucket (SHA-256 dedup), and emits documents.raw.received.
Content-addressed store
Every blob is keyed by SHA-256 hash with SSE-S3 encryption, bucket versioning, lifecycle expiration, and abort-incomplete-multipart cleanup.

How it works

The pipeline, end to end.

Numbered steps from input to output. Each step maps to a specific subsystem you can inspect via OpenTelemetry.

1
Source delivers a document
An email lands in the inbox, a portal scrape completes, or a user uploads a file. The raw blob is hashed (SHA-256), deduped, and stored in MinIO; documents.raw.received fires on NATS.
2
Convert to canonical JSON
conversion-service picks up the event, dispatches by content type to POI, PDFBox, or Tess4J OCR, and emits documents.converted with the {sheets:[{rows:[]}]} JSON shape.
3
Detect and classify tables
extraction-service runs heuristic table detection, segments by header row patterns, and classifies each table (rate grid vs. text-headed). Output: extraction.metadata.extracted.
4
Resolve headers
ingestion-service runs the three-tier resolver. Stored templates match first; AI falls back next; regex heuristic last. Confidence below threshold raises an event for human QC.
5
Land as DRAFT
Normalized rows land in MySQL with status DRAFT. ratesheet-service emits ratesheet.draft.ready; the QC dashboard renders a diff vs. the prior ACTIVE version.
6
Activate
On approval, the version transitions to ACTIVE and the previous version becomes SUPERSEDED. ratesheets.activated fires, every pricing-service node invalidates its cache, and webhooks dispatch to subscribers.

Hands on

ratesheets.activated webhook payload

Live JSON sample — copy, paste, ship.

# Webhook payload — ratesheets.activated
{
  "eventId": "01J2K8RM6X3T7Y9PNVA1F1Q3W2",
  "correlationId": "01J2K8RJYP4Z6N5GA2T9DH8C0H",
  "occurredAt": "2026-05-03T09:14:23.118Z",
  "type": "ratesheets.activated",
  "data": {
    "ratesheetId": "rs_C1c8a4ed",
    "investorCode": "INV_42",
    "version": 217,
    "previousVersion": 216,
    "activatedBy": "ingestion-service",
    "rowCount": 4128,
    "checksum": "sha256:f57a8d2c..."
  }
}

Why this matters

The pain it removes.

No more morning copy-paste

The IMAP poller and the portal automation handle the dailies. Your team reviews a diff, not a spreadsheet.

Auditable provenance

Every ratesheet row traces back to the source document — by SHA-256, by upload timestamp, by who approved which version.

Vendor changes don't break you

A vendor reorders two columns? The mapping resolver adapts on the next sheet, persists the new template, and your pipeline keeps running.

Frequently asked

Direct answers, no marketing spin.

What if the AI mapping is wrong?

The QC dashboard shows confidence scores per column. Below threshold, the activation is blocked until a human resolves it. Resolutions are written back as templates so the next sheet matches without AI.

How do you handle vendor portal scraping legally?

Portal automation is scoped to credentials you provide and only reaches hosts on your allowlist. agent-service never bypasses CAPTCHAs or bot detection — if a portal blocks automation, we surface the error and you handle it manually.

Can you ingest from our LOS?

Yes. The ingestion API accepts pre-normalized rows directly; you skip the conversion / extraction stages and feed straight into ingestion.

How are duplicate documents handled?

MinIO uses content-addressed keys (SHA-256). A second copy of the same document hashes to the same key, dedupes server-side, and skips the rest of the pipeline.

Can a rolled-back ratesheet be re-activated?

Yes. Versions are immutable; rolling back ACTIVE → SUPERSEDED leaves the prior version available. Re-activating it is a single API call, fully audited.

How we compare

Ratesheet automation that learns. — vs the alternatives.

Specific angles, not generic feature checklists. Each row links to a longer side-by-side; we're transparent about where competitors are the better choice.

vs Optimal Blue

Email-in, portal scraping, and OCR are all in the box — no premium ingestion module to license separately.

See the side-by-side

vs Polly

Learning header-mapping templates persist per-investor, so onboarding a new ratesheet stops being an engineer ticket after the first run.

See the side-by-side

vs Lender Price

DRAFT → ACTIVE → SUPERSEDED versioning is exposed via API and the UI — rollback is a single call, not a sales request.

See the side-by-side

vs Mortech

Open ingestion pipeline with a published canonical JSON shape — no proprietary mapping format to learn.

See the side-by-side

Comparisons reflect each vendor's public positioning. Where a fact is unverifiable, we mark it "Depends" or "Unknown" instead of guessing.

Related capabilities

Pieces that work better together.

Pricing engine

What runs against the ratesheets you ingest.

Observability

Per-document trace via correlationId.

Security

Encrypted blobs, hash-chained audit.

Ready to see it on your data?

Wire ratesheet automation that learns. up to your real workflow.

We'll spin you a sandbox, load your actual ratesheets, and walk you through this capability against your top scenarios.

Request a demo Or start in the sandbox

Ratesheet automation that learns.

Format-agnostic

Learning header mapping

Versioned activation

QC dashboard

Email-in

Content-addressed store

Source delivers a document

Convert to canonical JSON

Detect and classify tables

Resolve headers

Land as DRAFT

Activate

No more morning copy-paste

Auditable provenance

Vendor changes don't break you

Wire ratesheet automation that learns. up to your real workflow.