No more morning copy-paste
The IMAP poller and the portal automation handle the dailies. Your team reviews a diff, not a spreadsheet.
Email IMAP, portal automation, web/API scraping, file upload — feed any source into one normalization pipeline. Excel, PDF, and OCR conversion, table detection and classification, and a three-tier mapping resolver (stored template → AI → regex) that learns templates as it goes.
Overview
Every source is normalized into the same {sheets:[{name, rows:[...]}]} JSON shape: Excel via Apache POI, PDF via PDFBox, image-only PDFs via Tess4J OCR. Tables are detected, segmented, and classified. Headers are normalized through a three-tier resolver — stored template (≥0.85 similarity), AI fallback (≥0.7), regex heuristic last — and the resolver writes new templates back to MySQL as it learns. Final ratesheets land as DRAFT, transition to ACTIVE on QC approval, and SUPERSEDE the previous version with a JPA-backed audit trail.
Excel (POI), PDF text (PDFBox), image PDFs and scans (Tess4J OCR), HTML scrape (jsoup), API JSON. Same downstream pipeline.
First-seen vendor headers run AI; the resolved mapping is persisted as a template. Next sheet from that vendor matches the template directly — no AI cost.
DRAFT → ACTIVE → SUPERSEDED. Roll back to a prior version with a single API call; the engine immediately uses the rolled-back grid.
Side-by-side diff vs. prior version, sample-row preview, approve/reject. Approval is an explicit human action, never automatic.
IMAP poller pulls inbound ratesheet emails, attaches them to a content-addressed MinIO bucket (SHA-256 dedup), and emits documents.raw.received.
Every blob is keyed by SHA-256 hash with SSE-S3 encryption, bucket versioning, lifecycle expiration, and abort-incomplete-multipart cleanup.
How it works
Numbered steps from input to output. Each step maps to a specific subsystem you can inspect via OpenTelemetry.
An email lands in the inbox, a portal scrape completes, or a user uploads a file. The raw blob is hashed (SHA-256), deduped, and stored in MinIO; documents.raw.received fires on NATS.
conversion-service picks up the event, dispatches by content type to POI, PDFBox, or Tess4J OCR, and emits documents.converted with the {sheets:[{rows:[]}]} JSON shape.
extraction-service runs heuristic table detection, segments by header row patterns, and classifies each table (rate grid vs. text-headed). Output: extraction.metadata.extracted.
ingestion-service runs the three-tier resolver. Stored templates match first; AI falls back next; regex heuristic last. Confidence below threshold raises an event for human QC.
Normalized rows land in MySQL with status DRAFT. ratesheet-service emits ratesheet.draft.ready; the QC dashboard renders a diff vs. the prior ACTIVE version.
On approval, the version transitions to ACTIVE and the previous version becomes SUPERSEDED. ratesheets.activated fires, every pricing-service node invalidates its cache, and webhooks dispatch to subscribers.
Hands on
Live JSON sample — copy, paste, ship.
# Webhook payload — ratesheets.activated
{
"eventId": "01J2K8RM6X3T7Y9PNVA1F1Q3W2",
"correlationId": "01J2K8RJYP4Z6N5GA2T9DH8C0H",
"occurredAt": "2026-05-03T09:14:23.118Z",
"type": "ratesheets.activated",
"data": {
"ratesheetId": "rs_C1c8a4ed",
"investorCode": "INV_42",
"version": 217,
"previousVersion": 216,
"activatedBy": "ingestion-service",
"rowCount": 4128,
"checksum": "sha256:f57a8d2c..."
}
}Why this matters
The IMAP poller and the portal automation handle the dailies. Your team reviews a diff, not a spreadsheet.
Every ratesheet row traces back to the source document — by SHA-256, by upload timestamp, by who approved which version.
A vendor reorders two columns? The mapping resolver adapts on the next sheet, persists the new template, and your pipeline keeps running.
Frequently asked
The QC dashboard shows confidence scores per column. Below threshold, the activation is blocked until a human resolves it. Resolutions are written back as templates so the next sheet matches without AI.
Portal automation is scoped to credentials you provide and only reaches hosts on your allowlist. agent-service never bypasses CAPTCHAs or bot detection — if a portal blocks automation, we surface the error and you handle it manually.
Yes. The ingestion API accepts pre-normalized rows directly; you skip the conversion / extraction stages and feed straight into ingestion.
MinIO uses content-addressed keys (SHA-256). A second copy of the same document hashes to the same key, dedupes server-side, and skips the rest of the pipeline.
Yes. Versions are immutable; rolling back ACTIVE → SUPERSEDED leaves the prior version available. Re-activating it is a single API call, fully audited.
How we compare
Specific angles, not generic feature checklists. Each row links to a longer side-by-side; we're transparent about where competitors are the better choice.
Email-in, portal scraping, and OCR are all in the box — no premium ingestion module to license separately.
See the side-by-sideLearning header-mapping templates persist per-investor, so onboarding a new ratesheet stops being an engineer ticket after the first run.
See the side-by-sideDRAFT → ACTIVE → SUPERSEDED versioning is exposed via API and the UI — rollback is a single call, not a sales request.
See the side-by-sideOpen ingestion pipeline with a published canonical JSON shape — no proprietary mapping format to learn.
See the side-by-sideComparisons reflect each vendor's public positioning. Where a fact is unverifiable, we mark it "Depends" or "Unknown" instead of guessing.
Related capabilities
Ready to see it on your data?
We'll spin you a sandbox, load your actual ratesheets, and walk you through this capability against your top scenarios.