The worst bug a financial converter can have
There's a hierarchy of badness in data extraction. Misreading a value is bad, but visible — a date that's obviously wrong, an amount that looks off — and a human or a validation rule can catch it. Far worse is the error that leaves no trace: a transaction that the converter simply never emits. The output looks clean, the columns line up, the totals are plausible, and nothing tells you that one row in ten quietly went missing.
For a bank statement, that's catastrophic, because every transaction is load-bearing. A dropped row throws off the balance, the category totals, the tax figures and the reconciliation — and you don't find out at extraction time. You find out weeks later, when a closing balance won't match or an accountant asks why the numbers don't tie out, and by then you're hunting for one missing line among thousands. We decided that silent row loss was the one failure mode our converter was not allowed to have, and this is how we got there.
Why row loss stays invisible
The reason row loss is so pernicious is that none of the usual quality signals catch it. A confidence score reports how sure the model is about the values it didread — it says nothing about the rows it never attempted. Field-accuracy benchmarks measure correctness on captured cells, so a tool can score 99% and still be dropping whole transactions. Even a human reviewer skimming the output sees a coherent, well-formed table; there's no gap, no error marker, nothing that says “a row should be here.”
That invisibility is exactly why a high headline accuracy figure is the wrong thing to trust. The number you actually need isn't “how often is a value right” but “did every row survive” — a completely different question with a completely different answer. Recognising that completeness and field-accuracy are separate problems, with separate failure modes, was the first step; the second was building a way to prove completeness rather than assume it.
The old approach, and where it broke
Our original digital-PDF extractor did something that sounds reasonable: it split each page's text into table regions by looking for structural breaks — a row with very few items (a subtotal, a section marker, a blank) was treated as a boundary between tables. Each resulting region was then expected to have its own header row, which the extractor used to name the columns. Downstream, the pipeline picked the single “best” table per page by a confidence heuristic and used that.
On a clean, simple statement — one tidy table, one header, no page breaks mid-table — this works fine. The trouble is that real financial PDFs are almost never that tidy. A subtotal line in the middle of a statement isn't a table boundary; it's part of the table. A section that continues onto the next page doesn't repeat its header. A document can carry several related blocks that together form one logical ledger. Every one of those normal features looked, to the old logic, like a reason to split, demand a fresh header, and — when there wasn't one — throw the region away.
How the rows actually vanished
Three compounding behaviours did the damage. First, splitting on any low-item row chopped a single continuous statement into many fragments at every subtotal and marker. Second, the header-per-region requirement meant any fragment without its own header row — every continuation block, every headerless section — was treated as unparseable and dropped. Third, taking only the single best table per page meant that even when multiple valid blocks were extracted, the others were discarded downstream.
There was a fourth, subtler failure: the header detector would sometimes mistake a datarow for a header. A row whose first cell happened to look label-like could be promoted to a column header, which both lost that transaction and mislabelled the column for everything beneath it. So rows didn't just fall off the ends of fragments — some were consumed as structure. Put together, on a statement with subtotals, page breaks and a couple of sections, the extractor could confidently emit a clean-looking table that contained a fraction of the real transactions.
The stress test that exposed it
We built a deliberately hostile test set: ten documents chosen to break naive extraction. An accounts-payable ledger, a bank reconciliation report, a multi-currency business statement, a corporate expense report, a premium credit-card statement, a cross-border payments register, an international invoice register, a marketplace settlement report, a neobank statement and a travel-expense claim — each with the subtotals, page breaks, mixed sections and unusual layouts that real finance teams actually deal with.
The results were grim and, crucially, measured. On the hardest files — the accounts-payable, cross-border, travel and multi-currency documents — the old extractor lost between 50% and 88% of the rows. The merged output across all ten files came to a fraction of the true transaction count, and to add insult, it sometimes invented a bogus grand-total row by summing across currencies that should never have been added together. A converter that drops most of a statement and fabricates a total is worse than no converter, because it looks like it worked.
The fix: a document-level model
The rewrite inverted the logic. Instead of splitting a page into fragments and hoping each had a header, the new extractor builds a model of the document. It infers column layouts from the x-geometry of the text — where the columns physically sit on the page — and treats a layout as stable across the whole statement rather than re-deriving it per fragment. Every data row is then streamed into the layout it belongs to.
Two rules do most of the work. A strict header scoremeans a row containing any amount, date or identifier value can never be classified as a header — which directly fixes the “data row eaten as a header” bug. And the extractor skips only what it can prove is non-data: repeated headers and footers. Everything else — continuation rows after a page break, sections without their own header, subtotal-adjacent lines — flows into the table instead of triggering a split. The output is one lossless table per file.
Where a document genuinely contains several related blocks, two passes reconcile them: aligned tables with matching column shapes are merged, and related tables are unioned by canonical key — the same column-matching logic that powers consolidation, so “Date”, “Datum” and “Transaction Date” collapse to one column and nothing is stranded in an orphan block. The result is structural: rows can't fall through the cracks because there are no cracks to fall through.
Surviving the merge, not just the extraction
Fixing extraction is only half the battle when people consolidate dozens of statements at once. The consolidation engine itself was already lossless — it stacks rows verbatim and matches columns deterministically — but it could only be as complete as what extraction handed it. With the document-level model feeding it full tables, the whole pipeline became trustworthy end to end.
We also fixed a downstream insult to accuracy: a spurious grand-total row that appeared when a sheet mixed currencies. Totals are now only computed for genuinely numeric, same-currency columns, and suppressed where a sheet mixes currencies — so the merged workbook reflects the data, not an arithmetic artefact. The principle throughout: the combining step must never add, remove or invent a number, only carry through exactly what was extracted and approved.
The result: 100% fidelity on the hard set
With the document-level model in place, the same ten-document stress set went from a lossy subset to every single row present— ten files, 100% row fidelity, no missing transactions. The merged transaction count rose from 843 to 1,155 rows, and the fabricated cross-currency total was gone. Every file's extracted row count matched its ground truth, verified automatically rather than by eye.
The number that matters there isn't 1,155; it's the per-file checkmarks. “Average accuracy improved” would let a tool drop everything in one file and over-count another and still look fine. The harness asserts that eachdocument survives intact, because that's what a user actually experiences — they don't convert an average, they convert their statement, and their statement has to be complete.
Old fragment approach vs document-level model
It's worth laying the two designs side by side, because the contrast explains why the new one is robust rather than just tuned. The old extractor made local decisions — split here, demand a header there — that each seemed reasonable but compounded into lost data. The new one makes a single global decision about the document's structure and then never has to throw anything away.
| Aspect | Old: fragment & drop | New: document-level |
|---|---|---|
| Unit of work | A region of one page | The whole document |
| Column layout | Re-derived per fragment from a header | Inferred once from x-geometry, held stable |
| Subtotals & markers | Treated as table boundaries | Treated as part of the table |
| Continuation across pages | Orphaned without a header | Streamed into the same layout |
| Data row that looks label-like | Could be eaten as a header | Strict header score forbids it |
| Multiple blocks per page | Only the 'best' one kept | Unioned by canonical key |
| Failure mode | Silent row loss | Completeness, balance-checked |
The throughline is that every row in the right-hand column removes a way for data to disappear. You don't fix silent row loss by being more careful inside a fragile design; you fix it by choosing a design where the loss can't happen, then adding a check to catch the rare exception. That's the move from “usually accurate” to “provably complete.”
Proving it, not just claiming it
Hitting 100% on a test set is reassuring, but a user can't see our test set — they need to know theirstatement is complete. That's what the balance check provides on every statement: opening balance plus the sum of the transactions must equal the closing balance. If extraction ever did drop or misread a row, the arithmetic would break and the discrepancy is flagged for inspection.
This turns completeness from a promise into a property you can verify per document. It's the difference between “our extractor is accurate” and “this statement provably reconciles” — and for financial data only the latter is good enough. The document-level model makes row loss extremely unlikely; the balance check makes the rare remaining case visiblerather than silent, which is the whole point. The deeper rationale is in the bank statement accuracy write-up.
Keeping it fixed: the regression harness
Fixing a bug once is easy; keeping it fixed across months of changes is the hard part, and it's where most quietly-broken extractors lose ground. A refactor to handle a new bank's layout, a tweak to OCR pre-processing, a dependency bump — any of these can reintroduce row loss, and because the failure is silent, a casual test wouldn't catch it. So the fix isn't just the document-level model; it's a harness that makes a regression impossible to merge unnoticed.
The harness runs the real production code path — the same extract, consolidate and export functions the app calls, not a mock — against the ten-document stress set, every one with known ground truth. For each file it asserts three things: the exact row count matches, the balance reconciles, and no fabricated rows (like the old cross-currency grand total) have crept in. It runs over the merged output too, so the end-to-end count has to land on the expected figure, not just each file in isolation.
Asserting on per-file counts rather than an averageis deliberate, and it's the detail that makes the harness meaningful. An aggregate check would let a change drop every row in one document and over-count another and still report “100% on average.” By pinning each document to its own ground truth, a regression in any single file fails the build — which is exactly the granularity a user experiences, because they convert their statement, not an average of ten.
The same philosophy extends to the merge path, with a separate end-to-end harness that takes structured documents through consolidation and export and checks the resulting workbook. Together they mean the “843 to 1,155 rows” result isn't a one-off measurement from the week we shipped the fix — it's a property the test suite re-proves on every change, which is the only way a completeness guarantee stays true over time rather than decaying quietly after launch.
There's a broader lesson in choosing what to assert. It would have been tempting to measure something easy — “does the extractor produce a table?” — and call it tested. But that question can't fail in the way that actually hurts users, so it gives false confidence. The assertions worth writing are the ones aimed squarely at your worst failure mode: for a financial converter that means row counts and balance reconciliation, not table-shaped output. A test suite is only as honest as the thing it refuses to let regress, and the thing we refuse to let regress is a single missing transaction.
The human safety net: Merge Review
Automated checks should reduce human effort, not replace human judgement. So consolidating now opens Merge Review— an editable grid of the combined data with a quality score and every questionable cell highlighted, plus an issues panel that jumps you straight to each flagged date or amount. You fix anything in place and only export once you're satisfied, with every row keeping its source-file reference.
The design intent is a clean division of labour: the document-level model guarantees the rows are there, the balance check proves it, the confidence scoring points to anything uncertain, and the human spends thirty seconds on the genuine exceptions instead of re-reading a thousand rows. That's how accuracy scales — not by asking people to trust a black box, and not by asking them to check everything, but by surfacing exactly the few things worth a look.
Lessons we'd generalise
A few principles came out of this that apply well beyond our codebase. First, treat completeness as a first-class metric, separate from field accuracy — if you only measure value correctness, you're blind to your worst failure mode. Second, model the document, not the fragment: splitting eagerly and requiring local structure is what orphans data; inferring global structure and streaming into it is far more robust to the messiness of real PDFs.
Third, build the check that makes the failure visible. We can't promise extraction will never err, but a balance reconciliation converts a silent error into a flagged one, which is the difference that matters in practice. Fourth, test on hostile inputs and assert per-item, not on averages — the document that breaks your extractor is the one your user will upload, and an aggregate score will hide it. None of these are exotic; they're just the discipline financial data demands.
The one-line takeaway: don't ask whether your extractor is accurate on average — ask whether it can prove, document by document, that no row was left behind.
Three kinds of accuracy, and which one row loss breaks
Part of why this bug hid for as long as it could is that “accuracy” is really three different things, and the industry tends to quote only the first. Separating them is what let us see — and then close — the gap that mattered.
| Kind of accuracy | The question it answers | How FlowParse protects it |
|---|---|---|
| Field accuracy | Is each captured value correct? | AI extraction + confidence scoring + editable review |
| Row completeness | Did every transaction survive? | Document-level model — the focus of this work |
| Structural integrity | Do the numbers reconcile? | Balance check: opening + transactions = closing |
A headline “99% accurate” almost always refers to the first row of that table and quietly ignores the second and third — which is exactly where money goes missing. Treating all three as separate, measurable properties, each with its own safeguard, is the substance behind the accuracy claims; the document-level model is simply the piece that closed the middle row.
Why we publish the failure, not just the fix
Writing up a bug this serious is an unusual thing for a product to do — the comfortable path is to ship the fix quietly and only ever talk about the happy result. We're publishing the failure because, in this category, the failure is the most useful thing we can tell you. Almost every PDF-to-Excel and bank statement converter shares some version of the fragment-and-drop design, and almost none of them surface row loss to the user. If reading this makes you check your current tool on a hard statement, that's a good outcome whatever tool you land on.
It also reflects how we think trust should be earned with financial data: not with a marketing number, but with a falsifiable claim and a way for you to test it. We told you the failure mode, the measurement, the before-and-after (843 to 1,155 rows on the stress set), and the exact check — balance reconciliation — that would catch a regression. That's a claim you can hold us to, document by document, which is the only kind worth making about money.
If there's one thing to take from all of this, it's a question to ask any extraction tool — ours included — before you trust it with your books: not “how accurate are you?” but “how would I know if you dropped a row?” A tool that can't answer the second question is asking for blind faith. The honest answer is a per-document completeness check you can see, and that is precisely what we built this work to be able to give.
What it means if you convert statements
For anyone using FlowParse, the practical upshot is simple: a statement you convert today carries every transaction, the balance check confirms it reconciles, and the consolidation of a year across accounts is complete rather than approximately complete. The places this matters most — a practice closing client books, a lender reading applicant statements, a finance team consolidating across subsidiaries — are exactly the places where a single dropped row is most expensive.
And you don't have to take our word for it. Convert your own hardest statement — the scanned one, the multi-column one, the one that changed format mid-year — straight to Exceland check two things: did every transaction appear, and did the balance reconcile? That fifteen-second test on your worst document is the honest measure of a converter, and it's the one we built this work to pass.
Test it on your hardest statement
Convert a real statement free — no signup — and check it yourself: every row present, and the balance reconciled.
