Overview: data, not documents
A pay stub is built for a person to read, not for a computer to process. It carries gross pay, a list of earnings, a column of deductions, several taxes, net pay, and a year-to-date figure beside almost every line — and it arrives as a PDF, which you can't total, sort or compare. The goal of extraction is to flip that: to turn the document into structured data — one field per number, labelled and consistent — that you can add up, verify or hand to another system.
That's useful in three broad situations: an individual organising or proving their own pay, an employer or bookkeeper reconciling payroll into the books, and an organisation verifying someone's income at volume. This guide walks the whole method — from a single stub to a hundred — and the fastest entry points are pay stub to Excel (US) and payslip to Excel (UK/AU). For OCR and the API specifically, see paystub OCR. Throughout, the principle is the same: read the document by meaning, validate it with its own arithmetic, and keep the year-to-date figures — so the data you end up with is something you can total, reconcile and act on, not just a faster way to retype.
Before you start
- The pay stubs themselves — digital PDFs are fastest, but scans and phone photos work via OCR.
- A clear idea of which fields you need (gross, net, specific deductions or taxes, YTD) so you know what to check.
- A spreadsheet tool for totals and pivots, or an API key if you're feeding the data into a system.
- For income verification, the matching bank statements too, so you can cross-check pay against deposits.
There's nothing to install and no per-provider template to configure. You can start in the browser at pay stub to Excel, or, for systems, go straight to the document extraction API.
The anatomy of a pay stub
Every pay stub, whatever the provider, is assembled from the same building blocks. Knowing them makes extraction predictable: you know what should come out, and what to check. The stub is essentially four groups of numbers — earnings, deductions, taxes and totals — wrapped around identity fields, with a current-period and a year-to-date amount for most lines.
| Block | Examples | What's captured |
|---|---|---|
| Identity | Employee, employer, pay period, pay date | Keys each row; used to de-duplicate |
| Earnings | Regular, overtime, bonus, commission, PTO | Hours, rate, current and YTD amount |
| Pre-tax deductions | 401(k), health, dental, HSA, FSA | Per item, current and YTD |
| Taxes withheld | Federal, state, Social Security, Medicare | Per tax, current and YTD |
| Post-tax deductions | Garnishments, Roth, union dues | Per item, current and YTD |
| Totals | Gross pay, net pay, taxable wages | Current period and year-to-date |
The arithmetic that ties them together — gross minus deductions and taxes equals net — is what makes a pay stub self-checking, and it's the property a good extractor uses to validate its own output.
How the extraction actually works
It helps to know what happens under the hood, because it's why this beats both manual entry and old template-based parsers. For a digital PDF, the engine reads the text and layout directly. For a scan or a photo, optical character recognition (OCR) converts the pixels to text first. Either way, the raw text alone isn't the answer — a page of numbers means nothing until you know which number is which.
That's the job of the AI structuring layer. It identifies each value by meaningrather than position: this figure is gross pay, this one is a 401(k) deduction, this is federal tax, this is the year-to-date column. Because it reads by meaning, it doesn't need a template for each provider — an ADP stub and a stub from an in-house system both resolve to the same labelled fields. For the deeper contrast between plain OCR and structured AI extraction, see OCR vs AI document extraction.
Finally, the output is validated and scored: gross minus deductions is checked against net, and every field gets a confidence value. The result is structured, checked, editable data — not a flat dump of text — which is what makes everything after this fast and trustworthy.
Step-by-step: a pay stub → structured data
Step 1 — Upload the stub
Drop the pay stub PDF, scan or photo into the converter. Multiple stubs can go in together for a batch.
Step 2 — Read and structure
OCR (for images) plus AI structuring map earnings, deductions, taxes and totals to labelled fields by meaning — no template, no setup.
Step 3 — Review and validate
Check the editable preview; the gross-minus-deductions-equals-net check runs and low-confidence fields are flagged for a quick look.
Step 4 — Export the data
Download Excel or CSV, push to QuickBooks, or receive structured JSON over the API for a system to ingest automatically.
The whole loop takes seconds for one stub. The time saving compounds with volume — see many stubs at once below.
Field reference: what comes out
Extraction returns a consistent set of labelled fields regardless of which provider made the stub. The table below is the core schema; the current-period and year-to-date amount are both returned wherever the stub prints them, and each field carries a confidence score.
| Field | Meaning | Typical use |
|---|---|---|
| gross_pay | Total earnings before deductions | Income totals, annualising |
| earnings[] | Each earnings line with hours and rate | Splitting base vs overtime/bonus |
| deductions[] | Each pre/post-tax deduction, signed | 401k tracking, benefit audits |
| taxes[] | Each tax withheld (federal/state/FICA) | Withholding checks |
| net_pay | Take-home after all deductions | Cash-flow, verification |
| ytd_* | Year-to-date figure per line | Annualising, cross-period checks |
| employee / employer | Identity fields | Keying and de-duplication |
| pay_period / pay_date | The period and date paid | Sorting, frequency, gaps |
Over the APIthese come back as JSON; in the browser they become columns in an Excel or CSV sheet. Either way the names are stable, so downstream code or formulas don't change when the stub's layout does.
Year-to-date: the second source of truth
The year-to-date column is more useful than it looks. Beyond the current period, every stub carries the running total of pay, each tax and each deduction for the year so far — and capturing it gives you two things. First, you can annualise income from a single mid-year stub (YTD gross divided by the number of periods elapsed, times periods per year), which is exactly what lenders do. Second, across a sequence of stubs the YTD figures must increase consistently, which is a powerful cross-check: a YTD that jumps or stalls reveals a missing or duplicated stub.
Because FlowParse captures both the current and YTD amount per line, you get this for free. When you process a run of stubs, the year-to-date progression is a built-in audit of completeness — the payroll equivalent of reconciling a bank statement's running balance.
Any provider, any layout
Payroll providers each design their stubs differently — ADP, Gusto, Paychex, QuickBooks Payroll, Rippling, Workday and dozens of in-house systems all arrange earnings, deductions and taxes in their own way. A template-based parser has to be taught each one and breaks the moment a provider tweaks its design or you meet a format it has never seen.
Reading by meaning sidesteps that entirely. The extractor locates gross, the earnings lines, each deduction and tax, and the net and YTD totals wherever they sit, so every provider's stub yields the same clean fields. That layout- independence is what makes it usable across a real population of documents — a lender or a bookkeeper receives stubs from every employer imaginable, and they all need to come out identical.
UK, Irish and Australian payslips
Outside the US the document is usually called a payslip, and the fields differ — the extractor reads them in place of US taxes. UK payslips carry PAYE income tax, National Insurance, pension, a tax code and an NI category; Irish payslips substitute PRSI and USC; Australian payslips carry PAYG withholding and superannuation, plus the employer ABN. All are captured with their year-to-date figures.
| Region | Key fields | Term |
|---|---|---|
| United States | Federal, state, Social Security, Medicare; 401(k) | Pay stub |
| United Kingdom | PAYE, National Insurance, pension, tax code | Payslip |
| Ireland | PAYE, PRSI, USC, pension | Payslip |
| Australia | PAYG, superannuation, ordinary/overtime hours | Payslip |
The UK and Australian flows have their own entry point at payslip to Excel, and UK readers folding PAYE income into a return should see bank statements for Self Assessment.
Scanned and photographed stubs
Real stubs don't always arrive as clean PDFs. Someone photographs a printed stub on their phone, scans a stack on an office machine, or forwards a screenshot — and the quality varies. The OCR stage is built for exactly that: it copes with skew, shadows, moderate blur and low resolution, recovering text that a template parser would miss outright.
Where a read is genuinely uncertain — a creased line, a faint photocopy — the field is flagged with a low confidence score rather than guessed. That distinction is the whole game: it's the difference between OCR you can build an automated process around and OCR that quietly introduces errors. The dedicated paystub OCR page covers this in more depth.
Extracting many stubs at once
One stub is quick; the real saving is a stack. A year of your own pay is 24 or 26 stubs; an employer reconciling payroll has every employee's stub for every period; a lender processes stubs from many applicants a day. In the browser, upload up to 100 and consolidate them into one workbook, each row tagged to its pay period and employee, so you can pivot by person, by month or by field.
For higher or continuous volume, the same extraction runs over the document extraction API: POST each stub and receive structured JSON back, with no human in the loop for the clean ones. An income-verification or payroll platform embeds it so a stub becomes data the moment it's uploaded.
Validation and accuracy
Extraction is only useful if you can trust it, so trust is built into the output rather than assumed. Three checks run automatically: every field gets a confidence score; the stub's arithmetic — gross minus total deductions and taxes equals net — is verified; and across a sequence of stubs the year-to-date figures are cross-checked for consistent progression. Anything that doesn't reconcile is surfaced, not hidden.
FlowParse reaches around 98% field-level accuracy on standard stubs. For interactive use you confirm flagged fields in the editable preview; for automated use you set a confidence threshold and route low-confidence results to a human queue while clean ones pass straight through. Either way you decide the bar, rather than trusting a black box.
Exporting the data
The extracted data comes out in whatever shape the next step needs. For people, that's a clean Excel or CSV sheet of pay, deductions, taxes and net that you can total and chart, or a push into QuickBooks. For systems, it's structured JSON over the API — labelled fields with confidence scores, ready to store or score.
Because the schema is consistent across providers, a downstream payroll, HR or lending system ingests an ADP stub and an in-house stub identically. One integration, every layout — and no manual column-matching, because the fields arrive already labelled.
What pay-stub data is used for
Income verification and lending is the biggest driver. Lenders, mortgage brokers, landlords and benefits assessors need to confirm what someone earns, and the stub is the primary evidence. Extracting it to structured data makes the check fast, repeatable and auditable: compute average pay, annualise YTD, flag inconsistencies, and cross-check against bank statement deposits — turning a manual document review into a data step.
Payroll and bookkeeping is the other half. Employers reconcile payroll runs and audit deductions and contributions; bookkeepers pull client wages into the books without re-keying. And individualsuse it to track take-home, total a year's pay, check tax withheld, or assemble income proof — the same need the receipt scanner and statement tools serve for the rest of someone's financial paperwork.
From one stub to a pipeline
The method scales without changing shape. For one stub or a handful, the browser is enough: drop them in, review, export. For a year of your own pay or a workforce's worth, batch up to 100 and consolidate them into a single workbook, each row keyed to its pay period and employee so you can pivot by person, month or field. The per-document effort is the same whether you do one or a hundred.
For continuous volume, extraction belongs in a pipeline rather than a browser tab. The document extraction API takes a stub PDF or image and returns structured JSON with confidence scores, so a lending, HR or payroll system can digitize a document the moment it arrives — clean results flowing through automatically and only low-confidence ones queued for review. Because the same API also reads bank statements and invoices, one integration covers a whole income or bookkeeping workflow rather than just the payroll part.
Reading the deductions: pre-tax vs post-tax
The deductions are the densest part of a stub, and the most valuable to get right, because the split between pre-tax and post-tax deductions changes taxable pay. Pre-tax deductions come out beforetax is calculated, lowering the amount you're taxed on; post-tax deductions come out afterwards. Extracting each line separately and signed lets you total each kind and see exactly what's reducing take-home — something a dense PDF column hides.
| Deduction | Type | Effect |
|---|---|---|
| 401(k) / pension | Pre-tax | Reduces taxable pay; tracks toward annual limit |
| Health / dental / vision | Pre-tax | Insurance premiums before tax |
| HSA / FSA | Pre-tax | Health spending accounts |
| Roth 401(k) | Post-tax | Retirement, taxed now |
| Garnishments | Post-tax | Court-ordered, after tax |
| Union dues / charity | Post-tax | Voluntary, after tax |
Capturing the type as well as the amount means your spreadsheet can answer real questions: how much went into retirement this year, whether benefit deductions changed mid-year, what your true taxable pay was. Those are the analyses people actually want from a stub, and they're only possible once the deductions are structured rather than printed.
Cross-checking stubs against the bank
Extraction gets more powerful when you pair a stub with the bank statement that paid it. The net pay on each stub should match a deposit on the statement, so once both are structured you can line up pay against deposits and confirm the money landed and the amounts agree. For income verification that cross-check is far stronger evidence than either document alone, and the bank side reads the same way through bank statement analysis for loans.
It catches problems, too. A deposit that doesn't match any stub's net pay, or a stub with no matching deposit, points to a missing document, a changed deduction or a timing quirk worth a look. Because both documents come out of the same engine as consistent, dated, signed rows, the comparison is a simple spreadsheet — net pay in one column, matching deposit in another, the difference in a third — that you can refresh each pay period rather than reconcile by hand.
A worked example
To make it concrete, picture a single bi-weekly stub. The page shows gross pay of, say, a base amount plus an overtime line; a block of deductions — a 401(k) percentage, a health premium, a dental premium; four taxes — federal, state, Social Security and Medicare; and net pay at the bottom, with a year-to-date figure beside almost every line. To a person that's a thirty-second read and a five-minute transcription; to the extractor it's one pass.
The output is a row per line item: each earnings line with its hours, rate and amount; each deduction signed and typed pre- or post-tax; each tax with its current and year-to-date amount; and the totals. The validation confirms gross minus deductions and taxes equals net, and every field carries a confidence score. What was a dense page is now a small table you can total, file, or feed onward — and doing the same to twenty-six stubs builds a full year's view with no extra effort per document.
That's the whole point of extraction: the value isn't in reading one stub faster, it's in turning a stream of them into data that totals, reconciles and integrates. From here you can export to a spreadsheet or push the same fields into a system over the API.
Common mistakes (and how to avoid them)
Trusting raw OCR text. Plain OCR gives you characters, not meaning — and no validation. Use structured extraction that labels fields and checks gross minus deductions against net, so errors surface instead of flowing through.
Ignoring the YTD column. The year-to-date figures are your completeness check and your fastest route to annualised income. Capturing only the current period throws away the easiest cross-check you have.
Hand-keying at volume. A single stub has thirty-odd numbers; a payroll's worth is thousands. Manual entry is slow and silently error-prone — exactly the work to automate.
Assuming one provider's layout. Template-based tools break on an unfamiliar stub. If you handle stubs from many employers, you need meaning-based extraction that doesn't depend on recognising the design.
No confidence threshold in automation. In an automated flow, passing every result through unchecked lets a bad read reach your data. Route low-confidence fields to review and let only clean ones auto-process.
Best practices & checklist
Put together, a reliable pay-stub extraction process looks like this — whether you're doing one stub or wiring up a pipeline:
- Prefer digital PDFs; for scans and photos, let OCR run and check the flagged fields.
- Capture both current-period and year-to-date amounts for every line.
- Let the gross-minus-deductions-equals-net check run on every stub.
- Use meaning-based extraction so any provider's layout works without setup.
- Set a confidence threshold; review flagged fields, auto-process clean ones.
- Use the YTD progression across a sequence to confirm no stub is missing or duplicated.
- Export to the format your next step needs — Excel/CSV for people, JSON for systems.
- Keep handling private: TLS, delete after processing, no model training on your data.
Bottom line: read by meaning, validate with the stub's own arithmetic, and keep the year-to-date figures — and pay-stub data becomes as trustworthy as anything you'd file a tax or loan decision on.
Extract your pay stubs now
Upload one stub or a hundred and get clean, validated fields — gross, deductions, taxes, net and YTD — as a spreadsheet or structured JSON.
