Turn PDFs into JSON over REST
A PDF to JSON API removes the worst part of document automation: getting reliable structured data out of a PDF. Instead of writing per-layout templates or fragile text regex, you POST the file and receive typed JSON. FlowParse's `POST /api/v1/extract` does this for financial documents — it classifies the file, extracts every field, and returns a schema you can insert into a database without reshaping.
The output is deliberately boring in the best way: stable, snake_case, and identical to what the rest of the API consumes. That means the JSON you get from extraction is exactly what you pass to validation, export or reconciliation — no glue code translating between shapes. For statement-specific behaviour see the bank statement API; this page covers the generic PDF-to-JSON contract.
Under the hood it reads the PDF's own text geometry for pixel-accurate tables and only falls back to AI OCR for scanned pages, so a 40-page document keeps all 40 pages of rows. Every numeric field is parsed to a number, every date to ISO-8601, and the untouched source columns are preserved in `raw_table` for audit.
One call, structured JSON back
Get a key from the API dashboard, base64-encode the PDF, and POST it to `https://flowparse.io/api/v1/extract`. The whole request is plain JSON, so there's no multipart upload to wrangle from your backend.
curl -X POST https://flowparse.io/api/v1/extract \
-H "Authorization: Bearer pf_live_xxx" \
-H "Content-Type: application/json" \
-d '{ "file": "JVBERi0xLjcK...", "filename": "invoice-4471.pdf" }'Invoice example
For an invoice, the API returns the supplier and buyer, totals, tax breakdown and a typed `line_items` array — with the original table preserved in `raw_table` so nothing is lost.
{
"type": "invoice",
"pages": 1,
"billedPages": 1,
"data": {
"type": "invoice",
"data": {
"supplier_name": "Globex Ltd",
"invoice_number": "INV-4471",
"invoice_date": "2024-10-11",
"currency": "EUR",
"subtotal": 1000.00,
"tax_amount": 190.00,
"total": 1190.00,
"line_items": [
{ "description": "Design retainer", "quantity": 2, "unit_price": 500.00, "tax_rate": 19, "amount": 1000.00 }
]
}
}
}A stable, typed schema
Every response is `{ type, pages, billedPages, data }`, where `data` is `{ type, data }` — the inner `type` is the classified document type (`invoice`, `bank_statement`, `receipt`, or `mixed`) and the inner `data` holds the typed fields. The contract is versioned under `/api/v1`, so you can build against it with confidence.
| Document type | Key fields | Array |
|---|---|---|
| invoice / receipt | supplier_name, invoice_number, invoice_date, subtotal, tax_amount, total | line_items[] |
| bank_statement | bank_name, account_holder, currency, opening_balance, closing_balance | transactions[] |
| mixed | both an invoice object and a bank_statement object | both |
| all | raw_table { columns, rows } — original layout preserved 1:1 | — |
Why not regex or templates?
Template-based PDF parsing breaks the moment a vendor changes its layout, a bank tweaks a column, or a customer uploads a format you've never seen. You end up maintaining a growing library of brittle rules. An AI extraction API generalises across layouts instead: it understands *what a field means*, not *where it sits in pixels*, so a new supplier or bank works on day one without a code change.
Raw text extraction (e.g. pdf-to-text plus regex) has the opposite problem — it gives you a wall of text with the table structure destroyed, so multi-column statements and line-item tables become guesswork. FlowParse keeps structure by reading table geometry directly and reconstructing rows, which is why amounts line up with the right dates and descriptions.
Validate the JSON before you trust it
Structured doesn't automatically mean correct, so pair extraction with `POST /api/v1/validate`. It returns a 0–100 quality score, a grade and concrete checks — totals that don't add up, tax math that's off, balance breaks on statements, duplicate or out-of-order rows. Use the grade to auto-accept clean documents and queue only the doubtful ones for review.
curl -X POST https://flowparse.io/api/v1/validate \
-H "Authorization: Bearer pf_live_xxx" \
-d '{ "type": "invoice", "data": { ... } }'
# → { "validations": [ { "score": { "value": 98, "grade": "A" }, "checks": [ ... ] } ] }From JSON to files
When the destination is a spreadsheet or accounting system rather than your database, `POST /api/v1/export` turns the JSON into XLSX, CSV, XML, QuickBooks/Quicken bank feeds (`.QBO`/`.QFX`/`.OFX`), Xero, DATEV or 1С — returned as a base64 file. Previews are free, so you can confirm the column mapping before you generate the billed file. This is the same export engine described on PDF to QBO and bank statement to Xero.
A clean integration pattern
Authenticate
Create a key in the dashboard and send it as `Authorization: Bearer pf_live_…`.
Extract
POST the base64 PDF to /api/v1/extract and read the typed JSON `data`.
Validate
Send `data` to /api/v1/validate; branch on the score/grade to auto-accept or review.
Persist or export
Store the fields you need, or call /api/v1/export to produce the file your workflow expects.
Handle 429
If the page budget is exhausted you get a 429 — top up and retry; no unbilled data is returned.
Pricing and rate limits
Extraction and file exports are billed per page from your page balance; validation and previews are free. A request that would exceed your balance returns `429` with the exact shortfall, so spend is always predictable. Plans and the per-page rate are on the pricing page; usage per key is tracked in the dashboard, and you can rotate keys without downtime.
Start free: build the entire flow against validation and previews at no cost, then enable billed extraction when you go live. The complete reference — every field, every format, error codes and live examples — is in the API docs, with an interactive playground for quick experiments.
| Capability | Endpoint | Billing |
|---|---|---|
| PDF → JSON | POST /api/v1/extract | Per page |
| Validate JSON | POST /api/v1/validate | Free |
| Export to file | POST /api/v1/export | Per page (preview free) |
| Merge documents | POST /api/v1/merge | Per page (preview free) |
Call it from any language
Because the API is plain JSON over HTTPS, there's no SDK to install and no language lock-in — anything that can make a POST request can use it. The pattern is identical everywhere: read the file, base64-encode it, send it in the `file` field, and parse the JSON response. In Node you'd `readFileSync(path).toString('base64')`; in Python `base64.b64encode(open(path,'rb').read()).decode()`; in Ruby `Base64.strict_encode64(File.read(path))`; in Go `base64.StdEncoding.EncodeToString(bytes)`. The request body and the response schema don't change.
That uniformity makes the API easy to wrap in your own thin client: one function `extract(file) → data`, one `validate(data) → score`, one `export(data, format) → file`. Because the schema is stable and versioned under `/api/v1`, that wrapper stays valid as your product grows. The API docs include copy-paste curl for every endpoint, and the playground lets you fire a real request from the browser before you write a line of code.
import base64, requests
with open("invoice-4471.pdf", "rb") as f:
file_b64 = base64.b64encode(f.read()).decode()
r = requests.post(
"https://flowparse.io/api/v1/extract",
headers={"Authorization": "Bearer " + KEY},
json={"file": file_b64, "filename": "invoice-4471.pdf"},
)
data = r.json()["data"]Status codes, limits and retries
The API uses standard HTTP status codes and never returns unbilled data, so error handling is straightforward. A `200` carries the structured JSON. A `400` means the request body was wrong — a missing or invalid `file`, or bad base64 — and should be fixed rather than retried. A `401` is an invalid or missing key. A `422` means the document couldn't be read or had nothing extractable (a blank image, a non-financial file); it isn't billed for a file, so re-scan at higher quality or send the original PDF. A `429` means your page budget is exhausted, with the exact pages-needed-versus-available in the message, and a `503` is a transient issue to retry with backoff.
Files can be up to 20 MB per request and multiple pages are fully supported. For very large batches, extract documents individually — ideally in parallel across workers with a sensible concurrency cap — rather than trying to send everything in one call. Billing is per page from your page balance; validation and previews are free, so you can build and test the whole pipeline at zero cost and only switch on billed extraction when you go live. Plans and the per-page rate are on the pricing page.
| Code | Meaning | Action |
|---|---|---|
| 200 | Success — JSON returned | Process the data |
| 400 | Bad request (file/base64) | Fix the request body |
| 401 | Invalid or missing key | Check Authorization / rotate key |
| 422 | Unreadable / nothing extractable (not billed) | Re-scan or send the original PDF |
| 429 | Page budget exhausted | Top up or upgrade, then retry |
| 503 | Temporarily unavailable | Retry with backoff |
What teams build with it
A reliable PDF-to-JSON step unlocks a lot of automation that was previously stuck behind manual data entry. Accounts-payable teams extract supplier invoices straight into their ledger; expense tools turn receipt photos into line items; lenders and finance teams pull statement transactions for cash-flow and affordability analysis; and vertical SaaS products embed document capture so their own customers upload a PDF and get structured data inside the app. In each case the value isn't just the JSON — it's removing a human from the loop without losing accuracy.
Accounts payable
Invoices to ledger rows, validated, with no manual keying.
Expense & receipts
Receipt photos to typed line items for reports and reimbursement.
Finance & lending
Statement transactions for cash-flow, affordability and underwriting input.
Vertical SaaS
Embed capture so customers upload a PDF and your app gets structured data.
How structure and accuracy are preserved
The hardest part of turning a PDF into JSON is not reading the characters — it is reconstructing the *structure* so values land in the right fields. A line-item table and a transaction table are both grids, and a flat text dump destroys the grid: you lose which number is a quantity, which is a unit price, and which column a date belongs to. FlowParse reads the PDF's own text geometry to rebuild those grids cell by cell, so a unit price stays a unit price and a balance stays a balance, even across page breaks and multi-line descriptions.
On top of structure sits typing. Amounts are parsed to real numbers (no stray currency symbols or thousands separators), dates are normalised to ISO-8601, and bank debit/credit columns collapse into a single signed amount. Where the source is ambiguous — a faint scan, an unusual layout — the engine records lower confidence rather than guessing silently, and the deterministic checks behind `/api/v1/validate` give you a score you can act on. The original, untouched values always remain in `raw_table`, so you can reconcile the typed output against exactly what was printed whenever you need an audit trail.
This is why the JSON is safe to insert directly: it is not a best-effort text scrape but a structured, typed, checkable representation of the document. The same discipline applies whether the input is an invoice, a receipt or a statement, which is what lets one schema serve every financial PDF your product encounters.
Best practices for a reliable integration
A few habits make a PDF-to-JSON integration dependable in production. First, always validate before you trust — send the extracted `data` to `/api/v1/validate` and branch on the grade, so clean documents flow straight through and only doubtful ones reach a human. Second, prefer the original digital PDF over a photo or screenshot whenever one exists; the text layer is read exactly, while images must be OCR-ed. Third, persist the validation score and `billedPages` with each record, so you have a queryable history of what was processed, what it cost, and what was auto-accepted.
Operationally, run extraction behind a queue with a capped pool of workers rather than inline on a request, make each job idempotent on a file hash so retries don't duplicate records, and handle `429`/`503` with top-up-and-resume and exponential backoff respectively. Store only the fields you need, drop `raw_table` if you don't use it, and keep PII out of logs. None of this is exotic — it's the same shape as any robust third-party integration — and it turns a single call into an unattended pipeline. The guide to parsing bank statements with an API shows the full pattern with code.
Pick the right entry point
If your documents are mostly statements, start with the bank statement API. For scanned files, the bank statement OCR API explains the OCR path. For invoices and receipts at scale, see the document extraction API. And to wire a full integration end to end — extract, validate, export and reconcile — follow the guide to parsing bank statements with an API.
Convert your first PDF to JSON now
Grab a key, POST a PDF to /api/v1/extract, and get a stable, typed JSON schema back — ready to validate, store or export.
