What is a document extraction API?

A REST endpoint that converts documents into structured data. FlowParse's /api/v1/extract classifies an invoice, receipt or bank statement and returns typed JSON — fields, line items or transactions — without per-vendor templates.

Which document types are supported?

Invoices, receipts and bank statements, plus mixed documents that contain more than one. The response `type` tells you what was detected so you can branch on it.

Do I have to tell it the document type?

No. Classification is automatic and returned in the response. You can send any supported document to the same endpoint.

What file formats can I send?

PDF (digital and scanned), PNG/JPG images, XLSX and CSV — base64-encoded in the `file` field, up to 20 MB per request.

Is the schema stable across types?

Yes. Every response uses the envelope { type, pages, billedPages, data }, and `data` is snake_case — the same shape the validate, export, reconcile and merge endpoints accept.

How accurate is it, and how do I check?

It reads text geometry for digital files and OCRs scans, then you can score any result with /api/v1/validate (totals, tax math, balance reconciliation, duplicates) and gate on the grade.

Can it handle new vendors and layouts?

Yes — it extracts by meaning rather than fixed coordinates, so unfamiliar suppliers and banks work without authoring a template.

How do I authenticate?

Send an API key as a bearer token (Authorization: Bearer pf_live_…) or X-API-Key. Create and revoke keys at /get-api-key.

Can I export to accounting software?

Yes. /api/v1/export produces XLSX, CSV, XML, QuickBooks/Quicken (QBO/QFX/OFX), Xero, DATEV or 1С from any extracted document; previews are free.

Per page for extraction and file exports from your page balance; validation, reconciliation and previews are free. The per-page rate and plans are on the pricing page.

What happens when I run out of pages?

The API returns 429 with the pages needed vs available and no unbilled data. Top up or upgrade and retry.

Can I process documents in bulk?

Yes. Extract each document (parallelise across workers) and use /api/v1/merge to consolidate up to 100 into one Excel; preview:true is free.

Uploaded files are processed to produce the response and are not kept as downloadable documents. Requests use HTTPS and a hashed key, are logged for your audit trail, and are never used to train models.

Where are the full docs and a sandbox?

At /api-docs for the full reference and /api-playground to run requests in the browser. Validation and previews are free, so you can build the whole flow before enabling billed extraction.

Document Extraction API — Invoices, Receipts & Statements to JSON | FlowParse

Q: Can I reconcile invoices against payments?

Yes. /api/v1/reconcile matches invoices to bank payments and returns matched and unmatched items with a report — the same engine as the reconciliation feature.

One API for every financial document

A document extraction API replaces a wall of bespoke parsers with a single endpoint that understands documents by meaning. FlowParse's `POST /api/v1/extract` ingests a PDF, scan, image, XLSX or CSV, classifies it as an invoice, receipt, bank statement (or a mixed document), and returns typed JSON for that type — supplier and totals and `line_items` for an invoice, account header and `transactions` for a statement.

It's intelligent document processing (IDP) you can call from code: classification, OCR, field extraction, table reconstruction and validation behind one bearer key. Because it generalises across layouts, a supplier or bank you've never seen works on the first request — no template to author, no rule to maintain. For statement-specific behaviour see the bank statement API; for the generic contract see the PDF to JSON API.

Every response shares one envelope — `{ type, pages, billedPages, data }` — and the inner `data` is the same snake_case schema the rest of the API consumes. That uniformity is the point: classify, extract, validate, export and reconcile all speak the same shape, so your integration is a short, linear pipeline rather than a pile of adapters.

flowparse.io

Extract any document in one call

Base64-encode the document and POST it. You don't need to tell the API what it is — classification is automatic and returned in the response `type`. Create a key in the API dashboard.

POST /api/v1/extract

curl -X POST https://flowparse.io/api/v1/extract \
  -H "Authorization: Bearer pf_live_xxx" \
  -H "Content-Type: application/json" \
  -d '{ "file": "JVBERi0xLjcK...", "filename": "receipt.pdf" }'

flowparse.io

What it extracts, by type

The API auto-classifies and returns the fields that matter for each type. A mixed document (an invoice with an attached statement, say) returns both objects.

Type	Extracted fields	Array
invoice	supplier_name, invoice_number, invoice_date, subtotal, tax_amount, total, currency	line_items[]
receipt	merchant, date, total, tax, payment_method	line_items[]
bank_statement	bank_name, account_holder, currency, opening_balance, closing_balance	transactions[]
mixed	an invoice object and a bank_statement object together	both

A receipt, structured

200 OK — receipt

{
  "type": "receipt",
  "pages": 1,
  "billedPages": 1,
  "data": {
    "type": "invoice",
    "data": {
      "supplier_name": "Corner Cafe",
      "invoice_date": "2024-10-02",
      "currency": "USD",
      "tax_amount": 1.20,
      "total": 15.20,
      "line_items": [
        { "description": "Flat white", "quantity": 2, "unit_price": 5.00, "amount": 10.00 },
        { "description": "Croissant",  "quantity": 1, "unit_price": 4.00, "amount": 4.00 }
      ]
    }
  }
}

Validation as a quality gate

Extraction is only useful if you can trust it, so the API ships a deterministic validator. `POST /api/v1/validate` returns a 0–100 quality score, a grade and concrete checks — invoice totals and tax math, statement balance reconciliation, duplicate and out-of-order rows, low-confidence fields. Auto-accept high grades, queue the rest. The full rule set is documented on the validation engine and the AI VAT auditor adds tax-specific review on top.

POST /api/v1/validate

curl -X POST https://flowparse.io/api/v1/validate \
  -H "Authorization: Bearer pf_live_xxx" \
  -d '{ "type": "invoice", "data": { ... } }'
# → { "validations": [ { "score": { "value": 96, "grade": "A" }, "checks": [ ... ] } ] }

Classify → extract → act

Authenticate

Send your key as Authorization: Bearer pf_live_….

Extract

POST the base64 document to /api/v1/extract; read the classified `type` and typed `data`.

Validate

Score `data` with /api/v1/validate and branch on the grade for straight-through vs review.

Export

Turn invoices and statements into XLSX/CSV/XML or accounting files (QBO/QFX/OFX/Xero/DATEV/1С) via /api/v1/export.

Reconcile

Match invoices to bank payments with /api/v1/reconcile, or merge a batch with /api/v1/merge.

flowparse.io

Beyond extraction: export and reconcile

Most pipelines need more than JSON. `POST /api/v1/export` converts any extracted document into XLSX, CSV, XML, QuickBooks/Quicken (`.QBO`/`.QFX`/`.OFX`), Xero, DATEV or 1С — base64, previews free. `POST /api/v1/reconcile` matches a set of invoices against bank payments and returns matched and unmatched items with a reconciliation report, the same engine behind the reconciliation feature. And `POST /api/v1/merge` consolidates up to 100 documents into one reconciled Excel.

Together these turn the extraction API into a full back-office automation surface: capture an invoice, validate it, export it to your ledger, then reconcile it against the bank statement you extracted from the same API — no manual re-keying anywhere in the chain.

flowparse.io

Pricing, keys and limits

Extraction and file exports bill per page from your page balance; classification is part of extraction, and validation, reconciliation and previews are free. Over-budget calls return `429` with the exact shortfall. Manage keys, rotate per environment, and watch per-key usage in the dashboard. The per-page rate and plans are on the pricing page, the complete reference is in the API docs, and you can try requests in the playground.

Capability	Endpoint	Billing
Extract any document → JSON	POST /api/v1/extract	Per page
Validate / quality score	POST /api/v1/validate	Free
Export to file / accounting	POST /api/v1/export	Per page (preview free)
Reconcile invoices ↔ payments	POST /api/v1/reconcile	Free
Merge many → one Excel	POST /api/v1/merge	Per page (preview free)

flowparse.io

What teams automate with it

Accounts payable

Extract supplier invoices, validate totals and tax, and post to the ledger without manual entry.

Expense & receipts

Turn receipt photos into line-item data for expense reports and reimbursement.

Lending & finance ops

Combine statement and invoice extraction to assess cash flow and obligations.

Vertical SaaS

Embed document capture in your product so customers upload PDFs and you get structured data.

Why AI extraction, not templates or rules

Traditional intelligent-document-processing stacks are built from per-template rules: you define where each field lives for each vendor or document layout, and maintain that library forever. It works until a supplier redesigns an invoice, a bank moves a column, or a customer uploads a format nobody anticipated — then the rule misfires silently and bad data flows downstream. The whole approach scales badly because every new layout is an engineering ticket.

An AI document extraction API generalises instead of memorising. It understands that 'the total is the largest tax-inclusive amount near the bottom' or 'this column of dated rows is a transaction table' regardless of exact position, font or wording, so a document it has never seen returns the same clean schema on the first request. That's the difference between an automation that needs constant babysitting and one that just keeps working as your document mix grows. Pair it with the deterministic validation engine and you get generalisation *and* a hard correctness check — the combination most rule-based stacks lack.

flowparse.io

Integrate from any stack

There's no SDK to adopt and no language constraint: the API is JSON over HTTPS, so any HTTP client works. The shape is always the same — base64-encode the document, POST it to `/api/v1/extract`, read the classified `type` and typed `data`, then branch your logic on the type. Because every endpoint shares that schema, a thin internal wrapper of `extract`, `validate`, `export` and `reconcile` functions is usually all you need, and it stays valid as the contract is versioned under `/api/v1`.

A clean pattern at volume is a queue plus workers: enqueue each uploaded document, have a worker call extract then validate, write the result and notify your own system — effectively your own webhook, fully under your control. Make jobs idempotent on a file hash so retries don't duplicate records, cap worker concurrency so a big batch can't drain your page budget at once, and log the billed pages and validation grade for a complete audit trail. The API docs have copy-paste curl, and the playground runs real requests in the browser.

Status codes, billing and limits

Standard HTTP codes make error handling simple, and no call returns unbilled data. `200` carries the structured JSON; `400` is a malformed request; `401` is a bad or missing key; `422` means the document was unreadable or had nothing extractable (not billed for a file); `429` means the page budget is exhausted, with the exact shortfall in the message; and `503` is transient — retry with backoff. Extraction and file exports bill per page from your page balance, while classification is part of extraction and validation, reconciliation and previews are free.

Documents up to 20 MB and multiple pages are supported. For high throughput, extract each document individually and in parallel rather than in one giant call, and consolidate the results with Smart Merge when you need a single workbook. You can build and test the entire flow for free against validation and previews, then enable billed extraction at go-live — plans and the per-page rate are on the pricing page.

Code	Meaning	Action
200	Success — classified JSON returned	Branch on type, then process
400	Bad request (file/base64)	Fix the request body
401	Invalid or missing key	Check Authorization / rotate key
422	Unreadable / nothing extractable (not billed)	Re-scan or send a cleaner file
429	Page budget exhausted	Top up or upgrade, then retry
503	Temporarily unavailable	Retry with backoff

Classification and accuracy you can gate on

A document extraction API earns its place only if you can trust what it returns, so two things matter: correct classification and correct fields. Classification comes first — the engine decides whether a file is an invoice, a receipt, a bank statement or a mixed document, and returns that in the response `type` so your code branches correctly without you sniffing the file yourself. Get this wrong and everything downstream is wrong; get it right and each document flows to the right handler automatically.

Field accuracy is then protected the same way across every type. The engine reconstructs tables from the document's own geometry rather than guessing from flat text, types every value (numbers as numbers, dates as ISO-8601), and records lower confidence instead of inventing data when a source is ambiguous. You convert that into a decision with `/api/v1/validate`, which scores invoices on totals and tax math and statements on balance reconciliation, duplicates and date order. Auto-accept the clean ones, review the rest — and keep the original values in `raw_table` for a complete audit trail. That combination of broad coverage and a hard, per-document check is what makes the API safe to run unattended.

flowparse.io

Best practices for document automation

Treat the extraction API as one step in a pipeline, not a magic box. Validate every document and branch on the grade so a human only ever sees the genuinely ambiguous cases. Prefer original digital files over scans where you can, since the text layer is read exactly. Persist the classified type, the validation score and `billedPages` with each record so you have a queryable audit trail of what was captured, what it cost and what was auto-accepted versus reviewed — invaluable when finance or compliance asks how a number got into the books.

Run extraction behind a queue with a capped worker pool rather than inline on a request, make jobs idempotent on a file hash so retries don't duplicate records, and handle `429` (top up and resume) and `503` (exponential backoff) explicitly. Store only the fields you need and keep PII out of logs. Build and test the whole flow for free against validation and previews, then switch on billed extraction at go-live. Done this way, a document extraction API turns accounts payable, expense capture and statement onboarding into reliable, unattended automation — the guide walks the full pattern with code.

Monitoring usage and controlling cost

Running extraction at volume means watching two things: spend and quality. Every API key tracks its own request count, page total and cost, visible in the dashboard, so you can see exactly what each integration or customer is consuming and spot anomalies — a sudden spike usually means a retry loop or a malformed batch. Setting a budget that matches your plan, and capping worker concurrency so a single run can't exhaust it, keeps cost predictable rather than surprising.

On the quality side, log the validation grade and `billedPages` with every document and chart the auto-accept rate over time. A falling auto-accept rate is an early signal that input quality has dropped — a new scanner, a worse photo flow, a new bank format — and lets you act before bad data reaches the books. Together these two habits turn the API from a black box into an observable, controllable part of your pipeline; the per-page rate and plan limits are on the pricing page.

flowparse.io

Choose where to start

If statements are your focus, start with the bank statement API or the bank statement OCR API for scans. For the generic PDF contract, see the PDF to JSON API. To build a complete integration with batching and error handling, follow the guide to parsing bank statements with an API. Everything is documented at /api-docs.

flowparse.io

Automate document capture end to end

One endpoint to classify and extract invoices, receipts and statements — then validate, export and reconcile over the same API key.

Frequently asked questions

API Documentation Get an API Key API Playground Bank Statement API PDF to JSON API Bank Statement OCR API Extract Invoice Data Reconciliation Engine Guide: Parse statements with an API API Pricing

Document Extraction API — Invoices, Receipts & Statements

One API for every financial document

Extract any document in one call

What it extracts, by type

A receipt, structured

Validation as a quality gate

Classify → extract → act

Authenticate

Extract

Validate

Export

Reconcile

Beyond extraction: export and reconcile

Pricing, keys and limits

What teams automate with it

Accounts payable

Expense & receipts

Lending & finance ops

Vertical SaaS

Why AI extraction, not templates or rules

Integrate from any stack

Status codes, billing and limits

Classification and accuracy you can gate on

Best practices for document automation

Monitoring usage and controlling cost

Choose where to start

Automate document capture end to end

Frequently asked questions

Related