What does the PDF to JSON API do?

It converts a financial PDF into clean, typed JSON in one REST call. POST the file as base64 to /api/v1/extract and receive a classified document with all fields, line items or transactions, and a raw_table preserving the original layout.

What document types are supported?

Invoices, receipts and bank statements — and mixed documents that contain both. The response includes the classified `type` so you can branch on it.

Is the JSON schema stable?

Yes. It's snake_case, versioned under /api/v1, and identical to what the validate, export, reconcile and merge endpoints accept — so output flows into the next call without reshaping.

How do I send the PDF?

Base64-encode it and put it in the `file` field of a JSON body, with an optional `filename`. No multipart upload is required.

Does it work on scanned PDFs?

Yes. The engine reads the text layer for digital PDFs and falls back to OCR for scanned pages and images. See the bank statement OCR API for details on the OCR path.

Are numbers and dates typed?

Yes. Amounts are parsed to numbers, dates to ISO-8601, and bank debits/credits are normalised to a single signed amount. The untouched source values remain in raw_table.

How is it different from pdf-to-text plus regex?

Raw text extraction destroys table structure, so multi-column data becomes guesswork and every new layout needs new rules. AI extraction understands fields by meaning and reconstructs tables, so new vendors and banks work without code changes.

How do I know the JSON is correct?

Call /api/v1/validate to get a 0–100 score, a grade and specific checks (totals, tax math, balance reconciliation, duplicates). Auto-accept clean documents and review only the doubtful ones.

Can I get an Excel or accounting file instead of JSON?

Yes — pass the JSON to /api/v1/export for XLSX, CSV, XML, QuickBooks/Quicken (QBO/QFX/OFX), Xero, DATEV or 1С, returned as base64. Previews are free.

What happens when the budget runs out?

You get HTTP 429 with the exact pages-needed vs available, and no unbilled data. Top up or upgrade and retry.

How large can a PDF be?

Up to 20 MB per request, multi-page supported. For big batches, extract each document and combine results with /api/v1/merge.

Which languages can I use?

Any with an HTTP client — it's plain JSON over REST. The docs include curl you can paste, and the playground runs requests in the browser.

The file is processed to produce the JSON response and isn't kept as a downloadable document. Requests use HTTPS and a hashed key, are logged for your audit trail, and are never used to train models.

Create a key at /get-api-key, POST one PDF to /api/v1/extract, and read the JSON. The full reference is at /api-docs and you can experiment in /api-playground.

PDF to JSON API — Convert PDFs to Structured JSON | FlowParse

Q: What does it cost?

Per page for extraction and file exports, drawn from your page balance; validation and previews are free. The per-page rate and plans are on the pricing page.

Turn PDFs into JSON over REST

A PDF to JSON API removes the worst part of document automation: getting reliable structured data out of a PDF. Instead of writing per-layout templates or fragile text regex, you POST the file and receive typed JSON. FlowParse's `POST /api/v1/extract` does this for financial documents — it classifies the file, extracts every field, and returns a schema you can insert into a database without reshaping.

The output is deliberately boring in the best way: stable, snake_case, and identical to what the rest of the API consumes. That means the JSON you get from extraction is exactly what you pass to validation, export or reconciliation — no glue code translating between shapes. For statement-specific behaviour see the bank statement API; this page covers the generic PDF-to-JSON contract.

Under the hood it reads the PDF's own text geometry for pixel-accurate tables and only falls back to AI OCR for scanned pages, so a 40-page document keeps all 40 pages of rows. Every numeric field is parsed to a number, every date to ISO-8601, and the untouched source columns are preserved in `raw_table` for audit.

flowparse.io

One call, structured JSON back

Get a key from the API dashboard, base64-encode the PDF, and POST it to `https://flowparse.io/api/v1/extract`. The whole request is plain JSON, so there's no multipart upload to wrangle from your backend.

POST /api/v1/extract

curl -X POST https://flowparse.io/api/v1/extract \
  -H "Authorization: Bearer pf_live_xxx" \
  -H "Content-Type: application/json" \
  -d '{ "file": "JVBERi0xLjcK...", "filename": "invoice-4471.pdf" }'

flowparse.io

Invoice example

For an invoice, the API returns the supplier and buyer, totals, tax breakdown and a typed `line_items` array — with the original table preserved in `raw_table` so nothing is lost.

200 OK — invoice

{
  "type": "invoice",
  "pages": 1,
  "billedPages": 1,
  "data": {
    "type": "invoice",
    "data": {
      "supplier_name": "Globex Ltd",
      "invoice_number": "INV-4471",
      "invoice_date": "2024-10-11",
      "currency": "EUR",
      "subtotal": 1000.00,
      "tax_amount": 190.00,
      "total": 1190.00,
      "line_items": [
        { "description": "Design retainer", "quantity": 2, "unit_price": 500.00, "tax_rate": 19, "amount": 1000.00 }
      ]
    }
  }
}

A stable, typed schema

Every response is `{ type, pages, billedPages, data }`, where `data` is `{ type, data }` — the inner `type` is the classified document type (`invoice`, `bank_statement`, `receipt`, or `mixed`) and the inner `data` holds the typed fields. The contract is versioned under `/api/v1`, so you can build against it with confidence.

Document type	Key fields	Array
invoice / receipt	supplier_name, invoice_number, invoice_date, subtotal, tax_amount, total	line_items[]
bank_statement	bank_name, account_holder, currency, opening_balance, closing_balance	transactions[]
mixed	both an invoice object and a bank_statement object	both
all	raw_table { columns, rows } — original layout preserved 1:1	—

flowparse.io

Why not regex or templates?

Template-based PDF parsing breaks the moment a vendor changes its layout, a bank tweaks a column, or a customer uploads a format you've never seen. You end up maintaining a growing library of brittle rules. An AI extraction API generalises across layouts instead: it understands *what a field means*, not *where it sits in pixels*, so a new supplier or bank works on day one without a code change.

Raw text extraction (e.g. pdf-to-text plus regex) has the opposite problem — it gives you a wall of text with the table structure destroyed, so multi-column statements and line-item tables become guesswork. FlowParse keeps structure by reading table geometry directly and reconstructing rows, which is why amounts line up with the right dates and descriptions.

flowparse.io

Validate the JSON before you trust it

Structured doesn't automatically mean correct, so pair extraction with `POST /api/v1/validate`. It returns a 0–100 quality score, a grade and concrete checks — totals that don't add up, tax math that's off, balance breaks on statements, duplicate or out-of-order rows. Use the grade to auto-accept clean documents and queue only the doubtful ones for review.

POST /api/v1/validate

curl -X POST https://flowparse.io/api/v1/validate \
  -H "Authorization: Bearer pf_live_xxx" \
  -d '{ "type": "invoice", "data": { ... } }'
# → { "validations": [ { "score": { "value": 98, "grade": "A" }, "checks": [ ... ] } ] }

From JSON to files

When the destination is a spreadsheet or accounting system rather than your database, `POST /api/v1/export` turns the JSON into XLSX, CSV, XML, QuickBooks/Quicken bank feeds (`.QBO`/`.QFX`/`.OFX`), Xero, DATEV or 1С — returned as a base64 file. Previews are free, so you can confirm the column mapping before you generate the billed file. This is the same export engine described on PDF to QBO and bank statement to Xero.

A clean integration pattern

Authenticate

Create a key in the dashboard and send it as `Authorization: Bearer pf_live_…`.

Extract

POST the base64 PDF to /api/v1/extract and read the typed JSON `data`.

Validate

Send `data` to /api/v1/validate; branch on the score/grade to auto-accept or review.

Persist or export

Store the fields you need, or call /api/v1/export to produce the file your workflow expects.

Handle 429

If the page budget is exhausted you get a 429 — top up and retry; no unbilled data is returned.

Pricing and rate limits

Extraction and file exports are billed per page from your page balance; validation and previews are free. A request that would exceed your balance returns `429` with the exact shortfall, so spend is always predictable. Plans and the per-page rate are on the pricing page; usage per key is tracked in the dashboard, and you can rotate keys without downtime.

Start free: build the entire flow against validation and previews at no cost, then enable billed extraction when you go live. The complete reference — every field, every format, error codes and live examples — is in the API docs, with an interactive playground for quick experiments.

Capability	Endpoint	Billing
PDF → JSON	POST /api/v1/extract	Per page
Validate JSON	POST /api/v1/validate	Free
Export to file	POST /api/v1/export	Per page (preview free)
Merge documents	POST /api/v1/merge	Per page (preview free)

Call it from any language

Because the API is plain JSON over HTTPS, there's no SDK to install and no language lock-in — anything that can make a POST request can use it. The pattern is identical everywhere: read the file, base64-encode it, send it in the `file` field, and parse the JSON response. In Node you'd `readFileSync(path).toString('base64')`; in Python `base64.b64encode(open(path,'rb').read()).decode()`; in Ruby `Base64.strict_encode64(File.read(path))`; in Go `base64.StdEncoding.EncodeToString(bytes)`. The request body and the response schema don't change.

That uniformity makes the API easy to wrap in your own thin client: one function `extract(file) → data`, one `validate(data) → score`, one `export(data, format) → file`. Because the schema is stable and versioned under `/api/v1`, that wrapper stays valid as your product grows. The API docs include copy-paste curl for every endpoint, and the playground lets you fire a real request from the browser before you write a line of code.

Python

import base64, requests

with open("invoice-4471.pdf", "rb") as f:
    file_b64 = base64.b64encode(f.read()).decode()

r = requests.post(
    "https://flowparse.io/api/v1/extract",
    headers={"Authorization": "Bearer " + KEY},
    json={"file": file_b64, "filename": "invoice-4471.pdf"},
)
data = r.json()["data"]

flowparse.io

Status codes, limits and retries

The API uses standard HTTP status codes and never returns unbilled data, so error handling is straightforward. A `200` carries the structured JSON. A `400` means the request body was wrong — a missing or invalid `file`, or bad base64 — and should be fixed rather than retried. A `401` is an invalid or missing key. A `422` means the document couldn't be read or had nothing extractable (a blank image, a non-financial file); it isn't billed for a file, so re-scan at higher quality or send the original PDF. A `429` means your page budget is exhausted, with the exact pages-needed-versus-available in the message, and a `503` is a transient issue to retry with backoff.

Files can be up to 20 MB per request and multiple pages are fully supported. For very large batches, extract documents individually — ideally in parallel across workers with a sensible concurrency cap — rather than trying to send everything in one call. Billing is per page from your page balance; validation and previews are free, so you can build and test the whole pipeline at zero cost and only switch on billed extraction when you go live. Plans and the per-page rate are on the pricing page.

Code	Meaning	Action
200	Success — JSON returned	Process the data
400	Bad request (file/base64)	Fix the request body
401	Invalid or missing key	Check Authorization / rotate key
422	Unreadable / nothing extractable (not billed)	Re-scan or send the original PDF
429	Page budget exhausted	Top up or upgrade, then retry
503	Temporarily unavailable	Retry with backoff

flowparse.io

What teams build with it

A reliable PDF-to-JSON step unlocks a lot of automation that was previously stuck behind manual data entry. Accounts-payable teams extract supplier invoices straight into their ledger; expense tools turn receipt photos into line items; lenders and finance teams pull statement transactions for cash-flow and affordability analysis; and vertical SaaS products embed document capture so their own customers upload a PDF and get structured data inside the app. In each case the value isn't just the JSON — it's removing a human from the loop without losing accuracy.

Accounts payable

Invoices to ledger rows, validated, with no manual keying.

Expense & receipts

Receipt photos to typed line items for reports and reimbursement.

Finance & lending

Statement transactions for cash-flow, affordability and underwriting input.

Vertical SaaS

Embed capture so customers upload a PDF and your app gets structured data.

How structure and accuracy are preserved

The hardest part of turning a PDF into JSON is not reading the characters — it is reconstructing the *structure* so values land in the right fields. A line-item table and a transaction table are both grids, and a flat text dump destroys the grid: you lose which number is a quantity, which is a unit price, and which column a date belongs to. FlowParse reads the PDF's own text geometry to rebuild those grids cell by cell, so a unit price stays a unit price and a balance stays a balance, even across page breaks and multi-line descriptions.

On top of structure sits typing. Amounts are parsed to real numbers (no stray currency symbols or thousands separators), dates are normalised to ISO-8601, and bank debit/credit columns collapse into a single signed amount. Where the source is ambiguous — a faint scan, an unusual layout — the engine records lower confidence rather than guessing silently, and the deterministic checks behind `/api/v1/validate` give you a score you can act on. The original, untouched values always remain in `raw_table`, so you can reconcile the typed output against exactly what was printed whenever you need an audit trail.

This is why the JSON is safe to insert directly: it is not a best-effort text scrape but a structured, typed, checkable representation of the document. The same discipline applies whether the input is an invoice, a receipt or a statement, which is what lets one schema serve every financial PDF your product encounters.

flowparse.io

Best practices for a reliable integration

A few habits make a PDF-to-JSON integration dependable in production. First, always validate before you trust — send the extracted `data` to `/api/v1/validate` and branch on the grade, so clean documents flow straight through and only doubtful ones reach a human. Second, prefer the original digital PDF over a photo or screenshot whenever one exists; the text layer is read exactly, while images must be OCR-ed. Third, persist the validation score and `billedPages` with each record, so you have a queryable history of what was processed, what it cost, and what was auto-accepted.

Operationally, run extraction behind a queue with a capped pool of workers rather than inline on a request, make each job idempotent on a file hash so retries don't duplicate records, and handle `429`/`503` with top-up-and-resume and exponential backoff respectively. Store only the fields you need, drop `raw_table` if you don't use it, and keep PII out of logs. None of this is exotic — it's the same shape as any robust third-party integration — and it turns a single call into an unattended pipeline. The guide to parsing bank statements with an API shows the full pattern with code.

Pick the right entry point

If your documents are mostly statements, start with the bank statement API. For scanned files, the bank statement OCR API explains the OCR path. For invoices and receipts at scale, see the document extraction API. And to wire a full integration end to end — extract, validate, export and reconcile — follow the guide to parsing bank statements with an API.

flowparse.io

Convert your first PDF to JSON now

Grab a key, POST a PDF to /api/v1/extract, and get a stable, typed JSON schema back — ready to validate, store or export.

Frequently asked questions

API Documentation Get an API Key API Playground Bank Statement API Document Extraction API Bank Statement OCR API Guide: Parse statements with an API PDF to QBO Converter Extract Invoice Data API Pricing

PDF to JSON API — Structured Data from Any PDF

Turn PDFs into JSON over REST

One call, structured JSON back

Invoice example

A stable, typed schema

Why not regex or templates?

Validate the JSON before you trust it

From JSON to files

A clean integration pattern

Authenticate

Extract

Validate

Persist or export

Handle 429

Pricing and rate limits

Call it from any language

Status codes, limits and retries

What teams build with it

Accounts payable

Expense & receipts

Finance & lending

Vertical SaaS

How structure and accuracy are preserved

Best practices for a reliable integration

Pick the right entry point

Convert your first PDF to JSON now

Frequently asked questions

Related