Free tool

Statement Format Validator — know what you're dealing with

Drop a PDF and we'll tell you what kind it is — digital with a text layer, scanned image-only, password-protected, multi-page — plus a guess at which bank it came from. Useful before you commit to any conversion workflow.

We inspect the PDF using pdf.js in your browser. The file never leaves your device.

Runs in your browser
No upload, no signup. Your file never leaves your device.
Free and unmetered
Use it as often as you need. No daily quota, no credit card.
EU-built, GDPR-first
Hosted in Frankfurt. Built by a small EU team that takes privacy seriously.

Why pre-flight a PDF at all?

Bank statement PDFs look uniform from the outside but differ wildly in how they're built. A "PDF" can be: a text-based document with a full text layer (the easy case), a scanned image with no extractable text, an AES-encrypted file requiring a password, or any combination of the three. The conversion path that works for one fails on another.

Most people only discover what kind of PDF they have after dragging it into a converter and getting confused output: empty rows (scanned PDF run through a text extractor), a wall of 0.00 values (locale mismatch), or a flat "couldn't read" error (encryption). This validator tells you up front, so you can pick the right tool first.

Once you know what you have, drop it into our bank statement converter — it auto-detects the format and switches between text extraction and AI vision-based OCR transparently. For encrypted PDFs, use our free PDF password remover first.

What the validator checks

  1. Encryption. If the PDF requires a password to open, we report it immediately. No further inspection is possible until it's unlocked.
  2. Page count. Useful for quota planning — at 7 free pages a day, a 60-page statement is a multi-day project on the free tier, but a 1-minute job on our €19 Starter plan with 500 pages a month.
  3. Text layer presence. We sample the first few pages and measure how many text characters live in the PDF. Above 80 chars per page = digital, below = probably scanned.
  4. Bank guess. We look for known bank brand keywords (HDFC, Lloyds, BNP Paribas, Chase, etc.) in the extracted text. This is a hint, not authoritative — white-label or business statements often don't carry the bank brand on every page.
  5. Size hints. Anything over 60 pages prompts a warning about the free tier; anything over 50 MB is rejected outright (almost certainly a high-DPI scan).

Digital vs scanned — how to tell, and why it matters

A digital PDF is one where you can drag your mouse and highlight text. A scanned PDF is a picture of a statement — open it in any viewer and try selecting text; nothing happens. The validator does the equivalent check programmatically by asking pdf.js for the text content of each page.

The distinction matters because:

  • Digital PDFs convert in 5–15 seconds with near-perfect accuracy. Text extraction is deterministic; there's nothing to guess.
  • Scanned PDFs need vision-based OCR — much slower (20–60 seconds), and accuracy depends on scan quality. Our converter uses Gemini's vision model with a reconciliation pass to catch OCR digit errors before they ship.
  • Mixed PDFs (e.g. a digital cover page + scanned transaction list) get treated as scanned overall, because the high-value rows are in the scanned section.

When the validator's hints can mislead you

A few edge cases the simple heuristics miss:

  • Vector-rendered scans. Some bank PDFs are scanned, then "re-paginated" by software that overlays an invisible text layer. The validator will call this "digital" — which is fine for conversion (the text layer is usable) but the text may be misaligned with what you see visually.
  • Image-heavy digital PDFs. Statements with embedded logos and charts can push the average chars/page down. If the validator says "scanned" but you can highlight text in a viewer, ignore the hint and treat it as digital.
  • Bank guess collisions. "Chase" appears in non-Chase statements when transactions reference a Chase account elsewhere. The guess is heuristic, not authoritative.

What to do next

FAQ

Is the PDF uploaded anywhere?
No. The validator uses pdf.js running in your browser. The file is never sent to our servers or anyone else's. You can verify this in the network tab — no upload requests fire while the tool runs.
How accurate is the digital-vs-scanned classification?
Very high in practice. We sample the first few pages and measure average text characters per page. Anything under 80 chars/page strongly suggests image-only content (a normal statement page has 300–800 chars). The edge cases — vector-overlaid scans, image-heavy digital PDFs — are rare and the converter handles both paths anyway.
Why does it sometimes say 'looks like Chase' when it isn't Chase?
The bank-guess feature looks for brand names in the extracted text. If your statement mentions another bank in a transaction description (e.g. a transfer to/from a different account), the heuristic can trigger. Treat the guess as a hint, not a fact.
What if the PDF is encrypted?
The validator reports it as encrypted and stops there — we can't inspect the contents until it's unlocked. Use our PDF password remover at /tools/unlock-pdf to get an unencrypted copy first.
Can this detect a fake or tampered statement?
Not directly — that requires forensic PDF analysis (font consistency, object stream patterns, modification timestamps). The validator is a format triage tool, not a forensics tool. If you need fraud detection, please reach out via /contact and we'll discuss what's possible.
Does it work on credit-card statements?
Yes — the heuristics work on any PDF financial document. Page count, encryption, and text-layer detection are format-agnostic.

Related tools and guides