Why pre-flight a PDF at all?
Bank statement PDFs look uniform from the outside but differ wildly in how they're built. A "PDF" can be: a text-based document with a full text layer (the easy case), a scanned image with no extractable text, an AES-encrypted file requiring a password, or any combination of the three. The conversion path that works for one fails on another.
Most people only discover what kind of PDF they have after dragging it into a converter and getting confused output: empty rows (scanned PDF run through a text extractor), a wall of 0.00 values (locale mismatch), or a flat "couldn't read" error (encryption). This validator tells you up front, so you can pick the right tool first.
Once you know what you have, drop it into our bank statement converter — it auto-detects the format and switches between text extraction and AI vision-based OCR transparently. For encrypted PDFs, use our free PDF password remover first.
What the validator checks
- Encryption. If the PDF requires a password to open, we report it immediately. No further inspection is possible until it's unlocked.
- Page count. Useful for quota planning — at 7 free pages a day, a 60-page statement is a multi-day project on the free tier, but a 1-minute job on our €19 Starter plan with 500 pages a month.
- Text layer presence. We sample the first few pages and measure how many text characters live in the PDF. Above 80 chars per page = digital, below = probably scanned.
- Bank guess. We look for known bank brand keywords (HDFC, Lloyds, BNP Paribas, Chase, etc.) in the extracted text. This is a hint, not authoritative — white-label or business statements often don't carry the bank brand on every page.
- Size hints. Anything over 60 pages prompts a warning about the free tier; anything over 50 MB is rejected outright (almost certainly a high-DPI scan).
Digital vs scanned — how to tell, and why it matters
A digital PDF is one where you can drag your mouse and highlight text. A scanned PDF is a picture of a statement — open it in any viewer and try selecting text; nothing happens. The validator does the equivalent check programmatically by asking pdf.js for the text content of each page.
The distinction matters because:
- Digital PDFs convert in 5–15 seconds with near-perfect accuracy. Text extraction is deterministic; there's nothing to guess.
- Scanned PDFs need vision-based OCR — much slower (20–60 seconds), and accuracy depends on scan quality. Our converter uses Gemini's vision model with a reconciliation pass to catch OCR digit errors before they ship.
- Mixed PDFs (e.g. a digital cover page + scanned transaction list) get treated as scanned overall, because the high-value rows are in the scanned section.
When the validator's hints can mislead you
A few edge cases the simple heuristics miss:
- Vector-rendered scans. Some bank PDFs are scanned, then "re-paginated" by software that overlays an invisible text layer. The validator will call this "digital" — which is fine for conversion (the text layer is usable) but the text may be misaligned with what you see visually.
- Image-heavy digital PDFs. Statements with embedded logos and charts can push the average chars/page down. If the validator says "scanned" but you can highlight text in a viewer, ignore the hint and treat it as digital.
- Bank guess collisions. "Chase" appears in non-Chase statements when transactions reference a Chase account elsewhere. The guess is heuristic, not authoritative.
What to do next
- Digital PDF, English/EU locale: drop it straight into our main converter — output in 10 seconds.
- Scanned PDF: same place — our pipeline switches to vision-based OCR automatically. Read how we handle scanned statements.
- Encrypted PDF: unlock with our free PDF password remover first, then convert.
- Long statement (60+ pages): check the plan estimator — free tier won't cover it.
- Verified the output: run our reconciliation checker to confirm the totals match.