Show HN: I just built a scanned PDF text extractor for public PDFs (1-300 page)

Posted by fagnerbrack |42 minutes ago |1 comments

fagnerbrack 41 minutes ago

For comparison: Claude only uses OCR for the first 100 pages, then falls back to text-only extract. Public URL in, HTML page out, AI throughout up to 300 pages (spartaaaaa!).

Conveniently, that's also roughly where the cost math stops working for a free tool. Scanned PDFs are best-effort OCR. Multi-page tables spanning sheets are still a weak spot.

Here's a link you can check:

https://people.math.harvard.edu/~ctm/home/text/others/shanno...

Feel free to try with your own PDF links to see what breaks, it will help improving the crawl logic and the parser (I still need to get some rate limits up)