Show HN: We benchmarked 18 LLMs on OCR (7K+ calls) – cheaper models win

Posted by TimoKerr |3 hours ago |1 comments

TimoKerr 3 hours ago

Hi HN,

TLDR: Cheap (and sometimes old) models perform on par, or better than flagship models on standard OCR tasks, at a fraction of the cost. This conclusion comes from a benchmark we ran on 18 models and over 7k+ LLM calls. Leaderboard and benchmark repo completely open-source.

Too many teams are either stuck in legacy OCR pipelines, or are overpaying badly for LLM calls by defaulting to the newest/ biggest model.

So we investigated the topic and open-sourced everything, including a free tool to check your own documents.

We ran 18 models from OpenAI, Anthropic, Google, and Mistral on 42 real-world documents (invoices, receipts, bills of lading, transport orders). Each model ran 10 times per document to measure reliability, not just one-shot accuracy; 7,560 API calls total.

The finding: for standard document extraction, mid-tier and older models match or beat state-of-the-art, at a fraction of the cost. In some cases the cost difference is multiple orders of magnitude for equivalent accuracy.

We also track pass^n (how reliability degrades over repeated runs, see tau-bench), cost-per-success (not just cost-per-token), and critical field accuracy. Full methodology and dataset are open source.

Leaderboard: <https://www.arbitrhq.ai/leaderboards/>

Dataset + framework (GitHub): <https://github.com/ArbitrHq/ocr-mini-bench>

Or test your own documents for free: <https://app.arbitrhq.ai/benchmark-free>

Built by two founders in Antwerp. Very curious if other people have similar conclusions or if you've seen specific edge cases where the flagships still justify their price tag?