TimoKerr 3 hours ago
TLDR: Cheap (and sometimes old) models perform on par, or better than flagship models on standard OCR tasks, at a fraction of the cost. This conclusion comes from a benchmark we ran on 18 models and over 7k+ LLM calls. Leaderboard and benchmark repo completely open-source.
Too many teams are either stuck in legacy OCR pipelines, or are overpaying badly for LLM calls by defaulting to the newest/ biggest model.
So we investigated the topic and open-sourced everything, including a free tool to check your own documents.
We ran 18 models from OpenAI, Anthropic, Google, and Mistral on 42 real-world documents (invoices, receipts, bills of lading, transport orders). Each model ran 10 times per document to measure reliability, not just one-shot accuracy; 7,560 API calls total.
The finding: for standard document extraction, mid-tier and older models match or beat state-of-the-art, at a fraction of the cost. In some cases the cost difference is multiple orders of magnitude for equivalent accuracy.
We also track pass^n (how reliability degrades over repeated runs, see tau-bench), cost-per-success (not just cost-per-token), and critical field accuracy. Full methodology and dataset are open source.
Leaderboard: <https://www.arbitrhq.ai/leaderboards/>
Dataset + framework (GitHub): <https://github.com/ArbitrHq/ocr-mini-bench>
Or test your own documents for free: <https://app.arbitrhq.ai/benchmark-free>
Built by two founders in Antwerp. Very curious if other people have similar conclusions or if you've seen specific edge cases where the flagships still justify their price tag?