logo

Benchmark accuracy retention is the wrong metric

Posted by fmaccomber |3 hours ago |1 comments

fmaccomber 3 hours ago

Whether model routing works is an empirical problem. Existing empirical efforts rely on benchmark accuracy retention, i.e. how does a model routing system score compared to a sophisticated model like Opus 4.7 on a complex task benchmark like Terminal-Bench 2.0.

However, that metric is completely divorced from what we care about. The better metric is utility retention, which takes into account task importance.