logo

ClawsBench shows GPT-5.4 tries to reward hack 80% of the time

Posted by xdotli |3 hours ago |1 comments

xdotli 3 hours ago

Author here. We built 5 high-fidelity mock Google Workspace + Slack services and ran 7,224 trials across 6 frontier models and 4 agent harnesses.

The headline finding that surprised us most: scaffolding (skills + meta prompt) gives a 39-63pp lift, while the top 5 models are statistically indistinguishable (53-63% TSR, no pairwise comparison survives correction). Your choice of scaffolding matters ~6x more than your choice of model.

The safety findings are darker: Opus leads on task success (63%) but ties for most unsafe (23% UAR). GPT-5.4 is the safest (7% UAR) but mid-tier on tasks. There's no capability-safety tradeoff — they're decoupled.

Also I'm reviewer of Terminal Bench 3.0. Here's what I've heard from contributors as well.

> I noticed that when I was building tasks with harbor. Claude is a good student which generally follows the instruction. But gpt always try to find a short path to cheat. Like reversing the binary directly instead of interaction

Another friends added ways to address this: https://x.com/xeophon/status/2041772210562511080?s=20 > Just ask codex to not reward hack > It literally works. And it works even better when you state which things you consider reward hacking, eg wrapping a CLI or something

Paper: https://arxiv.org/abs/2604.05172 Traces (7,834 on HF): https://huggingface.co/datasets/benchflow/ClawsBench