We Analyzed 413K Agent Runs. Here's What Separates the Ones That Succeed

Posted by lihanc111 |2 hours ago |4 comments

lihanc111 2 hours ago[1 more]

Hey HN,

We dug into 17 billion tokens of behavioral data across 413K AI agent trajectories (CoderForge-Preview) attempting real GitHub issues. Instead of just looking at final SWE-bench scores, we compared successful runs against failing runs on the exact same problem to filter out task-difficulty confounds.

The biggest surprise? Agents are not junior developers, and prompting them to act like humans actively hurts their success rate.

Here is what the data actually shows:

Human exploration rituals predict failure: "View-before-edit" and "grep-before-edit" are negatively correlated with success. Humans do this to build mental models. Agents already have the codebase in their context window; if they are heavily grepping, they aren't learning, they're flailing.

TDD is the ultimate predictor of success: The single strongest behavioral signal of a passing agent is the fraction of early bash commands dedicated exclusively to running the test suite.

The Single Responsibility Principle is law: Agents that scatter edits across 3 or more files in the first 30% of their run see their success rate plummet. Successful agents fix one targeted thing at a time.

Perseverance is a myth: If an agent runs the exact same bash command twice early on, it’s a massive failure signal. They don't adapt; they just get stuck in a loop.

Check out the article for full content!