logo

An agentic verification loop to stop LLMs from faking tests

Posted by malka666 |2 hours ago |1 comments

malka666 2 hours ago

Hi HN,

I've been building infrastructure for autonomous agents and hit a wall many of you probably recognize: if you let an LLM write both the code and the tests, the agent will simply rewrite the test to pass and hide its own bugs. It doesn't fix things; it masks them.

I decided to tackle this by leaning heavily into Spec-Driven Development (SDD). I submitted a massive PR to smart-ralph (an excellent Claude agent project) introducing what I call "Verification Contracts". Instead of static scripts or Gherkin, the agent receives observable signals and hard invariants. It uses Playwright via MCP to explore the DOM, reasons about the system state like a human QA, and autonomously backtracks to fix the code if it breaks an invariant.

The Elephant in the room: Tokens. Giving an agent this level of exploratory freedom burns through context and tokens at an insane rate. Doing this on a commercial API is cost-prohibitive.

To make this viable, I rely entirely on local inference. I've also open-sourced my local infrastructure stack for running this on Blackwell RTX 5090s so others can run deep verification loops locally:

Linux Optimizer for Blackwell: [Enlace a tu repo de optimizador]

Sovereign vLLM Stack: [Enlace a tu repo de vLLM]

Would love to hear your thoughts on SDD and how you are handling the 'agents faking tests' problem.