I’m experimenting with a small self-harness repo based on this paper. The idea is to run simulated users through an agent harness, collect the traces, group the recurring failures, and use that to propose small harness changes with regression checks. Still early, but I’d be interested if anyone else is thinking about this workflow for agent development.