What broke when I tried to evaluate an AI agent in production

Posted by colinfly |2 hours ago |1 comments

colinfly 2 hours ago

If helpful, I put together a small tool (Cane Eval) to structure this kind of eval loop:

github.com/colingfly/cane-eval

Still early, but would love feedback if anyone is working on similar problems.