I've been a contributor and reviewer for terminal bench since last August, and this post is about what I've learned designing and reviewing tasks. The guidance is broadly applicable to anyone building an agentic benchmark.I would love feedback from the HN community.