barefootsanders 3 hours ago
155 people used it over 3.5 weeks. I analyzed the results and found some patterns I didn't expect.
The headline finding: someone typed "I a bartender" (12 characters, with a typo) and scored 85/100. A 15,576-character technical specification about development process analysis scored 72/100. The bartender input was reproducible, I ran it twice.
More surprisingly, "hey bro" scored 88/100. The system generated a "Casual Communication Skill" and suggested adding "quantifiable success metrics." The grading algorithm clearly has issues (acknowledged in the post).
What actually predicted quality: - Specific, well-understood domains (plumber, bartender, OKR expert) - Task-oriented descriptions (what you do vs. what you are) - Brevity with clarity (top scores averaged under 100 characters) - Named frameworks or methodologies
What didn't: length (negatively correlated with score), vague enthusiasm, attempts to jailbreak or override Claude's behavior.
The tool uses Claude to generate the skill, then a separate Claude call to grade it. The grading inconsistency is a known problem. I built a guided question flow to address the input quality issue, which asks three follow-up questions when input is too vague.
Stack: Next.js, Supabase, Claude API. Blog post has links to every skill mentioned so you can see the actual outputs.