Published 2026-03-10 | skill-eval v0.4.0
Everyone's building agent skills. Prompt templates, workflow scaffolds, CLI wrappers -- the ClawHub marketplace has hundreds. But here's the question nobody's answering: do they actually make the agent better?
We built a system to find out. Not with vibes. Not with self-reported metrics. With blind A/B testing at scale.
skill-eval is a self-evolving evaluation engine. For each skill, it runs the same task twice: once with the skill loaded, once without (bare model). Same prompts, same assertions, same grading. The only variable is whether the skill is present.
We tested 123 skills from ClawHub using Claude Opus 4 as the execution model. Every skill got: - 2-3 realistic test prompts - Deterministic assertions (file checks, keyword presence, format compliance) - Rubric-based quality grading - Timing and token usage measurement
The result: a score from 0-10 combining quality (0-5), value-add delta (0-3), and efficiency (0-2).
76% of skills are genuinely useful. That's 94 out of 123 scoring 7+ ("Recommended").
The full breakdown:
92% of skills showed a positive quality delta over baseline. The model is measurably better with most skills loaded.
But not all skills are created equal.
The top scorer -- Explain Code at 10.0/10 -- teaches the model to use analogies, ASCII diagrams, and structured walkthroughs when explaining code. The model CAN do these things without the skill, but it doesn't by default. The skill makes it consistent.
This is the pattern: the best skills don't teach the model new facts. They encode preferences and processes the model wouldn't follow on its own.
Article Writer (9.5/10) defines a banned-word list, numbered module structure, and "golden sentence" convention. The baseline model uses filler words freely; the skill eliminates them. 100% delta -- our first perfect discriminator.
BookNotes (9.6/10) produces structured reading notes with a specific template. Baseline pass rate: 12%. With skill: 100%. The model needs to be told what format to use.
The common thread: opinionated skills that define specific, testable conventions score highest. Vague "be better at X" instructions produce minimal delta.
The bottom of the leaderboard tells an equally clear story.
bio-generator-pay and hashtag-generator-pay both scored 2.0/10 with 0% pass rate. Why? They require paid API credentials that weren't configured. The baseline model completed the same tasks perfectly. These aren't bad skills -- they're broken deployment stories.
finance-lite (2.8/10) failed because its error handling bypasses its own output formatting. When the market data API is down, it drops the source citations and freshness timestamps that its template requires. The template is good; the fallback path is not.
We cataloged every way skills fail:
The most common? Reference manual anti-pattern. Skill authors paste SQL templates, framework guides, and code patterns into SKILL.md thinking more context helps. It doesn't. The model already knows this stuff. You're just burning tokens.
We built an automated skill improvement pipeline. For skills scoring below 7, we rewrite the SKILL.md and re-evaluate.
The results:
| Skill | Before | After | What Changed |
|---|---|---|---|
| Data Analyst | 5.0 | 7.0 | Removed 80% of content |
| Data Model Designer | 5.5 | 7.0 | Replaced Python classes with instructions |
| Test Runner | 5.5 | 7.0 | Removed framework tables, added mandates |
The primary improvement lever in all three cases: deleting content. Not adding instructions. Not making the skill smarter. Just making it smaller.
Data Analyst's original SKILL.md was full of SQL templates, pandas patterns, and statistics references. The model already knows all of this. After removing 80% of the content and adding 10 lines of behavioral mandates (MUST include methodology section, NEVER use SELECT *, ALWAYS start with executive summary), the overhead dropped from +155% to +45% and the score jumped 2 points.
Skills should be behavioral contracts, not textbooks.
The evaluation system itself improves over time. It's not a static benchmark.
After each batch of evaluations, the engine: 1. Updates its lessons learned (what assertion patterns work for which skill categories) 2. Catalogs new failure modes 3. Absorbs proven patterns back into its own methodology document 4. Bumps its version when the methodology meaningfully changes
We went from v0.1.0 (basic blind A/B, 7 phases) to v0.4.0 (12 phases, multi-model support, self-evolving improvement engine) in 4 days. Each batch produced sharper assertions that caught problems the previous batch missed.
The improvement engine has its own knowledge base too -- learned patterns like "for reference-manual skills, delete 70% and add MUST/ALWAYS/NEVER mandates, expected gain: +1.5 to +2.0 points." It tracks which strategies work, which fail, and gets better at fixing skills over time.
If you're writing agent skills:
The leaderboard is live. The methodology is open. The engine keeps getting better.
Check the full leaderboard to see how your favorite skills rank.
Built with skill-eval v0.4.0. Methodology: SKILL-EVAL.md. Questions? Open an issue.