Anthropic's Official Skill Creator: How to Test, Measure, and Refine Skills

Most agent skills are built on vibes. You write a SKILL.md, try a few prompts, eyeball the output, and call it done. If something breaks later — after a model update, after a team member installs three more skills, after you edit one section — you won’t know until someone complains.

Anthropic just shipped an update to their skill-creator that changes this. The tool now includes evals (structured tests for skill output), benchmarks (aggregate pass rates with timing and token data), multi-agent parallel testing, A/B comparators, and description optimization. It works on Claude.ai, Cowork, and as a Claude Code plugin.

This article breaks down what each feature does, when to use it, and what it means for anyone publishing skills on AgentSkillExchange or ClawHub.

Why Skill Testing Matters More Than You Think

Agent skills sit in an awkward spot. They’re not traditional software — you can’t write unit tests that assert function(x) === y. But they’re not just prose either. A skill changes how Claude behaves, and that behavior needs to be consistent and correct.

Three things break skills in practice:

Model updates. Anthropic ships model improvements regularly. A skill that compensated for a weakness in Claude 3.5 Sonnet might be unnecessary — or worse, counterproductive — on Claude 4. Without testing, you won’t know.
Skill interactions. When a user installs 15 skills, description overlap causes false triggers. Your carefully crafted skill fires on the wrong prompt, or never fires at all.
Silent regressions. You edit one section of a SKILL.md to fix edge case A, and edge case B starts failing. Without evals, this goes undetected until it costs someone real time.

Anthropic’s blog post frames this well: “Testing turns a skill that seems to work into one you know works.” That distinction is the gap between a hobby project and a production tool.

Two Kinds of Skills, Two Kinds of Testing

The skill-creator update introduces a useful taxonomy. Every skill falls into one of two categories, and understanding which one you’re building determines how you should test it.

Capability uplift skills

These help Claude do something it either can’t do or can’t do consistently with the base model alone. Anthropic’s document-creation skills are a good example — they encode techniques and patterns that produce better output than prompting alone.

The testing concern here: obsolescence. As models improve, the techniques a capability uplift skill teaches might get absorbed into the base model. If Claude starts passing your evals without the skill loaded, the skill has done its job — it’s just no longer needed. Your evals detect this automatically.

Encoded preference skills

These document workflows where Claude already has the capability, but the skill sequences steps according to your team’s specific process. An NDA review skill that checks against particular criteria, or a weekly reporting skill that pulls data from specific MCPs in a specific order.

The testing concern here: fidelity drift. Your process changes. Someone updates step 3 but forgets step 7 depended on it. Evals verify that the skill still matches your actual workflow end-to-end.

The Eval Framework: How It Works

An eval is a structured test case. You define a prompt (optionally with input files), describe what good output looks like, and the skill-creator checks whether the skill produces it.

Here’s what the eval definition looks like in evals/evals.json:

{
  "skill_name": "my-review-skill",
  "evals": [
    {
      "id": 1,
      "prompt": "Review this pull request for security issues",
      "expected_output": "Should flag the SQL injection in line 42 and the hardcoded API key in config.py",
      "files": ["test-fixtures/vulnerable-pr.diff"]
    },
    {
      "id": 2,
      "prompt": "Review this PR — it's a clean refactor",
      "expected_output": "Should approve with no security flags. Should note the improved error handling.",
      "files": ["test-fixtures/clean-refactor.diff"]
    }
  ]
}

The framework runs each eval, then grades the output against your assertions. Results go into a grading.json file with three fields per assertion: text (what was checked), passed (boolean), and evidence (what the output actually contained).

For assertions that can be verified programmatically — file format checks, presence of specific strings, JSON schema validation — the skill-creator generates and runs scripts rather than relying on LLM judgment. Faster, more reliable, reusable across iterations.

Benchmarks: Aggregate Data Across Test Runs

Individual evals tell you if a specific case passed. Benchmarks give you the big picture.

After running evals, an aggregation step produces benchmark.json and benchmark.md with three metrics per configuration:

Pass rate — percentage of assertions passed, reported as mean ± standard deviation across runs.
Time — how long each eval took to complete (useful for catching skills that accidentally make Claude verbose or loop).
Token usage — total tokens consumed, which directly affects cost.

The benchmark viewer — launched via generate_review.py — renders both qualitative outputs (so you can read what Claude actually produced) and quantitative data side by side. For headless environments, a --static flag writes a standalone HTML file instead of starting a local server.

Benchmarks are organized by iteration. When you edit a skill and re-run evals, passing --previous-workspace shows the delta between versions. This is how you answer the question: “Did my edit actually help?”

Multi-Agent Testing and A/B Comparisons

Two practical problems with running evals sequentially: it’s slow, and context bleeds between runs. If eval 3 leaves something in the conversation context, eval 4 might behave differently than it would in isolation.

The skill-creator now spins up independent agents for each eval. Each agent gets a clean context, its own token counter, and its own timing metrics. Everything runs in parallel.

More interesting is the comparator system. For each eval, the framework spawns two agents simultaneously:

With-skill run — Claude with your skill loaded, outputs saved to with_skill/outputs/
Baseline run — Claude without the skill (for new skills) or with the previous version (for iterations), saved to without_skill/outputs/ or old_skill/outputs/

A separate comparator agent then judges both outputs without knowing which came from which configuration. This removes bias and gives you a straight answer: did the skill actually improve things, or were you just seeing what you wanted to see?

Anthropic’s PDF skill provides a real example of this workflow in action. The skill previously struggled with non-fillable forms — Claude had to place text at exact coordinates with no form fields to guide it. Evals isolated the specific failure. The team shipped a fix that anchors positioning to extracted text coordinates. The before/after benchmark data confirmed the fix worked without regressing other cases.

Description Optimization: Fixing the Trigger Problem

Output quality doesn’t matter if the skill never fires. As we covered in a previous post, the description field determines when Claude activates a skill. Get it wrong, and your skill is invisible.

The updated skill-creator includes a description optimizer. It analyzes your current description against a set of sample prompts and identifies two failure modes:

False negatives — prompts that should trigger your skill but don’t. Usually means the description is too narrow.
False positives — prompts that trigger your skill but shouldn’t. Usually means the description is too broad and overlaps with other skills.

The optimizer suggests edits to reduce both. Anthropic tested it against their own document-creation skills and saw improved triggering accuracy on 5 out of 6 public skills.

The mechanics work like this: the tool generates an eval set of prompts (a mix of should-trigger and shouldn’t-trigger), splits it 60/40 into train and held-out test sets, evaluates the current description (running each query 3 times for reliability), then proposes revised descriptions and tests them against the held-out set.

This is the kind of process most skill authors skip entirely. It’s also the part that makes the biggest difference when a user has 10+ skills installed and Claude has to pick the right one from a description scan alone.

What This Means for ASE Skill Authors

If you publish skills on AgentSkillExchange, here’s what to take from this update:

Run evals before publishing. The skill-creator lowers the bar for structured testing dramatically. You don’t need to write code. You define prompts, describe what good looks like, and the tool handles the rest. Skills with eval results will stand out in reviews.

Re-run evals after model updates. When Anthropic ships a new Claude version, your skill might behave differently. A 5-minute benchmark run catches regressions before your users do.

Optimize your descriptions. With 570+ skills on ASE across six frameworks, description collisions are a real problem. Two skills with overlapping descriptions mean unpredictable activation. The description optimizer is the fix.

Use the comparator for iterations. Before you publish a v2 of your skill, run both versions through the same evals. If v2 doesn’t measurably beat v1, maybe it’s not ready yet.

Consider whether your skill is still needed. Capability uplift skills have a natural shelf life. If your evals pass at the same rate with and without the skill, the base model caught up. That’s a good thing — it means you can deprecate the skill and save your users context window space.

Looking Ahead: Skills as Specifications

Anthropic’s blog post ends with a thought worth sitting with: as models get better, the line between a “skill” (detailed implementation instructions) and a “specification” (a description of what you want) will blur. Today, a SKILL.md is an implementation plan. Tomorrow, a natural-language description of the desired behavior might be enough.

The eval framework already points in this direction. Evals describe the what — the desired output for a given input. If the model can consistently produce that output from the eval description alone, the detailed SKILL.md becomes optional.

We’re not there yet. But building evals now means you’ll know exactly when you are.

Getting Started

The skill-creator updates are available now on three platforms:

Claude.ai and Cowork — ask Claude to use the skill-creator directly.
Claude Code plugin — install from the official plugins repository.
Standalone — download from Anthropic’s skills repo on GitHub.

If you’re building skills for ASE, we strongly recommend running the description optimizer and at least a basic set of evals before submission. It takes 15 minutes and gives you — and your users — real confidence that the skill works.

Frequently Asked Questions

Do I need to know how to code to use the skill-creator evals?

No. The entire workflow is conversational. You describe test prompts in plain language, tell the skill-creator what good output looks like, and it handles the evaluation logic. For programmatic assertions (checking file formats, validating JSON), the tool generates and runs scripts for you.

How often should I re-run skill benchmarks?

At minimum, after every major model update from Anthropic and after any edit to your SKILL.md. If you publish a skill on a marketplace, monthly benchmark runs are reasonable. The multi-agent parallel testing makes this fast — a set of 5 evals typically completes in under 3 minutes.

Can I use the skill-creator to test skills built for other frameworks like Codex or OpenClaw?

The skill-creator is designed for Claude Code’s skill format (SKILL.md with YAML frontmatter). However, the AgentSkills standard is shared across frameworks. Skills that follow the standard structure will work with the eval framework regardless of which agent they’re deployed on. Framework-specific behavior (hooks, plugins) won’t be tested, but the core instructions will be.

What’s the difference between the skill-creator and the test-creator?

The skill-creator is the full authoring and testing tool — it helps you write skills, generate evals, run benchmarks, and optimize descriptions. The test-creator is a focused component specifically for generating structured test cases. If you already have a skill and just want to add evals, you can use either. If you’re building from scratch, start with the skill-creator.

Related reading on AgentSkillExchange:

Marketplace for trusted AI agent skills

Anthropic’s Official Skill Creator: How to Test, Measure, and Refine Skills