Anthropic Best Practices for Agent Skill Development: Prompt Caching, Tool Use, and Extended Thinking

If you are building skills for Claude, OpenClaw, or any other agent that follows the Agent Skills pattern, the biggest gains usually do not come from writing more instructions. They come from deciding what should stay in prompt context, what should become executable code, and what should only load when needed.

That is the real thread connecting Anthropic’s recent skill guidance: prompt caching cuts repeated context cost, tool use turns fragile instructions into reliable operations, and extended thinking gives the model more room for hard reasoning when a task deserves it. Used together, those three choices make skills cheaper, faster, and more dependable.

This article is for skill authors, marketplace curators, and teams packaging reusable workflows. We will break down the practical patterns, show where they fit, and give you a checklist you can apply before your next publish.

Key takeaways

Use prompt caching for stable instructions, long references, and repeated multi-turn work.
Use tools and scripts for deterministic operations, especially anything error-prone or expensive in pure text.
Use extended thinking only for genuinely difficult reasoning, not every routine request.
Keep core skill files lean, then rely on progressive disclosure for the rest.

Why these Anthropic best practices matter
Prompt caching for reusable skill context
Tool use and scripts for reliable execution
When extended thinking actually helps
A practical skill architecture
Pre-publish checklist
FAQ

Why these Anthropic best practices matter

Anthropic’s own framing is useful here. In its engineering post on Agent Skills, the company describes skills as organized folders of instructions, scripts, and resources that agents can discover and load dynamically. It also calls progressive disclosure the core design principle, with skill metadata loaded first, the main SKILL.md loaded only when relevant, and extra files pulled in on demand.

That matters because context is scarce, even when models feel roomy. Anthropic’s skill authoring guide explicitly warns that the context window is a shared resource across the system prompt, conversation history, skill metadata, and the user’s actual task. In other words, every unnecessary paragraph in a skill competes with the work you want the model to do.

If you want background on the file-splitting side of this, read our guides on progressive disclosure and the skill creator’s checklist.

Prompt caching for reusable skill context

Prompt caching is the part many skill authors still underuse. Anthropic’s prompt caching docs describe it as a way to reuse prompt prefixes so repeated requests do not have to reprocess the same stable context from scratch. Anthropic says this can reduce latency by up to 80% and costs by up to 90%, and the default cache lifetime is 5 minutes.

For skill development, the takeaway is simple: cache the parts that are expensive, stable, and likely to repeat.

Good candidates for prompt caching

Long system instructions that rarely change
Large reference sections such as API rules, policy docs, or style guides
Multi-turn workflows where the same task scaffold persists across several requests
Evaluation runs where you repeatedly test the same skill against different inputs

Bad candidates are short prompts, highly variable prefixes, or one-off tasks where setup overhead is already tiny.

{
  "model": "claude-opus-4-6",
  "max_tokens": 1200,
  "cache_control": { "type": "ephemeral" },
  "system": "You are validating agent skills against our publishing standards.",
  "messages": [
    {
      "role": "user",
      "content": "Review this new skill for naming, description quality, and gotchas coverage."
    }
  ]
}

On a marketplace team, prompt caching is especially useful when reviewers are applying the same rubric again and again. If the rubric stays stable and the input changes, caching is usually worth it.

This is also why concise SKILL.md files still matter even when caching exists. Caching helps repeated work, but it does not excuse bad structure. Keep the main skill short, move bulky references into separate files, and cache only the stable layers that are genuinely reused.

Tool use and scripts for reliable execution

The second best practice is about moving the right work out of prose and into tools. Anthropic’s engineering post makes the case directly: language models are flexible, but some operations are better handled by deterministic code for reliability and efficiency.

That sounds obvious, but many weak skills still explain a process in 300 words when they should just run a script.

Use tools when the operation is fragile

If a task has one safe path, encode that path. Database migrations, release tasks, schema validation, CI checks, PDF form extraction, and file normalization all fit this pattern. A model can decide when to run the tool, but the tool should handle the exact operation.

Anthropic’s skill docs now show another useful pattern too: dynamic shell-backed context in skill content, where command output is rendered before the model sees the final prompt. That is a clean way to feed live state into a skill without forcing the model to narrate every lookup step itself.

---
name: review-workflow
description: Reviews pull requests with live repo context. Use when summarizing changes, checking CI, or reviewing a PR.
allowed-tools: Bash(gh *)
---

## Pull request context
- PR diff: !`gh pr diff`
- PR comments: !`gh pr view --comments`
- Changed files: !`gh pr diff --name-only`

## Your task
Summarize the pull request, call out risk areas, and suggest next checks.

For ASE publishers and skill authors, that pattern points to a broader rule: write skills that orchestrate tools, not skills that impersonate tools. A strong skill gives the model the conditions, guardrails, and decision points. The script or CLI handles the exact mechanics.

Two ASE skill pages worth studying here are GitHub Actions Workflow Linter and Normalize raw CLI output into JSON for reliable downstream parsing and automation. Both reflect the same principle: use the model for orchestration and interpretation, and use code for consistency.

When extended thinking actually helps

Extended thinking is the easiest best practice to misuse because it feels universally helpful. It is not. Anthropic’s docs describe it as a way to give Claude enhanced reasoning for complex tasks. For Claude Opus 4.6 and Sonnet 4.6, Anthropic now recommends adaptive thinking rather than older manual thinking budgets. In earlier examples, manual mode used values like 10,000 budget tokens inside a 16,000-token response budget, which shows the scale involved.

That should be your clue. Extended thinking is for tasks where the reasoning path is the work: hard debugging, planning across many constraints, difficult evaluations, or cases where several valid approaches need to be weighed carefully.

Good uses of extended thinking

Comparing multiple architecture options with non-obvious tradeoffs
Tracing a subtle failure across logs, config, and code paths
Generating a structured remediation plan from messy evidence
Reviewing a complex skill for hidden overlap, security risk, or bad abstractions

Bad uses of extended thinking

Routine formatting
Simple extraction tasks
Straightforward file edits
Any request where a deterministic tool already answers the question

Anthropic’s Claude Code docs also note a practical skill-level shortcut: to enable extended thinking in a skill, include the word ultrathink somewhere in the skill content. That can be useful, but it should be treated as a scalpel, not a default setting. If every skill asks for deeper reasoning, your catalog gets slower and more expensive without becoming much smarter.

A practical skill architecture

The best skills combine all three patterns without forcing them into every path. A practical structure looks something like this:

release-skill/
├── SKILL.md
├── references/
│   ├── release-policy.md
│   └── rollback-rules.md
├── scripts/
│   ├── preflight.sh
│   └── deploy.sh
└── examples/
    └── incident-review.md

Prompt caching fits the stable policy and release guidance that gets reused across runs.
Tool use fits the preflight and deploy scripts, because release steps should be deterministic.
Extended thinking fits the exception path, where the model needs to reason through conflicting signals before deciding whether to proceed or stop.

This layered approach lines up with Anthropic’s broader design logic and with what performs well on ASE. It also keeps skills portable across ecosystems. OpenClaw, Claude Code, and other skill-aware tools all benefit when the skill is explicit about what belongs in context, what belongs in code, and what belongs in higher-effort reasoning.

Pre-publish checklist

Does the skill keep reusable stable context separate from one-off chatter?
Would prompt caching save meaningful time or cost on repeated runs?
Is any fragile process still described in prose when it should be scripted?
Are tools restricted to the smallest useful surface area?
Does the skill only invoke extended thinking for hard reasoning paths?
Is the main SKILL.md short enough to load cleanly?
Does the description clearly describe what the skill does and when to use it?
Does the skill link to supporting files instead of pasting everything inline?

If you can answer yes to those questions, you are much closer to a skill that earns repeated use instead of a one-time install.

Frequently Asked Questions

What is the best use of prompt caching in agent skills?

The best use is caching stable, repeated context such as instructions, policies, examples, or long references that appear across multiple runs. It is most valuable when the prompt prefix stays consistent and the task changes underneath it.

Should every agent skill use extended thinking?

No. Extended thinking is for difficult reasoning, not routine operations. If a task is mostly lookup, extraction, formatting, or deterministic execution, standard tool use is usually the better choice.

When should I use a script instead of more instructions?

Use a script when the operation is brittle, expensive to do by token generation, or needs high consistency. Migrations, validations, CI checks, API normalization, and repeatable file operations are strong candidates.

Conclusion

Anthropic’s best practices for agent skill development are not really about adding more clever prompt text. They are about choosing the right layer for the job. Cache the context you reuse. Script the operations that should not drift. Reserve extended thinking for the moments where careful reasoning is worth paying for.

That is the pattern that scales, and it is the pattern we expect to see more often in the strongest ASE submissions over the rest of 2026.

If you are refining your own catalog, start with one skill this week. Trim the core file, move the bulky reference content out, replace one fragile paragraph with a script, and only then decide whether the hard path deserves deeper reasoning. Small structural fixes usually outperform big prompt rewrites.

Sources: Anthropic engineering post on Agent Skills, Anthropic prompt caching docs, Anthropic skill authoring best practices, Anthropic extended thinking docs, and Claude Code skill and slash command docs.

Marketplace for trusted AI agent skills

Anthropic Best Practices for Agent Skill Development: Prompt Caching, Tool Use, and Extended Thinking

Anthropic Best Practices for Agent Skill Development: Prompt Caching, Tool Use, and Extended Thinking

Table of Contents

Why these Anthropic best practices matter

Prompt caching for reusable skill context

Good candidates for prompt caching

Tool use and scripts for reliable execution

Use tools when the operation is fragile

When extended thinking actually helps

Good uses of extended thinking

Bad uses of extended thinking

A practical skill architecture

Pre-publish checklist

Frequently Asked Questions

What is the best use of prompt caching in agent skills?

Should every agent skill use extended thinking?

When should I use a script instead of more instructions?

Conclusion