At 3 AM, your phone buzzes. Something is wrong with production. You pull up a terminal, try to remember which logs to check first, and start the familiar ritual of poking at symptoms until a pattern emerges.
Now imagine your AI agent already knows that ritual. It knows which symptoms map to which causes, which commands to run in what order, and how to format its findings into a structured report your team can act on immediately. That’s what a runbook skill does.
Runbooks are Category 8 in Anthropic’s nine-category agent skill taxonomy. They follow a specific pattern โ symptom โ investigation โ structured report โ and they’re one of the highest-value skill types you can build. This post covers what makes them work, where they fall apart, and includes a full example you can adapt.
What Runbook Skills Actually Do
A runbook skill encodes diagnostic knowledge into a format an AI agent can execute. The agent receives a symptom description (“the API is returning 502 errors”), follows a structured investigation path, and produces a report with findings, probable cause, and recommended actions.
This is different from a tutorial or a checklist. Tutorials teach humans. Runbooks teach agents. The distinction matters because the agent needs:
- Decision trees, not step lists โ “If X, check Y. If not X, check Z.”
- Tool commands it can run โ not descriptions of what a human should click.
- Output format specifications โ so the report is consistent and machine-parseable.
Think of a runbook skill as the senior engineer on your team who’s seen every failure mode โ compressed into a file the agent reads when it needs that expertise.
The Symptom โ Investigation โ Report Pattern
Anthropic’s Thariq described runbooks as following a specific three-phase structure. Each phase has a distinct purpose:
Phase 1: Symptom Matching
The skill’s description field defines which symptoms trigger it. This is where description optimization matters most โ runbook skills need to match against error messages, alert names, and the informal language engineers use when reporting problems.
description: >
Use when diagnosing server health issues: high CPU, memory leaks,
disk full, OOM kills, load average spikes, process hangs, zombie
processes, swap thrashing. Triggers on: "server is slow",
"out of memory", "disk space", "load average", "process stuck",
"can't SSH". NOT for: network/DNS issues (use network-triage),
application-level errors (use app-debug).
Notice the inclusion of informal phrases like “server is slow” alongside technical terms like “OOM kills.” Your on-call engineer might say either one. The skill needs to catch both.
Phase 2: Investigation
The investigation phase is where the skill earns its value. It provides a structured diagnostic path โ commands to run, output to interpret, and branching logic based on what the agent finds.
Good investigation sections don’t prescribe a rigid sequence. They provide a priority-ordered set of checks with conditional branching:
## Investigation Flow
### Step 1: Quick Health Snapshot
Run these commands and capture output:
- `uptime` โ check load averages (1min, 5min, 15min)
- `free -h` โ memory and swap usage
- `df -h` โ disk usage across mount points
- `dmesg --since "1 hour ago" | tail -50` โ recent kernel messages
### Step 2: Branch Based on Findings
**If load average > number of CPU cores:**
โ Jump to "CPU Investigation" section
**If memory usage > 90% or swap usage > 50%:**
โ Jump to "Memory Investigation" section
**If any filesystem > 85% capacity:**
โ Jump to "Disk Investigation" section
**If dmesg shows OOM killer entries:**
โ Jump to "OOM Investigation" section
**If none of the above:**
โ Jump to "Process-Level Investigation" section
This gives the agent a decision framework rather than a script. It can adapt to what it actually finds instead of blindly executing steps that don’t apply.
Phase 3: Structured Report
The report phase defines the output format. Consistency here is what makes runbook skills useful at scale โ every investigation produces a report in the same structure, so your team can scan findings quickly.
## Report Format
Produce the following structure after investigation:
### Triage Report: [Hostname] โ [Date/Time UTC]
**Severity:** Critical / Warning / Informational
**Symptom:** [What was reported]
**Duration:** [When it started, if determinable]
**Findings:**
1. [Finding with supporting data]
2. [Finding with supporting data]
**Probable Cause:** [Assessment based on evidence]
**Recommended Actions:**
- [ ] [Immediate action]
- [ ] [Follow-up action]
- [ ] [Preventive measure]
**Raw Data:**
[Attach relevant command outputs]
Anatomy of a Complete Runbook Skill
Here’s a full, working runbook skill for server triage. You can install this directly or use it as a template for your own runbooks.
my-server-triage/
โโโ SKILL.md # Core triage logic (~300 lines)
โโโ config.json # SSH targets, thresholds, alert channels
โโโ references/
โ โโโ cpu-investigation.md # Deep CPU diagnostic procedures
โ โโโ memory-investigation.md # Memory leak detection and analysis
โ โโโ disk-investigation.md # Disk space and I/O troubleshooting
โ โโโ oom-investigation.md # OOM killer analysis and prevention
โโโ scripts/
โโโ health-snapshot.sh # Quick health data collection script
โโโ process-report.sh # Top processes by resource usage
The SKILL.md handles the triage flow and report generation. The references/ directory holds deep-dive procedures the agent loads only when a specific issue type is identified. The scripts/ directory contains reliable data collection commands the agent executes rather than reconstructing from memory.
This is progressive disclosure applied to operations. The agent doesn’t load the memory investigation guide when the problem is disk space. It loads what it needs, when it needs it.
Building the Gotchas Section for Runbooks
Runbook skills need especially strong gotchas sections because the consequences of wrong diagnostic actions are higher than in most skill types. An agent that misreads top output or runs a destructive command during investigation can make things worse.
Here’s what a runbook gotchas section should cover:
## Gotchas
### Command Safety
- NEVER run `kill -9` on processes without explicit user approval.
Use `kill -15` (SIGTERM) first and wait 30 seconds.
- Do NOT restart services during investigation. Restarting destroys
evidence (open file handles, connection states, core dumps).
- `rm` is forbidden in runbook context. If temp files need cleanup,
flag them in the report for human action.
### Interpretation Pitfalls
- Load average on a 4-core machine: 4.0 is 100% utilization, not
"moderate load." Always compare against `nproc` output.
- `free -h` on Linux includes buffer/cache in "used" memory.
The actual available memory is the "available" column, not
"total minus used."
- `df -h` shows filesystem capacity, but inodes can be exhausted
independently. Always run `df -i` alongside `df -h`.
- A process showing 400% CPU on `top` is using 4 cores, not
exceeding capacity. Compare against total core count.
### SSH and Access
- If SSH connection fails, do NOT retry more than 3 times.
Report SSH failure as a finding โ the server may be too
overloaded to accept connections.
- Use non-interactive commands only. Never launch `vim`, `less`,
or any pager. Pipe through `head` or `tail` instead.
- Timeout all commands at 30 seconds. A hanging command means
the server is in worse shape than expected โ report that.
Every entry here comes from real failure modes. An agent that misreads the free command output โ interpreting buffer/cache memory as genuinely consumed โ will produce incorrect triage reports. An agent that kills processes without asking will create new incidents while trying to diagnose existing ones.
Configuration: Making Runbooks Portable
Runbook skills need more configuration than most skill types. Different teams have different servers, different threshold definitions, and different escalation paths. All of this belongs in config.json:
{
"thresholds": {
"cpu_warning": 80,
"cpu_critical": 95,
"memory_warning": 85,
"memory_critical": 95,
"disk_warning": 80,
"disk_critical": 90,
"load_factor": 1.5
},
"ssh_defaults": {
"timeout_seconds": 30,
"max_retries": 3
},
"report": {
"format": "markdown",
"include_raw_output": true,
"max_raw_lines": 100
},
"escalation": {
"critical_channel": "#incidents",
"warning_channel": "#ops-alerts"
}
}
The load_factor setting (1.5ร the CPU core count) defines when load average counts as elevated. A value of 1.5 means a 4-core machine with a load average above 6.0 triggers the CPU investigation branch. Different teams have different tolerances โ some run hot and only care at 2.0ร; others want early warnings at 1.0ร.
By externalizing these thresholds, one runbook skill serves every team. Install it, edit config.json, and the agent adapts.
Runbooks vs. Monitoring: Complementary, Not Competing
A common objection to runbook skills: “We already have monitoring and alerting. Why do we need this?”
Monitoring tells you something is wrong. A runbook skill tells the agent what to do about it. They’re different layers:
| Layer | Tool | Function |
|---|---|---|
| Detection | Prometheus, Datadog, CloudWatch | Alert when metrics cross thresholds |
| Diagnosis | Runbook skill | Investigate the alert and identify root cause |
| Resolution | CI/CD skills, deployment skills | Apply fixes (with human approval) |
The gap between detection and resolution is where on-call engineers spend most of their time. That gap is exactly what runbook skills fill. Your monitoring system fires an alert at 3 AM. Instead of a bleary-eyed engineer running diagnostic commands from memory, the agent executes the runbook, produces a structured report, and presents findings with recommended actions. The engineer reviews the report and approves the fix โ or escalates with a complete picture already assembled.
This doesn’t replace human judgment. It front-loads the tedious data collection so human judgment can focus on the decisions that matter.
Real-World Runbook Patterns
Beyond server triage, runbook skills fit a wide range of operational scenarios. Here are patterns teams are building on AgentSkillExchange:
Database Performance Triage
Symptom: slow queries, connection pool exhaustion, replication lag. Investigation: check pg_stat_activity, identify long-running queries, analyze EXPLAIN plans, check replication status. Report: query performance breakdown, lock analysis, recommended index additions.
Deployment Failure Analysis
Symptom: deploy failed, rollback triggered, health checks failing post-deploy. Investigation: compare current vs. previous container images, check resource limits, review recent config changes, analyze health check endpoint responses. Report: failure timeline, diff summary, recommended rollback or fix.
Certificate and TLS Issues
Symptom: SSL errors, certificate expiry warnings, mixed content. Investigation: check certificate chain, verify renewal automation, scan for expiring certs across services. Report: certificate inventory, expiry timeline, renewal action items.
Cost Spike Investigation
Symptom: cloud bill unexpectedly high. Investigation: identify top cost contributors, compare against baseline, check for orphaned resources, review recent scaling events. Report: cost attribution breakdown, anomalous resources, recommended cleanup.
Each of these follows the same symptom โ investigation โ report pattern. The structure is consistent; the domain knowledge changes.
Testing Runbook Skills
Runbook skills require more rigorous testing than most skill types because they interact with live systems. Three testing strategies that work:
1. Dry-run mode. Add a dry_run flag in config.json that makes the skill print commands instead of executing them. This lets you verify the investigation flow without touching real servers.
2. Canned output testing. Save real command outputs from past incidents into test-data/. Feed these to the agent and verify it produces correct triage reports. This tests the interpretation logic without needing access to infrastructure.
3. Chaos engineering integration. Tools like Chaos Monkey or Chaos Mesh can inject controlled failures into staging environments. Run the runbook skill against those failures and verify it identifies the injected problem correctly.
The third approach is the gold standard. If your runbook skill can correctly diagnose a chaos-engineered failure, it’ll handle the real thing.
Common Mistakes When Building Runbook Skills
After reviewing dozens of runbook skill submissions, these are the patterns that cause rejection or poor performance:
- No safety boundaries. Runbooks that include
rm,kill -9, or service restarts in the investigation phase. Investigation should be read-only. Remediation actions go in the report as recommendations, not automated steps. - Missing timeout handling. Commands that hang on an overloaded server will block the agent indefinitely. Every command in a runbook needs a timeout, and “command timed out” should be treated as diagnostic information.
- Hardcoded thresholds. “CPU above 80% is critical” is not universally true. A batch processing server running at 95% CPU might be perfectly healthy. Thresholds belong in
config.json. - Flat investigation structure. Running every check regardless of findings wastes time and produces noisy reports. Use conditional branching โ skip the disk investigation if disk usage is at 20%.
- No report template. Without a consistent output format, every investigation produces a different-shaped report. Your team can’t build processes around inconsistent outputs.
Getting Started
If you want to build your first runbook skill:
- Pick your most common incident type. Check your on-call logs from the last 3 months. What gets paged most? That’s your first runbook.
- Document what the best engineer does. Shadow your most experienced on-call engineer through one of these incidents. Write down every command they run, every decision point, every “I check this because…” explanation. That’s your investigation flow.
- Build the skill structure. Follow the skill creation guide for the basics, then apply the runbook patterns from this post.
- Test with past incidents. Replay 3โ5 real incidents through the skill. Compare the agent’s report against what actually happened. Refine the gotchas section based on where the agent got it wrong.
- Ship it and iterate. Publish to your team’s skill directory or to AgentSkillExchange. Every real incident that uses the runbook will surface new gotchas and refinements.
The goal isn’t a perfect runbook on day one. It’s a runbook that gets better with every incident โ accumulating your team’s operational knowledge in a form that compounds over time.
Frequently Asked Questions
Can a runbook skill actually fix problems, or just diagnose them?
Runbook skills focus on diagnosis and reporting. Remediation actions should appear as recommendations in the report, requiring human approval before execution. You can build skills that also remediate, but mixing diagnosis and automated remediation in one skill creates safety risks. Keep them separate โ a runbook skill identifies the problem, a deployment or CI/CD skill applies the fix.
How do runbook skills handle multi-service architectures?
For microservice environments, build one runbook per service boundary (database, API gateway, message queue) rather than one mega-runbook. Each skill’s description field scopes it to relevant symptoms. The agent selects the right runbook based on the reported symptom, and can chain multiple runbooks if the investigation reveals cross-service issues.
What’s the difference between a runbook skill and just pasting commands into ChatGPT?
Three things: persistence, structure, and execution. A runbook skill lives in your codebase, improves over time, and the agent can actually run the commands against real infrastructure. Pasting commands into a chat is one-shot with no learning loop, no consistent output format, and no connection to your systems.
Browse more category deep dives and skill-building tutorials on the ASE blog, or explore runbook skills on the marketplace.