Infrastructure Operations Skills: Safer SSH, Health Checks, and Recovery for AI Agents

If your team still handles routine server work by pasting commands into SSH sessions, infrastructure operations skills are worth your attention. The short version is simple: a good ops skill turns repeatable maintenance, diagnosis, and recovery work into a guarded workflow that an agent can execute without improvising the dangerous parts.

This article is for platform engineers, SREs, DevOps leads, and solo operators who want AI help with real systems work without handing over a blank check. We will look at the patterns that make infrastructure operations skills useful, where they fail, and how to design them so they are fast during an incident and boring in the best possible way during normal maintenance.

Key takeaways

Infrastructure operations is the missing category in many skill libraries, even though it benefits the most from repeatable guardrails.
Anthropic recommends concise descriptions and keeping SKILL.md under roughly 500 lines, with a description field capped at 1024 characters, which matters even more for risky ops tasks.
Google’s SRE workbook says operational work should stay under about 50% of team time, a useful benchmark for deciding what to automate first.
On ASE today, the marketplace exposes 2,315 published skills across 17 live categories and 10 frameworks, which means discoverability and scope discipline matter more than ever.

What infrastructure operations skills actually cover
Why this category matters now
The design patterns that make ops skills safe
A practical skill skeleton
Useful ASE skills and related reading
FAQ

What infrastructure operations skills actually cover

Infrastructure operations skills sit next to runbook skills and CI/CD and deployment skills, but they solve a different problem. Runbooks are mostly about diagnosis and reporting. Deployment skills are about shipping. Infrastructure operations skills are about the repetitive work that keeps systems healthy between those moments.

That usually includes things like log rotation checks, disk and memory triage, service restarts, backup verification, certificate expiry checks, cron validation, queue inspection, and basic recovery procedures. In other words, the tasks people know how to do, rarely enjoy doing, and frequently do under time pressure.

Google’s SRE workbook defines toil as manual, repetitive, automatable work and notes that Google limits operational work to roughly 50% of team time because anything higher becomes hard to sustain. That is a useful filter for deciding what should become a skill first. If the work is predictable, frequent, and written down in a wiki anyway, you probably have a good candidate.

Why this category matters now

The ASE marketplace has grown quickly. The public API currently reports 2,315 published skills, 17 live categories, and 10 frameworks. In a catalog that large, generic “server helper” skills disappear. Specific operations skills do better because they solve one narrow job, map cleanly to one trigger set, and are easier to trust.

This is also where agent skills can do work that prompt libraries cannot. A prompt can remind an operator to “check disk usage and restart the service if needed.” An infrastructure operations skill can bundle the exact commands, the dangerous edge cases, the environment-specific exclusions, and the evidence the agent must collect before it touches anything. That shift from reminder to guarded execution is the entire point.

Anthropic’s skill authoring guidance reinforces this. The description field is not marketing copy. It is the discovery layer the model uses to decide whether a skill should load. Their docs also recommend keeping the main SKILL.md body under about 500 lines and using progressive disclosure for references and scripts. For infrastructure work, that is not just a token optimization. It is a safety feature. The agent should load the high-level policy first, then read the exact reference file for the specific subsystem it is about to touch.

The design patterns that make ops skills safe

1. Separate observe, decide, and act

The fastest way to make an ops skill scary is to combine diagnosis and action in one fuzzy paragraph. Safer skills break the workflow into three phases:

Observe: gather service status, logs, metrics, config, and timestamps.
Decide: compare the evidence to explicit thresholds or known failure patterns.
Act: run the smallest reversible change that addresses the verified issue.

If your skill jumps straight from “something looks wrong” to “restart the service,” it is not ready.

2. Encode permission boundaries in plain language

A strong infrastructure operations skill says what the agent may do, what needs approval, and what is never allowed. This is much more useful than generic warnings.

## Allowed without approval
- Read logs, systemd status, disk usage, memory usage, and recent deploy metadata
- Run non-mutating health checks

## Requires approval
- Restarting a service
- Changing config files
- Rotating credentials
- Deleting files outside documented cache paths

## Never do automatically
- Drop a database
- Disable backups
- Change firewall defaults
- Modify DNS or TLS settings during an active incident without human sign-off

This kind of boundary keeps the model useful while removing ambiguity from the dangerous parts.

3. Put fragile commands into scripts

Anthropic’s best practices are direct on this point: use scripts when consistency matters. For ops work, that is often the difference between safe automation and a messy postmortem. A restart, backup verification, or log bundle collection step should usually live in scripts/, not as a command the model reassembles from memory.

That is the same reason skills like Remote-control tmux sessions for interactive CLI agents and Audit OpenClaw host security posture and hardening gaps are compelling. They package a bounded workflow, not just an idea.

4. Build gotchas from real incidents

Ops skills need a gotchas section even more than code-generation skills do. The useful gotchas are the ones that came from real failures, not imagined risks.

## Gotchas
- If disk usage is high on /var, check journald size before deleting app logs. On this fleet, journald growth is the usual root cause.
- Do not restart nginx before confirming the certificate renewal timer succeeded. A failed renewal plus restart can take all HTTPS traffic down.
- If system load is high after deploy, compare queue depth and worker concurrency first. The symptom often looks like CPU pressure but is caused by worker misconfiguration.

That is the kind of content an LLM will not reliably infer on its own.

5. Prefer evidence-rich output over “fixed it” output

A good infrastructure skill ends with a structured report: what was observed, what changed, what command ran, what the new health state is, and what still needs follow-up. This matters because the operator who takes over later was not inside the model’s head.

A practical skill skeleton

Here is a minimal pattern for an infrastructure operations skill that handles service health triage and safe restart decisions.

ops-service-health/
├── SKILL.md
├── config.json
├── references/
│   ├── service-thresholds.md
│   ├── incident-patterns.md
│   └── rollback-rules.md
└── scripts/
    ├── collect_evidence.sh
    ├── restart_service.sh
    └── verify_recovery.sh

---
name: checking-service-health
description: Checks Linux service health, gathers logs and resource signals, and performs guarded restart workflows when a service is degraded. Use when users mention failing services, restart loops, high disk usage, memory pressure, unhealthy systemd units, or incident triage on a host.
---

# Service Health Operations

1. Run scripts/collect_evidence.sh first and summarize findings.
2. Read references/service-thresholds.md before recommending action.
3. Restart only when thresholds match a documented recovery case.
4. After any action, run scripts/verify_recovery.sh and return a structured report.
5. If evidence is mixed, stop and ask for approval instead of escalating changes.

Notice what is missing: long prose about what Linux is, vague reminders to “be careful,” and giant all-purpose sections that cover every possible subsystem. The skill stays narrow. The references hold the detail. The scripts hold the fragile steps.

If you want a simple checklist to adapt, copy this into your own repo:

- Define one operational surface only, such as backups, disk pressure, TLS renewals, or worker restarts
- List the exact user phrases and incident signals that should trigger the skill
- Write the three to seven gotchas learned from real incidents
- Move irreversible or fragile steps into scripts/
- Document what requires approval
- Define success checks after every action
- Return a structured incident summary with timestamps and next steps

Useful ASE skills and related reading

If you want examples from the marketplace, start with these:

Audit OpenClaw host security posture and hardening gaps for host-level hardening workflows.
Remote-control tmux sessions for interactive CLI agents for bounded interaction with terminal tools.
Docker Container Health Inspector for evidence-first container diagnosis.
OpenClaw Security Suite (ClawSec) for integrity checks and drift monitoring.

Then pair those with earlier ASE blog coverage on progressive disclosure and gotchas sections. Those patterns are not abstract writing advice. They are what make ops automation survivable.

For external references, Anthropic’s skill authoring best practices are the clearest guide to discovery and structure, while Google’s Eliminating Toil chapter is still the best short argument for why repetitive ops work should become software.

Frequently asked questions

What is the difference between a runbook skill and an infrastructure operations skill?

A runbook skill usually focuses on diagnosis and reporting. An infrastructure operations skill includes the guarded actions that may follow, such as restart, cleanup, verification, or rollback, with explicit approval boundaries.

Should infrastructure operations skills always be fully automated?

No. Most teams should automate observation first, semi-automate remediation second, and reserve full automation for the narrow cases with proven recovery patterns and strong verification checks.

What should an infrastructure operations skill never do on its own?

It should never make irreversible, high-blast-radius changes without human approval. Database destruction, firewall rewrites, credential rotation, and DNS changes are common examples.

Conclusion

Infrastructure operations skills are where agent skills stop being clever and start being operationally valuable. The best ones reduce toil, speed up diagnosis, and make routine changes safer because they narrow the problem instead of pretending to solve everything.

If you are building one next, start small. Pick a single maintenance workflow that already has clear evidence, clear thresholds, and a known recovery path. Encode the guardrails, the gotchas, and the verification step. That is how you get an ops skill people will actually trust.

And if you publish it on ASE, make the scope obvious in the description. In a marketplace with thousands of skills, the safest skill is usually the one that knows exactly where it should stop.

Marketplace for trusted AI agent skills