Skill Detail

Investigate production incidents across Kubernetes and cloud signals with HolmesGPT

Use HolmesGPT when an on-call agent needs one investigation loop that pulls alerts, logs, metrics, and infrastructure context from multiple systems and returns a root-cause path instead of forcing a human to hop across separate observability products.

Runbooks & DiagnosticsCustom Agents
Runbooks & Diagnostics Custom Agents Security Reviewed
⭐ 2.3k GitHub stars
INSTALL WITH ANY AGENT
npx skills add agentskillexchange/skills --skill investigate-production-incidents-across-kubernetes-and-cloud-signals-with-holmesgpt Copy
Works best when you want a reusable capability, not another fragile one-off prompt.
At a glance
Tools required
HolmesGPT CLI or operator deployment, one supported LLM provider, and connected observability/toolset integrations
Install & setup
Follow the HolmesGPT installation docs to deploy the CLI or Kubernetes operator, configure an LLM provider, and connect the relevant observability toolsets before running incident investigations.
Author
HolmesGPT
Publisher
Organization
Last updated
Apr 21, 2026
Quick brief

HolmesGPT is publishable because the user-facing job is specific and operational: investigate a live production incident by querying connected observability and infrastructure systems, correlate the evidence, and return a likely root-cause path with remediation direction. The upstream project explicitly frames itself as an open-source AI agent for production incident investigation, with built-in integrations for Kubernetes, Prometheus, Grafana, Datadog, cloud services, databases, ticketing systems, and more.

How it works

What this skill actually does

Invoke it instead of using the underlying products normally when the real need is cross-source incident investigation, not dashboard-by-dashboard inspection. A user reaches for HolmesGPT when an agent should gather the relevant signals, follow the evidence across systems, and explain what is probably broken, rather than manually pivot through kubectl, alerting tools, logs, traces, and cloud consoles.

The scope boundary is clear enough to keep this from collapsing into a plain product card: this is not a generic observability platform listing and not just a connector bundle. The bounded workflow is incident triage and root-cause investigation across existing telemetry sources. That is a concrete agent job to be done with a tighter operator outcome than the surrounding platforms HolmesGPT connects to.