Skill Detail

RunPod Serverless GPU Inference

Deploy and manage GPU inference endpoints on RunPod Serverless using their REST API. Handles endpoint creation, cold start optimization, request queuing, and auto-scaling configuration for image generation models.

Image & Creative AutomationClaude Agents
Image & Creative Automation Claude Agents Security Reviewed
Tool match: docker
INSTALL WITH ANY AGENT
npx skills add agentskillexchange/skills --skill runpod-serverless-gpu-inference Copy
Works best when you want a reusable capability, not another fragile one-off prompt.
At a glance
Last updated
Mar 24, 2026
Quick brief

Run GPU-accelerated inference workloads on RunPod’s serverless platform via their management and inference APIs. This skill automates endpoint deployment and request orchestration for ML model serving.

How it works

What this skill actually does

Endpoint creation uses the RunPod GraphQL API to configure worker templates with Docker image references, GPU type selection (A100, A40, RTX 4090), and scaling parameters including min/max workers, idle timeout, and request queue depth.

Inference requests are submitted via POST /v2/{endpoint_id}/run for async execution or /v2/{endpoint_id}/runsync for synchronous responses. The skill manages request lifecycle by polling /v2/{endpoint_id}/status/{job_id} for async jobs and implements client-side timeout handling.

Cold start optimization strategies include maintaining minimum active workers during peak hours, pre-warming endpoints with lightweight health check requests, and configuring execution timeout buffers. The skill tracks per-endpoint metrics including queue depth, average execution time, and GPU utilization via the RunPod status API.

Cost management features monitor GPU-seconds consumed per endpoint and alert when spend approaches configured budget thresholds.