Skill Detail

Whisper Subtitle Generator

Generates accurate subtitles and captions using OpenAI Whisper API with word-level timestamps. Outputs SRT, VTT, and ASS formats with configurable line length and speaker diarization via pyannote.

Media & TranscriptionGemini

Media & Transcription Gemini Security Reviewed

Tool match: whisper ⭐ 97.8k GitHub stars MIT license

INSTALL WITH ANY AGENT

npx skills add agentskillexchange/skills --skill whisper-subtitle-generator Copy

Works best when you want a reusable capability, not another fragile one-off prompt.

View source

At a glance

Last updated

Mar 24, 2026

Quick brief

The Whisper Subtitle Generator skill combines OpenAI Whisper speech recognition with professional subtitle formatting to produce broadcast-ready caption files. It supports both the Whisper API and local whisper.cpp inference for flexible deployment options.

How it works

What this skill actually does

The skill processes audio through the Whisper API transcriptions endpoint with word-level timestamp granularity, enabling precise subtitle synchronization. It formats output into industry-standard formats: SRT (SubRip) for universal compatibility, WebVTT for HTML5 video players with CSS styling support, and ASS (Advanced SubStation Alpha) for complex typographical layouts. Line-breaking algorithms respect maximum characters-per-line and reading-speed constraints (typically 20 characters per second) to ensure viewer comfort.

Speaker diarization integrates with pyannote.audio to identify and label different speakers, creating color-coded or prefixed subtitle tracks for multi-speaker content like interviews and panel discussions. The skill handles pre-processing through pydub for audio segmentation and format normalization before transcription. Post-processing includes punctuation restoration, profanity filtering via custom word lists, and translation through the Whisper translation endpoint. Batch processing supports entire video libraries with progress tracking and error recovery for interrupted jobs.

Best fit

When to reach for it

Best when the job fits Media & Transcription.
Works naturally with Gemini setups.

Trust & provenance

Why this listing is credible

Built around the whisper toolchain.
Trust status: Security Reviewed.
97.8k GitHub stars on the linked upstream source.
License: MIT.
Last updated Mar 24, 2026.

View source ↗