Skill Detail

whisper.cpp High-Performance C/C++ Speech Recognition Engine

whisper.cpp is a lightweight, high-performance C/C++ port of OpenAI's Whisper automatic speech recognition model. It runs on CPU and GPU across all major platforms with zero dependencies and zero runtime memory allocations.

Media & TranscriptionMulti-Framework
Media & Transcription Multi-Framework Security Reviewed
โญ 48.3k GitHub stars
INSTALL WITH ANY AGENT
npx skills add agentskillexchange/skills --skill whisper-cpp-speech-recognition-engine Copy
Works best when you want a reusable capability, not another fragile one-off prompt.
At a glance
Last updated
Mar 31, 2026
Quick brief

whisper.cpp is a portable C/C++ implementation of OpenAI’s Whisper speech recognition model, maintained by the ggml-org team. With over 48,000 GitHub stars, it is the leading native implementation for on-device speech-to-text, supporting macOS, iOS, Android, Linux, Windows, WebAssembly, and Raspberry Pi.

How it works

What this skill actually does

Architecture and Performance

The implementation is pure C/C++ with no external dependencies. It uses ARM NEON and Apple Accelerate on ARM platforms, AVX/AVX2 intrinsics on x86, and supports GPU acceleration through Apple Metal, NVIDIA CUDA, Vulkan, and OpenVINO. Mixed F16/F32 precision and integer quantization (Q4, Q5, Q8) reduce memory usage while maintaining accuracy. The engine achieves zero runtime memory allocations for predictable performance.

Voice Activity Detection

whisper.cpp includes built-in Voice Activity Detection (VAD) that segments audio into speech regions before transcription. This reduces processing time on files with significant silence and improves accuracy by filtering noise. VAD can be configured independently and is available through both CLI and API.

CLI Usage

The primary interface is whisper-cli, which transcribes audio files to text with timestamp output. Build with CMake (cmake -B build && cmake --build build) and run: ./build/bin/whisper-cli -f audio.wav. Input must be 16-bit WAV at 16kHz; use ffmpeg to convert: ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav. Output formats include plain text, SRT subtitles, VTT, CSV, and JSON.

Server Mode

The built-in HTTP server (whisper-server) provides a REST API compatible with the OpenAI Audio Transcriptions endpoint. This allows applications that use the OpenAI Whisper API to switch to local inference by changing only the base URL. The server supports concurrent requests and streaming output.

Language Bindings

Official and community bindings exist for Python, Ruby, Java, Go, and Node.js (available as whisper.cpp on npm). The C-style API in whisper.h makes integration straightforward for any language with FFI support.

Models

Pre-converted GGML format models are available via the included download script: sh ./models/download-ggml-model.sh base.en. Models range from tiny (39M parameters) to large-v3 (1.5B parameters), with English-only variants for improved speed on English content.