Skill Detail

Vosk Offline Speech Recognition Toolkit

Perform offline speech recognition across 20+ languages with Vosk. Provides compact models, zero-latency streaming transcription, and bindings for Python, Node.js, Java, C#, and Go — all without cloud API dependencies.

Media & TranscriptionMulti-Framework

Media & Transcription Multi-Framework Published

Tool match: vosk-api ⭐ 14.5k GitHub stars Apache-2.0 license

INSTALL WITH ANY AGENT

npx skills add agentskillexchange/skills --skill vosk-offline-speech-recognition-toolkit Copy

Works best when you want a reusable capability, not another fragile one-off prompt.

View source

At a glance

Last updated

Apr 1, 2026

Quick brief

Vosk is an open-source offline speech recognition toolkit developed by Alpha Cephei. It provides accurate speech-to-text transcription for 20+ languages using compact acoustic models that run entirely on-device. Unlike cloud-based ASR services, Vosk requires no internet connection, sends no audio data externally, and operates with zero-latency streaming response — making it ideal for privacy-sensitive applications, embedded devices, and real-time transcription scenarios.

How it works

What this skill actually does

What is Vosk?

How This Skill Works

This skill enables agents to transcribe audio to text using the Vosk speech recognition API. Agents load a pre-trained language model (as small as 50 MB for lightweight models, or 1-2 GB for large-vocabulary models), create a recognizer instance configured for the audio sample rate, and feed audio data through the recognizer in chunks. Vosk returns partial results in real-time as audio streams in, and final results when utterances complete, all formatted as JSON with word-level timestamps and confidence scores.

Key Capabilities

Multi-language support: Pre-trained models for English, German, French, Spanish, Portuguese, Chinese, Russian, Turkish, Vietnamese, Italian, Dutch, Arabic, Greek, Hindi, Japanese, Ukrainian, and more.
Streaming recognition: Process audio in real-time with streaming API that returns partial hypotheses as speech is detected, enabling live captioning and voice command scenarios.
Speaker identification: Built-in speaker diarization support identifies different speakers in multi-party conversations without additional models.
Reconfigurable vocabulary: Dynamically set grammar and vocabulary constraints to improve accuracy for domain-specific recognition (menu navigation, command sets, specialized terminology).
Cross-platform bindings: Native API bindings for Python, Node.js, Java, C#, C++, Rust, and Go. Works on Linux, Windows, macOS, Android, iOS, and Raspberry Pi.

Integration Points

Vosk accepts PCM audio input at 8kHz or 16kHz sample rates in mono format. It integrates with ffmpeg for audio format conversion, with WebSocket servers for browser-based real-time transcription, and with microphone input via PyAudio or PortAudio for live transcription. The Python package installs via pip install vosk and the Node.js package via npm install vosk. Output is structured JSON containing transcript text, word-level timings, and confidence scores suitable for downstream NLP processing.

Source

GitHub: alphacep/vosk-api (14.4K+ stars, Apache-2.0 license) — Docs: alphacephei.com/vosk

Best fit

When to reach for it

Best when the job fits Media & Transcription.
Works naturally with Multi-Framework setups.

Trust & provenance

Why this listing is credible

Built around the vosk-api toolchain.
Trust status: Published.
14.5k GitHub stars on the linked upstream source.
License: Apache-2.0.
Last updated Apr 1, 2026.

View source ↗