Skill Detail

Silero VAD Pre-Trained Enterprise Voice Activity Detection

Silero VAD is a pre-trained enterprise-grade Voice Activity Detector that identifies speech segments in audio streams. It runs locally via PyTorch or ONNX Runtime with minimal resource requirements, making it ideal for real-time audio processing pipelines.

Media & TranscriptionMulti-Framework

Media & Transcription Multi-Framework Security Reviewed

Tool match: silero-vad ⭐ 8.6k GitHub stars MIT license

INSTALL WITH ANY AGENT

npx skills add agentskillexchange/skills --skill silero-vad-voice-activity-detection Copy

Works best when you want a reusable capability, not another fragile one-off prompt.

View source

At a glance

Last updated

Mar 30, 2026

Quick brief

Silero VAD is a pre-trained, enterprise-grade Voice Activity Detector developed by the Silero team. It detects speech segments in audio with high accuracy, running efficiently on CPU with minimal memory footprint. The model supports both PyTorch and ONNX Runtime backends, enabling deployment on embedded systems, mobile devices, and servers alike.

How it works

What this skill actually does

Core Capabilities

Silero VAD processes audio at 16kHz or 8kHz sampling rates and returns precise speech timestamps. It excels at distinguishing speech from background noise, music, and other non-speech audio. The detector runs in real-time with sub-millisecond latency per audio chunk, requiring only 1GB of RAM and a modern CPU with AVX instruction support.

How It Works

Install via pip install silero-vad and load the model with a single function call. Feed audio data in chunks and receive speech probability scores or timestamp ranges. The library provides utility functions for reading audio files, extracting speech segments, and saving results. Integration through torch.hub is also supported for zero-dependency setups.

Agent Integration

AI agents use Silero VAD as a preprocessing step in audio intelligence pipelines. Before running expensive transcription with Whisper or speaker diarization, VAD identifies which portions of audio contain speech, reducing processing time and cost. It is the default VAD used by WhisperX, faster-whisper, and many real-time transcription systems. The model runs entirely offline with no API calls required.

Key Features

Enterprise-grade accuracy with sub-millisecond latency
PyTorch and ONNX Runtime backends
Supports 16kHz and 8kHz audio
Real-time streaming and batch processing
Minimal dependencies (torch or onnxruntime)
MIT license, free for commercial use
Used as default VAD in WhisperX and faster-whisper

Best fit

When to reach for it

Best when the job fits Media & Transcription.
Works naturally with Multi-Framework setups.

Trust & provenance

Why this listing is credible

Built around the silero-vad toolchain.
Trust status: Security Reviewed.
8.6k GitHub stars on the linked upstream source.
License: MIT.
Last updated Mar 30, 2026.

View source ↗