Skill Detail

RealtimeSTT Low-Latency Speech-to-Text Python Library

RealtimeSTT is a Python library for real-time speech-to-text with advanced voice activity detection, wake word activation, and instant transcription. It combines WebRTC VAD, Silero VAD, and Faster Whisper for production-grade voice input in agent applications.

Media & TranscriptionCustom Agents

Media & Transcription Custom Agents Security Reviewed

Tool match: realtimestt ⭐ 9.6k GitHub stars MIT license

INSTALL WITH ANY AGENT

npx skills add agentskillexchange/skills --skill realtimestt-low-latency-speech-to-text-python Copy

Works best when you want a reusable capability, not another fragile one-off prompt.

View source

At a glance

Last updated

Mar 30, 2026

Quick brief

RealtimeSTT is a Python library that provides easy-to-use, low-latency speech-to-text conversion designed for real-time applications. Created by Kolja Beigel, it combines industry-standard components for voice activity detection and speech recognition into a unified API that handles the complex pipeline of listening, detecting speech boundaries, and transcribing audio.

How it works

What this skill actually does

Architecture

The library uses a multi-stage processing pipeline. First, WebRTC VAD performs initial voice activity detection with minimal latency. Silero VAD then provides more accurate verification to reduce false positives. Finally, Faster Whisper handles GPU-accelerated transcription using OpenAI’s Whisper models. For wake word detection, it supports both Picovoice Porcupine and OpenWakeWord.

Core Features

Automatic voice activity detection that identifies speech start and stop points
Real-time transcription with configurable Whisper model sizes (tiny through large-v3)
Wake word activation using Porcupine or OpenWakeWord for hands-free operation
Client-server architecture with AudioToTextRecorderClient for distributed deployments
CLI interface with stt-server and stt commands for quick integration
Callback-based API for processing transcribed text as it becomes available

Agent Integration

RealtimeSTT is particularly valuable for building voice-controlled AI agents and assistants. Its AudioToTextRecorder class provides a simple interface where agents receive transcribed text through callbacks, making it straightforward to pipe voice input into LLM-based conversation systems. The library handles all the complexity of microphone input, silence detection, and transcription timing.

Usage

Basic usage requires just a few lines of Python. Initialize AudioToTextRecorder, then call recorder.text() in a loop with a callback function. The library automatically manages audio capture, speech detection, and transcription. GPU acceleration via CUDA significantly improves transcription speed for larger models.

The library is available on PyPI via pip install RealtimeSTT and is MIT-licensed. It has over 9,600 GitHub stars and an active community. The companion library RealtimeTTS handles the output side for complete voice conversation systems.

Best fit

When to reach for it

Best when the job fits Media & Transcription.
Works naturally with Custom Agents setups.

Trust & provenance

Why this listing is credible

Built around the realtimestt toolchain.
Trust status: Security Reviewed.
9.6k GitHub stars on the linked upstream source.
License: MIT.
Last updated Mar 30, 2026.

View source ↗