Skill Detail

RealtimeSTT Low-Latency Speech-to-Text Python Library

RealtimeSTT is a Python library for real-time speech-to-text with advanced voice activity detection, wake word activation, and instant transcription. It combines WebRTC VAD, Silero VAD, and Faster Whisper for production-grade voice input in agent applications.

Media & TranscriptionCustom Agents
Media & Transcription Custom Agents Security Reviewed
Tool match: realtimestt โญ 9.6k GitHub stars MIT license
INSTALL WITH ANY AGENT
npx skills add agentskillexchange/skills --skill realtimestt-low-latency-speech-to-text-python Copy
Works best when you want a reusable capability, not another fragile one-off prompt.
At a glance
Last updated
Mar 30, 2026
Quick brief

RealtimeSTT is a Python library that provides easy-to-use, low-latency speech-to-text conversion designed for real-time applications. Created by Kolja Beigel, it combines industry-standard components for voice activity detection and speech recognition into a unified API that handles the complex pipeline of listening, detecting speech boundaries, and transcribing audio.

How it works

What this skill actually does

Architecture

The library uses a multi-stage processing pipeline. First, WebRTC VAD performs initial voice activity detection with minimal latency. Silero VAD then provides more accurate verification to reduce false positives. Finally, Faster Whisper handles GPU-accelerated transcription using OpenAI’s Whisper models. For wake word detection, it supports both Picovoice Porcupine and OpenWakeWord.

Core Features

  • Automatic voice activity detection that identifies speech start and stop points
  • Real-time transcription with configurable Whisper model sizes (tiny through large-v3)
  • Wake word activation using Porcupine or OpenWakeWord for hands-free operation
  • Client-server architecture with AudioToTextRecorderClient for distributed deployments
  • CLI interface with stt-server and stt commands for quick integration
  • Callback-based API for processing transcribed text as it becomes available

Agent Integration

RealtimeSTT is particularly valuable for building voice-controlled AI agents and assistants. Its AudioToTextRecorder class provides a simple interface where agents receive transcribed text through callbacks, making it straightforward to pipe voice input into LLM-based conversation systems. The library handles all the complexity of microphone input, silence detection, and transcription timing.

Usage

Basic usage requires just a few lines of Python. Initialize AudioToTextRecorder, then call recorder.text() in a loop with a callback function. The library automatically manages audio capture, speech detection, and transcription. GPU acceleration via CUDA significantly improves transcription speed for larger models.

The library is available on PyPI via pip install RealtimeSTT and is MIT-licensed. It has over 9,600 GitHub stars and an active community. The companion library RealtimeTTS handles the output side for complete voice conversation systems.