Skill Detail

Self-host an OpenAI-compatible speech API for local transcription, translation, and TTS with Speaches

Use Speaches when an agent stack expects OpenAI-style audio endpoints but you want a self-hosted speech backend for transcription, translation, and text-to-speech instead of a hosted API.

Media & TranscriptionMulti-Framework
Media & Transcription Multi-Framework Security Reviewed
⭐ 3.2k GitHub stars
INSTALL WITH ANY AGENT
npx skills add agentskillexchange/skills --skill self-host-an-openai-compatible-speech-api-for-local-transcription-translation-and-tts-with-speaches Copy
Works best when you want a reusable capability, not another fragile one-off prompt.
At a glance
Tools required
Docker or Python-based deployment environment, CPU or GPU runtime, supported speech models, and any client or agent stack that can call OpenAI-compatible audio endpoints.
Install & setup
Deploy Speaches with Docker or the supported local setup from the project docs, download or configure the speech models you plan to serve, start the API, then point your agent or application at the Speaches base URL using the same OpenAI-style audio calls it already expects.
Author
speaches-ai
Publisher
Company
Last updated
Apr 14, 2026
Quick brief

Tool: Speaches. This skill gives an agent operator a narrow, practical job: stand up a local or self-hosted speech server that speaks the OpenAI audio API shape, then swap existing agent audio workflows onto that endpoint with minimal integration churn. Speaches supports streaming transcription, translation, and speech generation, with dynamic model loading and both CPU and GPU deployment paths.

How it works

What this skill actually does

When to use it: invoke this when your agents already know how to call OpenAI-compatible speech endpoints, but you want local control over models, lower vendor dependence, or a unified speech back end for multiple clients. It is useful for voice assistants, transcription pipelines, speech-enabled support tools, and multimodal agent setups where the integration surface matters more than training or model research. Using this skill is different from using the product normally because the operator workflow is explicit: deploy the server, choose speech models, expose compatible endpoints, and repoint downstream agent tools.

Scope boundary: this is not a generic speech-product listing, not a hosted API comparison page, and not a broad model zoo entry. Its boundary is specific: run a self-hosted OpenAI-compatible speech server so existing agent pipelines can keep working while you control the runtime and model selection.