Skill Detail

Self-host an OpenAI-compatible speech API for local transcription, translation, and TTS with Speaches

Use Speaches when an agent stack expects OpenAI-style audio endpoints but you want a self-hosted speech backend for transcription, translation, and text-to-speech instead of a hosted API.

Media & TranscriptionMulti-Framework

Media & Transcription Multi-Framework Security Reviewed

⭐ 3.2k GitHub stars

INSTALL WITH ANY AGENT

npx skills add agentskillexchange/skills --skill self-host-an-openai-compatible-speech-api-for-local-transcription-translation-and-tts-with-speaches

Copy

Works best when you want a reusable capability, not another fragile one-off prompt.

View source Documentation

At a glance

Tools required

Docker or Python-based deployment environment, CPU or GPU runtime, supported speech models, and any client or agent stack that can call OpenAI-compatible audio endpoints.

Install & setup

Deploy Speaches with Docker or the supported local setup from the project docs, download or configure the speech models you plan to serve, start the API, then point your agent or application at the Speaches base URL using the same OpenAI-style audio calls it already expects.

Author

speaches-ai

Publisher

Company

Last updated

Apr 14, 2026

Quick brief

Tool: Speaches. This skill gives an agent operator a narrow, practical job: stand up a local or self-hosted speech server that speaks the OpenAI audio API shape, then swap existing agent audio workflows onto that endpoint with minimal integration churn. Speaches supports streaming transcription, translation, and speech generation, with dynamic model loading and both CPU and GPU deployment paths.

How it works

What this skill actually does

When to use it: invoke this when your agents already know how to call OpenAI-compatible speech endpoints, but you want local control over models, lower vendor dependence, or a unified speech back end for multiple clients. It is useful for voice assistants, transcription pipelines, speech-enabled support tools, and multimodal agent setups where the integration surface matters more than training or model research. Using this skill is different from using the product normally because the operator workflow is explicit: deploy the server, choose speech models, expose compatible endpoints, and repoint downstream agent tools.

Scope boundary: this is not a generic speech-product listing, not a hosted API comparison page, and not a broad model zoo entry. Its boundary is specific: run a self-hosted OpenAI-compatible speech server so existing agent pipelines can keep working while you control the runtime and model selection.

Best fit

When to reach for it

Best when the job fits Media & Transcription.
Works naturally with Multi-Framework setups.
Requires Docker or Python-based deployment environment, CPU or GPU runtime, supported….
Installation is straightforward: Deploy Speaches with Docker or the supported local setup from the project docs, download or configure…

Trust & provenance

Why this listing is credible

Trust status: Security Reviewed.
3.2k GitHub stars on the linked upstream source.
Last updated Apr 14, 2026.

View source ↗ Documentation ↗