Skill Detail

Kokoro FastAPI OpenAI-Compatible Text-to-Speech Server

Kokoro-FastAPI is a Dockerized FastAPI wrapper around the Kokoro-82M text-to-speech model with OpenAI-compatible speech endpoints. It supports local TTS serving, multi-language synthesis, web UI access, and timestamped audio generation workflows.

Media & TranscriptionMulti-Framework
Media & Transcription Multi-Framework Security Reviewed
โญ 4.7k GitHub stars
INSTALL WITH ANY AGENT
npx skills add agentskillexchange/skills --skill kokoro-fastapi-openai-compatible-text-to-speech-server Copy
Works best when you want a reusable capability, not another fragile one-off prompt.
At a glance
Tools required
Docker
Install & setup
docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest
Author
remsky
Publisher
Individual Developer
Last updated
Mar 31, 2026
Quick brief

Kokoro-FastAPI is a self-hostable text-to-speech server built around the Kokoro-82M model and exposed through a FastAPI application. The upstream project packages the model behind an OpenAI-compatible speech endpoint, making it much easier to drop into existing agent and automation systems that already know how to call speech APIs. The project README highlights multi-language support, CPU and NVIDIA GPU inference modes, a local web UI, debug endpoints, phoneme-aware generation, voice mixing, and per-word timestamped caption output.

How it works

What this skill actually does

The upstream source is the remsky/Kokoro-FastAPI repository. It is distributed with Docker and Docker Compose workflows, plus wiki-based integration guides for environments such as Kubernetes, DigitalOcean, and OpenWebUI. That makes it a strong fit for teams that want local or self-hosted speech generation instead of routing every request through a managed SaaS voice provider.

An ASE skill built around Kokoro-FastAPI is useful when an agent needs to stand up a local TTS service, generate speech from prompts, produce aligned captions, or expose a speech endpoint to downstream apps that expect an OpenAI-like interface. Typical outputs include generated audio files, API-ready speech responses, caption timestamps, deployment guidance for CPU or GPU hosts, and integration steps for chat interfaces or voice-enabled assistants. It also pairs well with content narration, accessibility tooling, demo voiceovers, and internal prototyping where controllable, self-hosted TTS infrastructure is more important than a managed black-box API.