Skill Detail

Kokoro FastAPI OpenAI-Compatible Text-to-Speech Server

Kokoro-FastAPI is a Dockerized FastAPI wrapper around the Kokoro-82M text-to-speech model with OpenAI-compatible speech endpoints. It supports local TTS serving, multi-language synthesis, web UI access, and timestamped audio generation workflows.

Media & TranscriptionMulti-Framework

Media & Transcription Multi-Framework Security Reviewed

⭐ 4.7k GitHub stars

INSTALL WITH ANY AGENT

npx skills add agentskillexchange/skills --skill kokoro-fastapi-openai-compatible-text-to-speech-server Copy

Works best when you want a reusable capability, not another fragile one-off prompt.

View source Documentation

At a glance

Tools required

Docker

Install & setup

docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest

Author

remsky

Publisher

Individual Developer

Last updated

Mar 31, 2026

Quick brief

Kokoro-FastAPI is a self-hostable text-to-speech server built around the Kokoro-82M model and exposed through a FastAPI application. The upstream project packages the model behind an OpenAI-compatible speech endpoint, making it much easier to drop into existing agent and automation systems that already know how to call speech APIs. The project README highlights multi-language support, CPU and NVIDIA GPU inference modes, a local web UI, debug endpoints, phoneme-aware generation, voice mixing, and per-word timestamped caption output.

How it works

What this skill actually does

The upstream source is the remsky/Kokoro-FastAPI repository. It is distributed with Docker and Docker Compose workflows, plus wiki-based integration guides for environments such as Kubernetes, DigitalOcean, and OpenWebUI. That makes it a strong fit for teams that want local or self-hosted speech generation instead of routing every request through a managed SaaS voice provider.

An ASE skill built around Kokoro-FastAPI is useful when an agent needs to stand up a local TTS service, generate speech from prompts, produce aligned captions, or expose a speech endpoint to downstream apps that expect an OpenAI-like interface. Typical outputs include generated audio files, API-ready speech responses, caption timestamps, deployment guidance for CPU or GPU hosts, and integration steps for chat interfaces or voice-enabled assistants. It also pairs well with content narration, accessibility tooling, demo voiceovers, and internal prototyping where controllable, self-hosted TTS infrastructure is more important than a managed black-box API.

Best fit

When to reach for it

Best when the job fits Media & Transcription.
Works naturally with Multi-Framework setups.
Requires Docker.
Installation is straightforward: docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest

Trust & provenance

Why this listing is credible

Trust status: Security Reviewed.
4.7k GitHub stars on the linked upstream source.
Last updated Mar 31, 2026.

View source ↗ Documentation ↗