All apps · 0 apps

Kokoro-FastAPI---GPU

Overview

Dockerized FastAPI wrapper for Kokoro-82M text-to-speech model w/CPU ONNX and NVIDIA GPU PyTorch support, handling, and auto-stitching.
This is a version meant for Nvidia GPUs.

Readme

View on GitHub

Kokoro TTS Banner

_{_FastKoko}

Dockerized FastAPI wrapper for Kokoro-82M text-to-speech model

OpenAI-compatible Speech endpoint, multi-language support
- English (US/GB), Spanish, French, Hindi, Italian, Japanese, Brazilian Portuguese, Mandarin Chinese
Per-word timestamped caption generation, voice mixing with weighted combinations
Phoneme endpoints: generate phonemes from text, or generate audio from phonemes
Prebuilt multiplatform images
- CPU and NVIDIA GPU (CUDA): linux/amd64 + linux/arm64
- AMD GPU (ROCm, experimental): linux/amd64 only
Apple Silicon (MPS) supported when running directly via UV (no image)

Integration Guides

Get Started

Quickest Start (docker run)

Pre-built multi-arch images with models baked in.

:latest is available, but please pin to a release tag for stable usage.

Your hardware	Image
No GPU (any laptop, VPS, CPU-only server)	`kokoro-fastapi-cpu:latest`
Apple Silicon (M1/M2/M3)	`kokoro-fastapi-cpu:latest` in Docker, or `./start-gpu_mac.sh` natively for MPS
NVIDIA GTX 9xx, 10xx, 20xx, 30xx, 40xx (x86_64)	`kokoro-fastapi-gpu:latest-cu126` or `kokoro-fastapi-gpu:latest`
NVIDIA RTX 50-series / Blackwell (x86_64)	`kokoro-fastapi-gpu:latest-cu128`
NVIDIA on arm64 (Jetson, GH200)	`kokoro-fastapi-gpu:latest` (ships cu129, no cu126 arm64 wheels upstream)
AMD GPU	`kokoro-fastapi-rocm:latest` (experimental, x86_64 only)

docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest                                       # CPU
docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:latest                            # NVIDIA (x86_64 or arm64)
docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:latest-cu128                      # NVIDIA Blackwell / RTX 50-series
docker run --device=/dev/kfd --device=/dev/dri -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-rocm:latest  # AMD

Configuration via environment variables, see core/config.py. The :latest and :latest-cu126 tags resolve to the same multi-arch image.

Quick Start (docker compose)

Install prerequisites, and start the service using Docker Compose (Full setup including UI):

Install Docker

Clone the repository:

git clone https://github.com/remsky/Kokoro-FastAPI.git
cd Kokoro-FastAPI

cd docker/gpu   # For NVIDIA GPU support
# or cd docker/cpu   # For CPU support
# or cd docker/rocm  # For AMD GPU (ROCm, experimental, amd64 only)
docker compose up --build

# *Note for Apple Silicon (M1/M2/M3) users:
# The Docker GPU image is CUDA-only and won't run on Apple Silicon. With Docker, use `docker/cpu`.
# For native MPS (Apple GPU) acceleration, run directly via UV with `./start-gpu_mac.sh`.

cd ../..  # back to repo root for the paths below

# Models will auto-download, but if needed you can manually download:
python docker/scripts/download_model.py --output api/src/models/v1_0

# Or run directly via UV:
./start-gpu.sh  # For GPU support
./start-cpu.sh  # For CPU support

Direct Run (via uv)

Install prerequisites ():
- Install astral-uv
- Install espeak-ng in your system if you want it available as a fallback for unknown words/sounds. The upstream libraries may attempt to handle this, but results have varied.
- Clone the repository:
```
git clone https://github.com/remsky/Kokoro-FastAPI.git
cd Kokoro-FastAPI
```
  Run the model download script if you haven't already
  
  Start directly via UV (with hot-reload)
  
  Linux and macOS
```
./start-cpu.sh OR
./start-gpu.sh 
```
  Windows
```
.\start-cpu.ps1 OR
.\start-gpu.ps1 
```

Up and Running?

Run locally as an OpenAI-Compatible Speech Endpoint

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8880/v1", api_key="not-needed"
)

with client.audio.speech.with_streaming_response.create(
    model="kokoro",
    voice="af_sky+af_bella", #single or multiple voicepack combo
    input="Hello world!"
  ) as response:
      response.stream_to_file("output.mp3")

The API will be available at http://localhost:8880
API Documentation: http://localhost:8880/docs
Web Interface: http://localhost:8880/web

Features

OpenAI-Compatible Speech Endpoint

# Using OpenAI's Python library
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8880/v1", api_key="not-needed")
response = client.audio.speech.create(
    model="kokoro",  
    voice="af_bella+af_sky", # see /api/src/core/openai_mappings.json to customize
    input="Hello world!",
    response_format="mp3"
)

response.stream_to_file("output.mp3")

Or Via Requests:

import requests


response = requests.get("http://localhost:8880/v1/audio/voices")
voices = [v["id"] for v in response.json()["voices"]]

# Generate audio
response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "model": "kokoro",  
        "input": "Hello world!",
        "voice": "af_bella",
        "response_format": "mp3",  # Supported: mp3, wav, opus, flac
        "speed": 1.0
    }
)

# Save audio
with open("output.mp3", "wb") as f:
    f.write(response.content)

Quick tests (run from another terminal):

python examples/assorted_checks/test_openai/test_openai_tts.py # Test OpenAI Compatibility
python examples/assorted_checks/test_voices/test_all_voices.py # Test all available voices

Voice Combination

Weighted voice combinations using ratios (e.g., "af_bella(2)+af_heart(1)" for 67%/33% mix)
Ratios are automatically normalized to sum to 100%
Available through any endpoint by adding weights in parentheses
Saves generated voicepacks for future use

Combine voices and generate audio:

import requests
response = requests.get("http://localhost:8880/v1/audio/voices")
voices = [v["id"] for v in response.json()["voices"]]

# Example 1: Simple voice combination (50%/50% mix)
response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "input": "Hello world!",
        "voice": "af_bella+af_sky",  # Equal weights
        "response_format": "mp3"
    }
)

# Example 2: Weighted voice combination (67%/33% mix)
response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "input": "Hello world!",
        "voice": "af_bella(2)+af_sky(1)",  # 2:1 ratio = 67%/33%
        "response_format": "mp3"
    }
)

# Example 3: Download combined voice as .pt file
response = requests.post(
    "http://localhost:8880/v1/audio/voices/combine",
    json="af_bella(2)+af_sky(1)"  # 2:1 ratio = 67%/33%
)

# Save the .pt file
with open("combined_voice.pt", "wb") as f:
    f.write(response.content)

# Use the downloaded voice file
response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "input": "Hello world!",
        "voice": "combined_voice",  # Use the saved voice file
        "response_format": "mp3"
    }
)

Voice Analysis Comparison

Multiple Output Audio Formats

mp3
wav
opus
flac
m4a
pcm

Audio Format Comparison

Streaming Support

# OpenAI-compatible streaming
from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8880/v1", api_key="not-needed")

# Stream to file
with client.audio.speech.with_streaming_response.create(
    model="kokoro",
    voice="af_bella",
    input="Hello world!"
) as response:
    response.stream_to_file("output.mp3")

# Stream to speakers (requires PyAudio)
import pyaudio
player = pyaudio.PyAudio().open(
    format=pyaudio.paInt16, 
    channels=1, 
    rate=24000, 
    output=True
)

with client.audio.speech.with_streaming_response.create(
    model="kokoro",
    voice="af_bella",
    response_format="pcm",
    input="Hello world!"
) as response:
    for chunk in response.iter_bytes(chunk_size=1024):
        player.write(chunk)

Or via requests:

import requests

response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "input": "Hello world!",
        "voice": "af_bella",
        "response_format": "pcm"
    },
    stream=True
)

for chunk in response.iter_content(chunk_size=1024):
    if chunk:
        # Process streaming chunks
        pass

GPU First Token Timeline CPU First Token Timeline

Key Streaming Metrics:

First token latency @ chunksize
- ~300ms (GPU) @ 400
- ~3500ms (CPU) @ 200 (older i7)
- ~<1s (CPU) @ 200 (M3 Pro)
Adjustable chunking settings for real-time playback

Note: Artifacts in intonation can increase with smaller chunks

Processing Details

Performance & Benchmarks

Hardware variants

# GPU: Requires NVIDIA driver with CUDA 12.6+ support (~35x-100x realtime speed)
cd docker/gpu
docker compose up --build

# CPU: PyTorch CPU inference
cd docker/cpu
docker compose up --build

# AMD GPU: ROCm 6.4 (experimental, amd64 only)
cd docker/rocm
docker compose up --build

Throughput

Benchmarking was performed on generation via the local API using text lengths up to feature-length books (~1.5 hours output), measuring processing time and realtime factor. Tests were run on:

Windows 11 Home w/ WSL2
NVIDIA 4060Ti 16gb GPU @ CUDA 12.1
11th Gen i7-11700 @ 2.5GHz
64gb RAM
WAV native output
H.G. Wells - The Time Machine (full text)

Processing Time Realtime Factor

Key Performance Metrics:

Realtime Speed: Ranges between 35x-100x (generation time to output audio length)
Average Processing Rate: 137.67 tokens/second (cl100k_base)

Model Unload / VRAM Reclaim

POST /dev/unload frees the model from VRAM and reloads lazily on the next request. Reclaim scales with load (the activation pool, not just weights) but plateaus: chunks cap at 450 tokens. Long-form = ~30 paragraphs. Same setup as above.

Short workload Long-form workload

Workload	Loaded	Floor	Reclaimed	Reload
Short (6s audio)	3.11 GB	2.37 GB	758 MiB	+4.9s
Long-form (7.5m)	3.98 GB	2.37 GB	1,656 MiB	+5.1s

Floor is host + CUDA context. Reproduce with uv run --extra benchmarks assorted_checks/benchmarks/benchmark_model_unload.py from examples/.

Transcription roundtrip (WER/CER)

End-to-end roundtrip: synthesize with Kokoro, transcribe the result back with faster-whisper, compare to the source text. Scripts and data live under examples/assorted_checks/test_transcription/.

Long-form English (full book, A Journey to the Centre of the Earth, Project Gutenberg, voice af_heart, base.en Whisper on CUDA float16, baseline captured on cu126 GPU build):

Run	Input chars	Audio length	Synth speedup	Transcribe speedup	WER
Short (~ch.7)	64,996	66m 06s	36.4x rt	62.4x rt	0.047
Full book	502,766	507m 52s	45.7x rt	65.1x rt	0.033

See examples/assorted_checks/test_transcription/BASELINE.md for the full regression bands.

Per-language check (single-sentence per voice, multilingual Whisper small. WER for Latin scripts, CER for ja/zh/hi):

Language	Voice	Metric	Score
English	`af_heart`	WER	0.000
English (UK)	`bf_emma`	WER	0.111
Spanish	`ef_dora`	WER	0.000
French	`ff_siwis`	WER	0.000
Italian	`if_sara`	WER	0.000
Portuguese	`pf_dora`	WER	0.000
Hindi	`hf_alpha`	CER	0.059
Japanese	`jf_alpha`	CER	0.000
Chinese	`zf_xiaobei`	CER	0.143

Caveat: these are single short sentences, not a comprehensive per-language quality benchmark. They confirm each voice produces transcribable audio in its target language; deeper quality evaluation per language is open work.

To reproduce, see examples/assorted_checks/test_transcription/README.md.

Natural Boundary Detection

Automatically splits and stitches at sentence boundaries
Helps to reduce artifacts and allow long form processing as the base model is only currently configured for approximately 30s output

The model is capable of processing up to a 510 phonemized token chunk at a time, however, this can often lead to 'rushed' speech or other artifacts. An additional layer of chunking is applied in the server, that creates flexible chunks with a TARGET_MIN_TOKENS , TARGET_MAX_TOKENS, and ABSOLUTE_MAX_TOKENS which are configurable via environment variables, and set to 175, 250, 450 by default

Timestamped Captions & Phonemes

Generate audio with word-level timestamps without streaming:

import requests
import base64
import json

response = requests.post(
    "http://localhost:8880/dev/captioned_speech",
    json={
        "model": "kokoro",
        "input": "Hello world!",
        "voice": "af_bella",
        "speed": 1.0,
        "response_format": "mp3",
        "stream": False,
    },
    stream=False
)

with open("output.mp3","wb") as f:

    audio_json=json.loads(response.content)
    
    # Decode base 64 stream to bytes
    chunk_audio=base64.b64decode(audio_json["audio"].encode("utf-8"))
    
    # Process streaming chunks
    f.write(chunk_audio)
    
    # Print word level timestamps
    print(audio_json["timestamps"])

Generate audio with word-level timestamps with streaming:

import requests
import base64
import json

response = requests.post(
    "http://localhost:8880/dev/captioned_speech",
    json={
        "model": "kokoro",
        "input": "Hello world!",
        "voice": "af_bella",
        "speed": 1.0,
        "response_format": "mp3",
        "stream": True,
    },
    stream=True
)

f=open("output.mp3","wb")
for chunk in response.iter_lines(decode_unicode=True):
    if chunk:
        chunk_json=json.loads(chunk)
        
        # Decode base 64 stream to bytes
        chunk_audio=base64.b64decode(chunk_json["audio"].encode("utf-8"))
        
        # Process streaming chunks
        f.write(chunk_audio)
        
        # Print word level timestamps
        print(chunk_json["timestamps"])

Phoneme & Token Routes

Convert text to phonemes and/or generate audio directly from phonemes:

import requests

def get_phonemes(text: str, language: str = "a"):
    """Get phonemes and tokens for input text"""
    response = requests.post(
        "http://localhost:8880/dev/phonemize",
        json={"text": text, "language": language}  # "a" for American English
    )
    response.raise_for_status()
    result = response.json()
    return result["phonemes"], result["tokens"]

def generate_audio_from_phonemes(phonemes: str, voice: str = "af_bella"):
    """Generate audio from phonemes"""
    response = requests.post(
        "http://localhost:8880/dev/generate_from_phonemes",
        json={"phonemes": phonemes, "voice": voice},
        headers={"Accept": "audio/wav"}
    )
    if response.status_code != 200:
        print(f"Error: {response.text}")
        return None
    return response.content

# Example usage
text = "Hello world!"
try:
    # Convert text to phonemes
    phonemes, tokens = get_phonemes(text)
    print(f"Phonemes: {phonemes}")  # e.g. ðɪs ɪz ˈoʊnli ɐ tˈɛst
    print(f"Tokens: {tokens}")      # Token IDs including start/end tokens

    # Generate and save audio
    if audio_bytes := generate_audio_from_phonemes(phonemes):
        with open("speech.wav", "wb") as f:
            f.write(audio_bytes)
        print(f"Generated {len(audio_bytes)} bytes of audio")
except Exception as e:
    print(f"Error: {e}")

See examples/phoneme_examples/generate_phonemes.py for a sample script.

Inline Control Tokens

Two tokens can be embedded in the input text and are parsed server-side (API, WebUI, or any client):

Pause: [pause:1.5s] inserts that much silence. Must be exactly this form (colon, trailing s, case-insensitive). [pause=1.5], [PAUSE 1.0], and SSML <break/> are not recognized and get read aloud.
Pronunciation: [Worcester](/wˈʊstər/) speaks the IPA between the slashes instead of the word. English only; use /dev/phonemize to find the IPA.

The city of [Worcester](/wˈʊstər/) is easy. [pause:1s] See?

Debug Endpoints

Monitor system state and resource usage with these endpoints. The /debug/* routes expose host and process internals, so they are off by default; set ENABLE_DEBUG_ENDPOINTS=true to enable.

/debug/threads - Get thread information and stack traces
/debug/storage - Monitor temp file and output directory usage
/debug/system - Get system information (CPU, memory, GPU)
POST /dev/unload - Release model from VRAM; reloads lazily on next request. Off by default; set ALLOW_DEV_UNLOAD=true to enable

Useful for debugging resource exhaustion or performance issues.

Stability: the /v1/* OpenAI-compatible routes are the stable API. /dev/* and /debug/* are operational helpers, and may change or move behind flags between minor releases.

Logging

Global API loguru logging level can be set using the API_LOG_LEVEL environment variable. Defaults to DEBUG.

Docker

Modify the appropriate compose yml or append to command line.

docker run --env 'API_LOG_LEVEL=WARNING' ...

Direct via UV

Linux and macOS

export API_LOG_LEVEL=WARNING
./start-cpu.sh OR
./start-gpu.sh

Windows

$env:API_LOG_LEVEL = 'WARNING'
.\start-cpu.ps1 OR
.\start-gpu.ps1

Known Issues & Troubleshooting

Missing words & Missing some timestamps

The api will automatically do text normalization on input text which may incorrectly remove or change some phrases. This can be disabled by adding "normalization_options":{"normalize": false} to your request json:

import requests

response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "input": "Hello world!",
        "voice": "af_heart",
        "response_format": "pcm",
        "normalization_options":
        {
            "normalize": False
        }
    },
    stream=True
)

for chunk in response.iter_content(chunk_size=1024):
    if chunk:
        # Process streaming chunks
        pass

Linux GPU Permissions

Some Linux users may encounter GPU permission issues when running as non-root. Can't guarantee anything, but here are some common solutions, consider your security requirements carefully

Option 1: Container Groups (Likely the best option)

services:
  kokoro-tts:
    # ... existing config ...
    group_add:
      - "video"
      - "render"

Option 2: Host System Groups

services:
  kokoro-tts:
    # ... existing config ...
    user: "${UID}:${GID}"
    group_add:
      - "video"

Note: May require adding host user to groups: sudo usermod -aG docker,video $USER and system restart.

Option 3: Device Permissions (Use with caution)

services:
  kokoro-tts:
    # ... existing config ...
    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-uvm:/dev/nvidia-uvm

⚠️ Warning: Reduces system security. Use only in development environments.

Prerequisites: NVIDIA GPU, drivers, and container toolkit must be properly configured.

Visit NVIDIA Container Toolkit installation for more detailed information

WAV duration reported as nonsense in some readers

WAV responses ship with streaming-sentinel (0xFFFFFFFF) size fields in the header. Most readers (soundfile, pydub/ffmpeg, browsers, OS players) handle this fine. Python's stdlib wave does not, and reports a bogus duration. Use soundfile.info(path).duration or ffprobe for exact length.

Project

Versioning & Development

Branching Strategy:

release branch: Contains the latest stable build, recommended for production use. Docker images tagged with specific versions are built from this branch.
master branch: Used for active development. It may contain experimental features, ongoing changes, or fixes not yet in a stable release. Use this branch if you want the absolute latest code, but be aware it might be less stable. The latest Docker tag often points to builds from this branch.

Note: This is a development focused project at its core.

If you run into trouble, you may have to roll back a version on the release tags if something comes up, or build up from source and/or troubleshoot + submit a PR.

Free and open source is a community effort, and there's only really so many hours in a day. If you'd like to support the work, feel free to open a PR, buy me a coffee, or report any bugs/features/etc you find during use.

Model

This API uses the Kokoro-82M model from HuggingFace.

Visit the model page for more details about training, architecture, and capabilities. I have no affiliation with any of their work, and produced this wrapper for ease of use and personal projects.

License

This project is licensed under the Apache License 2.0 - see below for details:

The Kokoro model weights are licensed under Apache 2.0 (see model page)
The FastAPI wrapper code in this repository is licensed under Apache 2.0 to match
The inference code adapted from StyleTTS2 is MIT licensed

The full Apache 2.0 license text can be found at: https://www.apache.org/licenses/LICENSE-2.0

Contributor Stats

Alt

Made with contrib.rocks.

Media gallery

1 / 2

Install Kokoro-FastAPI---GPU on Unraid in a few clicks.

Find Kokoro-FastAPI---GPU in Community Apps on your Unraid server, review the template, and click Install. Unraid handles the Docker app or plugin setup from the published template.

Open the Apps tab on your Unraid server Search Community Apps for Kokoro-FastAPI---GPU Review the template variables and paths Click Install

Explore Unraid OS

Related apps

Explore more like this

Explore all

AI Productivity Utilities apps Other

Links

Projectgithub.com Supportgithub.com GHCRregistry.hub.docker.com Templateraw.githubusercontent.com

Details

Repository

ghcr.io/remsky/kokoro-fastapi-gpu:latest

Registry

https://registry.hub.docker.com/r/ghcr.io/remsky/kokoro-fastapi-gpu

Last Updated2026-07-17

First Seen2025-04-25

Runtime arguments

Web UI: http://[IP]:[PORT:8880]/
Network: bridge
Privileged: false
Extra Params: --gpus all

Template configuration

Web UI PortPorttcp

Container Port: 8880

Target: 8880
Default: 8880
Value: 8880

Download ModelVariable

Download model on start

Target: DOWNLOAD_MODEL
Default: true|false

App DataPath

Path to the app data folder

Target: /app/api
Default: /mnt/user/appdata/kokoro-fastapi/data
Value: /mnt/user/appdata/kokoro-fastapi/data

Python PathVariable

Python path environment variable

Target: PYTHONPATH
Default: /app:/app/api
Value: /app:/app/api

Log LevelVariable

Logging level for the API

Target: API_LOG_LEVEL
Default: DEBUG
Value: DEBUG

Use GPUVariable

Enable GPU usage for PyTorch model inference

Target: USE_GPU
Default: true
Value: true

Python UnbufferedVariable

Set Python output to be unbuffered

Target: PYTHONUNBUFFERED
Default: 1
Value: 1

Library

Curated

Categories