A high-performance OpenAI-compatible Text-to-Speech API server powered by VUI - a small conversational speech model that runs on device.
- 🎯 OpenAI-compatible API - Drop-in replacement for OpenAI's TTS API
- 🚀 High Performance - GPU acceleration with optional
torch.compile()
optimization - 🎵 Multiple Audio Formats - WAV, MP3, Opus, FLAC, AAC, PCM support
- 📡 Streaming Support - Real-time audio streaming capabilities
- 🐳 Docker Ready - Easy deployment with CUDA support
- 💬 Conversational Quality - Natural speech with human-like characteristics
- NVIDIA GPU with CUDA support
- Docker with NVIDIA Container Toolkit
- Hugging Face account with accepted terms for:
- Clone the repository
git clone https://github.com/dwain-barnes/vui-fastapi-server.git
cd vui-fastapi-server
-
Get your Hugging Face token
- Visit Hugging Face Settings
- Create a new token with read permissions
- Accept the terms for the required models (links above)
-
Build the Docker image
docker build --build-arg HUGGING_FACE_HUB_TOKEN=your_hf_token_here -t vui-fastapi .
- Run the server
docker run --gpus all -p 8000:8000 -e USE_GPU=1 vui-fastapi
The server will be available at http://localhost:8000
The server implements the OpenAI Text-to-Speech API specification:
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "vui",
"input": "Hello world! This is a test of the VUI text-to-speech system.",
"voice": "default",
"response_format": "wav"
}' \
--output speech.wav
Parameter | Type | Default | Description |
---|---|---|---|
model |
string | "vui" |
Model identifier (currently only "vui" supported) |
input |
string | required | Text to convert to speech (max 4096 characters) |
voice |
string | null |
Voice selection (currently ignored) |
response_format |
string | "wav" |
Audio format: wav , mp3 , opus , flac , aac , pcm |
speed |
float | 1.0 |
Speech speed (currently ignored) |
stream |
boolean | false |
Enable streaming response |
Basic usage:
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Hello world!", "response_format": "mp3"}' \
--output hello.mp3
Streaming response:
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "This is a streaming test", "stream": true}' \
--output stream.wav
Python example:
import requests
response = requests.post(
"http://localhost:8000/v1/audio/speech",
json={
"model": "vui",
"input": "Hello from Python!",
"response_format": "wav"
}
)
with open("output.wav", "wb") as f:
f.write(response.content)
vui-fastapi-server/
├── main.py # FastAPI application
├── requirements.txt # Python dependencies
├── download_pyannote.py # PyAnnote model downloader
├── Dockerfile # Docker configuration
└── README.md # This file
- Install dependencies
pip install -r requirements.txt
- Install VUI
git clone https://github.com/fluxions-ai/vui.git
pip install -e ./vui
- Set environment variables
export HUGGING_FACE_HUB_TOKEN=your_token_here
export USE_GPU=1 # Optional: force GPU usage
- Run the server
uvicorn main:app --host 0.0.0.0 --port 8000
Environment variables:
USE_GPU
: Set to1
to force GPU usage,0
for CPUHUGGING_FACE_HUB_TOKEN
: Required for downloading gated models
The server includes several optimizations:
- GPU Acceleration: Automatic CUDA detection and usage
- Model Compilation:
torch.compile()
for improved inference speed - Model Warmup: Pre-loads and warms up the model during startup
- Streaming: Chunked transfer encoding for real-time audio streaming
Typical performance on modern GPUs:
- Generation Speed: 1-5x real-time depending on text length
- Latency: ~1-3 seconds for short phrases
- Quality: High-quality conversational speech with natural characteristics
Once the server is running, visit:
- Interactive API docs: http://localhost:8000/docs
- OpenAPI spec: http://localhost:8000/v1/openapi.json
Model compilation errors:
RuntimeError: Failed to find C compiler
The Dockerfile includes build tools to resolve this. If building locally, install:
# Ubuntu/Debian
sudo apt-get install build-essential gcc g++
# macOS
xcode-select --install
GPU not detected:
VUI model loaded on cpu
Ensure NVIDIA Container Toolkit is installed and --gpus all
flag is used.
Hugging Face token errors:
Token is required but no token found
Make sure you've accepted the terms for the required models and provided a valid token.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is built on top of VUI by Fluxions AI. Please refer to the original VUI repository for licensing information.
- Fluxions AI for the VUI model
- pyannote.audio for voice activity detection
- OpenAI for the TTS API specification