Create Audio
Generate high-quality audio from text using local TTS models that run on your CPU.
Quick Start
Basic usage:
# Generate with default voice
scripts/create_audio.py --text "Hello world" --output hello.wav
# Use specific voice
scripts/create_audio.py --voice marius --text "Bonjour" --output greeting.wav
# From stdin
echo "Hello from stdin" | scripts/create_audio.py --output output.wav
Available Providers
Pocket TTS (Default)
- Provider ID:
pocket-tts - Model: Kyutai Labs Pocket TTS (100M parameters)
- Performance: ~6x real-time on MacBook Air M4, ~200ms first chunk
- Requirements: CPU only (no GPU needed)
- Language: English
- Installation:
pip install pocket-ttsoruvx pocket-tts
Built-in Voices:
alba(default) - Female voicemarius- Male voicejavert,jean,fantine,cosette,eponine,azelma- Character voices
MLX-Audio Providers (Apple Silicon Optimized)
MLX-Audio provides 7 different TTS models optimized for Apple Silicon (M1/M2/M3/M4). Each model is registered as a separate provider.
Installation: pip install mlx-audio
Requirements: Apple Silicon Mac (M1/M2/M3/M4), Python 3.9+
1. Kokoro (mlx-audio-kokoro)
- Languages: English, Japanese, Chinese, French, Spanish, Italian, Portuguese, Hindi
- Voices: 10 built-in voices + voice cloning
- American:
af_heart,af_bella,af_nova,af_sky,am_adam,am_echo - British:
bf_alice,bf_emma,bm_daniel,bm_george
- American:
- Best for: Multilingual content, fast generation
- Features: Speed control, voice cloning
2. CSM (mlx-audio-csm)
- Languages: English
- Features: Conversational Speech Model with voice cloning
- Best for: Natural conversations, voice cloning
3. Dia (mlx-audio-dia)
- Languages: English
- Best for: Dialogue-focused content
4. OuteTTS (mlx-audio-oute)
- Languages: English
- Best for: Efficient, fast generation
5. Spark (mlx-audio-spark)
- Languages: English, Chinese
- Best for: Bilingual content
6. Chatterbox (mlx-audio-chatterbox)
- Languages: 16 languages (en, es, fr, de, it, pt, pl, tr, ru, nl, cs, ar, zh, ja, hu, ko)
- Best for: Expressive multilingual content
7. Soprano (mlx-audio-soprano)
- Languages: English
- Best for: High-quality English TTS
ElevenLabs (Cloud API)
Premium cloud-based TTS with the highest quality voices and massive voice library.
Provider ID: elevenlabs
- Languages: 32 languages (en, es, fr, de, it, pt, pl, uk, nl, sv, da, fi, no, cs, sk, el, ro, bg, hr, sr, mk, lv, lt, et, sl, hu, tr, vi, ar, hi, bn, ta, ko, zh, ja, etc.)
- Voices: 10,000+ voices in Voice Library
- Features: Professional voice cloning, voice design, emotional control
- Requirements: API key (ELEVEN_API_KEY environment variable)
- Pricing: Free tier available, paid plans for production
- Best for: Production-quality voiceovers, professional content
- Installation:
pip install elevenlabs
Popular Voices:
rachel,drew,clyde,paul,domi- Englishantoni,thomas,charlie,george- Various accentsemily,elli,charlotte,alice,matilda- Female voices- Or use any custom voice_id from the Voice Library
Coqui TTS Providers (Open Source)
Coqui TTS provides 4 different models for various use cases - all free and open source.
Installation: pip install TTS
Requirements: Python 3.10-3.14, CPU or GPU
1. XTTS v2 (coqui-xtts_v2)
- Languages: 17 languages (en, es, fr, de, it, pt, pl, tr, ru, nl, cs, ar, zh-cn, ja, hu, ko, hi)
- Features: Best quality, voice cloning with 3-10 seconds of audio
- Best for: Multilingual projects with voice cloning needs
- Streaming: <200ms latency
2. VITS (coqui-vits)
- Languages: English
- Features: Fast, single speaker
- Best for: Quick English TTS without voice customization
3. YourTTS (coqui-yourtts)
- Languages: English, French, Portuguese
- Features: Multilingual voice cloning, multi-speaker
- Best for: Voice cloning in multiple languages
4. Bark (coqui-bark)
- Languages: 13 languages (en, de, es, fr, hi, it, ja, ko, pl, pt, ru, tr, zh)
- Features: Highly expressive, multi-speaker
- Best for: Expressive, natural-sounding speech
Core Capabilities
1. Basic Text-to-Speech
Generate audio from any text input:
scripts/create_audio.py \
--text "Your text here" \
--output audio.wav
2. Voice Selection
Choose from built-in voices:
scripts/create_audio.py \
--voice marius \
--text "Hello in a different voice" \
--output voice_demo.wav
3. Voice Cloning
Clone any voice by providing a WAV file:
scripts/create_audio.py \
--voice /path/to/reference_voice.wav \
--text "This will sound like the reference" \
--output cloned.wav
4. Provider Selection
Switch between TTS providers (extensible architecture):
# List all available providers
scripts/create_audio.py --list
# Use specific provider
scripts/create_audio.py \
--provider pocket-tts \
--text "Hello" \
--output audio.wav
5. Pipeline Integration
Integrate with other tools via stdin/stdout:
# From file
cat script.txt | scripts/create_audio.py --output narration.wav
# From command output
echo "This is generated text" | scripts/create_audio.py -o result.wav
# Chain with other tools
cat content.md | sed 's/#//g' | scripts/create_audio.py -o doc_audio.wav
Common Use Cases
Voiceover Generation
# Pocket TTS voiceover
scripts/create_audio.py \
--voice alba \
--text "Welcome to this tutorial about..." \
--output intro_voiceover.wav
# MLX-Audio Kokoro (multilingual)
scripts/create_audio.py \
--provider mlx-audio-kokoro \
--voice af_heart \
--text "Welcome to this tutorial" \
--output intro_mlx.wav
Multi-Voice Content
# Character 1 (Pocket TTS)
scripts/create_audio.py --voice jean --text "First character speaks" -o char1.wav
# Character 2 (MLX-Audio)
scripts/create_audio.py --provider mlx-audio-kokoro --voice bm_george --text "Second character responds" -o char2.wav
Multilingual Content
# English
scripts/create_audio.py \
--provider mlx-audio-kokoro \
--voice af_heart \
--params '{"lang": "en"}' \
--text "Hello world" \
--output en.wav
# Japanese
scripts/create_audio.py \
--provider mlx-audio-kokoro \
--params '{"lang": "ja"}' \
--text "こんにちは世界" \
--output ja.wav
# French
scripts/create_audio.py \
--provider mlx-audio-kokoro \
--params '{"lang": "fr"}' \
--text "Bonjour le monde" \
--output fr.wav
Custom Voice Branding
# Pocket TTS voice cloning
scripts/create_audio.py \
--voice /path/to/your_voice_sample.wav \
--text "All content in my voice" \
--output branded_audio.wav
# MLX-Audio CSM voice cloning
scripts/create_audio.py \
--provider mlx-audio-csm \
--voice custom \
--params '{"custom_voice_path": "/path/to/voice.wav"}' \
--text "Cloned voice with CSM" \
--output csm_cloned.wav
Speed Control
# Faster speech (1.5x)
scripts/create_audio.py \
--provider mlx-audio-kokoro \
--voice af_nova \
--params '{"speed": 1.5}' \
--text "This will be faster" \
--output fast.wav
# Slower speech (0.8x)
scripts/create_audio.py \
--provider mlx-audio-kokoro \
--voice bf_emma \
--params '{"speed": 0.8}' \
--text "This will be slower" \
--output slow.wav
ElevenLabs Production Quality
# Set API key first
export ELEVEN_API_KEY="your_api_key_here"
# Use default voice
scripts/create_audio.py \
--provider elevenlabs \
--voice rachel \
--text "Professional quality voiceover" \
--output professional.wav
# Use custom voice from Voice Library
scripts/create_audio.py \
--provider elevenlabs \
--voice custom \
--params '{"voice_id": "your_voice_id_here"}' \
--text "Custom voice from library" \
--output custom.wav
# Advanced settings (stability, similarity, style)
scripts/create_audio.py \
--provider elevenlabs \
--voice rachel \
--params '{"stability": 60, "similarity_boost": 80, "style": 20, "speed": 1.1}' \
--text "Fine-tuned voice settings" \
--output tuned.wav
# Multilingual with language enforcement
scripts/create_audio.py \
--provider elevenlabs \
--voice antoni \
--params '{"language_code": "es"}' \
--text "Hola mundo" \
--output spanish.wav
Coqui TTS Voice Cloning
# XTTS v2 voice cloning (best quality)
scripts/create_audio.py \
--provider coqui-xtts_v2 \
--voice custom \
--params '{"speaker_wav": "/path/to/reference.wav", "language": "en"}' \
--text "Cloned voice with XTTS v2" \
--output xtts_cloned.wav
# YourTTS multilingual cloning
scripts/create_audio.py \
--provider coqui-yourtts \
--voice custom \
--params '{"speaker_wav": "/path/to/voice.wav", "language": "fr"}' \
--text "Bonjour le monde" \
--output yourtts_fr.wav
# VITS fast English
scripts/create_audio.py \
--provider coqui-vits \
--text "Fast English synthesis" \
--output vits.wav
# Bark expressive speech
scripts/create_audio.py \
--provider coqui-bark \
--voice custom \
--params '{"speaker_idx": 0}' \
--text "Very expressive and natural sounding" \
--output bark.wav
Adding New TTS Providers
The skill uses an extensible provider architecture. To add a new provider:
- Create provider class in
scripts/<provider>_provider.py:
from tts_provider import TTSProvider
class NewProvider(TTSProvider):
@property
def name(self) -> str:
return "new-provider"
@property
def supported_voices(self) -> List[str]:
return ["voice1", "voice2"]
# Implement other required methods...
- Register provider in
scripts/provider_registry.py:
from new_provider import NewProvider
ProviderRegistry.register(NewProvider)
- Use immediately:
scripts/create_audio.py --provider new-provider --voice voice1 --text "Test"
See scripts/pocket_tts_provider.py for a complete implementation example.
Advanced Usage
Provider-Specific Parameters
Pass provider-specific parameters as JSON:
scripts/create_audio.py \
--voice custom \
--params '{"custom_voice_path": "/path/to/voice.wav"}' \
--text "Using custom voice" \
--output result.wav
Batch Processing
Generate multiple audio files:
# Simple loop
for voice in alba marius jean; do
scripts/create_audio.py \
--voice $voice \
--text "Sample text" \
--output "${voice}_sample.wav"
done
Installation Requirements
Pocket TTS
# Via pip
pip install pocket-tts
# Or use uvx (no installation needed)
uvx pocket-tts generate --text "Test"
Dependencies:
- Python 3.10+ (supports up to 3.14)
- PyTorch 2.5+ (CPU version)
- scipy (for WAV file writing)
Troubleshooting
Model not loading?
- First run takes longer (downloads model)
- Model is cached for subsequent uses
- Check internet connection for initial download
Import errors?
pip install pocket-tts scipy
Voice file not found?
- Ensure WAV files are valid audio files
- Use absolute paths for custom voices
- Check file permissions
Technical Details
Architecture
create_audio.py (CLI)
↓
provider_registry.py (Provider management)
↓
tts_provider.py (Base interface)
↓
[pocket_tts_provider.py, future_provider.py, ...]
Provider Interface
All providers implement:
name: Provider identifiersupported_voices: List of available voicesdefault_voice: Fallback voicegenerate_audio(): Core synthesis methodget_provider_params(): Provider-specific options
Output Format
- Format: WAV (uncompressed)
- Sample Rate: Provider-specific (Pocket TTS: 16kHz)
- Channels: Mono
- Bit Depth: 16-bit PCM
Resources
Scripts
create_audio.py- Main CLI tooltts_provider.py- Base provider interfacepocket_tts_provider.py- Pocket TTS implementationprovider_registry.py- Provider management system
References
Pocket TTS:
MLX-Audio:
ElevenLabs:
Coqui TTS:
微信扫一扫