VibeVoice TTS Skill
Generate high-quality speech audio from text using the VibeVoice-1.5B model running locally.
Quick Reference
- Skill directory:
{SKILL_DIR}(resolve at runtime, e.g.~/.codebuddy/skills/vibevoice-tts) - Setup script:
{SKILL_DIR}/scripts/setup.sh - TTS script:
{SKILL_DIR}/scripts/tts_generate.py - Virtual env:
{SKILL_DIR}/.venv/ - Bundled voices:
{SKILL_DIR}/voices/(9 presets included) - Output directory:
{SKILL_DIR}/outputs/ - Default voice:
zh-Xinran_woman(Chinese female) - Default model:
vibevoice/VibeVoice-1.5B(auto-downloaded from HuggingFace)
Note for AI agent:
{SKILL_DIR}refers to the parent directory ofscripts/. At runtime, resolve it to the actual installed skill path. All commands useuv run --python {SKILL_DIR}/.venv/bin/pythonto execute within the skill's own virtual environment.
First-Time Setup
Run the setup script once to create the virtual environment and install dependencies:
bash {SKILL_DIR}/scripts/setup.sh
This will:
- Create a
.venv/virtual environment in the skill directory (Python 3.11) - Install
vibevoicefrom GitHub source (latest version) with all dependencies
Prerequisites: only uv needs to be installed on the system. Install with:
curl -LsSf https://astral.sh/uv/install.sh | sh
Verify Setup
Check that the environment is ready:
uv run --python {SKILL_DIR}/.venv/bin/python -c "import vibevoice; print('vibevoice OK')"
Workflow
Step 0: Ensure Environment
Before generating, check if {SKILL_DIR}/.venv/ exists. If not, run bash {SKILL_DIR}/scripts/setup.sh first.
Step 1: Determine Input
Identify the text to synthesize and the number of speakers.
- Plain text (no
Speaker N:tags): auto-wrapped as single speaker. - Multi-speaker script: must follow
Speaker N: ...format (N from 1 to 4).
For Chinese text, prefer English punctuation (commas , and periods .) to avoid pronunciation instability. Refer to references/script_format.md for detailed format guidance.
Step 2: Choose Voices
Select voice presets for each speaker. Refer to references/voice_presets.md for the full list.
Common choices:
- Chinese female (default):
zh-Xinran_woman - Chinese male:
zh-Bowen_man - English female:
en-Alice_womanoren-Maya_woman - English male:
en-Frank_manoren-Carter_man
Short aliases like Xinran, Bowen, Alice, Frank are also accepted.
Step 3: Generate Audio
All commands use the skill's virtual environment via uv run:
uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py [arguments]
Single speaker from plain text (most common)
uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
--text "你好,欢迎收听本期节目."
Output is saved to {SKILL_DIR}/outputs/tts_TIMESTAMP.wav.
Single speaker with custom voice and output
uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
--text "Hello and welcome to today's episode." \
--speaker_names en-Alice_woman \
--output /path/to/output.wav
Multi-speaker from a script file
uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
--txt_path /path/to/script.txt \
--speaker_names zh-Xinran_woman zh-Bowen_man
Use custom voice files from a different directory
uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
--text "Hello world." \
--voices_dir /path/to/my/custom/voices
All available arguments
| Argument | Default | Description |
|----------|---------|-------------|
| --text | (required if no --txt_path) | Plain text string to synthesize |
| --txt_path | (required if no --text) | Path to a .txt script file |
| --model_path | vibevoice/VibeVoice-1.5B | HuggingFace model id or local path |
| --speaker_names | zh-Xinran_woman | Voice preset name(s), space-separated |
| --output / -o | auto-generated | Output .wav file path |
| --output_dir | {SKILL_DIR}/outputs/ | Directory for auto-named outputs |
| --device | auto-detect | cuda, mps, or cpu |
| --cfg_scale | 1.3 | CFG guidance scale |
| --seed | random | Random seed for reproducibility |
| --ddpm_steps | 10 | DDPM denoising steps (more = higher quality, slower) |
| --disable_prefill | false | Skip voice cloning / speaker conditioning |
| --checkpoint_path | None | Path to fine-tuned LoRA checkpoint directory (optional) |
| --voices_dir | bundled voices | Directory containing custom voice .wav files |
Step 4: Verify Output
After generation, the script prints a summary with output path, duration, and generation time. Inform the user of the output file location.
Important Notes
- First run downloads the model (~3 GB) from HuggingFace. Subsequent runs use the cached model.
- No git clone needed: the setup script installs
vibevoicedirectly from GitHub viauv pip install. - Mac MPS: uses
float32+sdpaattention. Requires ~6 GB unified memory for the 1.5B model. - CUDA: uses
bfloat16+flash_attention_2for optimal speed. Falls back tosdpaif flash attention is unavailable. - Generation speed: expect roughly 2-5x real-time factor on Apple Silicon (e.g. 30s audio takes 60-150s).
- Long text: the model supports up to ~90 minutes of audio. For very long scripts, generation may take a long time.
- Multi-line text:
--textis for single-line, short text only (shell will swallow newlines). When users provide multi-line text or long paragraphs, always prepare a temporary.txtfile in the properSpeaker N:format and use--txt_pathinstead. - Fine-tuned models: use
--checkpoint_path /path/to/lora/dirto load LoRA adapters trained via the fine-tuning pipeline. - Custom voices: place
.wavfiles in a directory and pass--voices_dirto use them.
Loading a fine-tuned LoRA checkpoint
uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
--txt_path /path/to/script.txt \
--checkpoint_path /path/to/finetuned/checkpoint \
--speaker_names en-Alice_woman
Portability
This skill is fully self-contained and portable across machines. Everything lives under the skill directory:
{SKILL_DIR}/
├── .venv/ ← virtual environment (created by setup.sh)
├── voices/ ← 9 bundled voice presets (~10 MB)
├── outputs/ ← generated audio files
├── scripts/
│ ├── setup.sh ← one-time environment setup
│ └── tts_generate.py
├── references/
└── SKILL.md
To use on a new machine:
- Copy the skill directory (or let CodeBuddy sync it)
- Ensure
uvis installed:curl -LsSf https://astral.sh/uv/install.sh | sh - Run
bash {SKILL_DIR}/scripts/setup.sh - Done! Start generating speech.
No hardcoded paths, no external project dependencies, no git clone needed.
微信扫一扫