VibeVoice TTS Skill

Generate high-quality speech audio from text using the VibeVoice-1.5B model running locally.

Quick Reference

Skill directory: {SKILL_DIR} (resolve at runtime, e.g. ~/.codebuddy/skills/vibevoice-tts)
Setup script: {SKILL_DIR}/scripts/setup.sh
TTS script: {SKILL_DIR}/scripts/tts_generate.py
Virtual env: {SKILL_DIR}/.venv/
Bundled voices: {SKILL_DIR}/voices/ (9 presets included)
Output directory: {SKILL_DIR}/outputs/
Default voice: zh-Xinran_woman (Chinese female)
Default model: vibevoice/VibeVoice-1.5B (auto-downloaded from HuggingFace)

Note for AI agent: {SKILL_DIR} refers to the parent directory of scripts/. At runtime, resolve it to the actual installed skill path. All commands use uv run --python {SKILL_DIR}/.venv/bin/python to execute within the skill's own virtual environment.

First-Time Setup

Run the setup script once to create the virtual environment and install dependencies:

bash {SKILL_DIR}/scripts/setup.sh

This will:

Create a .venv/ virtual environment in the skill directory (Python 3.11)
Install vibevoice from GitHub source (latest version) with all dependencies

Prerequisites: only uv needs to be installed on the system. Install with:

curl -LsSf https://astral.sh/uv/install.sh | sh

Verify Setup

Check that the environment is ready:

uv run --python {SKILL_DIR}/.venv/bin/python -c "import vibevoice; print('vibevoice OK')"

Workflow

Step 0: Ensure Environment

Before generating, check if {SKILL_DIR}/.venv/ exists. If not, run bash {SKILL_DIR}/scripts/setup.sh first.

Step 1: Determine Input

Identify the text to synthesize and the number of speakers.

Plain text (no Speaker N: tags): auto-wrapped as single speaker.
Multi-speaker script: must follow Speaker N: ... format (N from 1 to 4).

For Chinese text, prefer English punctuation (commas , and periods .) to avoid pronunciation instability. Refer to references/script_format.md for detailed format guidance.

Step 2: Choose Voices

Select voice presets for each speaker. Refer to references/voice_presets.md for the full list.

Common choices:

Chinese female (default): zh-Xinran_woman
Chinese male: zh-Bowen_man
English female: en-Alice_woman or en-Maya_woman
English male: en-Frank_man or en-Carter_man

Short aliases like Xinran, Bowen, Alice, Frank are also accepted.

Step 3: Generate Audio

All commands use the skill's virtual environment via uv run:

uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py [arguments]

Single speaker from plain text (most common)

uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
    --text "你好,欢迎收听本期节目."

Output is saved to {SKILL_DIR}/outputs/tts_TIMESTAMP.wav.

Single speaker with custom voice and output

uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
    --text "Hello and welcome to today's episode." \
    --speaker_names en-Alice_woman \
    --output /path/to/output.wav

Multi-speaker from a script file

uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
    --txt_path /path/to/script.txt \
    --speaker_names zh-Xinran_woman zh-Bowen_man

Use custom voice files from a different directory

uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
    --text "Hello world." \
    --voices_dir /path/to/my/custom/voices

All available arguments

| Argument | Default | Description | |----------|---------|-------------| | --text | (required if no --txt_path) | Plain text string to synthesize | | --txt_path | (required if no --text) | Path to a .txt script file | | --model_path | vibevoice/VibeVoice-1.5B | HuggingFace model id or local path | | --speaker_names | zh-Xinran_woman | Voice preset name(s), space-separated | | --output / -o | auto-generated | Output .wav file path | | --output_dir | {SKILL_DIR}/outputs/ | Directory for auto-named outputs | | --device | auto-detect | cuda, mps, or cpu | | --cfg_scale | 1.3 | CFG guidance scale | | --seed | random | Random seed for reproducibility | | --ddpm_steps | 10 | DDPM denoising steps (more = higher quality, slower) | | --disable_prefill | false | Skip voice cloning / speaker conditioning | | --checkpoint_path | None | Path to fine-tuned LoRA checkpoint directory (optional) | | --voices_dir | bundled voices | Directory containing custom voice .wav files |

Step 4: Verify Output

After generation, the script prints a summary with output path, duration, and generation time. Inform the user of the output file location.

Important Notes

First run downloads the model (~3 GB) from HuggingFace. Subsequent runs use the cached model.
No git clone needed: the setup script installs vibevoice directly from GitHub via uv pip install.
Mac MPS: uses float32 + sdpa attention. Requires ~6 GB unified memory for the 1.5B model.
CUDA: uses bfloat16 + flash_attention_2 for optimal speed. Falls back to sdpa if flash attention is unavailable.
Generation speed: expect roughly 2-5x real-time factor on Apple Silicon (e.g. 30s audio takes 60-150s).
Long text: the model supports up to ~90 minutes of audio. For very long scripts, generation may take a long time.
Multi-line text: --text is for single-line, short text only (shell will swallow newlines). When users provide multi-line text or long paragraphs, always prepare a temporary .txt file in the proper Speaker N: format and use --txt_path instead.
Fine-tuned models: use --checkpoint_path /path/to/lora/dir to load LoRA adapters trained via the fine-tuning pipeline.
Custom voices: place .wav files in a directory and pass --voices_dir to use them.

Loading a fine-tuned LoRA checkpoint

uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
    --txt_path /path/to/script.txt \
    --checkpoint_path /path/to/finetuned/checkpoint \
    --speaker_names en-Alice_woman

Portability

This skill is fully self-contained and portable across machines. Everything lives under the skill directory:

{SKILL_DIR}/
├── .venv/          ← virtual environment (created by setup.sh)
├── voices/         ← 9 bundled voice presets (~10 MB)
├── outputs/        ← generated audio files
├── scripts/
│   ├── setup.sh    ← one-time environment setup
│   └── tts_generate.py
├── references/
└── SKILL.md

To use on a new machine:

Copy the skill directory (or let CodeBuddy sync it)
Ensure uv is installed: curl -LsSf https://astral.sh/uv/install.sh | sh
Run bash {SKILL_DIR}/scripts/setup.sh
Done! Start generating speech.

No hardcoded paths, no external project dependencies, no git clone needed.