Back to skills
extension
Category: Content & MediaNo API key required

audio-generation

Guide to audio generation and understanding in MassGen. Covers text-to-speech, music, sound effects, and audio understanding across ElevenLabs and OpenAI backends.

personAuthor: jakexiaohubgithub

Audio Generation

Generate audio using generate_media with mode="audio". Supports speech (TTS), music, and sound effects. ElevenLabs is preferred when available, with OpenAI as fallback.

Quick Start

# Text-to-speech (auto-selects ElevenLabs if key available)
generate_media(prompt="Hello, welcome to our presentation!", mode="audio")

# With specific voice
generate_media(prompt="Hello!", mode="audio", voice="Rachel")

# Music generation (ElevenLabs only)
generate_media(prompt="Upbeat jazz piano with soft drums", mode="audio",
               audio_type="music", duration=30)

# Sound effects (ElevenLabs only)
generate_media(prompt="Thunder rolling across a mountain valley", mode="audio",
               audio_type="sound_effect", duration=5)

Audio Types

| Type | Backends | Description | |------|----------|-------------| | "speech" (default) | ElevenLabs, OpenAI | Text-to-speech with voice selection | | "music" | ElevenLabs only | Music generation from text prompt | | "sound_effect" | ElevenLabs only | Sound effect generation | | "voice_conversion" | ElevenLabs only | Change voice of existing audio (speech-to-speech) | | "audio_isolation" | ElevenLabs only | Remove background noise, isolate vocals | | "voice_design" | ElevenLabs only | Create a new synthetic voice from text description | | "voice_clone" | ElevenLabs only | Clone a voice from audio samples | | "dubbing" | ElevenLabs only | Translate and dub audio to another language |

Backend Comparison

| Backend | Default Model | Supports | API Key | |---------|--------------|----------|---------| | ElevenLabs (priority 1) | eleven_multilingual_v2 | Speech, music, SFX | ELEVENLABS_API_KEY | | OpenAI (priority 2) | gpt-4o-mini-tts | Speech only | OPENAI_API_KEY |

If ElevenLabs TTS fails, the system automatically falls back to OpenAI TTS.

Key Parameters

| Parameter | Description | Example | |-----------|-------------|---------| | prompt | Text to speak (speech) or description (music/SFX) | "Hello world!" | | voice | Voice name or ID | "Rachel", "nova", "alloy" | | audio_type | Type of audio | "speech", "music", "sound_effect" | | duration | Length in seconds (music/SFX only) | 30 | | instructions | Speaking style (OpenAI gpt-4o-mini-tts only) | "warm, reflective tone" | | audio_format | Output format | "mp3", "wav", "opus" |

Voice Quick Reference

ElevenLabs (top voices): | Voice | Character | |-------|-----------| | Rachel | Warm, conversational female | | Sarah | Clear, professional female | | Josh | Friendly male | | Adam | Deep, authoritative male | | Emily | Bright, energetic female |

OpenAI voices: alloy, echo, fable, onyx, nova, shimmer, coral, sage

Important: prompt vs instructions

For speech, prompt is the literal text to speak. Style guidance goes in instructions:

# CORRECT: prompt = text to speak, instructions = how to speak it
generate_media(
    prompt="Welcome to the annual report presentation.",
    mode="audio",
    voice="alloy",
    instructions="warm, reflective tone with measured pacing",
    backend_type="openai"
)

# WRONG: Don't put style instructions in prompt
generate_media(prompt="Say this warmly: Welcome...", mode="audio")  # Bad!

instructions only works with OpenAI gpt-4o-mini-tts. ElevenLabs uses voice selection for tone.

Audio Understanding

Use read_media (not generate_media) to analyze existing audio:

read_media(path="recording.mp3", prompt="Transcribe and summarize this audio")

Need More Control?