Speaker Diarization
Advanced speaker diarization using pyannote-audio - state-of-the-art neural network models for speaker identification.
When to Use
Use this skill when:
- Video has multiple speakers (podcasts, interviews, panels)
- You need accurate speaker identification
- Content has overlapping speech (people talking over each other)
- You want speaker-specific clips
- Gemini's diarization isn't accurate enough
- Working with multi-language or mixed speakers
Don't use when:
- Single speaker content (use basic transcription instead)
- Real-time processing needed (this is offline/batch)
- Storage is limited (models require ~2GB)
Why pyannote-audio?
Benchmarks (Diarization Error Rate - lower is better):
- pyannote community-1: 17.0% on AMI dataset
- Gemini/Whisper: ~25-30% error rate
- 35% accuracy improvement over cloud APIs!
Advantages:
- ✅ Local processing - Privacy, no API costs
- ✅ Better accuracy - State-of-the-art neural models
- ✅ Overlapping speech detection - Identifies when people talk simultaneously
- ✅ Precise timestamps - Millisecond-accurate speaker boundaries
- ✅ No internet required - After initial model download
- ✅ Works with any language - Language-agnostic
Available Scripts
scripts/diarize.py
Main diarization script.
Usage:
python skills/speaker-diarization/scripts/diarize.py <video_path> [options]
Options:
--output, -o: Output format (json, rttm, srt) - default: json--min-speakers: Minimum number of speakers to expect--max-speakers: Maximum number of speakers to expect--num-speakers: Exact number of speakers (if known)--device: Processing device (cpu, cuda) - default: auto--huggingface-token: HuggingFace token (or use env var)
Examples:
Basic diarization:
export HUGGINGFACE_TOKEN="your-token"
python skills/speaker-diarization/scripts/diarize.py podcast.mp4
Specify speaker count range:
python skills/speaker-diarization/scripts/diarize.py interview.mp4 --min-speakers 2 --max-speakers 3
Output to RTTM format:
python skills/speaker-diarization/scripts/diarize.py panel.mp4 --output rttm
Output (JSON):
{
"success": true,
"video_path": "podcast.mp4",
"num_speakers": 3,
"duration": 1200.5,
"speakers": {
"SPEAKER_00": {"duration": 450.2, "segments": 45},
"SPEAKER_01": {"duration": 380.5, "segments": 38},
"SPEAKER_02": {"duration": 369.8, "segments": 42}
},
"segments": [
{
"start": 0.0,
"end": 5.2,
"speaker": "SPEAKER_00",
"duration": 5.2
},
{
"start": 5.2,
"end": 12.8,
"speaker": "SPEAKER_01",
"duration": 7.6
}
],
"overlapping_segments": [
{
"start": 45.2,
"end": 47.8,
"speakers": ["SPEAKER_00", "SPEAKER_01"]
}
]
}
scripts/extract_speaker_segments.py
Extract video segments for specific speakers.
Usage:
python skills/speaker-diarization/scripts/extract_speaker_segments.py <video_path> <diarization_json> [options]
Options:
--speaker: Speaker ID to extract (SPEAKER_00, SPEAKER_01, etc.) - default: all--min-segment-duration: Minimum segment duration (seconds) - default: 5.0--context: Add context seconds before/after - default: 2.0--output-dir: Output directory
Examples:
Extract all speakers separately:
python skills/speaker-diarization/scripts/extract_speaker_segments.py podcast.mp4 podcast_diarization.json
Extract only SPEAKER_00:
python skills/speaker-diarization/scripts/extract_speaker_segments.py podcast.mp4 podcast_diarization.json --speaker SPEAKER_00
Extract with 3-second context:
python skills/speaker-diarization/scripts/extract_speaker_segments.py interview.mp4 diarization.json --context 3.0
scripts/analyze_speaker_dynamics.py
Analyze speaker interactions and dynamics.
Usage:
python skills/speaker-diarization/scripts/analyze_speaker_dynamics.py <diarization_json> [options]
Output:
{
"speaker_dynamics": {
"total_speakers": 3,
"dominant_speaker": "SPEAKER_00",
"speaker_balance": 0.72,
"interaction_moments": [
{
"type": "debate",
"start": 120.5,
"end": 145.2,
"speakers": ["SPEAKER_00", "SPEAKER_01"],
"intensity": 0.85
},
{
"type": "overlapping_speech",
"start": 200.0,
"end": 202.5,
"speakers": ["SPEAKER_01", "SPEAKER_02"]
}
]
}
}
Setup
1. Install Dependencies
pip install pyannote.audio torch torchaudio speechbrain
2. Get HuggingFace Token
- Create account at huggingface.co
- Generate token at hf.co/settings/tokens
- Accept terms at pyannote/speaker-diarization-community-1
3. Set Environment Variable
export HUGGINGFACE_TOKEN="your-token-here"
Or use --huggingface-token flag.
How AI Agents Decide
When to use pyannote vs Gemini diarization:
def select_diarization_method(video_info, user_instructions):
# User explicitly wants pyannote
if "accurate" in user_instructions or "precise" in user_instructions:
return "pyannote"
# Multi-speaker content detected
if video_info.get('num_speakers', 1) > 2:
return "pyannote"
# Podcast/interview format
if any(word in user_instructions for word in ['podcast', 'interview', 'panel', 'debate']):
return "pyannote"
# Overlapping speech expected
if 'overlapping' in user_instructions or 'talk over' in user_instructions:
return "pyannote"
# Privacy requirement
if 'private' in user_instructions or 'offline' in user_instructions:
return "pyannote"
# Single speaker or simple case - use Gemini (faster)
return "gemini"
Agent decision criteria:
- Use pyannote: Multi-speaker, accuracy-critical, offline needed
- Use Gemini: Single speaker, speed-critical, simple scenario
Integration with Other Skills
Enhanced video-transcriber
# Transcribe with pyannote diarization
python skills/video-transcriber/scripts/transcribe.py video.mp4 \
--model whisper \
--diarization pyannote \
--output-format srt-with-speakers
Enhanced highlight-scanner
# Find highlights considering speaker dynamics
python skills/highlight-scanner/scripts/find_highlights.py video.mp4 \
--transcript-path video.srt \
--diarization-path video_diarization.json \
--speaker-dynamics
Enhanced autocut-shorts
# Autocut focusing on specific speaker
python skills/autocut-shorts/scripts/autocut.py podcast.mp4 \
--use-speaker-diarization \
--focus-speaker SPEAKER_00 \
--num-clips 5
Output Formats
JSON (default)
Full metadata including speaker statistics and overlapping segments.
RTTM
Standard diarization format for research/annotation:
SPEAKER podcast 1 0.0 5.2 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER podcast 1 5.2 7.6 <NA> <NA> SPEAKER_01 <NA> <NA>
SRT with Speakers
1
00:00:00,000 --> 00:00:05,200
[SPEAKER_00]: Welcome to the show everyone
2
00:00:05,200 --> 00:00:12,800
[SPEAKER_01]: Thanks for having me on today
Performance
Processing Speed:
- CPU: ~30 seconds per hour of audio (Intel i7)
- GPU: ~10 seconds per hour of audio (NVIDIA RTX 3060)
- First run: +60 seconds (model download)
Accuracy:
- 2-3 speakers: 95%+ accuracy
- 4-6 speakers: 85-90% accuracy
- 7+ speakers: 70-80% accuracy
Tips
- Specify speaker count if known - improves accuracy
- Use for podcasts/interviews - better than cloud APIs
- Combine with transcription - diarization + Whisper = perfect
- Check overlapping speech - identifies heated discussions
- Export to SRT - easy to import into video editors
References
- pyannote.audio: https://github.com/pyannote/pyannote-audio
- Model hub: https://huggingface.co/pyannote
- Paper: https://arxiv.org/abs/2310.11347
Scan to contact