Back to skills
extension
Category: Content & MediaNo API key required

speaker-diarization

Advanced speaker diarization using pyannote-audio. Identify who speaks when, detect multiple speakers, handle overlapping speech, and create speaker-specific segments. Use when you need accurate speaker identification, multi-speaker content analysis, or speaker-specific clip extraction. More accurate than Gemini's built-in diarization for complex scenarios.

personAuthor: jakexiaohubgithub

Speaker Diarization

Advanced speaker diarization using pyannote-audio - state-of-the-art neural network models for speaker identification.

When to Use

Use this skill when:

  • Video has multiple speakers (podcasts, interviews, panels)
  • You need accurate speaker identification
  • Content has overlapping speech (people talking over each other)
  • You want speaker-specific clips
  • Gemini's diarization isn't accurate enough
  • Working with multi-language or mixed speakers

Don't use when:

  • Single speaker content (use basic transcription instead)
  • Real-time processing needed (this is offline/batch)
  • Storage is limited (models require ~2GB)

Why pyannote-audio?

Benchmarks (Diarization Error Rate - lower is better):

  • pyannote community-1: 17.0% on AMI dataset
  • Gemini/Whisper: ~25-30% error rate
  • 35% accuracy improvement over cloud APIs!

Advantages:

  • Local processing - Privacy, no API costs
  • Better accuracy - State-of-the-art neural models
  • Overlapping speech detection - Identifies when people talk simultaneously
  • Precise timestamps - Millisecond-accurate speaker boundaries
  • No internet required - After initial model download
  • Works with any language - Language-agnostic

Available Scripts

scripts/diarize.py

Main diarization script.

Usage:

python skills/speaker-diarization/scripts/diarize.py <video_path> [options]

Options:

  • --output, -o: Output format (json, rttm, srt) - default: json
  • --min-speakers: Minimum number of speakers to expect
  • --max-speakers: Maximum number of speakers to expect
  • --num-speakers: Exact number of speakers (if known)
  • --device: Processing device (cpu, cuda) - default: auto
  • --huggingface-token: HuggingFace token (or use env var)

Examples:

Basic diarization:

export HUGGINGFACE_TOKEN="your-token"
python skills/speaker-diarization/scripts/diarize.py podcast.mp4

Specify speaker count range:

python skills/speaker-diarization/scripts/diarize.py interview.mp4 --min-speakers 2 --max-speakers 3

Output to RTTM format:

python skills/speaker-diarization/scripts/diarize.py panel.mp4 --output rttm

Output (JSON):

{
  "success": true,
  "video_path": "podcast.mp4",
  "num_speakers": 3,
  "duration": 1200.5,
  "speakers": {
    "SPEAKER_00": {"duration": 450.2, "segments": 45},
    "SPEAKER_01": {"duration": 380.5, "segments": 38},
    "SPEAKER_02": {"duration": 369.8, "segments": 42}
  },
  "segments": [
    {
      "start": 0.0,
      "end": 5.2,
      "speaker": "SPEAKER_00",
      "duration": 5.2
    },
    {
      "start": 5.2,
      "end": 12.8,
      "speaker": "SPEAKER_01",
      "duration": 7.6
    }
  ],
  "overlapping_segments": [
    {
      "start": 45.2,
      "end": 47.8,
      "speakers": ["SPEAKER_00", "SPEAKER_01"]
    }
  ]
}

scripts/extract_speaker_segments.py

Extract video segments for specific speakers.

Usage:

python skills/speaker-diarization/scripts/extract_speaker_segments.py <video_path> <diarization_json> [options]

Options:

  • --speaker: Speaker ID to extract (SPEAKER_00, SPEAKER_01, etc.) - default: all
  • --min-segment-duration: Minimum segment duration (seconds) - default: 5.0
  • --context: Add context seconds before/after - default: 2.0
  • --output-dir: Output directory

Examples:

Extract all speakers separately:

python skills/speaker-diarization/scripts/extract_speaker_segments.py podcast.mp4 podcast_diarization.json

Extract only SPEAKER_00:

python skills/speaker-diarization/scripts/extract_speaker_segments.py podcast.mp4 podcast_diarization.json --speaker SPEAKER_00

Extract with 3-second context:

python skills/speaker-diarization/scripts/extract_speaker_segments.py interview.mp4 diarization.json --context 3.0

scripts/analyze_speaker_dynamics.py

Analyze speaker interactions and dynamics.

Usage:

python skills/speaker-diarization/scripts/analyze_speaker_dynamics.py <diarization_json> [options]

Output:

{
  "speaker_dynamics": {
    "total_speakers": 3,
    "dominant_speaker": "SPEAKER_00",
    "speaker_balance": 0.72,
    "interaction_moments": [
      {
        "type": "debate",
        "start": 120.5,
        "end": 145.2,
        "speakers": ["SPEAKER_00", "SPEAKER_01"],
        "intensity": 0.85
      },
      {
        "type": "overlapping_speech",
        "start": 200.0,
        "end": 202.5,
        "speakers": ["SPEAKER_01", "SPEAKER_02"]
      }
    ]
  }
}

Setup

1. Install Dependencies

pip install pyannote.audio torch torchaudio speechbrain

2. Get HuggingFace Token

  1. Create account at huggingface.co
  2. Generate token at hf.co/settings/tokens
  3. Accept terms at pyannote/speaker-diarization-community-1

3. Set Environment Variable

export HUGGINGFACE_TOKEN="your-token-here"

Or use --huggingface-token flag.

How AI Agents Decide

When to use pyannote vs Gemini diarization:

def select_diarization_method(video_info, user_instructions):
    # User explicitly wants pyannote
    if "accurate" in user_instructions or "precise" in user_instructions:
        return "pyannote"
    
    # Multi-speaker content detected
    if video_info.get('num_speakers', 1) > 2:
        return "pyannote"
    
    # Podcast/interview format
    if any(word in user_instructions for word in ['podcast', 'interview', 'panel', 'debate']):
        return "pyannote"
    
    # Overlapping speech expected
    if 'overlapping' in user_instructions or 'talk over' in user_instructions:
        return "pyannote"
    
    # Privacy requirement
    if 'private' in user_instructions or 'offline' in user_instructions:
        return "pyannote"
    
    # Single speaker or simple case - use Gemini (faster)
    return "gemini"

Agent decision criteria:

  • Use pyannote: Multi-speaker, accuracy-critical, offline needed
  • Use Gemini: Single speaker, speed-critical, simple scenario

Integration with Other Skills

Enhanced video-transcriber

# Transcribe with pyannote diarization
python skills/video-transcriber/scripts/transcribe.py video.mp4 \
  --model whisper \
  --diarization pyannote \
  --output-format srt-with-speakers

Enhanced highlight-scanner

# Find highlights considering speaker dynamics
python skills/highlight-scanner/scripts/find_highlights.py video.mp4 \
  --transcript-path video.srt \
  --diarization-path video_diarization.json \
  --speaker-dynamics

Enhanced autocut-shorts

# Autocut focusing on specific speaker
python skills/autocut-shorts/scripts/autocut.py podcast.mp4 \
  --use-speaker-diarization \
  --focus-speaker SPEAKER_00 \
  --num-clips 5

Output Formats

JSON (default)

Full metadata including speaker statistics and overlapping segments.

RTTM

Standard diarization format for research/annotation:

SPEAKER podcast 1 0.0 5.2 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER podcast 1 5.2 7.6 <NA> <NA> SPEAKER_01 <NA> <NA>

SRT with Speakers

1
00:00:00,000 --> 00:00:05,200
[SPEAKER_00]: Welcome to the show everyone

2
00:00:05,200 --> 00:00:12,800
[SPEAKER_01]: Thanks for having me on today

Performance

Processing Speed:

  • CPU: ~30 seconds per hour of audio (Intel i7)
  • GPU: ~10 seconds per hour of audio (NVIDIA RTX 3060)
  • First run: +60 seconds (model download)

Accuracy:

  • 2-3 speakers: 95%+ accuracy
  • 4-6 speakers: 85-90% accuracy
  • 7+ speakers: 70-80% accuracy

Tips

  1. Specify speaker count if known - improves accuracy
  2. Use for podcasts/interviews - better than cloud APIs
  3. Combine with transcription - diarization + Whisper = perfect
  4. Check overlapping speech - identifies heated discussions
  5. Export to SRT - easy to import into video editors

References

  • pyannote.audio: https://github.com/pyannote/pyannote-audio
  • Model hub: https://huggingface.co/pyannote
  • Paper: https://arxiv.org/abs/2310.11347