Speaker Diarization

Advanced speaker diarization using pyannote-audio - state-of-the-art neural network models for speaker identification.

When to Use

Use this skill when:

Video has multiple speakers (podcasts, interviews, panels)
You need accurate speaker identification
Content has overlapping speech (people talking over each other)
You want speaker-specific clips
Gemini's diarization isn't accurate enough
Working with multi-language or mixed speakers

Don't use when:

Single speaker content (use basic transcription instead)
Real-time processing needed (this is offline/batch)
Storage is limited (models require ~2GB)

Why pyannote-audio?

Benchmarks (Diarization Error Rate - lower is better):

pyannote community-1: 17.0% on AMI dataset
Gemini/Whisper: ~25-30% error rate
35% accuracy improvement over cloud APIs!

Advantages:

✅ Local processing - Privacy, no API costs
✅ Better accuracy - State-of-the-art neural models
✅ Overlapping speech detection - Identifies when people talk simultaneously
✅ Precise timestamps - Millisecond-accurate speaker boundaries
✅ No internet required - After initial model download
✅ Works with any language - Language-agnostic

Available Scripts

`scripts/diarize.py`

Main diarization script.

Usage:

python skills/speaker-diarization/scripts/diarize.py <video_path> [options]

Options:

--output, -o: Output format (json, rttm, srt) - default: json
--min-speakers: Minimum number of speakers to expect
--max-speakers: Maximum number of speakers to expect
--num-speakers: Exact number of speakers (if known)
--device: Processing device (cpu, cuda) - default: auto
--huggingface-token: HuggingFace token (or use env var)

Examples:

Basic diarization:

export HUGGINGFACE_TOKEN="your-token"
python skills/speaker-diarization/scripts/diarize.py podcast.mp4

Specify speaker count range:

python skills/speaker-diarization/scripts/diarize.py interview.mp4 --min-speakers 2 --max-speakers 3

Output to RTTM format:

python skills/speaker-diarization/scripts/diarize.py panel.mp4 --output rttm

Output (JSON):

{
  "success": true,
  "video_path": "podcast.mp4",
  "num_speakers": 3,
  "duration": 1200.5,
  "speakers": {
    "SPEAKER_00": {"duration": 450.2, "segments": 45},
    "SPEAKER_01": {"duration": 380.5, "segments": 38},
    "SPEAKER_02": {"duration": 369.8, "segments": 42}
  },
  "segments": [
    {
      "start": 0.0,
      "end": 5.2,
      "speaker": "SPEAKER_00",
      "duration": 5.2
    },
    {
      "start": 5.2,
      "end": 12.8,
      "speaker": "SPEAKER_01",
      "duration": 7.6
    }
  ],
  "overlapping_segments": [
    {
      "start": 45.2,
      "end": 47.8,
      "speakers": ["SPEAKER_00", "SPEAKER_01"]
    }
  ]
}

`scripts/extract_speaker_segments.py`

Extract video segments for specific speakers.

Usage:

python skills/speaker-diarization/scripts/extract_speaker_segments.py <video_path> <diarization_json> [options]

Options:

--speaker: Speaker ID to extract (SPEAKER_00, SPEAKER_01, etc.) - default: all
--min-segment-duration: Minimum segment duration (seconds) - default: 5.0
--context: Add context seconds before/after - default: 2.0
--output-dir: Output directory

Examples:

Extract all speakers separately:

python skills/speaker-diarization/scripts/extract_speaker_segments.py podcast.mp4 podcast_diarization.json

Extract only SPEAKER_00:

python skills/speaker-diarization/scripts/extract_speaker_segments.py podcast.mp4 podcast_diarization.json --speaker SPEAKER_00

Extract with 3-second context:

python skills/speaker-diarization/scripts/extract_speaker_segments.py interview.mp4 diarization.json --context 3.0

`scripts/analyze_speaker_dynamics.py`

Analyze speaker interactions and dynamics.

Usage:

python skills/speaker-diarization/scripts/analyze_speaker_dynamics.py <diarization_json> [options]

Output:

{
  "speaker_dynamics": {
    "total_speakers": 3,
    "dominant_speaker": "SPEAKER_00",
    "speaker_balance": 0.72,
    "interaction_moments": [
      {
        "type": "debate",
        "start": 120.5,
        "end": 145.2,
        "speakers": ["SPEAKER_00", "SPEAKER_01"],
        "intensity": 0.85
      },
      {
        "type": "overlapping_speech",
        "start": 200.0,
        "end": 202.5,
        "speakers": ["SPEAKER_01", "SPEAKER_02"]
      }
    ]
  }
}

Setup

1. Install Dependencies

pip install pyannote.audio torch torchaudio speechbrain

2. Get HuggingFace Token

Create account at huggingface.co
Generate token at hf.co/settings/tokens
Accept terms at pyannote/speaker-diarization-community-1

3. Set Environment Variable

export HUGGINGFACE_TOKEN="your-token-here"

Or use --huggingface-token flag.

How AI Agents Decide

When to use pyannote vs Gemini diarization:

def select_diarization_method(video_info, user_instructions):
    # User explicitly wants pyannote
    if "accurate" in user_instructions or "precise" in user_instructions:
        return "pyannote"
    
    # Multi-speaker content detected
    if video_info.get('num_speakers', 1) > 2:
        return "pyannote"
    
    # Podcast/interview format
    if any(word in user_instructions for word in ['podcast', 'interview', 'panel', 'debate']):
        return "pyannote"
    
    # Overlapping speech expected
    if 'overlapping' in user_instructions or 'talk over' in user_instructions:
        return "pyannote"
    
    # Privacy requirement
    if 'private' in user_instructions or 'offline' in user_instructions:
        return "pyannote"
    
    # Single speaker or simple case - use Gemini (faster)
    return "gemini"

Agent decision criteria:

Use pyannote: Multi-speaker, accuracy-critical, offline needed
Use Gemini: Single speaker, speed-critical, simple scenario

Integration with Other Skills

Enhanced `video-transcriber`

# Transcribe with pyannote diarization
python skills/video-transcriber/scripts/transcribe.py video.mp4 \
  --model whisper \
  --diarization pyannote \
  --output-format srt-with-speakers

Enhanced `highlight-scanner`

# Find highlights considering speaker dynamics
python skills/highlight-scanner/scripts/find_highlights.py video.mp4 \
  --transcript-path video.srt \
  --diarization-path video_diarization.json \
  --speaker-dynamics

Enhanced `autocut-shorts`

# Autocut focusing on specific speaker
python skills/autocut-shorts/scripts/autocut.py podcast.mp4 \
  --use-speaker-diarization \
  --focus-speaker SPEAKER_00 \
  --num-clips 5

Output Formats

JSON (default)

Full metadata including speaker statistics and overlapping segments.

RTTM

Standard diarization format for research/annotation:

SPEAKER podcast 1 0.0 5.2 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER podcast 1 5.2 7.6 <NA> <NA> SPEAKER_01 <NA> <NA>

SRT with Speakers

1
00:00:00,000 --> 00:00:05,200
[SPEAKER_00]: Welcome to the show everyone

2
00:00:05,200 --> 00:00:12,800
[SPEAKER_01]: Thanks for having me on today

Performance

Processing Speed:

CPU: ~30 seconds per hour of audio (Intel i7)
GPU: ~10 seconds per hour of audio (NVIDIA RTX 3060)
First run: +60 seconds (model download)

Accuracy:

2-3 speakers: 95%+ accuracy
4-6 speakers: 85-90% accuracy
7+ speakers: 70-80% accuracy

Tips

Specify speaker count if known - improves accuracy
Use for podcasts/interviews - better than cloud APIs
Combine with transcription - diarization + Whisper = perfect
Check overlapping speech - identifies heated discussions
Export to SRT - easy to import into video editors

References

pyannote.audio: https://github.com/pyannote/pyannote-audio
Model hub: https://huggingface.co/pyannote
Paper: https://arxiv.org/abs/2310.11347

speaker-diarization

Speaker Diarization

When to Use

Why pyannote-audio?

Available Scripts

scripts/diarize.py

scripts/extract_speaker_segments.py

scripts/analyze_speaker_dynamics.py

Setup

1. Install Dependencies

2. Get HuggingFace Token

3. Set Environment Variable

How AI Agents Decide

Integration with Other Skills

Enhanced video-transcriber

Enhanced highlight-scanner

Enhanced autocut-shorts

Output Formats

JSON (default)

RTTM

SRT with Speakers

Performance

Tips

References

`scripts/diarize.py`

`scripts/extract_speaker_segments.py`

`scripts/analyze_speaker_dynamics.py`

Enhanced `video-transcriber`

Enhanced `highlight-scanner`

Enhanced `autocut-shorts`