SDK-first: always use the official SDK — see sdk-integration for policy, setup, and fallback criteria.
References
Consult these sibling skills as needed:
- ../sdk-integration/SKILL.md -- SDK setup, client initialization, error handling, and SDK vs raw API decision guide
- ../sdk-integration/references/sdk-versions.md -- Current SDK versions (auto-synced by CI)
- ../troubleshooting/SKILL.md -- Common errors, gotchas, and verification checklist
- ../live-transcription/SKILL.md -- Live streaming transcription
- ../pre-recorded-transcription/SKILL.md -- Pre-recorded file transcription
name: Gladia description: Use when building speech-to-text transcription features, processing audio or video files, implementing real-time transcription, extracting insights from audio (translation, summarization, speaker identification), or integrating audio intelligence into applications. metadata: mintlify-proj: gladia version: "1.0"
Gladia Skill
Product summary
Gladia is a speech-to-text API that transcribes audio and video files in two modes: pre-recorded (asynchronous, batch) and live (real-time, WebSocket-based). The API returns structured transcripts with word-level timing, confidence scores, and optional audio intelligence features (translation, diarization, summarization, entity recognition, sentiment analysis, PII redaction, subtitles). Use the JavaScript/TypeScript SDK (@gladiaio/sdk) or Python SDK (gladiaio-sdk) for simplified integration, or call REST/WebSocket endpoints directly. Authenticate with x-gladia-key header. Primary docs: https://docs.gladia.io
When to use
- Pre-recorded transcription: Transcribe uploaded audio/video files (MP3, WAV, MP4, YouTube links, etc.) asynchronously. Typical latency: seconds to minutes depending on file length.
- Live transcription: Stream audio in real-time via WebSocket for immediate transcripts (e.g., call centers, live events, voice assistants).
- Audio intelligence: Extract metadata from transcripts — translate to multiple languages, identify speakers, detect sentiment, redact PII, generate summaries, create subtitles, recognize named entities.
- Custom vocabulary: Improve accuracy for domain-specific terms, brand names, proper nouns by providing phonetic hints.
- Multi-speaker scenarios: Use diarization to attribute speech to individual speakers, or send multi-channel audio to preserve speaker identity.
Quick reference
Authentication
# All requests require x-gladia-key header
curl -H "x-gladia-key: YOUR_API_KEY" https://api.gladia.io/v2/pre-recorded
Pre-recorded workflow (SDK)
import { GladiaClient } from "@gladiaio/sdk";
const client = new GladiaClient({ apiKey: "YOUR_KEY" });
const result = await client.preRecorded().transcribe("audio_url_or_local_path");
Live workflow (SDK)
const session = client.liveV2().startSession({
encoding: "wav/pcm",
sample_rate: 16000,
bit_depth: 16,
channels: 1,
language_config: { languages: ["en"] }
});
session.on("message", (msg) => console.log(msg));
session.sendAudio(audioChunk);
session.stopRecording();
Audio formats
| Type | Examples | |------|----------| | Audio | MP3, WAV, FLAC, AAC, OGG, Opus | | Video | MP4, MOV, AVI, WebM, Matroska | | Online | YouTube, TikTok, Instagram, Facebook, Vimeo, LinkedIn |
Limits
| Limit | Value | |-------|-------| | Pre-recorded max duration | 135 minutes (free/paid); 4h15 (enterprise) | | Pre-recorded max file size | 1000 MB | | Live session max duration | 3 hours | | Free tier monthly usage | 10 hours | | Concurrent pre-recorded jobs (free) | 3 | | Concurrent pre-recorded jobs (paid) | 25 | | Concurrent live sessions (free) | 1 | | Concurrent live sessions (paid) | 30 |
Audio intelligence features
| Feature | Pre-recorded | Live | Purpose | |---------|--------------|------|---------| | Diarization | ✓ | ✗ | Identify speakers | | Translation | ✓ | ✓ | Multi-language output | | Summarization | ✓ | ✗ | Generate summaries/bullet points | | Sentiment analysis | ✓ | ✓ | Detect emotions and tone | | Named entity recognition | ✓ | ✓ | Extract people, orgs, dates | | PII redaction | ✓ | ✗ | Anonymize sensitive data | | Subtitles | ✓ | ✗ | Generate SRT/VTT files | | Custom vocabulary | ✓ | ✓ | Improve domain-specific terms | | Custom spelling | ✓ | ✓ | Normalize misspellings | | Chapterization | ✓ | ✗ | Segment long audio into chapters | | Audio-to-LLM | ✓ | ✗ | Run custom prompts on transcript |
Decision guidance
When to use pre-recorded vs. live
| Scenario | Pre-recorded | Live |
|----------|--------------|------|
| Batch processing uploaded files | ✓ | ✗ |
| Real-time streaming (calls, events) | ✗ | ✓ |
| Need diarization | ✓ | ✗ |
| Need immediate partial results | ✗ | ✓ (with receive_partial_transcripts: true) |
| Need summarization | ✓ | ✗ |
| Multi-hour content | ✓ (up to 135 min) | ✓ (up to 3 hours per session) |
When to use SDK vs. raw API
| Approach | Best for | |----------|----------| | SDK | Rapid development, automatic error handling, built-in polling/retry logic | | Raw API | Custom workflows, specific language/framework, fine-grained control |
When to use diarization vs. multi-channel audio
| Approach | Use when | |----------|----------| | Diarization | Single audio file with multiple speakers; you want the API to separate them | | Multi-channel | Multiple audio sources (e.g., separate participant feeds); you can merge them into one multi-channel stream |
When to use custom vocabulary vs. custom spelling
| Feature | Use when | |---------|----------| | Custom vocabulary | Word is mispronounced/garbled; you provide phonetic hints (e.g., "Nietzsche" → ["Niche", "Neechee"]) | | Custom spelling | Word is recognized but misspelled (e.g., "Salesforce" → "Sales Force"); literal text matching |
Workflow
Pre-recorded transcription (typical task)
- Prepare audio: Ensure file is under 1000 MB and 135 minutes. Supported formats: MP3, WAV, MP4, YouTube URL, etc.
- Choose delivery method: Use SDK for simplicity, or raw API for control.
- Configure transcription:
- Set
language_config.languagesexplicitly if known (avoids detection overhead). - Enable
diarization: trueif multiple speakers. - Add
custom_vocabularyfor domain terms. - Enable audio intelligence features (translation, summarization, etc.) as needed.
- Set
- Submit job: Call
transcribe()(SDK) orPOST /v2/pre-recorded(API). - Retrieve results: Poll
GET /v2/pre-recorded/:idor configure webhooks/callbacks. - Parse response: Extract
transcription.utterances[]for text and timing, plus any audio intelligence results.
Live transcription (typical task)
- Initialize session: Call
POST /v2/livewith audio config (encoding, sample_rate, bit_depth, channels). - Connect WebSocket: Use returned URL to open WebSocket connection.
- Configure messages: Set
messages_configto specify which message types to receive (transcripts, partial transcripts, post-processing events). - Stream audio: Send audio chunks via
sendAudio()(SDK) or binary/base64 JSON (raw API). - Handle messages: Listen for
transcriptmessages; checkis_finalto distinguish partials from finals. - Stop recording: Call
stopRecording()to trigger post-processing (diarization, translation, etc.). - Retrieve final result: Poll
GET /v2/live/:idor wait for callback with complete result.
Adding custom vocabulary
- Identify problem terms: Transcribe without custom vocabulary; note mis-transcribed words.
- Categorize: Garbled/phonetically wrong → custom vocabulary; recognizable but misspelled → custom spelling.
- Build vocabulary list:
{ "custom_vocabulary": true, "custom_vocabulary_config": { "vocabulary": [ "Gladia", { "value": "Salesforce", "pronunciations": ["sell force"], "intensity": 0.5 } ], "default_intensity": 0.4 } } - Test: Transcribe again; confirm targets appear and check for false positives.
- Refine: Lower intensity, add pronunciations, or move stubborn terms to custom spelling.
Common gotchas
- Language detection overhead: Always set
language_config.languagesexplicitly if you know the language. Auto-detection adds latency and can fail if audio starts with silence or music. - Code switching without language list: Never enable
code_switching: truewith an emptylanguagesarray — the model will evaluate against 100+ languages, causing frequent misdetections. Always provide a constrained list (3–5 languages). - Diarization hints are not hard constraints:
number_of_speakers,min_speakers,max_speakersare hints, not guarantees. The model may detect a different count. - Custom vocabulary intensity tuning: Start at
default_intensity: 0.4and adjust per-entry only. Raising intensity globally increases false positives. Addpronunciationsvariants before raising intensity. - Live session 3-hour limit: A single WebSocket session cannot exceed 3 hours. For longer events, close the session and start a new one before hitting the limit.
- Pre-recorded 135-minute limit: Files longer than 135 minutes will fail. Split into ~60-minute chunks using ffmpeg or similar tools.
- Audio format conversion overhead: Large video files (e.g., AVI, MOV) take ~1 minute to convert to WAV/PCM. Plan for this latency.
- Polling without webhooks: If you poll
GET /v2/pre-recorded/:idin a tight loop, you'll hit rate limits. Use webhooks or callbacks instead, or poll with exponential backoff. - Multi-channel billing: Transcribing multi-channel audio is billed as
duration × number_of_channels. A 10-minute 3-channel stream costs 30 minutes of usage. - Partial transcripts in live mode: Partial transcripts are low-latency but less accurate. Always check
is_final: truebefore using a transcript for critical decisions. - Missing audio_url on upload: After uploading a file, the response includes
audio_url— use this URL in the transcription request, not the local file path. - WebSocket reconnection: If the WebSocket disconnects, reconnect to the same URL (returned from init) to resume the session without losing context.
Verification checklist
Before submitting transcription work:
- [ ] API key is valid and passed in
x-gladia-keyheader. - [ ] Audio file is under 1000 MB and 135 minutes (pre-recorded) or 3 hours (live).
- [ ] Audio format is supported (MP3, WAV, MP4, etc.).
- [ ] Language is set explicitly in
language_config.languagesif known. - [ ] If using code switching,
languageslist is constrained to 3–5 expected languages. - [ ] Diarization is enabled if multiple speakers need attribution.
- [ ] Custom vocabulary entries have realistic
intensity(0.4–0.6) andpronunciations. - [ ] Webhooks or callbacks are configured if polling is not feasible.
- [ ] Live sessions are closed before 3 hours; pre-recorded jobs are split if over 135 minutes.
- [ ] Response includes expected fields:
transcription.utterances[],metadata, and any requested audio intelligence results. - [ ] Confidence scores and timing (
start,end) are present for quality validation. - [ ] Multi-channel audio is correctly interleaved if merging multiple sources.
Resources
- Comprehensive page listing: https://docs.gladia.io/llms.txt
- Getting started guide: https://docs.gladia.io/chapters/introduction/getting-started
- Pre-recorded quickstart: https://docs.gladia.io/chapters/pre-recorded-stt/quickstart
- Live transcription quickstart: https://docs.gladia.io/chapters/live-stt/quickstart
- API reference: https://docs.gladia.io/api-reference
- Recommended parameters by use case: https://docs.gladia.io/chapters/pre-recorded-stt/recommended-parameters
- Audio intelligence features: https://docs.gladia.io/chapters/audio-intelligence
- Supported formats and limits: https://docs.gladia.io/chapters/limits-and-specifications/supported-formats
For additional documentation and navigation, see: https://docs.gladia.io/llms.txt
This file is auto-synced from https://docs.gladia.io/.well-known/agent-skills/gladia/skill.md Do not edit manually — changes will be overwritten by CI. For additional documentation and navigation, see: https://docs.gladia.io/llms.txt
微信扫一扫