Speaker Diarization: Implementation Guide
Identifying and separating multiple speakers in audio
Speaker Diarization: Implementation Guide (2026)
Diarization answers "who spoke when" — segmenting audio by speaker so a transcript becomes "Speaker A: … / Speaker B: …". It's essential for meeting notes, call analytics, and any multi-party audio. The go-to open toolkit is pyannote.audio.
What diarization does (and doesn't)
Diarization labels speaker turns (Speaker 1, 2, …) but does not identify who they are by name, and does not transcribe. The full "who said what" pipeline is: diarization + ASR, then align the two by timestamp.
python
pip install pyannote.audio (gated model: accept terms on Hugging Face)
from pyannote.audio import Pipeline
pipe = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token=HF_TOKEN)
diary = pipe("meeting.wav")
for turn, _, speaker in diary.itertracks(yield_label=True):
print(f"{turn.start:.1f}-{turn.end:.1f} {speaker}")
Combining with transcription
Run Whisper for the words and pyannote for the speakers, then merge on overlapping timestamps to attribute each transcript segment to a speaker. See Multilingual ASR and OpenAI Whisper API. Hosted alternatives bundle diarization — see Whisper vs Deepgram, and for the meeting use case, 会议智能转录.
Accuracy factors
FAQ
Does diarization name speakers? No — it labels Speaker 1/2/…; mapping to names needs enrollment or manual tagging. Does it transcribe? No — pair it with ASR and merge by timestamp. Why are results poor? Usually overlapping speech or noisy/far-field audio. Hosted option? Deepgram and other ASR APIs offer built-in diarization.
Summary
Diarization segments audio by speaker; combine it with ASR (merge on timestamps) to get "who said what." pyannote.audio is the open standard. Quality hinges on clean, well-separated audio — use per-speaker channels when you can and avoid over-denoising.
*Last updated: June 2026. Verify APIs against the pyannote.audio docs.*
Also available in 中文.