Voice Activity Detection: Implementation Guide
Detecting and segmenting speech in audio streams
Voice Activity Detection: Implementation Guide (2026)
Voice Activity Detection (VAD) decides which parts of an audio stream contain speech versus silence or noise. It's the gatekeeper in front of transcription and voice agents: run expensive ASR only on speech segments, and detect when a user has stopped talking ("end-of-turn") so the agent can respond.
Why you need it
Without VAD you transcribe silence, waste API calls, and can't tell when a speaker finished. With it you cut cost, reduce latency, and enable natural turn-taking in voice agents.
Practical options
python
pip install silero-vad torch torchaudio
from silero_vad import load_silero_vad, read_audio, get_speech_timestampsmodel = load_silero_vad()
wav = read_audio("call.wav", sampling_rate=16000)
segments = get_speech_timestamps(wav, model, sampling_rate=16000)
segments = [{'start': 1280, 'end': 20480}, ...] → feed only these to ASR
Using VAD with transcription
The standard pipeline: VAD → segment → ASR. Slice the audio into speech segments, then send each to Whisper or a streaming ASR. This avoids transcribing dead air and improves accuracy. For the ASR step, see Multilingual ASR and OpenAI Whisper API.
Tuning for real-time agents
FAQ
webrtcvad or Silero? Silero for accuracy in noise; webrtcvad when you need something tiny and ultra-fast. Does VAD transcribe? No — it only marks speech regions; ASR does the transcription. How do I detect end-of-turn? Fire after a tuned window of trailing silence. Why pad segments? To avoid clipping the first/last words before ASR.
Summary
VAD is the cheap, high-leverage front end to any speech pipeline: detect speech with Silero (or webrtcvad), feed only speech to ASR, and tune trailing-silence thresholds for natural turn-taking. It cuts cost and latency while improving transcription quality.
*Last updated: June 2026. Verify APIs against the Silero VAD / webrtcvad docs.*
Also available in 中文.