Voice Activity Detection: Implementation Guide

Detecting and segmenting speech in audio streams

By AI Skill Navigation Editorial TeamPublished June 9, 2026

Voice Activity Detection: Implementation Guide (2026)

Voice Activity Detection (VAD) decides which parts of an audio stream contain speech versus silence or noise. It's the gatekeeper in front of transcription and voice agents: run expensive ASR only on speech segments, and detect when a user has stopped talking ("end-of-turn") so the agent can respond.

Why you need it

Without VAD you transcribe silence, waste API calls, and can't tell when a speaker finished. With it you cut cost, reduce latency, and enable natural turn-taking in voice agents.

Practical options

webrtcvad: lightweight, fast, classic energy/GMM-based VAD — great for chunking before ASR.

Silero VAD: a small, accurate neural VAD that handles noise far better than energy-based methods; the modern default.

python
pip install silero-vad torch torchaudio
from silero_vad import load_silero_vad, read_audio, get_speech_timestamps
model = load_silero_vad()
wav = read_audio("call.wav", sampling_rate=16000)
segments = get_speech_timestamps(wav, model, sampling_rate=16000)
segments = [{'start': 1280, 'end': 20480}, ...]  → feed only these to ASR

Using VAD with transcription

The standard pipeline: VAD → segment → ASR. Slice the audio into speech segments, then send each to Whisper or a streaming ASR. This avoids transcribing dead air and improves accuracy. For the ASR step, see Multilingual ASR and OpenAI Whisper API.

Tuning for real-time agents

End-of-turn detection: trigger a response after N ms of trailing silence (e.g. 500–800ms). Too short interrupts the user; too long feels laggy.

Aggressiveness: webrtcvad has modes 0–3; higher rejects more non-speech but may clip soft speech.

Pre/post padding: add a little audio before/after each segment so ASR doesn't clip word edges.

FAQ

webrtcvad or Silero? Silero for accuracy in noise; webrtcvad when you need something tiny and ultra-fast. Does VAD transcribe? No — it only marks speech regions; ASR does the transcription. How do I detect end-of-turn? Fire after a tuned window of trailing silence. Why pad segments? To avoid clipping the first/last words before ASR.

Summary

VAD is the cheap, high-leverage front end to any speech pipeline: detect speech with Silero (or webrtcvad), feed only speech to ASR, and tune trailing-silence thresholds for natural turn-taking. It cuts cost and latency while improving transcription quality.

*Last updated: June 2026. Verify APIs against the Silero VAD / webrtcvad docs.*

Also available in 中文.

Voice Activity Detection: Implementation Guide

Voice Activity Detection: Implementation Guide (2026)

Why you need it

Practical options

pip install silero-vad torch torchaudio

segments = [{'start': 1280, 'end': 20480}, ...] → feed only these to ASR

Using VAD with transcription

Tuning for real-time agents

FAQ

Summary

Documentation

Getting Started

Learn more