Voice Activity Detection (VAD): Detecting Whether Someone Is Speaking with Python
The first checkpoint for voice applications—getting it right can save most of the recognition cost
Voice Activity Detection: Detecting Human Speech with Python
When building a voice application, the first problem to solve isn't "what was said"—it's "is there actually someone speaking in this audio?" That's exactly what VAD (Voice Activity Detection) does.
Why is it important? Because feeding silence and noise segments to a speech recognition (ASR) system wastes money, time, and often causes errors. VAD first extracts the speech segments, making the rest of the pipeline efficient.
What It Solves
An audio recording typically contains a mix of: human speech, pauses/silence, and background noise. VAD's job is to label each short frame—"this is speech" or "this is not speech."
Typical use cases:
Option 1: webrtcvad (Lightweight, Fast)
This VAD comes from the Google WebRTC project. It's extremely lightweight, relies purely on signal features (no models), and is blazing fast.
python
import webrtcvad
vad = webrtcvad.Vad(2) # 0-3, higher is more aggressive (more likely to classify as non-speech)Audio must be 16kHz/8kHz, 16-bit mono, framed at 10/20/30ms
is_speech = vad.is_speech(frame_bytes, sample_rate=16000)
Pros: Fast, zero model dependencies, suitable for real-time and resource-constrained scenarios. Cons: Prone to errors in noisy environments—it can't distinguish between "human speech" and "noise that sounds like speech."
Option 2: Silero VAD (Accurate, Noise-Resistant)
A neural network-based VAD with significantly higher accuracy, especially in noisy conditions. The model is small enough to run in real-time on a CPU.
python
import torch
model, utils = torch.hub.load('snakers4/silero-vad', 'silero_vad')
(get_speech_timestamps, _, read_audio, _, _) = utilswav = read_audio('audio.wav', sampling_rate=16000)
speech_ts = get_speech_timestamps(wav, model, sampling_rate=16000)
speech_ts: [{'start': 12000, 'end': 35000}, ...] sample index ranges for speech segments
Pros: Accurate, noise-resistant, language-agnostic, and still lightweight. Cons: Slightly heavier than webrtcvad (requires loading a model), but the overhead is worth it for most scenarios.
How to Choose
Honestly, unless you're running on a very weak device, default to Silero VAD—it's reliable and accurate.
Practical Pitfalls
Sampling rate and format must be correct. webrtcvad is picky about input format (16/8kHz, 16-bit, mono, fixed frame size). If the format is wrong, it will error or misclassify. Always convert audio to the required format first.
Don't expect VAD to solve everything. In challenging scenarios like heavy noise or multiple simultaneous speakers, VAD will make mistakes. It's a "first coarse filter," not a perfect separator.
Add some padding. Leave a small margin (padding) before and after speech segments to avoid cutting off the first word or trailing sounds.
Next Steps
After VAD extracts speech segments, the next step is to send them to ASR for recognition. If you need multilingual recognition, check out Multilingual Speech Recognition or OpenAI Whisper API.
Summary
VAD is the most easily overlooked yet most crucial step in a voice pipeline. Get it right, and the subsequent recognition becomes faster and cheaper. For new projects, starting with Silero VAD is a safe bet.
Also available in 中文.