← Back to tutorials

Voice Activity Detection (VAD): Detecting Whether Someone Is Speaking with Python

The first checkpoint for voice applications—getting it right can save most of the recognition cost

Voice Activity Detection: Detecting Human Speech with Python

When building a voice application, the first problem to solve isn't "what was said"—it's "is there actually someone speaking in this audio?" That's exactly what VAD (Voice Activity Detection) does.

Why is it important? Because feeding silence and noise segments to a speech recognition (ASR) system wastes money, time, and often causes errors. VAD first extracts the speech segments, making the rest of the pipeline efficient.

What It Solves

An audio recording typically contains a mix of: human speech, pauses/silence, and background noise. VAD's job is to label each short frame—"this is speech" or "this is not speech."

Typical use cases:

  • Real-time voice assistants: Detect when the user starts speaking to wake up, and when they finish to submit for recognition.
  • Reduce ASR costs: Send only speech segments for recognition; skip silence.
  • Voice segmentation: Split long recordings into chunks based on speech/pause boundaries.
  • Option 1: webrtcvad (Lightweight, Fast)

    This VAD comes from the Google WebRTC project. It's extremely lightweight, relies purely on signal features (no models), and is blazing fast.

    python
    import webrtcvad
    vad = webrtcvad.Vad(2)  # 0-3, higher is more aggressive (more likely to classify as non-speech)

    Audio must be 16kHz/8kHz, 16-bit mono, framed at 10/20/30ms

    is_speech = vad.is_speech(frame_bytes, sample_rate=16000)

    Pros: Fast, zero model dependencies, suitable for real-time and resource-constrained scenarios. Cons: Prone to errors in noisy environments—it can't distinguish between "human speech" and "noise that sounds like speech."

    Option 2: Silero VAD (Accurate, Noise-Resistant)

    A neural network-based VAD with significantly higher accuracy, especially in noisy conditions. The model is small enough to run in real-time on a CPU.

    python
    import torch
    model, utils = torch.hub.load('snakers4/silero-vad', 'silero_vad')
    (get_speech_timestamps, _, read_audio, _, _) = utils

    wav = read_audio('audio.wav', sampling_rate=16000) speech_ts = get_speech_timestamps(wav, model, sampling_rate=16000)

    speech_ts: [{'start': 12000, 'end': 35000}, ...] sample index ranges for speech segments

    Pros: Accurate, noise-resistant, language-agnostic, and still lightweight. Cons: Slightly heavier than webrtcvad (requires loading a model), but the overhead is worth it for most scenarios.

    How to Choose

    ScenarioRecommendation

    Extremely resource-constrained / embeddedwebrtcvad Quiet environment, need maximum speedwebrtcvad Noisy environment / need accuracySilero VAD Most applicationsSilero VAD

    Honestly, unless you're running on a very weak device, default to Silero VAD—it's reliable and accurate.

    Practical Pitfalls

    Sampling rate and format must be correct. webrtcvad is picky about input format (16/8kHz, 16-bit, mono, fixed frame size). If the format is wrong, it will error or misclassify. Always convert audio to the required format first.

    Don't expect VAD to solve everything. In challenging scenarios like heavy noise or multiple simultaneous speakers, VAD will make mistakes. It's a "first coarse filter," not a perfect separator.

    Add some padding. Leave a small margin (padding) before and after speech segments to avoid cutting off the first word or trailing sounds.

    Next Steps

    After VAD extracts speech segments, the next step is to send them to ASR for recognition. If you need multilingual recognition, check out Multilingual Speech Recognition or OpenAI Whisper API.

    Summary

    VAD is the most easily overlooked yet most crucial step in a voice pipeline. Get it right, and the subsequent recognition becomes faster and cheaper. For new projects, starting with Silero VAD is a safe bet.

    Also available in 中文.

    Voice Activity Detection (VAD): Detecting Whether Someone Is Speaking with Python | AI Skill Navigation | AI Skill Navigation