Voice Activity Detection (VAD): Detecting Whether Someone Is Speaking with Python

The first checkpoint for voice applications—getting it right can save most of the recognition cost

By AI Skill Navigation Editorial TeamPublished July 22, 2026

When building a voice application, the first problem to solve isn't "what was said," but rather "is there actually someone speaking in this audio segment?" That's the core task of VAD (Voice Activity Detection). VAD plays a critical role in the voice pipeline—it filters out silence and noise segments, keeping only human speech, which dramatically improves the efficiency and accuracy of downstream Automatic Speech Recognition (ASR). Whether you're building a real-time voice assistant, processing long recordings, or optimizing API call costs, VAD is an indispensable foundational component.

Why VAD Matters So Much

A typical recording is a mix of: human speech, pauses of silence, and background noise. VAD's job is to label each small audio chunk—"this is speech" or "this is not speech." Typical use cases include:

Real-time voice assistants: Wake up only when the user starts speaking, and submit for recognition only after they finish. This avoids continuously sending silent data, reducing bandwidth and computational overhead.

Reducing ASR costs: Send only the speech segments for recognition, not the silence. With Whisper, for example, recognizing silence not only wastes compute but can also produce meaningless hallucinated text (like "um…" or misrecognized random noise).

Audio segmentation: Long recordings are split into speech/pause segments for downstream processing, such as speaker diarization or segment-by-segment transcription.

The accuracy of VAD directly impacts the quality of the entire voice pipeline. A false positive (classifying noise as speech) causes the ASR to produce garbage text; a false negative (classifying speech as silence) loses critical information. Therefore, choosing the right VAD solution is crucial.

Option 1: webrtcvad (Lightweight, Fast)

This is the VAD implementation from the Google WebRTC project. It's extremely lightweight, relying purely on signal features (like energy, zero-crossing rate, spectral features) for decision-making, requiring no model loading and running very fast. It uses a Gaussian Mixture Model (GMM) to model the statistical distributions of speech and noise, performing reliably in clean environments.

python
import webrtcvad
Create a VAD instance, aggressiveness parameter 0-3
0 is most permissive (more frames classified as speech), 3 is most aggressive (more frames classified as non-speech)
vad = webrtcvad.Vad(2)
Audio must meet strict constraints:
- Sample rate: 8000, 16000, 32000, or 48000 Hz
- Frame duration: 10ms, 20ms, or 30ms (must match exactly)
- Format: mono, 16-bit PCM
Assume frame_bytes is 30ms of 16kHz audio data
30ms * 16kHz * 2 bytes/sample = 960 bytes
frame_bytes = b'\x00\x00' * 480  # Example: 30ms silent frame
is_speech = vad.is_speech(frame_bytes, sample_rate=16000)
print(f"Is speech: {is_speech}")

Advantages: Zero model dependency, extremely low CPU overhead, suitable for embedded or resource-constrained scenarios. It can run in real-time on a Raspberry Pi or low-power MCU. Disadvantages: Prone to misclassification in noisy environments—it can't distinguish between "human speech" and "noise that sounds like speech" (e.g., fan noise, keyboard clicks). Increasing aggressiveness reduces false positives but increases false negatives, especially with soft speech or slow speaking rates.

Option 2: Silero VAD (Accurate, Noise-Resistant)

A neural network-based VAD with significantly higher accuracy, especially strong noise resistance. The model is small (about 2MB) and runs in real-time on a CPU. It uses deep neural networks (like LSTMs or CNNs) to learn complex patterns of speech and noise, offering some generalization to unseen noise types.

python
Method 1: via the silero-vad package (recommended)
pip install silero-vad
from silero_vad import load_silero_vad, get_speech_timestamps, read_audio
model = load_silero_vad()
wav = read_audio('audio.wav', sampling_rate=16000)  # Supports 8000 and 16000 Hz
speech_ts = get_speech_timestamps(
    wav, model,
    sampling_rate=16000,
    threshold=0.5,  # Probability threshold, default 0.5
    min_speech_duration_ms=250,  # Minimum speech segment length
    min_silence_duration_ms=100,  # Minimum silence segment length
    speech_pad_ms=30  # Padding before and after
)
speech_ts: [{'start': 12000, 'end': 35000}, ...] sample index ranges
print(f"Detected {len(speech_ts)} speech segments")

python
Method 2: via torch.hub (alternative)
import torch
model, utils = torch.hub.load('snakers4/silero-vad', 'silero_vad')
(get_speech_timestamps, _, read_audio, _, _) = utils
Subsequent usage is the same as above

Advantages: Accurate, noise-resistant, language-agnostic, lightweight model. In noisy environments like a bustling café or factory floor, Silero VAD's false positive rate is far lower than webrtcvad. Disadvantages: Requires loading a model (about 2MB of memory), which is slightly heavier than webrtcvad, but this overhead is worthwhile for most scenarios. The first load requires downloading the weight file; offline caching is recommended.

Principle Comparison: Energy/Statistical Feature Method vs. Neural Network Model Method

MethodPrincipleSuitable ScenariosError Characteristics

Energy/Statistical Feature Method (e.g., webrtcvad)Based on handcrafted features like short-term energy, zero-crossing rate, spectral flatness + GMMQuiet environments, resource-constrained devicesProne to classifying non-speech as speech in noise (false positives); accurate in quiet settings Neural Network Model Method (e.g., Silero VAD)Based on deep neural networks (e.g., LSTM, CNN) learning complex speech/noise patternsNoisy environments, high accuracy requirementsHigh accuracy in noise; but may fail on unseen noise types

Key Difference: webrtcvad is "rule-driven," while Silero VAD is "data-driven." The former performs well in clean environments; the latter is significantly better in complex acoustic environments. In real projects, if latency and resource consumption aren't critical, prioritize Silero VAD.

Tying with ASR: VAD + Whisper to Save Costs

Using VAD to segment audio before feeding it to Whisper (or faster-whisper) reduces the computational cost of recognizing silence and avoids hallucinated text from silent segments. This is highly effective for long recordings or real-time streaming audio.

python
import torch
from silero_vad import load_silero_vad, get_speech_timestamps, read_audio
from faster_whisper import WhisperModel
1. VAD detects speech segments
model_vad = load_silero_vad()
wav = read_audio('long_audio.wav', sampling_rate=16000)
speech_ts = get_speech_timestamps(wav, model_vad, sampling_rate=16000)
2. Load Whisper model
model_asr = WhisperModel("base", device="cpu", compute_type="int8")
3. Recognize segment by segment
full_text = []
for seg in speech_ts:
    segment = wav[seg['start']:seg['end']]
    # faster-whisper accepts numpy arrays
    segments, _ = model_asr.transcribe(segment, beam_size=5)
    for s in segments:
        full_text.append(s.text)print("Recognition result:", " ".join(full_text))

Cost-saving principle: Suppose a 10-minute recording contains only 3 minutes of actual speech. Without VAD, Whisper must process all 10 minutes; with VAD, it only processes 3 minutes, reducing computation by 70%. For API calls (like the OpenAI Whisper API), which charge by audio duration, the cost savings are even more pronounced. Additionally, VAD segmentation prevents Whisper from producing hallucinated text like "um…" during silence, improving output quality.

Real-time Streaming vs. Offline Processing Design Differences

DimensionOffline ProcessingReal-time Streaming

InputComplete audio fileAudio stream (arrives in chunks) BufferingNo buffering neededRequires ring buffer for frame management Endpoint DetectionSimple: end when silence is detectedRequires hangover mechanism: wait for a period after silence detection before confirming end LatencyNo real-time requirementTypically requires response within 100-300ms

Real-time Streaming VAD Example (pseudocode logic):

python
Ring buffer + hangover mechanism
ring_buffer = []
hangover_frames = 0
HANGOVER_THRESHOLD = 10  # Confirm end only after 10 consecutive non-speech frameswhile streaming:
    frame = get_audio_frame()  # Get 30ms frame from microphone
    ring_buffer.append(frame)
    is_speech = vad.is_speech(frame, 16000)
    
    if is_speech:
        hangover_frames = 0
        # Continue accumulating audio
    else:
        hangover_frames += 1
        if hangover_frames >= HANGOVER_THRESHOLD:
            # Confirm speech segment end, process audio in ring buffer
            process_speech_segment(ring_buffer)
            ring_buffer.clear()

The key to real-time streaming VAD is the hangover mechanism: don't end immediately upon detecting silence; wait for several frames to confirm. This prevents erroneous segmentation due to brief pauses (like gaps in speech). The ring buffer manages audio frames, ensuring complete output when a speech segment ends.

Two parameters dominate real-world tuning. First, the aggressiveness or speech-probability threshold: in a quiet office, moderate settings work well, but noisy call-center or in-car audio usually demands stricter settings (higher aggressiveness for webrtcvad, a higher threshold for Silero)—at the cost of occasionally clipping soft speech. Always tune against recordings from your actual deployment environment, not clean demo audio. Second, frame size: 30ms frames mean fewer VAD calls and lower CPU overhead, while 10ms frames give finer granularity for faster endpointing. A common production pattern is to run the VAD at 30ms and let the ASR's word-level timestamps refine the final segment boundaries.

Common Pitfalls and Solutions

Sample rate mismatch: webrtcvad only supports 8000/16000/32000/48000 Hz. If your audio is 44100 Hz, you must resample first.

python
   import librosa
   audio, _ = librosa.load('audio.wav', sr=16000, mono=True)

Incorrect frame duration: webrtcvad frame duration must be exactly 10/20/30ms. Calculate frame byte size: frame_duration(seconds) * sample_rate * 2.

python
   frame_duration_ms = 30  # 30ms
   frame_size = int(16000 * frame_duration_ms / 1000) * 2  # 960 bytes

Adjusting aggressiveness in noise: webrtcvad's aggressiveness parameter ranges from 0-3. In noisy environments, use 2 or 3, but this may miss soft speech. Silero VAD's sensitivity can be adjusted via the threshold parameter; lowering the threshold captures more speech but may increase false positives.

pyaudio capture example:

python
   import pyaudio
   import webrtcvad
   
   CHUNK = 480  # 30ms @ 16kHz
   FORMAT = pyaudio.paInt16
   CHANNELS = 1
   RATE = 16000
   
   p = pyaudio.PyAudio()
   stream = p.open(format=FORMAT, channels=CHANNELS, rate=RATE,
                   input=True, frames_per_buffer=CHUNK)
   vad = webrtcvad.Vad(2)
   
   while True:
       data = stream.read(CHUNK)
       is_speech = vad.is_speech(data, RATE)
       # Process speech segments...

How to Choose

ScenarioRecommendation

Extremely resource-constrained / embeddedwebrtcvad Quiet environment, need maximum speedwebrtcvad Noisy environment / need accuracySilero VAD Most applicationsSilero VAD

To be honest, unless you're running on very weak hardware, default to Silero VAD—it's hassle-free and accurate. Its accuracy and noise resistance far exceed webrtcvad, and the model loading overhead is negligible on today's hardware.

More Resources

After VAD extracts the speech segments, the next step is sending them to ASR for recognition. For more ASR integration options, see the API integration topic; for self-hosted inference of models like Whisper, see the model deployment topic. Additionally, for scenarios requiring speaker differentiation, VAD output can serve as input for speaker diarization, enabling further refinement.

FAQ

Q: What exactly is the difference between webrtcvad's aggressiveness parameters 0-3? A: 0 is the most permissive (more frames classified as speech), and 3 is the most aggressive (more frames classified as non-speech). In noisy environments, 2 or 3 is recommended, but this may miss soft or slow speech. In practice, test different values in your target scenario to find the balance between false positives and false negatives.

Q: What sample rates does Silero VAD support? A: Officially, 8000 Hz and 16000 Hz. Other sample rates require resampling first. 16000 Hz is recommended, as it's the standard input for most ASR models (like Whisper).

Q: Can VAD distinguish between different speakers? A: No. VAD only detects "is someone speaking," not who is speaking. Speaker diarization requires specialized models, such as those from pyannote-audio or NVIDIA NeMo.

Q: Why add padding after VAD segmentation? A: To avoid cutting off the beginning and end of speech (like consonants or trailing sounds). Typically, 30-100ms of padding is added on both sides. Silero VAD's speech_pad_ms parameter is designed for this purpose.

Q: How do you control latency in real-time streaming VAD? A: Latency mainly depends on frame duration and hangover time. A 30ms frame + 10-frame hangover ≈ 300ms latency. You can reduce latency by using a smaller frame duration (e.g., 10ms) or a lower hangover threshold, but this may increase misclassification. For real-time interactive scenarios, aim for latency under 200ms.

Q: Do I need a GPU for Silero VAD? A: No. The Silero VAD model is tiny (around 2 MB of weights) and runs faster than real time on CPU—even a modest VPS can handle many concurrent audio streams. A GPU only becomes relevant when you're batch-processing large audio archives and want maximum throughput.

*Last updated: July 2026. Always verify against each tool's official docs.*

Also available in 中文.

Voice Activity Detection (VAD): Detecting Whether Someone Is Speaking with Python

Why VAD Matters So Much

Option 1: webrtcvad (Lightweight, Fast)

Create a VAD instance, aggressiveness parameter 0-3

0 is most permissive (more frames classified as speech), 3 is most aggressive (more frames classified as non-speech)

Audio must meet strict constraints:

- Sample rate: 8000, 16000, 32000, or 48000 Hz

- Frame duration: 10ms, 20ms, or 30ms (must match exactly)

- Format: mono, 16-bit PCM

Assume frame_bytes is 30ms of 16kHz audio data

30ms * 16kHz * 2 bytes/sample = 960 bytes

Option 2: Silero VAD (Accurate, Noise-Resistant)

Method 1: via the silero-vad package (recommended)

pip install silero-vad

speech_ts: [{'start': 12000, 'end': 35000}, ...] sample index ranges

Method 2: via torch.hub (alternative)

Subsequent usage is the same as above

Principle Comparison: Energy/Statistical Feature Method vs. Neural Network Model Method

Tying with ASR: VAD + Whisper to Save Costs

1. VAD detects speech segments

2. Load Whisper model

3. Recognize segment by segment

Real-time Streaming vs. Offline Processing Design Differences

Ring buffer + hangover mechanism

Common Pitfalls and Solutions

How to Choose

More Resources

FAQ

Documentation

Getting Started

Learn more