← Back to tutorials

Voice Activity Detection: Implementation Guide

Detecting and segmenting speech in audio streams

Voice Activity Detection: Implementation Guide (2026)

Voice Activity Detection (VAD) decides which parts of an audio stream contain speech versus silence or noise. It's the gatekeeper in front of transcription and voice agents: run expensive ASR only on speech segments, and detect when a user has stopped talking ("end-of-turn") so the agent can respond.

Why you need it

Without VAD you transcribe silence, waste API calls, and can't tell when a speaker finished. With it you cut cost, reduce latency, and enable natural turn-taking in voice agents.

Practical options

  • webrtcvad: lightweight, fast, classic energy/GMM-based VAD — great for chunking before ASR.
  • Silero VAD: a small, accurate neural VAD that handles noise far better than energy-based methods; the modern default.
  • python
    

    pip install silero-vad torch torchaudio

    from silero_vad import load_silero_vad, read_audio, get_speech_timestamps

    model = load_silero_vad() wav = read_audio("call.wav", sampling_rate=16000) segments = get_speech_timestamps(wav, model, sampling_rate=16000)

    segments = [{'start': 1280, 'end': 20480}, ...] → feed only these to ASR

    Using VAD with transcription

    The standard pipeline: VAD → segment → ASR. Slice the audio into speech segments, then send each to Whisper or a streaming ASR. This avoids transcribing dead air and improves accuracy. For the ASR step, see Multilingual ASR and OpenAI Whisper API.

    Tuning for real-time agents

  • End-of-turn detection: trigger a response after N ms of trailing silence (e.g. 500–800ms). Too short interrupts the user; too long feels laggy.
  • Aggressiveness: webrtcvad has modes 0–3; higher rejects more non-speech but may clip soft speech.
  • Pre/post padding: add a little audio before/after each segment so ASR doesn't clip word edges.
  • FAQ

    webrtcvad or Silero? Silero for accuracy in noise; webrtcvad when you need something tiny and ultra-fast. Does VAD transcribe? No — it only marks speech regions; ASR does the transcription. How do I detect end-of-turn? Fire after a tuned window of trailing silence. Why pad segments? To avoid clipping the first/last words before ASR.

    Summary

    VAD is the cheap, high-leverage front end to any speech pipeline: detect speech with Silero (or webrtcvad), feed only speech to ASR, and tune trailing-silence thresholds for natural turn-taking. It cuts cost and latency while improving transcription quality.


    *Last updated: June 2026. Verify APIs against the Silero VAD / webrtcvad docs.*

    Also available in 中文.

    Voice Activity Detection: Implementation Guide | AI Skill Navigation | AI Skill Navigation