AI Speech: Recognition, Synthesis, and Voice Applications

Build voice-enabled AI applications with modern speech tech

返回教程列表
进阶32 分钟

AI Speech: Recognition, Synthesis, and Voice Applications

Build voice-enabled AI applications with modern speech tech

Complete guide to AI speech technologies including Whisper for transcription, ElevenLabs for synthesis, and building voice-first applications. Covers real-time processing, accent handling, and multilingual support.

speech-recognitiontext-to-speechwhisperelevenlabsvoice-ai

AI Speech Technologies

Speech Recognition with Whisper

python
import whisper
import numpy as np

Load model (tiny/base/small/medium/large)

model = whisper.load_model("base")

Transcribe audio file

result = model.transcribe("audio.mp3") print(result["text"])

Transcribe with timestamps

for segment in result["segments"]: print(f"[{segment['start']:.1f}s - {segment['end']:.1f}s]: {segment['text']}")

Real-Time Transcription

python
import sounddevice as sd
import numpy as np
import openai

client = openai.OpenAI()

def record_and_transcribe(duration: int = 5, sample_rate: int = 16000) -> str: recording = sd.rec( int(duration * sample_rate), samplerate=sample_rate, channels=1, dtype=np.float32 ) sd.wait() # Save to buffer and transcribe audio_bytes = (recording * 32768).astype(np.int16).tobytes() result = client.audio.transcriptions.create( model="whisper-1", file=("audio.wav", audio_bytes, "audio/wav"), language="en" ) return result.text

Text-to-Speech with ElevenLabs

python
from elevenlabs import ElevenLabs, Voice, VoiceSettings

client = ElevenLabs(api_key="...")

def generate_speech(text: str, voice_id: str = "Rachel") -> bytes: audio = client.generate( text=text, voice=Voice( voice_id=voice_id, settings=VoiceSettings( stability=0.71, similarity_boost=0.5, style=0.0, use_speaker_boost=True ) ), model="eleven_multilingual_v2" ) return audio

Save to file

audio_bytes = generate_speech("Hello, I'm an AI voice assistant!") with open("output.mp3", "wb") as f: f.write(audio_bytes)

Voice Cloning for Personalization

python

Clone a voice from samples

voice = client.clone( name="Custom Brand Voice", description="Professional, warm, trustworthy", files=["sample1.mp3", "sample2.mp3", "sample3.mp3"] )

Voice-First Application Architecture


User Speech
    ↓
Whisper (STT)
    ↓
Intent Detection
    ↓
LLM Response Generation
    ↓
ElevenLabs (TTS)
    ↓
Audio Output

Use Cases

  • Call center automation
  • Podcast generation from articles
  • Accessibility tools (text readers)
  • Language learning apps
  • Voice-controlled applications
  • 相关工具

    whisperelevenlabsopenaiazure-speech