AI Speech: Recognition, Synthesis, and Voice Applications
Build voice-enabled AI applications with modern speech tech
AI Speech Technologies
Speech Recognition with Whisper
python
import whisper
import numpy as npLoad model (tiny/base/small/medium/large)
model = whisper.load_model("base")Transcribe audio file
result = model.transcribe("audio.mp3")
print(result["text"])Transcribe with timestamps
for segment in result["segments"]:
print(f"[{segment['start']:.1f}s - {segment['end']:.1f}s]: {segment['text']}")
Real-Time Transcription
python
import sounddevice as sd
import numpy as np
import openaiclient = openai.OpenAI()
def record_and_transcribe(duration: int = 5, sample_rate: int = 16000) -> str:
recording = sd.rec(
int(duration * sample_rate),
samplerate=sample_rate,
channels=1,
dtype=np.float32
)
sd.wait()
# Save to buffer and transcribe
audio_bytes = (recording * 32768).astype(np.int16).tobytes()
result = client.audio.transcriptions.create(
model="whisper-1",
file=("audio.wav", audio_bytes, "audio/wav"),
language="en"
)
return result.text
Text-to-Speech with ElevenLabs
python
from elevenlabs import ElevenLabs, Voice, VoiceSettingsclient = ElevenLabs(api_key="...")
def generate_speech(text: str, voice_id: str = "Rachel") -> bytes:
audio = client.generate(
text=text,
voice=Voice(
voice_id=voice_id,
settings=VoiceSettings(
stability=0.71,
similarity_boost=0.5,
style=0.0,
use_speaker_boost=True
)
),
model="eleven_multilingual_v2"
)
return audio
Save to file
audio_bytes = generate_speech("Hello, I'm an AI voice assistant!")
with open("output.mp3", "wb") as f:
f.write(audio_bytes)
Voice Cloning for Personalization
python
Clone a voice from samples
voice = client.clone(
name="Custom Brand Voice",
description="Professional, warm, trustworthy",
files=["sample1.mp3", "sample2.mp3", "sample3.mp3"]
)
Voice-First Application Architecture
User Speech
↓
Whisper (STT)
↓
Intent Detection
↓
LLM Response Generation
↓
ElevenLabs (TTS)
↓
Audio Output
Use Cases
Also available in 中文.