AI Speech: Recognition, Synthesis, and Voice Applications
Build voice-enabled AI applications with modern speech tech
AI Speech: Recognition, Synthesis, and Voice Applications
Build voice-enabled AI applications with modern speech tech
Complete guide to AI speech technologies including Whisper for transcription, ElevenLabs for synthesis, and building voice-first applications. Covers real-time processing, accent handling, and multilingual support.
AI Speech Technologies
Speech Recognition with Whisper
python
import whisper
import numpy as npLoad model (tiny/base/small/medium/large)
model = whisper.load_model("base")Transcribe audio file
result = model.transcribe("audio.mp3")
print(result["text"])Transcribe with timestamps
for segment in result["segments"]:
print(f"[{segment['start']:.1f}s - {segment['end']:.1f}s]: {segment['text']}")
Real-Time Transcription
python
import sounddevice as sd
import numpy as np
import openaiclient = openai.OpenAI()
def record_and_transcribe(duration: int = 5, sample_rate: int = 16000) -> str:
recording = sd.rec(
int(duration * sample_rate),
samplerate=sample_rate,
channels=1,
dtype=np.float32
)
sd.wait()
# Save to buffer and transcribe
audio_bytes = (recording * 32768).astype(np.int16).tobytes()
result = client.audio.transcriptions.create(
model="whisper-1",
file=("audio.wav", audio_bytes, "audio/wav"),
language="en"
)
return result.text
Text-to-Speech with ElevenLabs
python
from elevenlabs import ElevenLabs, Voice, VoiceSettingsclient = ElevenLabs(api_key="...")
def generate_speech(text: str, voice_id: str = "Rachel") -> bytes:
audio = client.generate(
text=text,
voice=Voice(
voice_id=voice_id,
settings=VoiceSettings(
stability=0.71,
similarity_boost=0.5,
style=0.0,
use_speaker_boost=True
)
),
model="eleven_multilingual_v2"
)
return audio
Save to file
audio_bytes = generate_speech("Hello, I'm an AI voice assistant!")
with open("output.mp3", "wb") as f:
f.write(audio_bytes)
Voice Cloning for Personalization
python
Clone a voice from samples
voice = client.clone(
name="Custom Brand Voice",
description="Professional, warm, trustworthy",
files=["sample1.mp3", "sample2.mp3", "sample3.mp3"]
)
Voice-First Application Architecture
User Speech
↓
Whisper (STT)
↓
Intent Detection
↓
LLM Response Generation
↓
ElevenLabs (TTS)
↓
Audio Output
Use Cases
相关工具
相关教程
How talent teams use AI to hire faster while reducing bias and improving quality
How physicians and nurses use AI to reduce documentation burden and improve patient care
Save 10+ hours per week with AI-powered teaching tools and workflows