OpenAI Whisper API: Complete Guide to Speech Recognition in Your App
Add accurate speech-to-text to any application using OpenAI Whisper API
OpenAI Whisper API: Complete Guide to Speech Recognition in Your App
Add accurate speech-to-text to any application using OpenAI Whisper API
Complete guide to integrating OpenAI Whisper for speech recognition: API setup, language detection, translation, real-time streaming, cost optimization, and handling audio quality issues.
OpenAI Whisper API: Complete Integration Guide
What Whisper Does
Whisper transcribes audio to text with state-of-the-art accuracy. Supports 57 languages and can translate non-English audio directly to English text.
Cost: $0.006 per minute - extremely affordable for most use cases.
Basic Setup
python
from openai import OpenAI
client = OpenAI(api_key='your-key')Transcribe an audio file
with open('meeting.mp3', 'rb') as audio_file:
transcript = client.audio.transcriptions.create(
model='whisper-1',
file=audio_file
)
print(transcript.text)
Language Detection and Translation
python
Auto-detect language
transcript = client.audio.transcriptions.create(
model='whisper-1',
file=audio_file,
response_format='verbose_json' # Includes detected language
)
print(f'Language: {transcript.language}')
print(f'Text: {transcript.text}')Translate non-English to English
translation = client.audio.translations.create(
model='whisper-1',
file=spanish_audio
)
print(translation.text) # Always English output
Timestamps and Segmentation
python
transcript = client.audio.transcriptions.create(
model='whisper-1',
file=audio_file,
response_format='verbose_json',
timestamp_granularities=['segment', 'word'] # Both segment and word timestamps
)for segment in transcript.segments:
print(f'[{segment.start:.1f}s - {segment.end:.1f}s]: {segment.text}')
Cost Optimization
python
import librosa
import soundfile as sfdef optimize_audio_for_whisper(input_path: str, output_path: str):
# Load and resample to 16kHz mono (Whisper native format)
audio, sr = librosa.load(input_path, sr=16000, mono=True)
# Trim silence (saves significant cost on meetings with long pauses)
audio_trimmed, _ = librosa.effects.trim(audio, top_db=20)
# Save as 16-bit PCM WAV (smaller than MP3 for short clips)
sf.write(output_path, audio_trimmed, 16000, subtype='PCM_16')
original_duration = librosa.get_duration(filename=input_path)
trimmed_duration = len(audio_trimmed) / 16000
savings = (original_duration - trimmed_duration) / original_duration
print(f'Audio reduced by {savings:.1%}')
return output_path
Handling Large Files
Whisper API has a 25MB file size limit. For longer audio:
python
from pydub import AudioSegmentdef transcribe_long_audio(file_path: str, chunk_minutes: int = 10) -> str:
audio = AudioSegment.from_file(file_path)
chunk_ms = chunk_minutes * 60 * 1000
chunks = [audio[i:i+chunk_ms] for i in range(0, len(audio), chunk_ms)]
transcripts = []
for i, chunk in enumerate(chunks):
chunk_path = f'/tmp/chunk_{i}.mp3'
chunk.export(chunk_path, format='mp3', bitrate='64k')
with open(chunk_path, 'rb') as f:
result = client.audio.transcriptions.create(
model='whisper-1', file=f
)
transcripts.append(result.text)
return ' '.join(transcripts)
Real-World Applications
Meeting transcription: Record meetings, transcribe, then use GPT-4o to extract action items and summaries.
Customer service analytics: Transcribe support calls to identify common issues and sentiment patterns.
Subtitle generation: Whisper with word timestamps generates accurate SRT subtitle files.
Multilingual support: Support users in any of 57 languages without separate language-specific models.
Quality Tips
相关工具
相关教程
Complete privacy with zero API costs - setup, models, and integration
Early access creators share innovative projects made with Sora text-to-video AI
Film producers and YouTubers share their complete Runway AI video creation workflows