OpenAI Whisper API 2026: Speech-to-Text for AI Applications
Transcribe audio files, meetings, and real-time speech with Whisper
OpenAI Whisper API 2026: Speech-to-Text for AI Applications
Transcribe audio files, meetings, and real-time speech with Whisper
Complete Whisper API tutorial. Covers transcription with timestamps, translation, local faster-whisper, real-time recording, and meeting transcription with AI summary pipeline.
OpenAI Whisper API 2026: Speech-to-Text for AI Applications
Whisper is OpenAI's state-of-the-art speech recognition model, available via API and locally.
Why Whisper?
API Transcription
python
from openai import OpenAI
import osclient = OpenAI()
Basic transcription
with open('audio.mp3', 'rb') as f:
transcript = client.audio.transcriptions.create(
model='whisper-1',
file=f,
language='en', # Optional - auto-detects if not specified
response_format='text' # text, json, srt, vtt, or verbose_json
)
print(transcript)Verbose JSON with timestamps
with open('meeting.mp3', 'rb') as f:
transcript = client.audio.transcriptions.create(
model='whisper-1',
file=f,
response_format='verbose_json',
timestamp_granularities=['word', 'segment']
)print(f'Duration: {transcript.duration}s')
for seg in transcript.segments:
print(f'[{seg.start:.1f}s - {seg.end:.1f}s] {seg.text}')
Word-level timestamps
for word in transcript.words:
print(f'{word.word}: {word.start:.2f}s - {word.end:.2f}s')
Translation (Non-English to English)
python
with open('french_interview.mp3', 'rb') as f:
translation = client.audio.translations.create(
model='whisper-1',
file=f,
response_format='text'
)
print(translation) # Always returns English
Local Whisper (Free, Private)
bash
pip install openai-whisper
Or faster-whisper for 4x speed
pip install faster-whisper
python
faster-whisper (recommended for local use)
from faster_whisper import WhisperModelModels: tiny, base, small, medium, large-v3
model = WhisperModel('medium', device='cuda', compute_type='float16')
CPU: model = WhisperModel('base', device='cpu', compute_type='int8')
segments, info = model.transcribe('audio.mp3', beam_size=5)
print(f'Detected language: {info.language} ({info.language_probability:.0%})')
for segment in segments:
print(f'[{segment.start:.1f}s -> {segment.end:.1f}s] {segment.text}')
Real-Time Transcription
python
import pyaudio
import wave
import tempfile
import threadingCHUNK = 1024
FORMAT = pyaudio.paFloat32
CHANNELS = 1
RATE = 16000
RECORD_SECONDS = 5
def record_and_transcribe():
audio = pyaudio.PyAudio()
stream = audio.open(format=FORMAT, channels=CHANNELS, rate=RATE, input=True, frames_per_buffer=CHUNK)
print('Recording...')
frames = [stream.read(CHUNK) for _ in range(0, int(RATE / CHUNK * RECORD_SECONDS))]
stream.close()
audio.terminate()
with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as f:
wf = wave.open(f.name, 'wb')
wf.setnchannels(CHANNELS)
wf.setsampwidth(audio.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(frames))
wf.close()
with open(f.name, 'rb') as audio_file:
result = client.audio.transcriptions.create(model='whisper-1', file=audio_file)
return result.text
print(record_and_transcribe())
Meeting Transcription + AI Summary Pipeline
python
def transcribe_and_summarize(audio_path: str) -> dict:
# Transcribe
with open(audio_path, 'rb') as f:
transcript = client.audio.transcriptions.create(
model='whisper-1', file=f, response_format='verbose_json'
)
text = transcript.text
# Summarize with GPT-4
summary = client.chat.completions.create(
model='gpt-4o',
messages=[{
'role': 'user',
'content': f'Summarize this meeting transcript. Include:\n'
f'1. Key decisions made\n'
f'2. Action items with owners\n'
f'3. Next steps\n\nTranscript:\n{text}'
}]
)
return {
'transcript': text,
'duration': transcript.duration,
'summary': summary.choices[0].message.content
}
Conclusion
Whisper is the most reliable speech-to-text solution in 2026. Use the API for convenience, faster-whisper locally for privacy and cost. The meeting transcription + AI summary pipeline is immediately production-ready.
相关工具
相关教程
Automatically classify, summarize, and draft replies to emails using AI
Build voice AI applications with natural-sounding TTS and custom voice cloning
Connect LLMs to your documents with LlamaIndex ingestion pipelines and query engines