OpenAI Whisper API 2026: Speech-to-Text for AI Applications

Transcribe audio files, meetings, and real-time speech with Whisper

返回教程列表
入门25 分钟

OpenAI Whisper API 2026: Speech-to-Text for AI Applications

Transcribe audio files, meetings, and real-time speech with Whisper

Complete Whisper API tutorial. Covers transcription with timestamps, translation, local faster-whisper, real-time recording, and meeting transcription with AI summary pipeline.

whisperspeech to textopenaiaudiotranscription

OpenAI Whisper API 2026: Speech-to-Text for AI Applications

Whisper is OpenAI's state-of-the-art speech recognition model, available via API and locally.

Why Whisper?

  • 99 languages supported with high accuracy
  • Handles accents, background noise, and technical vocabulary
  • Outputs timestamps for word and segment alignment
  • Available via API or run locally for privacy
  • API Transcription

    python
    from openai import OpenAI
    import os

    client = OpenAI()

    Basic transcription

    with open('audio.mp3', 'rb') as f: transcript = client.audio.transcriptions.create( model='whisper-1', file=f, language='en', # Optional - auto-detects if not specified response_format='text' # text, json, srt, vtt, or verbose_json ) print(transcript)

    Verbose JSON with timestamps

    with open('meeting.mp3', 'rb') as f: transcript = client.audio.transcriptions.create( model='whisper-1', file=f, response_format='verbose_json', timestamp_granularities=['word', 'segment'] )

    print(f'Duration: {transcript.duration}s') for seg in transcript.segments: print(f'[{seg.start:.1f}s - {seg.end:.1f}s] {seg.text}')

    Word-level timestamps

    for word in transcript.words: print(f'{word.word}: {word.start:.2f}s - {word.end:.2f}s')

    Translation (Non-English to English)

    python
    with open('french_interview.mp3', 'rb') as f:
        translation = client.audio.translations.create(
            model='whisper-1',
            file=f,
            response_format='text'
        )
    print(translation)  # Always returns English
    

    Local Whisper (Free, Private)

    bash
    pip install openai-whisper
    

    Or faster-whisper for 4x speed

    pip install faster-whisper

    python
    

    faster-whisper (recommended for local use)

    from faster_whisper import WhisperModel

    Models: tiny, base, small, medium, large-v3

    model = WhisperModel('medium', device='cuda', compute_type='float16')

    CPU: model = WhisperModel('base', device='cpu', compute_type='int8')

    segments, info = model.transcribe('audio.mp3', beam_size=5) print(f'Detected language: {info.language} ({info.language_probability:.0%})')

    for segment in segments: print(f'[{segment.start:.1f}s -> {segment.end:.1f}s] {segment.text}')

    Real-Time Transcription

    python
    import pyaudio
    import wave
    import tempfile
    import threading

    CHUNK = 1024 FORMAT = pyaudio.paFloat32 CHANNELS = 1 RATE = 16000 RECORD_SECONDS = 5

    def record_and_transcribe(): audio = pyaudio.PyAudio() stream = audio.open(format=FORMAT, channels=CHANNELS, rate=RATE, input=True, frames_per_buffer=CHUNK) print('Recording...') frames = [stream.read(CHUNK) for _ in range(0, int(RATE / CHUNK * RECORD_SECONDS))] stream.close() audio.terminate() with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as f: wf = wave.open(f.name, 'wb') wf.setnchannels(CHANNELS) wf.setsampwidth(audio.get_sample_size(FORMAT)) wf.setframerate(RATE) wf.writeframes(b''.join(frames)) wf.close() with open(f.name, 'rb') as audio_file: result = client.audio.transcriptions.create(model='whisper-1', file=audio_file) return result.text

    print(record_and_transcribe())

    Meeting Transcription + AI Summary Pipeline

    python
    def transcribe_and_summarize(audio_path: str) -> dict:
        # Transcribe
        with open(audio_path, 'rb') as f:
            transcript = client.audio.transcriptions.create(
                model='whisper-1', file=f, response_format='verbose_json'
            )
        
        text = transcript.text
        
        # Summarize with GPT-4
        summary = client.chat.completions.create(
            model='gpt-4o',
            messages=[{
                'role': 'user',
                'content': f'Summarize this meeting transcript. Include:\n'
                           f'1. Key decisions made\n'
                           f'2. Action items with owners\n'
                           f'3. Next steps\n\nTranscript:\n{text}'
            }]
        )
        
        return {
            'transcript': text,
            'duration': transcript.duration,
            'summary': summary.choices[0].message.content
        }
    

    Conclusion

    Whisper is the most reliable speech-to-text solution in 2026. Use the API for convenience, faster-whisper locally for privacy and cost. The meeting transcription + AI summary pipeline is immediately production-ready.

    相关工具

    openaipython