中文
← Back to tutorials

AI Multilingual Live Commentary and Subtitles for the World Cup (Whisper + Translation)

How do you turn one match's commentary into subtitles in dozens of languages in real time? Breaking down the ASR + translation + timeline-alignment pipeline

AI Multilingual Live Commentary and Subtitles for the World Cup

The World Cup is watched by billions worldwide, yet official commentary usually covers only a handful of languages. A Brazilian fan wants Portuguese commentary; a Japanese fan wants Japanese subtitles — exactly the gap AI speech technology can fill. This guide breaks down how to build a "commentary audio → multilingual real-time subtitles" pipeline, and where the real engineering challenges lie.

First, the pipeline has three stages: speech recognition (ASR) → machine translation → timeline-aligned rendering. Sounds straightforward, but each stage has traps in the "real-time + sports" setting.

Stage 1: Speech recognition (ASR)

Turning the commentator's voice into text — Whisper is the most reliable open-source choice. Its multilingual ability is strong, and it ships with a translation mode.

python
import whisper

model = whisper.load_model("large-v3")

Whisper recognizes and segments directly, returning timestamped segments

result = model.transcribe("commentary.wav", language="pt") # Portuguese commentary for seg in result["segments"]: print(f"[{seg['start']:.1f}-{seg['end']:.1f}] {seg['text']}")

But World Cup commentary is hell-mode for ASR, for several reasons:

  • Huge background noise: the roar of tens of thousands of fans sits on top of the commentary, terrible signal-to-noise. The moment of a goal, when the commentator shouts and the stadium erupts, is exactly when you most need accuracy and least can get it.
  • Fast, emotional speech: commentators rattle off names machine-gun style, with heavy slurring and elision.
  • Tons of proper nouns: player names, team names, tactical terms — many aren't in the general vocabulary, and Whisper tends to mis-recognize them as similar-sounding common words.
  • In practice, denoising preprocessing + feeding Whisper a prompt (stuffing the match's roster into initial_prompt) noticeably improves accuracy. For full Whisper API usage, see OpenAI Whisper API speech-to-text; for in-depth multilingual recognition, see multilingual ASR.

    Stage 2: Real-time is the biggest constraint

    Offline subtitling is easy; the hard part is "real-time." In a live setting you can't wait for a full segment before processing — you have to produce text as you listen. This requires streaming processing:

  • Slice the audio into small windows (say 2-5 seconds), feeding them rolling to the model.
  • But too fine and you lose context, cutting sentences in half; too long and latency climbs. It's a trade-off.
  • A common approach is a sliding window with overlap: leave a little overlap between adjacent windows to avoid dropping words at the boundary, then dedupe and stitch.
  • python
    

    Core idea of streaming: sliding window + overlap

    WINDOW = 5.0 # seconds OVERLAP = 1.0 # seconds of overlap, to prevent sentences being cut off

    Production uses faster-whisper or WhisperLive, purpose-built streaming implementations

    Budget your latency: ASR a few hundred ms + translation a few hundred ms + rendering — keep total latency within 2-3 seconds for viewers to accept it. Beyond that, subtitles drift out of sync with the picture. For more on real-time transcription, see real-time AI transcription.

    Stage 3: Translation — sports terminology is key

    The recognized text must be translated into the target language. General translation models often produce embarrassing literal translations of sports commentary:

  • "He scored a brace" translated literally into another language can become a nonsensical sentence — these are football idioms with fixed renderings.
  • "Offside," "free kick," "penalty shootout" all have established translations in each language and can't be translated word-for-word.
  • Player nicknames and localized monikers need a dedicated glossary even more.
  • The solution is to build a sports-terminology glossary and constrain translation with it. If translating with an LLM, stuff the glossary into the system prompt and explicitly require following it:

    python
    from openai import OpenAI
    client = OpenAI()

    GLOSSARY = """Fixed football term translations: brace = scored two goals in one match offside = (target-language equivalent) penalty shootout = (target-language equivalent) free kick = (target-language equivalent) """

    def translate(text, target_lang): return client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "system", "content": f"You are a sports-commentary translator. Translate " f"strictly per the glossary.\n{GLOSSARY}"}, {"role": "user", "content": f"Translate into {target_lang}: {text}"}, ], temperature=0.3, ).choices[0].message.content

    The upside of LLM translation is it understands context and follows the glossary; the downside is higher latency and cost than a dedicated translation API. In real-time settings, use a dedicated translation API for high-frequency generic lines (fast), and the LLM for key lines (accurate) — a practical hybrid compromise.

    Timeline alignment: don't let subtitles drift

    The last step is often overlooked: subtitles must sync with the picture. Whisper's timestamps are relative to the audio start; you need to align them to the video timeline. In a live stream, audio and video may have a tens-of-milliseconds offset, and accumulated, subtitles drift further and further. Production periodically re-aligns using audio feature points.

    Putting it together

    The full pipeline: live audio stream → streaming ASR (Whisper) → glossary-constrained translation → multilingual subtitle tracks → aligned rendering. Run one translation track per target language in parallel, and you output subtitles in dozens of languages simultaneously.

    This tech isn't just for subtitles — feed the recognized commentary text to an LLM and you can auto-generate multilingual match reports, covered in event content automation. For the big picture of AI at the World Cup, see the AI and 2026 World Cup roundup.

    From a practice standpoint, get the offline version working first — one commentary clip, Whisper transcription + LLM glossary translation, output a bilingual subtitle file. Once that's smooth, tackle the streaming real-time part.

    Also available in 中文.