AI Multilingual Live Commentary and Subtitles for the World Cup (Whisper + Translation)

How do you turn one match's commentary into subtitles in dozens of languages in real time? Breaking down the ASR + translation + timeline-alignment pipeline

AI Multilingual Live Commentary and Subtitles for the World Cup

The World Cup is watched by billions worldwide, yet official commentary usually covers only a handful of languages. A Brazilian fan wants Portuguese commentary; a Japanese fan wants Japanese subtitles — exactly the gap AI speech technology can fill. This guide breaks down how to build a "commentary audio → multilingual real-time subtitles" pipeline, and where the real engineering challenges lie.

First, the pipeline has three stages: speech recognition (ASR) → machine translation → timeline-aligned rendering. Sounds straightforward, but each stage has traps in the "real-time + sports" setting.

Stage 1: Speech recognition (ASR)

Turning the commentator's voice into text — Whisper is the most reliable open-source choice. Its multilingual ability is strong, and it ships with a translation mode.

python
import whisper
model = whisper.load_model("large-v3")
Whisper recognizes and segments directly, returning timestamped segments
result = model.transcribe("commentary.wav", language="pt")  # Portuguese commentary
for seg in result["segments"]:
    print(f"[{seg['start']:.1f}-{seg['end']:.1f}] {seg['text']}")

But World Cup commentary is hell-mode for ASR, for several reasons:

Huge background noise: the roar of tens of thousands of fans sits on top of the commentary, terrible signal-to-noise. The moment of a goal, when the commentator shouts and the stadium erupts, is exactly when you most need accuracy and least can get it.

Fast, emotional speech: commentators rattle off names machine-gun style, with heavy slurring and elision.

Tons of proper nouns: player names, team names, tactical terms — many aren't in the general vocabulary, and Whisper tends to mis-recognize them as similar-sounding common words.

In practice, denoising preprocessing + feeding Whisper a prompt (stuffing the match's roster into initial_prompt) noticeably improves accuracy. For full Whisper API usage, see OpenAI Whisper API speech-to-text; for in-depth multilingual recognition, see multilingual ASR.

Stage 2: Real-time is the biggest constraint

Offline subtitling is easy; the hard part is "real-time." In a live setting you can't wait for a full segment before processing — you have to produce text as you listen. This requires streaming processing:

Slice the audio into small windows (say 2-5 seconds), feeding them rolling to the model.

But too fine and you lose context, cutting sentences in half; too long and latency climbs. It's a trade-off.

A common approach is a sliding window with overlap: leave a little overlap between adjacent windows to avoid dropping words at the boundary, then dedupe and stitch.

python
Core idea of streaming: sliding window + overlap
WINDOW = 5.0   # seconds
OVERLAP = 1.0  # seconds of overlap, to prevent sentences being cut off
Production uses faster-whisper or WhisperLive, purpose-built streaming implementations

Budget your latency: ASR a few hundred ms + translation a few hundred ms + rendering — keep total latency within 2-3 seconds for viewers to accept it. Beyond that, subtitles drift out of sync with the picture. For more on real-time transcription, see real-time AI transcription.

Stage 3: Translation — sports terminology is key

The recognized text must be translated into the target language. General translation models often produce embarrassing literal translations of sports commentary:

"He scored a brace" translated literally into another language can become a nonsensical sentence — these are football idioms with fixed renderings.

"Offside," "free kick," "penalty shootout" all have established translations in each language and can't be translated word-for-word.

Player nicknames and localized monikers need a dedicated glossary even more.

The solution is to build a sports-terminology glossary and constrain translation with it. If translating with an LLM, stuff the glossary into the system prompt and explicitly require following it:

python
from openai import OpenAI
client = OpenAI()
GLOSSARY = """Fixed football term translations:
brace = scored two goals in one match
offside = (target-language equivalent)
penalty shootout = (target-language equivalent)
free kick = (target-language equivalent)
"""def translate(text, target_lang):
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": f"You are a sports-commentary translator. Translate "
                                           f"strictly per the glossary.\n{GLOSSARY}"},
            {"role": "user", "content": f"Translate into {target_lang}: {text}"},
        ],
        temperature=0.3,
    ).choices[0].message.content

The upside of LLM translation is it understands context and follows the glossary; the downside is higher latency and cost than a dedicated translation API. In real-time settings, use a dedicated translation API for high-frequency generic lines (fast), and the LLM for key lines (accurate) — a practical hybrid compromise.

Timeline alignment: don't let subtitles drift

The last step is often overlooked: subtitles must sync with the picture. Whisper's timestamps are relative to the audio start; you need to align them to the video timeline. In a live stream, audio and video may have a tens-of-milliseconds offset, and accumulated, subtitles drift further and further. Production periodically re-aligns using audio feature points.

Putting it together

The full pipeline: live audio stream → streaming ASR (Whisper) → glossary-constrained translation → multilingual subtitle tracks → aligned rendering. Run one translation track per target language in parallel, and you output subtitles in dozens of languages simultaneously.

This tech isn't just for subtitles — feed the recognized commentary text to an LLM and you can auto-generate multilingual match reports, covered in event content automation. For the big picture of AI at the World Cup, see the AI and 2026 World Cup roundup.

From a practice standpoint, get the offline version working first — one commentary clip, Whisper transcription + LLM glossary translation, output a bilingual subtitle file. Once that's smooth, tackle the streaming real-time part.

Also available in 中文.

AI Multilingual Live Commentary and Subtitles for the World Cup (Whisper + Translation)

AI Multilingual Live Commentary and Subtitles for the World Cup

Stage 1: Speech recognition (ASR)

Whisper recognizes and segments directly, returning timestamped segments

Stage 2: Real-time is the biggest constraint

Core idea of streaming: sliding window + overlap

Production uses faster-whisper or WhisperLive, purpose-built streaming implementations

Stage 3: Translation — sports terminology is key

Timeline alignment: don't let subtitles drift

Putting it together

Documentation

Getting Started

Learn more