AI Multilingual Live Commentary and Subtitles for the World Cup (Whisper + Translation)
How do you turn one match's commentary into subtitles in dozens of languages in real time? Breaking down the ASR + translation + timeline-alignment pipeline
AI Multilingual Live Commentary and Subtitles for the World Cup
The World Cup is watched by billions worldwide, yet official commentary usually covers only a handful of languages. A Brazilian fan wants Portuguese commentary; a Japanese fan wants Japanese subtitles — exactly the gap AI speech technology can fill. This guide breaks down how to build a "commentary audio → multilingual real-time subtitles" pipeline, and where the real engineering challenges lie.
First, the pipeline has three stages: speech recognition (ASR) → machine translation → timeline-aligned rendering. Sounds straightforward, but each stage has traps in the "real-time + sports" setting.
Stage 1: Speech recognition (ASR)
Turning the commentator's voice into text — Whisper is the most reliable open-source choice. Its multilingual ability is strong, and it ships with a translation mode.
python
import whispermodel = whisper.load_model("large-v3")
Whisper recognizes and segments directly, returning timestamped segments
result = model.transcribe("commentary.wav", language="pt") # Portuguese commentary
for seg in result["segments"]:
print(f"[{seg['start']:.1f}-{seg['end']:.1f}] {seg['text']}")
But World Cup commentary is hell-mode for ASR, for several reasons:
In practice, denoising preprocessing + feeding Whisper a prompt (stuffing the match's roster into initial_prompt) noticeably improves accuracy. For full Whisper API usage, see OpenAI Whisper API speech-to-text; for in-depth multilingual recognition, see multilingual ASR.
Stage 2: Real-time is the biggest constraint
Offline subtitling is easy; the hard part is "real-time." In a live setting you can't wait for a full segment before processing — you have to produce text as you listen. This requires streaming processing:
python
Core idea of streaming: sliding window + overlap
WINDOW = 5.0 # seconds
OVERLAP = 1.0 # seconds of overlap, to prevent sentences being cut off
Production uses faster-whisper or WhisperLive, purpose-built streaming implementations
Budget your latency: ASR a few hundred ms + translation a few hundred ms + rendering — keep total latency within 2-3 seconds for viewers to accept it. Beyond that, subtitles drift out of sync with the picture. For more on real-time transcription, see real-time AI transcription.
Stage 3: Translation — sports terminology is key
The recognized text must be translated into the target language. General translation models often produce embarrassing literal translations of sports commentary:
The solution is to build a sports-terminology glossary and constrain translation with it. If translating with an LLM, stuff the glossary into the system prompt and explicitly require following it:
python
from openai import OpenAI
client = OpenAI()GLOSSARY = """Fixed football term translations:
brace = scored two goals in one match
offside = (target-language equivalent)
penalty shootout = (target-language equivalent)
free kick = (target-language equivalent)
"""
def translate(text, target_lang):
return client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": f"You are a sports-commentary translator. Translate "
f"strictly per the glossary.\n{GLOSSARY}"},
{"role": "user", "content": f"Translate into {target_lang}: {text}"},
],
temperature=0.3,
).choices[0].message.content
The upside of LLM translation is it understands context and follows the glossary; the downside is higher latency and cost than a dedicated translation API. In real-time settings, use a dedicated translation API for high-frequency generic lines (fast), and the LLM for key lines (accurate) — a practical hybrid compromise.
Timeline alignment: don't let subtitles drift
The last step is often overlooked: subtitles must sync with the picture. Whisper's timestamps are relative to the audio start; you need to align them to the video timeline. In a live stream, audio and video may have a tens-of-milliseconds offset, and accumulated, subtitles drift further and further. Production periodically re-aligns using audio feature points.
Putting it together
The full pipeline: live audio stream → streaming ASR (Whisper) → glossary-constrained translation → multilingual subtitle tracks → aligned rendering. Run one translation track per target language in parallel, and you output subtitles in dozens of languages simultaneously.
This tech isn't just for subtitles — feed the recognized commentary text to an LLM and you can auto-generate multilingual match reports, covered in event content automation. For the big picture of AI at the World Cup, see the AI and 2026 World Cup roundup.
From a practice standpoint, get the offline version working first — one commentary clip, Whisper transcription + LLM glossary translation, output a bilingual subtitle file. Once that's smooth, tackle the streaming real-time part.
Also available in 中文.