Query World Cup Data in Plain Language: Text-to-SQL Hands-On (2026)

"Which teams had over 60% possession but still lost?" — turning that sentence into SQL automatically, and the engineering behind it

Query World Cup Data in Plain Language: Text-to-SQL Hands-On

The World Cup generates staggering amounts of data: thousands of touch, run, and pass events per match. To dig out insights — "which teams had over 60% possession but still lost?", "which position scored the most in the knockout stage?" — someone who can write SQL spends ages typing, and someone who can't just stares helplessly.

Text-to-SQL solves exactly this: you ask in plain language, AI automatically translates it into SQL to query the database, then explains the results back to you. This guide builds a World Cup data Q&A assistant and clarifies the few things production really must solve. For a general Text-to-SQL intro, see Text-to-SQL: querying databases in natural language; this article focuses on the sports-data scenario.

Core principle: schema is everything

Text-to-SQL is fundamentally about letting an LLM look at your database schema and translate natural language into SQL. So the top priority is feeding the schema to the model clearly — it doesn't know what your tables look like, so without it, it just makes up column names.

python
SCHEMA = """
Table matches:
  id, home_team, away_team, home_goals, away_goals, stage(group/round16/quarter/semi/final), match_date
Table team_stats (per-match team data):
  match_id, team, possession(%), shots, shots_on_target, passes, pass_accuracy(%)
Table players:
  id, name, team, position(GK/DF/MF/FW)
Table goals:
  match_id, player_id, minute, type(open_play/penalty/free_kick/own_goal)
"""

In the schema description, annotate the meaning and enum values of each column (e.g. which values stage can take), and the SQL the model generates gets far more accurate. This is the highest-ROI optimization.

Generating SQL

Splice the schema + the user's question into the prompt and have the model produce SQL:

python
from openai import OpenAI
client = OpenAI()
def to_sql(question):
    prompt = f"""You are a SQL expert. Using the database schema below, translate the user's
question into a single PostgreSQL query. Output only SQL, no explanation.
Only SELECT is allowed; any write operation is forbidden.
{SCHEMA}
Question: {question}"""
    sql = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    ).choices[0].message.content
    return sql.strip().strip('`')
print(to_sql("Which teams had over 60% possession but still lost the match?"))
→ SELECT ts.team, ... FROM team_stats ts JOIN matches m ... WHERE ts.possession > 60 AND (losing condition)

temperature=0 is mandatory — generating SQL needs no creativity, it needs determinism and accuracy.

The three must-answer questions for production

A demo is easy; for production you must solve these three, or you'll get burned eventually:

1. Safety: never let it execute writes

This is the first red line. The model-generated SQL hits the database directly, and if it ever generates a DELETE or DROP (lured by a malicious prompt, or just glitching), the consequences are dire. Layered defense:

Read-only database account: give the query connection an account with only SELECT permission. This is the hardest line of defense — blocked at the permission layer.

SQL validation: parse the SQL before execution; reject non-SELECT statements and multi-statements.

Timeout and row limits: prevent one SELECT from scanning the whole DB into a crash.

python
import sqlparsedef is_safe_select(sql):
    parsed = sqlparse.parse(sql)
    if len(parsed) != 1:  # reject multi-statements
        return False
    stmt = parsed[0]
    if stmt.get_type() != 'SELECT':  # allow SELECT only
        return False
    forbidden = ['insert', 'update', 'delete', 'drop', 'alter', 'truncate', 'grant']
    return not any(w in sql.lower() for w in forbidden)

Safety here is not optional. If your underlying data is in PostgreSQL, for how to configure account permissions, see PostgreSQL for AI Applications.

2. Accuracy: how do you know it didn't query wrong

The model can generate syntactically correct but logically wrong SQL — e.g. interpreting "over 60% possession" as "below 60%." Mitigations:

Echo the SQL to the user: display the generated SQL alongside, so those who can read it can verify.

EXPLAIN before executing: check the query plan is reasonable, and incidentally block full-table scans.

Give few-shot examples: put a few "question → correct SQL" examples in the prompt for the model to learn from; accuracy improves markedly.

3. Results should speak human

The query returns a pile of rows; have the LLM summarize the result into a natural-language answer. Feed the SQL result back to the model: "Based on the query, three teams dominated possession yet lost, namely..." This step turns the whole experience from a "database tool" into a "conversational assistant."

Advanced: wire it into an Agent

A more powerful pattern is wrapping Text-to-SQL as a tool and handing it to an Agent for multi-step reasoning: query one piece of data, then decide what to query next based on the result. For building such a "data-analysis Agent," see building a data-analysis agent with AI agents.

Compared with batch one's live-commentary Agent: that one uses tool calls to query live scores, this one uses Text-to-SQL to query a historical database — two different directions of the Agent's tool capability. For tool-calling basics, see live commentary with an LLM Agent.

Summary

Text-to-SQL lets people who don't know SQL query complex data, especially fitting for data-dense scenarios like the World Cup. But remember: the schema description determines accuracy, and read-only permission determines safety. Nail those two and the rest is experience polish.

For more AI applications at the World Cup, see the AI and 2026 World Cup roundup.

Also available in 中文.