中文
← Back to tutorials

Computer Vision Watches Football: Offside Detection and Auto-Highlight Editing (2026)

From semi-automated offside tech to AI-generated highlights, breaking down the CV stack and engineering challenges behind football video analysis

Computer Vision Watches Football: Offside Detection and Highlight Auto-Editing

Watching a World Cup broadcast, you've surely seen it: after a goal, a virtual offside line is drawn on screen, judging offside to the millimeter; or, moments after the final whistle, the platform serves up a highlight reel. Behind these is a full computer vision (CV) stack. This article breaks down how they work, using a stack you can actually pick up.

First, a point that's easy to miss: FIFA's official system is called "Semi-Automated Offside Technology" (SAOT) — note the words "semi-automated." The AI provides a recommendation; the video assistant referee makes the final call. This isn't a technical shortcoming — it's that high-stakes decisions must keep a human backstop. That design philosophy is worth borrowing for any AI deployment.

The overall stack

The core pipeline for football video analysis is roughly:

  • Object detection: find players, referees, and the ball in each frame.
  • Multi-object tracking: link the same person across frames into a trajectory.
  • Pose / keypoint estimation: precise down to a player's foot or shoulder — offside is judged on body parts.
  • Pitch calibration: map video pixel coordinates to real pitch coordinates.
  • Event recognition: identify key events like goals, shots, corners.
  • Detecting players and the ball

    The most basic step; the YOLO family is the de facto standard — fast and accurate enough.

    python
    from ultralytics import YOLO

    model = YOLO("yolov8x.pt") # use the large model for accuracy; the ball is tiny and hard

    Detect on each frame

    results = model.track( "match.mp4", classes=[0, 32], # 0=person, 32=sports ball persist=True, # enable cross-frame tracking, auto-assigning each target an ID tracker="bytetrack.yaml", )

    persist=True lets YOLO's built-in ByteTrack assign each player a stable ID, which is what lets you build trajectories. In practice the ball is the biggest headache — small and fast, often occluded by legs and lost for frames; this part usually needs separate training or trajectory interpolation.

    Offside calls: where the difficulty lies

    The CV logic for offside sounds simple — compare the attacking player's position to the second-to-last defender's — but it's brutally hard in engineering:

  • You must judge "which body part": the rule looks at valid scoring parts (torso, legs, head), not the whole person. So a detection box isn't enough — you need pose estimation precise to keypoints.
  • It must be 3D: players are three-dimensional on the pitch; a single camera has perspective distortion. The official system uses a dozen-plus dedicated cameras for 3D reconstruction.
  • It must nail the pass moment: offside is judged at "the instant the ball is played" — a few frames off and the conclusion flips.
  • python
    

    Use pose estimation to get each player's keypoints

    pose_model = YOLO("yolov8x-pose.pt") poses = pose_model(frame)

    For each player, take the foremost valid scoring part in the attacking direction,

    then compare against the second-to-last defender — that's the essence of the offside line

    So a hobby reproduction can build a "rough offside hint," but millimeter-level calls are built on dedicated hardware plus multiple cameras — a single-camera video can't do it. Recognizing this boundary matters.

    Auto-highlight editing: a more approachable project

    Compared to offside, auto-highlights is a far more realistic project for the average developer. The idea: detect "exciting events" and automatically clip and splice the surrounding segments.

    Several signals indicate an exciting moment; combining them is most robust:

  • Visual events: detect the ball hitting the net, or players clustering to celebrate.
  • Audio energy: the commentator's pitch suddenly rising, the crowd's roar peaking — this is especially effective; audio energy almost always spikes at the moment of a goal.
  • Scoreboard change: OCR the corner scoreboard; if it changes, a goal was scored.
  • python
    import librosa
    import numpy as np

    Use audio energy peaks to locate exciting moments — simple but surprisingly effective

    y, sr = librosa.load("match.mp4") energy = librosa.feature.rms(y=y)[0] threshold = np.percentile(energy, 97) # take the top 3% energy segments peaks = np.where(energy > threshold)[0]

    Convert peaks to timestamps, clip 10s before and 5s after

    Fuse the audio-peak and visual-detection signals and the highlight accuracy is much higher than either alone. This is a project you can demo in a weekend — great for practice.

    Traps you'll hit in practice

  • The broadcast feed cuts shots: wide, close-up, replay — constant switching, so tracking IDs break often. Do shot-boundary detection and reset trajectories on each cut.
  • Compute: running a big YOLO frame by frame is slow; real-time processing needs lower resolution or frame sampling.
  • Annotation is expensive: training your own football-specific model has steep labeling costs — start with the public SoccerNet dataset.
  • Where this sits in the World Cup picture

    CV is the "eyes" of AI watching the match, but it's just one link. For prediction see predicting scores with ML, for knowledge Q&A see the RAG knowledge base, and for the full picture see the AI and 2026 World Cup roundup.

    From a "what can I actually run" standpoint, start with auto-highlights — audio peaks plus YOLO detection, a weekend to a demo, far more realistic than tackling 3D offside reconstruction head-on.

    Also available in 中文.