← Back to tutorials

World Models: The Next Piece of the Puzzle from Theory to Embodied Intelligence

Exploring core advances like JEPA, Kairos, and γ-World, revealing how world models drive embodied intelligence to a new stage

Introduction: From Language to Physics—AI's Next Battlefield

In the past few years, large language models (LLMs) have achieved remarkable breakthroughs in text generation, code writing, and mathematical reasoning. But an awkward fact remains: AI still can't pour itself a glass of water.

This dilemma of "high IQ, low physical ability" is precisely captured by Moravec's paradox—tasks that are trivial for human infants, such as walking, grasping, and obstacle avoidance, are extremely difficult for AI. The root cause is that existing models lack common-sense understanding of the physical world: they don't know about gravity, friction, object permanence, or how to anticipate the consequences of their own actions.

World models are designed to address this problem. They aim to enable AI to build an internal representation of the physical world's operating laws, thereby enabling causal reasoning, action planning, and zero-shot adaptation. This article systematically reviews the latest progress and future directions of world models from three dimensions: theory, technology, and industry.

Theoretical Foundations of World Models

What is a World Model?

Yann LeCun gave a clear definition in his 2026 lecture at ETH Zurich: A world model is a causal model based on actions/interventions. It takes system observations and human actions as input and predicts the outcomes of interventions. The key difference is that world models do not predict raw data details but make predictions in an abstract representation space, actively ignoring noise and unpredictable details.

This definition clarifies common misconceptions: a world model ≠ digital twin, full simulator, or video generation system. It is an action-oriented abstract predictor whose core goal is to support reasoning and planning.

JEPA: Joint Embedding Predictive Architecture

JEPA (Joint Embedding Predictive Architecture) is the core technology for implementing world models. Unlike traditional generative models (e.g., VAEs, diffusion models), JEPA does not require predicting every pixel. Instead, it learns abstract representations of data and performs state prediction in the representation space. This makes it naturally friendly to high-dimensional, continuous, noisy data (e.g., video, sensors).

Training JEPA faces a key challenge: representation collapse, where the model loses effective representational capacity. There are two main solutions:

  • Contrastive methods: Lower the energy of positive samples and raise the energy of negative samples, but scalability is poor in high-dimensional scenarios.
  • Regularization methods (advocated by AMI Labs): Constrain the volume of low-energy spaces, such as SIGReg (isotropic Gaussian regularization), VCReg, Barlow Twins, etc.
  • The JEPA family has spawned several mature models:

  • I-JEPA: For static images, leading to the DINOv3 general-purpose visual foundation model.
  • V-JEPA: For video and dynamic scenes, setting new records on benchmarks like EK100 and SSv2, capable of learning intuitive physical common sense.
  • LeWorldModel (LeWM): An end-to-end JEPA world model that outperforms competitors like DINO-WM and PLDM in robot planning tasks.
  • Hierarchical Planning and Safety Constraints

    World models support multi-level, multi-timescale hierarchical planning:

  • High level: Responsible for long-term, far-horizon predictions, with concise representations, generating subgoals.
  • Low level: Responsible for short-term, near-horizon predictions, retaining details, executing specific actions.
  • Safety guardrails: Constraints applied at all levels to ensure system controllability.
  • For example, in the task "go from the office to the airport": the top level selects the airport and transportation mode, the middle level plans the route and obstacle avoidance, and the bottom level executes actions like walking and grasping. This architecture enables world models to handle complex long-horizon tasks.

    Frontier Advances: From Single-Agent to Multi-Agent World Models

    Kairos: A 4B-Parameter Champion World Model Defeating 28B

    Daxiao Robotics' Kairos world model achieved first place in four international benchmarks: RoboTwin 2.0, LIBERO-Plus, WorldModelBench Robot, and DreamGen Bench, with only 4B parameters.

    Core technologies:

  • Native unified architecture: Integrates multimodal understanding, video generation, and state prediction into a single model, rather than post-training modifications on existing models.
  • Self-developed hybrid linear attention mechanism and global state sharing mechanism, enabling the three capabilities to operate synergistically.
  • Training data: Over 100,000 hours of human-centric real-world data + millions of hours of internet video, combined with explicit imitation learning and latent-space reinforcement learning.
  • Key performance:

  • RoboTwin 2.0 (dual-arm manipulation): Average success rate 96.1%, surpassing models like G0.5 and starVLA.
  • LIBERO-Plus (scene generalization): 89.0 points, first time surpassing mainstream VLA approaches (e.g., Pi 0.5).
  • WorldModelBench Robot: 4B parameters scored 9.30, defeating 28B Lingbot.
  • DreamGen Bench: First in both physical adherence and overall average score.
  • Kairos-4B is also the first embodied world model that can directly drive a robot body on the edge, reducing intermediate conversion latency and extending world models from "cognitive systems" to "execution systems."

    γ-World: Real-Time Multi-Agent Shared-World Interaction

    γ-World, introduced by NVIDIA in collaboration with Tsinghua University, extends world models from single-player mode to multi-player shared spaces. Its core innovations include:

  • SRAE (Simplex Rotation Agent Encoding): Maps N agents to N vertices of a regular simplex in a rotation angle space, with equal distances between any two vertices. This achieves identity symmetry and scalability without learning parameters. After two-player training, it can zero-shot generalize to four players.
  • SHA (Sparse Hub Attention): Introduces learnable Hub Tokens as intermediaries, reducing cross-agent attention cost from O(N²) to O(N).
  • Distillation pipeline: A teacher model (bidirectional diffusion) generates high-quality data, and a student model (chunked causal) achieves 24 FPS real-time inference via KV caching.
  • γ-World has been validated in both virtual games and real robot collaboration scenarios, providing a foundation for multi-robot collaboration, autonomous driving multi-vehicle interaction, and more.

    Industry Deployment: From Data Flywheels to Open-Source Ecosystems

    ForceMind: A Two-Way Match of Model and Scenario

    The merger of ForceMind and Atomix represents a powerful combination of "embodied large models" and "real-world scenario data." ForceMind's DM0 model achieved first place globally in the RoboChallenge real-robot evaluation, with only 2.4B parameters, trained on a fusion of internet, autonomous driving, and robot manipulation data across 8 types of embodiments. Atomix, meanwhile, has accumulated over 500 projects in logistics warehousing scenarios, with a daily shipment volume of 600,000 items, providing massive real-world picking data for the model.

    This merger builds a data flywheel: "model improves → robots become smarter → data gets better → model continues to improve." Tang Wenbin views picking as the "atomic task" of embodied intelligence, analogous to coding in the era of large models—with massive data, clear feedback, and strong transferability.

    Accelerated Evolution: MVP on the Soccer Field and an OS Ambition

    Accelerated Evolution chose robot soccer as the minimal closed loop for technology validation, persisting in this scenario for 20 years. Its K1 humanoid robot (priced at 39,900 RMB) is already sold on JD.com, and the T2 flagship model boasts high dynamic capabilities. The company does not bet on end-to-end large models but follows a layered deployment path: perception → decision → execution. It is also developing the Booster Studio development tool, aiming to build an operating system for the embodied agent ecosystem.

    Cheng Hao believes that embodied large models will take another 5-10 years to mature; before that, operating systems and data flywheels are more pragmatic paths.

    Jiuwen Symbiosis: An Open-Source Physical AI Framework

    The openJiuwen community's open-source Jiuwen Symbiosis proposes a "situational awareness loop" architecture, where the cognitive layer and execution layer collaborate through a shared Workspace. Its core modules include multimodal perception, safe planning, physical execution, state observation, observation feedback, and spatial memory, supporting zero-shot cross-embodiment adaptation and long-horizon composite tasks.

    The framework adopts an edge-cloud collaborative architecture: cloud-side LLM/VLM handles complex reasoning, while edge-side Ascend NPU and Kunpeng CPU handle real-time perception and execution, reducing power consumption and deployment costs.

    Challenges and Future Directions

    Despite rapid progress, world models still face many challenges:

  • Data bottleneck: Real physical interaction data is expensive to acquire and limited in scale. Simulation data suffers from the Sim2Real gap.
  • Computational efficiency: The conflict between real-time inference requirements and model complexity, especially in multi-agent scenarios.
  • Safety and interpretability: Black-box models make fault localization difficult; more transparent architectures are needed.
  • Standardization: Lack of unified evaluation benchmarks and open-source ecosystems makes horizontal comparison across different approaches difficult.
  • In the future, world models are expected to deeply integrate with AI Agent, driving embodied intelligence from labs to homes, factories, and cities. Meanwhile, lightweight and edge deployment of model deployment will be key.

    Conclusion

    World models are moving from theory to practice, from single-agent to multi-agent, and from virtual to physical. Whether it's LeCun's JEPA theory, Kairos's champion performance, γ-World's multi-player interaction, or industry explorations by ForceMind, Accelerated Evolution, and Jiuwen Symbiosis, they all point in the same direction: enabling AI to truly understand and act upon the physical world.

    For developers, now is the best time to deeply understand world models, participate in open-source ecosystems, and validate technologies in real-world scenarios.

    FAQ

    What is the difference between a world model and a video generation model? Video generation models aim to generate realistic future frames but lack understanding of physical laws and causality. World models, on the other hand, make predictions in an abstract representation space, actively ignoring noise and unpredictable details. Their core goal is to support reasoning and planning, not pixel-level generation.

    Why is JEPA more suitable for world models than generative models? Generative models must predict every data detail, making them poorly compatible with high-dimensional continuous data like video. JEPA only learns abstract representations and predicts states in the representation space, naturally fitting image, video, and sensor data while avoiding pixel-level blurring and distortion.

    What is the biggest challenge facing world models today? The data bottleneck is the primary challenge—real physical interaction data is expensive to acquire and limited in scale, while simulation data suffers from the Sim2Real gap. Additionally, real-time inference efficiency, safety interpretability, and standardized evaluation are pressing issues.

    What are typical application scenarios for multi-agent world models? Multi-robot collaboration (e.g., dual-arm manipulation, warehouse coordination), autonomous driving multi-vehicle interaction, multi-player games, embodied intelligence training, etc. γ-World has demonstrated good scalability by zero-shot generalizing from two-player training to four-player scenarios.

    How can small teams get started in world model research? It is recommended to start with open-source frameworks like Jiuwen Symbiosis or Dexbotic, validating algorithms in simulation environments. Focus on a specific scenario (e.g., robot soccer, tabletop manipulation) as a minimal closed loop to accumulate data and experience. Also, refer to toolchains like LangChain for rapid prototyping.

    What is the relationship between world models and reinforcement learning? LeCun suggests prioritizing model predictive control (MPC) for planning, using reinforcement learning to fine-tune the world model only when planning fails. The world model itself can serve as an environment model for reinforcement learning, providing low-cost, high-throughput training data.

    Also available in 中文.