World Models: The Next Piece of the Puzzle from Theory to Embodied Intelligence
Exploring core advances like JEPA, Kairos, and γ-World, revealing how world models drive embodied intelligence to a new stage
Introduction: From Language to Physics—AI's Next Battlefield
In the past few years, large language models (LLMs) have achieved remarkable breakthroughs in text generation, code writing, and mathematical reasoning. But an awkward fact remains: AI still can't pour itself a glass of water.
This dilemma of "high IQ, low physical ability" is precisely captured by Moravec's paradox—tasks that are trivial for human infants, such as walking, grasping, and obstacle avoidance, are extremely difficult for AI. The root cause is that existing models lack common-sense understanding of the physical world: they don't know about gravity, friction, object permanence, or how to anticipate the consequences of their own actions.
World models are designed to address this problem. They aim to enable AI to build an internal representation of the physical world's operating laws, thereby enabling causal reasoning, action planning, and zero-shot adaptation. This article systematically reviews the latest progress and future directions of world models from three dimensions: theory, technology, and industry.
Theoretical Foundations of World Models
What is a World Model?
Yann LeCun gave a clear definition in his 2026 lecture at ETH Zurich: A world model is a causal model based on actions/interventions. It takes system observations and human actions as input and predicts the outcomes of interventions. The key difference is that world models do not predict raw data details but make predictions in an abstract representation space, actively ignoring noise and unpredictable details.
This definition clarifies common misconceptions: a world model ≠ digital twin, full simulator, or video generation system. It is an action-oriented abstract predictor whose core goal is to support reasoning and planning.
JEPA: Joint Embedding Predictive Architecture
JEPA (Joint Embedding Predictive Architecture) is the core technology for implementing world models. Unlike traditional generative models (e.g., VAEs, diffusion models), JEPA does not require predicting every pixel. Instead, it learns abstract representations of data and performs state prediction in the representation space. This makes it naturally friendly to high-dimensional, continuous, noisy data (e.g., video, sensors).
Training JEPA faces a key challenge: representation collapse, where the model loses effective representational capacity. There are two main solutions:
The JEPA family has spawned several mature models:
Hierarchical Planning and Safety Constraints
World models support multi-level, multi-timescale hierarchical planning:
For example, in the task "go from the office to the airport": the top level selects the airport and transportation mode, the middle level plans the route and obstacle avoidance, and the bottom level executes actions like walking and grasping. This architecture enables world models to handle complex long-horizon tasks.
Frontier Advances: From Single-Agent to Multi-Agent World Models
Kairos: A 4B-Parameter Champion World Model Defeating 28B
Daxiao Robotics' Kairos world model achieved first place in four international benchmarks: RoboTwin 2.0, LIBERO-Plus, WorldModelBench Robot, and DreamGen Bench, with only 4B parameters.
Core technologies:
Key performance:
Kairos-4B is also the first embodied world model that can directly drive a robot body on the edge, reducing intermediate conversion latency and extending world models from "cognitive systems" to "execution systems."
γ-World: Real-Time Multi-Agent Shared-World Interaction
γ-World, introduced by NVIDIA in collaboration with Tsinghua University, extends world models from single-player mode to multi-player shared spaces. Its core innovations include:
γ-World has been validated in both virtual games and real robot collaboration scenarios, providing a foundation for multi-robot collaboration, autonomous driving multi-vehicle interaction, and more.
Industry Deployment: From Data Flywheels to Open-Source Ecosystems
ForceMind: A Two-Way Match of Model and Scenario
The merger of ForceMind and Atomix represents a powerful combination of "embodied large models" and "real-world scenario data." ForceMind's DM0 model achieved first place globally in the RoboChallenge real-robot evaluation, with only 2.4B parameters, trained on a fusion of internet, autonomous driving, and robot manipulation data across 8 types of embodiments. Atomix, meanwhile, has accumulated over 500 projects in logistics warehousing scenarios, with a daily shipment volume of 600,000 items, providing massive real-world picking data for the model.
This merger builds a data flywheel: "model improves → robots become smarter → data gets better → model continues to improve." Tang Wenbin views picking as the "atomic task" of embodied intelligence, analogous to coding in the era of large models—with massive data, clear feedback, and strong transferability.
Accelerated Evolution: MVP on the Soccer Field and an OS Ambition
Accelerated Evolution chose robot soccer as the minimal closed loop for technology validation, persisting in this scenario for 20 years. Its K1 humanoid robot (priced at 39,900 RMB) is already sold on JD.com, and the T2 flagship model boasts high dynamic capabilities. The company does not bet on end-to-end large models but follows a layered deployment path: perception → decision → execution. It is also developing the Booster Studio development tool, aiming to build an operating system for the embodied agent ecosystem.
Cheng Hao believes that embodied large models will take another 5-10 years to mature; before that, operating systems and data flywheels are more pragmatic paths.
Jiuwen Symbiosis: An Open-Source Physical AI Framework
The openJiuwen community's open-source Jiuwen Symbiosis proposes a "situational awareness loop" architecture, where the cognitive layer and execution layer collaborate through a shared Workspace. Its core modules include multimodal perception, safe planning, physical execution, state observation, observation feedback, and spatial memory, supporting zero-shot cross-embodiment adaptation and long-horizon composite tasks.
The framework adopts an edge-cloud collaborative architecture: cloud-side LLM/VLM handles complex reasoning, while edge-side Ascend NPU and Kunpeng CPU handle real-time perception and execution, reducing power consumption and deployment costs.
Challenges and Future Directions
Despite rapid progress, world models still face many challenges:
In the future, world models are expected to deeply integrate with AI Agent, driving embodied intelligence from labs to homes, factories, and cities. Meanwhile, lightweight and edge deployment of model deployment will be key.
Conclusion
World models are moving from theory to practice, from single-agent to multi-agent, and from virtual to physical. Whether it's LeCun's JEPA theory, Kairos's champion performance, γ-World's multi-player interaction, or industry explorations by ForceMind, Accelerated Evolution, and Jiuwen Symbiosis, they all point in the same direction: enabling AI to truly understand and act upon the physical world.
For developers, now is the best time to deeply understand world models, participate in open-source ecosystems, and validate technologies in real-world scenarios.
FAQ
What is the difference between a world model and a video generation model? Video generation models aim to generate realistic future frames but lack understanding of physical laws and causality. World models, on the other hand, make predictions in an abstract representation space, actively ignoring noise and unpredictable details. Their core goal is to support reasoning and planning, not pixel-level generation.
Why is JEPA more suitable for world models than generative models? Generative models must predict every data detail, making them poorly compatible with high-dimensional continuous data like video. JEPA only learns abstract representations and predicts states in the representation space, naturally fitting image, video, and sensor data while avoiding pixel-level blurring and distortion.
What is the biggest challenge facing world models today? The data bottleneck is the primary challenge—real physical interaction data is expensive to acquire and limited in scale, while simulation data suffers from the Sim2Real gap. Additionally, real-time inference efficiency, safety interpretability, and standardized evaluation are pressing issues.
What are typical application scenarios for multi-agent world models? Multi-robot collaboration (e.g., dual-arm manipulation, warehouse coordination), autonomous driving multi-vehicle interaction, multi-player games, embodied intelligence training, etc. γ-World has demonstrated good scalability by zero-shot generalizing from two-player training to four-player scenarios.
How can small teams get started in world model research? It is recommended to start with open-source frameworks like Jiuwen Symbiosis or Dexbotic, validating algorithms in simulation environments. Focus on a specific scenario (e.g., robot soccer, tabletop manipulation) as a minimal closed loop to accumulate data and experience. Also, refer to toolchains like LangChain for rapid prototyping.
What is the relationship between world models and reinforcement learning? LeCun suggests prioritizing model predictive control (MPC) for planning, using reinforcement learning to fine-tune the world model only when planning fails. The world model itself can serve as an environment model for reinforcement learning, providing low-cost, high-throughput training data.
Also available in 中文.