This is a companion blog post to the paper Learning to Drive from a World Model with some additional perspectives and visualizations.
We propose an end-to-end architecture for training driving policies in an on-policy manner using real-world driving data and simulation. We introduce two simulator types: one based on reprojective simulation and another using a learned world model.
More importantly, these end-to-end driving policies are currently being used in openpilot and showing great performance in the real world.
The World Model-based simulation has the advantage of being a completely end-to-end general purpose method which scales with increased computation.
By now, most autonomous driving labs agree that building a fully autonomous driving policy based on hard-coded rules and engineered features is doomed to fail. The only realistic way to build an autonomous driving policy that scales to arbitrarily complex and diverse environments is to use methods that scale arbitrarily with computation and data, search and learning
A key challenge in end-to-end learning is how to train a policy that can perform well under the non-i.i.d. assumption made by most supervised learning algorithms such as Behavior Cloning. In the real world, the policy's predictions influence its future observations. Small errors accumulate over time, leading to a compounding effect that drives the system into states it never encountered during pure imitation learning training.
In a previous blog post, we show how a pure imitation learning policy does not recover from its mistakes, leading to a slow drift away from the desired trajectory. To overcome this, the driving policy needs to be trained on-policy, allowing it to learn from its own interactions with the environment, and enabling it to recover from its own mistakes. Running on-policy learning in the real world is costly and impractical
Given a dense depth map, a 6 DOF pose, and an image, we can render a new image by reprojecting the 3D points in the depth map to a new desired pose. This process is called Reprojective Simulation
We shipped a model trained end-to-end with reprojective simulation to our users for lateral planning in openpilot 0.8.15, and for longitudinal planning in openpilot 0.9.0.
We talk extensively about the limitation of classical reprojective simulation in Learning a Driving Simulator | COMMA_CON 2023, and in Section 3 of the paper. They can be summarized as:
World Models
World Models can take many forms. The key idea is to represent the state as a latent lower-dimensional representation using a "compressor model," and to model the dynamics of the latent space using "dynamics model."
The current system is based on the Stable Diffusion image VAE
In order to be used as a simulator for training driving policies, the World Model also needs to provide an Action Ground Truth, i.e. the ideal curvature and acceleration given the current state. To do so, we add a "Plan Head" to the dynamics model, which predicts the trajectory to take.
The "Plan Head" is trained using the human path. But only giving the past states to the world model is not enough to make it "recover," it essentially suffers from the off-policy training problems described above.
To overcome this, we "Anchor" the world model to a future state by providing future states at some fixed time step in the future. Knowing where the car is going to be in the future allows the world model to recover from its mistakes and to predict images and plans that converge to the future state.
More implementation details are given in Section 4 of the paper
Similar to the reprojective simulator, we can control the world model by providing a desired 6 DOF pose.
Both driving simulators are used to train a driving model using On-Policy Learning.
In practice, we use distributed and asynchronous rollout data collection and model updates, similar to IMPALA
For attribution in academic contexts, please cite this work as
"Learning to Drive from a World Model", Autonomy team, comma.ai, 2025.
BibTeX citation
@misc{yousfi2025learningdriveworldmodel, title={Learning to Drive from a World Model}, author={Mitchell Goff and Greg Hogan and George Hotz and Armand du Parc Locmaria and Kacper Raczy and Harald Schäfer and Adeeb Shihadeh and Weixing Zhang and Yassine Yousfi}, year={2025}, eprint={2504.19077}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2504.19077}, }