Jingyu Li1,2*, Junjie Wu3*, Dongnan Hu4,2, Xiangkai Huang3, Bin Sun3†, Zhihui Hao3†,
Xianpeng Lang3, Xiatian Zhu5, Li Zhang1,2✉
1Fudan University 2Shanghai Innovation Institute 3 Li Auto Inc. 4 Tongji University 5 University of Surrey
(*) Equal contribution. (†) Project leader. (✉) Corresponding author.
Arxiv 2025
Jan. 15th, 2026: We released results on NAVSIM v2 navhard_two_stage!Jan. 09th, 2026: We released our paper on Arxiv. Code/Models are coming soon. Please stay tuned! ☕️
- Release Paper
- Release results on Navdard_two_stage
- Release Full Models
- Release Training/Evaluation Framework
- News
- Updates
- Abstract
- Getting Started
- Driving Pretraining Datasets
- Qualitative Results on NAVSIM Navtest
- Contact
- Acknowledgement
- Citation
Recent end-to-end autonomous driving approaches have leveraged Vision-Language Models (VLMs) to enhance planning capabilities in complex driving scenarios. However, VLMs are inherently trained as generalist models, lacking specialized understanding of driving-specific reasoning in 3D space and time. When applied to autonomous driving, these models struggle to establish structured spatial-temporal representations that capture geometric relationships, scene context, and motion patterns critical for safe trajectory planning. To address these limitations, we propose SGDrive, a novel framework that explicitly structures the VLM's representation learning around driving-specific knowledge hierarchies. Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition: drivers first perceive the overall environment (scene context), then attend to safety-critical agents and their behaviors, and finally formulate short-term goals before executing actions. This hierarchical decomposition provides the structured spatial-temporal representation that generalist VLMs lack, integrating multi-level information into a compact yet comprehensive format for trajectory planning. Extensive experiments on the NAVSIM benchmark demonstrate that SGDrive achieves state-of-the-art performance among camera-only methods on both PDMS and EPDMS, validating the effectiveness of hierarchical knowledge structuring for adapting generalist VLMs to autonomous driving.
- Download NAVSIM datasets following official instruction
- Preparation of SGDrive environment
- SGDrive Training and Evaluation
We are still working toward achieving better results!
Results on NAVSIM v1 navtest
| Method | Model Size | Training Method | PDMS | Weight Download(coming soon) |
|---|---|---|---|---|
| SGDrive-VLM | 2B | Q&A SFT | 85.5 | Model |
| SGDrive-IL | 2B | SFT | 87.4 | Model |
| SGDrive-RL | 2B | RFT | 91.1 | Model |
Results on NAVSIM v2 navtest
| Method | Model Size | Training Stage | EPDMS | Weight Download(coming soon) |
|---|---|---|---|---|
| SGDrive-IL | 2B | SFT | 86.2 | Model |
Results on NAVSIM v2 navhard_two_stage (* denotes results reproduced with the official code repository or official checkpoint. )
| Method | Model Size | Training Stage | EPDMS |
|---|---|---|---|
| DiffusionDrive* | -- | IL | 24.2 |
| GTRS-DP* | -- | IL | 23.8 |
| GuideFlow | -- | IL | 27.1 |
| Recogdrive-IL* | 2B | SFT | 26.0 |
| SGDrive-IL | 2B | SFT | 27.1 |
Our qualitative results demonstrate strong alignment with ground truth across the scene–agent–goal hierarchy, indicating rich driving-world knowledge and reliable short-horizon representation.
SGDrive adaptively perceives the driving scene according to the ego-vehicle's motion state and navigation command. This demonstrates a more structured and effective representation of driving-relevant world knowledge, providing strong evidence that SGDrive successfully elicits the VLM's world-modeling ability.
We compare SGDrive (SFT) with ReCogDrive, both of which leverage structured driving-world knowledge and can extrapolate it reasonably to ensure safe and rational driving behavior. More visualizations are in the supplementary material.
If you have any questions, please contact Jingyu Li via email (jingyuli24@m.fudan.edu.cn).
SGDrive is greatly inspired by the following outstanding contributions to the open-source community: NAVSIM, RecogDrive, GR00T.
If you find SGDrive is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.
@article{li2026sgdrive,
title={SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving},
author={Li, Jingyu and Wu, Junjie and Hu, Dongnan and Huang, Xiangkai and Sun, Bin and Hao, Zhihui and Lang, Xianpeng and Zhu, Xiatian and Zhang, Li},
journal={arXiv preprint arXiv:2601.05640},
year={2026}
}
@misc{li2026sgdrivescenetogoalhierarchicalworld,
title={SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving},
author={Jingyu Li and Junjie Wu and Dongnan Hu and Xiangkai Huang and Bin Sun and Zhihui Hao and Xianpeng Lang and Xiatian Zhu and Li Zhang},
year={2026},
eprint={2601.05640},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.05640},
}


