Skip to content

SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving

Notifications You must be signed in to change notification settings

LogosRoboticsGroup/SGDrive

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 

Repository files navigation

SGDrive

SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving

Jingyu Li1,2*, Junjie Wu3*, Dongnan Hu4,2, Xiangkai Huang3, Bin Sun3†, Zhihui Hao3†,
Xianpeng Lang3, Xiatian Zhu5, Li Zhang1,2✉

1Fudan University 2Shanghai Innovation Institute 3 Li Auto Inc. 4 Tongji University 5 University of Surrey

(*) Equal contribution. (†) Project leader. (✉) Corresponding author.

Arxiv 2025

Paper PDF

News

  • Jan. 15th, 2026: We released results on NAVSIM v2 navhard_two_stage!
  • Jan. 09th, 2026: We released our paper on Arxiv. Code/Models are coming soon. Please stay tuned! ☕️

Updates

  • Release Paper
  • Release results on Navdard_two_stage
  • Release Full Models
  • Release Training/Evaluation Framework

Table of Contents

Abstract

Recent end-to-end autonomous driving approaches have leveraged Vision-Language Models (VLMs) to enhance planning capabilities in complex driving scenarios. However, VLMs are inherently trained as generalist models, lacking specialized understanding of driving-specific reasoning in 3D space and time. When applied to autonomous driving, these models struggle to establish structured spatial-temporal representations that capture geometric relationships, scene context, and motion patterns critical for safe trajectory planning. To address these limitations, we propose SGDrive, a novel framework that explicitly structures the VLM's representation learning around driving-specific knowledge hierarchies. Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition: drivers first perceive the overall environment (scene context), then attend to safety-critical agents and their behaviors, and finally formulate short-term goals before executing actions. This hierarchical decomposition provides the structured spatial-temporal representation that generalist VLMs lack, integrating multi-level information into a compact yet comprehensive format for trajectory planning. Extensive experiments on the NAVSIM benchmark demonstrate that SGDrive achieves state-of-the-art performance among camera-only methods on both PDMS and EPDMS, validating the effectiveness of hierarchical knowledge structuring for adapting generalist VLMs to autonomous driving.

Getting Started

Checkpoint

We are still working toward achieving better results!

Results on NAVSIM v1 navtest

Method Model Size Training Method PDMS Weight Download(coming soon)
SGDrive-VLM 2B Q&A SFT 85.5 Model
SGDrive-IL 2B SFT 87.4 Model
SGDrive-RL 2B RFT 91.1 Model

Results on NAVSIM v2 navtest

Method Model Size Training Stage EPDMS Weight Download(coming soon)
SGDrive-IL 2B SFT 86.2 Model

Results on NAVSIM v2 navhard_two_stage (* denotes results reproduced with the official code repository or official checkpoint. )

Method Model Size Training Stage EPDMS
DiffusionDrive* -- IL 24.2
GTRS-DP* -- IL 23.8
GuideFlow -- IL 27.1
Recogdrive-IL* 2B SFT 26.0
SGDrive-IL 2B SFT 27.1

Qualitative Results on NAVSIM Navtest

Our qualitative results demonstrate strong alignment with ground truth across the scene–agent–goal hierarchy, indicating rich driving-world knowledge and reliable short-horizon representation.

SGDrive adaptively perceives the driving scene according to the ego-vehicle's motion state and navigation command. This demonstrates a more structured and effective representation of driving-relevant world knowledge, providing strong evidence that SGDrive successfully elicits the VLM's world-modeling ability.

We compare SGDrive (SFT) with ReCogDrive, both of which leverage structured driving-world knowledge and can extrapolate it reasonably to ensure safe and rational driving behavior. More visualizations are in the supplementary material.

Contact

If you have any questions, please contact Jingyu Li via email (jingyuli24@m.fudan.edu.cn).

Acknowledgement

SGDrive is greatly inspired by the following outstanding contributions to the open-source community: NAVSIM, RecogDrive, GR00T.

Citation

If you find SGDrive is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.

@article{li2026sgdrive,
  title={SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving},
  author={Li, Jingyu and Wu, Junjie and Hu, Dongnan and Huang, Xiangkai and Sun, Bin and Hao, Zhihui and Lang, Xianpeng and Zhu, Xiatian and Zhang, Li},
  journal={arXiv preprint arXiv:2601.05640},
  year={2026}
}

@misc{li2026sgdrivescenetogoalhierarchicalworld,
      title={SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving}, 
      author={Jingyu Li and Junjie Wu and Dongnan Hu and Xiangkai Huang and Bin Sun and Zhihui Hao and Xianpeng Lang and Xiatian Zhu and Li Zhang},
      year={2026},
      eprint={2601.05640},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.05640}, 
}

About

SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published