4D-VLA:
Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration

Jiahui Zhang^1*, Yurui Chen^1*, Yueming Xu¹, Ze Huang¹, Yanpeng Zhou², Yu-Jie Yuan², Xinyue Cai², Guowei Huang², Xingyue Quan², Hang Xu², Li Zhang¹

¹Fudan University ²Huawei Noah’s Ark Lab

Top: Our pretraining design philosophy highlights that prior methods often lack key cues in their input for accurate action inference. This leads to target action distributions A_t(·) exhibiting high variance or non-smoothness, which negatively impacts pretraining performance. A rough analysis shows that in the DROID dataset, 67% of the samples have the robot’s base occluded, causing coordinate system chaos.
Bottom: We verify our method in both simulated and real-world robotic settings and report the performance for the OpenVLA baseline and our 4D-VLA approach.

🎥 Demo Video

vla-video.mp4

📚 Bibtex

If you find this project or dataset helpful, please consider citing our paper:

@article{zhang2025vla,
    title={4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration},
    author={Zhang, Jiahui and Chen, Yurui and Xu, Yueming and Huang, Ze and Zhou, Yanpeng and Yuan, Yujie and Cai, Xinyue and Huang, Guowei and Quan, Xingyue and Xu, Hang and Zhang, Li},
    year={2025},
    journal={arXiv preprint arXiv:2506.22242},
}

Codebase

This repository is organized as a pure 4D-VLA (InternVL-based) codebase:

Core training/model code: internvl_chat/
Experiments / evaluation entrypoints: experiments/

Install

If you see OSError: [Errno 18] Invalid cross-device link during pip install -e, set pip temp/cache to this repo:

export TMPDIR="$PWD/.tmp" PIP_CACHE_DIR="$PWD/.pip-cache"
mkdir -p "$TMPDIR" "$PIP_CACHE_DIR"

pip install -e internvl_chat

If you run into missing deps, install the full list:

pip install -r internvl_chat/requirements.txt
pip install -e internvl_chat

End-to-End Pipeline (4 Steps)

Pretrain (DROID)
Generate finetune data (Libero / RLDS → jsonl + RGBD frames)
Finetune (Franka or Libero)
Eval (Libero simulation)

1) Pretrain (DROID)

Canonical entry: internvl_chat/run_pretrain.sh
Trainer: internvl_chat/internvl/train/vla_pretrain_droid_mb.py
Dataset meta: internvl_chat/shell/exp_data/droid_pretrain_mb_20ws_5bs_simple.json

Data:

Processed DROID data is hosted on ModelScope: hutchinsonian/droid_processed
The default meta expects (relative to internvl_chat/):
- Images: ../data/droid/droid_processed_data
- Annotations: ../data/droid/processed_jsonl_folder_20ws_5bs_simple

Example (download from ModelScope, then link into the expected layout):

mkdir -p data/droid

# Option A: via git-lfs
# sudo apt-get install git-lfs  # if needed
# git lfs install
git lfs clone https://www.modelscope.cn/hutchinsonian/droid_processed.git data/droid/_modelscope_droid_processed

# Option B: via Python (requires `pip install modelscope`)
# python - <<'PY'
# from modelscope.hub.snapshot_download import snapshot_download
# snapshot_download('hutchinsonian/droid_processed', local_dir='data/droid/_modelscope_droid_processed', local_dir_use_symlinks=False)
# PY

ln -sfn "$PWD/data/droid/_modelscope_droid_processed/droid_processed_data" data/droid/droid_processed_data
ln -sfn "$PWD/data/droid/_modelscope_droid_processed/processed_jsonl_folder_20ws_5bs_simple" data/droid/processed_jsonl_folder_20ws_5bs_simple

If your local paths differ, edit internvl_chat/shell/exp_data/droid_pretrain_mb_20ws_5bs_simple.json (root / annotation) accordingly.

bash internvl_chat/run_pretrain.sh

Common overrides:

GPUS=8 \
MODEL_NAME_OR_PATH=../hugg_models/InternVL2-4B \
META_PATH=shell/exp_data/droid_pretrain_mb_20ws_5bs_simple.json \
OUTPUT_DIR=work_dirs/4dvla/pretrain_droid_mb_20ws_5bs \
bash internvl_chat/run_pretrain.sh

2) Generate finetune data (Libero / RLDS)

Script:

internvl_chat/tools/prepare_libero_memory_bank.py (wrapper)
internvl_chat/tools/prepare_libero_memory_bank_mb.py (implementation)

Example:

python internvl_chat/tools/prepare_libero_memory_bank.py \
  --data-root-dir data \
  --data-mix libero_object_no_noops_3d
#
# or equivalently:
# python internvl_chat/tools/prepare_libero_memory_bank_mb.py --data-root-dir data --data-mix libero_object_no_noops_3d

Notes:

This step typically requires a separate env (TensorFlow/TFDS/dlimp). See docs/ENV_SETUP.md.
Use --dry-run to only validate dataset access (no outputs written).

3) Finetune (Franka or Libero)

Franka:
- Shell: internvl_chat/run_finetune.sh
- Trainer: internvl_chat/internvl/train/finetune_franka_baseline_pt_mb_aug_dep.py
Libero:
- Shell: internvl_chat/run_finetune_libero.sh
- Trainer: internvl_chat/internvl/train/run_finetune_libero.py

MODEL_NAME_OR_PATH=/path/to/pretrain/checkpoint-XXXXX \
META_PATH=/path/to/franka_meta.json \
  bash internvl_chat/run_finetune.sh

MODEL_NAME_OR_PATH=/path/to/pretrain/checkpoint-XXXXX \
  bash internvl_chat/run_finetune_libero.sh

Finetune defaults:

run_finetune*.sh default to dynamic_image_size=False and use_thumbnail=False. Override via:
- USE_DYNAMIC_IMAGE_SIZE=True / USE_THUMBNAIL=True

4) Eval (Libero simulation)

Entrypoint:

Python: experiments/robot/libero/run_libero_eval_ablation.py

Scripts:

Goal tasks (parallel): experiments/robot/libero/scripts/eval_ablation_exp4_parallel_goal_seed.sh
Spatial tasks (parallel): experiments/robot/libero/scripts/eval_ablation_exp4_parallel_spatial_seed.sh
Memory-bank eval launcher: vla-scripts/exp_eval/eval_whole_mb.sh

Example:

PRETRAINED_CHECKPOINT=/path/to/finetune/checkpoint \
  bash experiments/robot/libero/scripts/eval_ablation_exp4_parallel_goal_seed.sh 0

Notes:

Eval requires external deps (LIBERO + robosuite). If missing, the code will error with a clear message at runtime.
draccus is used for CLI parsing; install with project requirements.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
envs		envs
experiments		experiments
internvl_chat		internvl_chat
vla-scripts/exp_eval		vla-scripts/exp_eval
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

4D-VLA:
Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration

🎥 Demo Video

📚 Bibtex

Codebase

Install

End-to-End Pipeline (4 Steps)

1) Pretrain (DROID)

2) Generate finetune data (Libero / RLDS)

3) Finetune (Franka or Libero)

4) Eval (Libero simulation)

About

Uh oh!

Releases

Packages

Languages

License

LogosRoboticsGroup/4D-VLA

Folders and files

Latest commit

History

Repository files navigation

4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration

🎥 Demo Video

📚 Bibtex

Codebase

Install

End-to-End Pipeline (4 Steps)

1) Pretrain (DROID)

2) Generate finetune data (Libero / RLDS)

3) Finetune (Franka or Libero)

4) Eval (Libero simulation)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

4D-VLA:
Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration

Packages