Top: Our pretraining design philosophy highlights that prior methods often lack key cues in their input for accurate action inference.
This leads to target action distributions
At(·)
exhibiting high variance or non-smoothness, which negatively impacts pretraining performance.
A rough analysis shows that in the DROID dataset, 67% of the samples have the robot’s base occluded, causing coordinate system chaos.
Bottom: We verify our method in both simulated and real-world robotic settings and report the performance for the
OpenVLA baseline and our
4D-VLA approach.
vla-video.mp4
If you find this project or dataset helpful, please consider citing our paper:
@article{zhang2025vla,
title={4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration},
author={Zhang, Jiahui and Chen, Yurui and Xu, Yueming and Huang, Ze and Zhou, Yanpeng and Yuan, Yujie and Cai, Xinyue and Huang, Guowei and Quan, Xingyue and Xu, Hang and Zhang, Li},
year={2025},
journal={arXiv preprint arXiv:2506.22242},
}This repository is organized as a pure 4D-VLA (InternVL-based) codebase:
- Core training/model code:
internvl_chat/ - Experiments / evaluation entrypoints:
experiments/
If you see OSError: [Errno 18] Invalid cross-device link during pip install -e, set pip temp/cache to this repo:
export TMPDIR="$PWD/.tmp" PIP_CACHE_DIR="$PWD/.pip-cache"
mkdir -p "$TMPDIR" "$PIP_CACHE_DIR"pip install -e internvl_chatIf you run into missing deps, install the full list:
pip install -r internvl_chat/requirements.txt
pip install -e internvl_chat- Pretrain (DROID)
- Generate finetune data (Libero / RLDS → jsonl + RGBD frames)
- Finetune (Franka or Libero)
- Eval (Libero simulation)
- Canonical entry:
internvl_chat/run_pretrain.sh - Trainer:
internvl_chat/internvl/train/vla_pretrain_droid_mb.py - Dataset meta:
internvl_chat/shell/exp_data/droid_pretrain_mb_20ws_5bs_simple.json
Data:
- Processed DROID data is hosted on ModelScope:
hutchinsonian/droid_processed - The default meta expects (relative to
internvl_chat/):- Images:
../data/droid/droid_processed_data - Annotations:
../data/droid/processed_jsonl_folder_20ws_5bs_simple
- Images:
Example (download from ModelScope, then link into the expected layout):
mkdir -p data/droid
# Option A: via git-lfs
# sudo apt-get install git-lfs # if needed
# git lfs install
git lfs clone https://www.modelscope.cn/hutchinsonian/droid_processed.git data/droid/_modelscope_droid_processed
# Option B: via Python (requires `pip install modelscope`)
# python - <<'PY'
# from modelscope.hub.snapshot_download import snapshot_download
# snapshot_download('hutchinsonian/droid_processed', local_dir='data/droid/_modelscope_droid_processed', local_dir_use_symlinks=False)
# PY
ln -sfn "$PWD/data/droid/_modelscope_droid_processed/droid_processed_data" data/droid/droid_processed_data
ln -sfn "$PWD/data/droid/_modelscope_droid_processed/processed_jsonl_folder_20ws_5bs_simple" data/droid/processed_jsonl_folder_20ws_5bs_simpleIf your local paths differ, edit internvl_chat/shell/exp_data/droid_pretrain_mb_20ws_5bs_simple.json (root / annotation) accordingly.
bash internvl_chat/run_pretrain.shCommon overrides:
GPUS=8 \
MODEL_NAME_OR_PATH=../hugg_models/InternVL2-4B \
META_PATH=shell/exp_data/droid_pretrain_mb_20ws_5bs_simple.json \
OUTPUT_DIR=work_dirs/4dvla/pretrain_droid_mb_20ws_5bs \
bash internvl_chat/run_pretrain.shScript:
internvl_chat/tools/prepare_libero_memory_bank.py(wrapper)internvl_chat/tools/prepare_libero_memory_bank_mb.py(implementation)
Example:
python internvl_chat/tools/prepare_libero_memory_bank.py \
--data-root-dir data \
--data-mix libero_object_no_noops_3d
#
# or equivalently:
# python internvl_chat/tools/prepare_libero_memory_bank_mb.py --data-root-dir data --data-mix libero_object_no_noops_3dNotes:
- This step typically requires a separate env (TensorFlow/TFDS/dlimp). See
docs/ENV_SETUP.md. - Use
--dry-runto only validate dataset access (no outputs written).
- Franka:
- Shell:
internvl_chat/run_finetune.sh - Trainer:
internvl_chat/internvl/train/finetune_franka_baseline_pt_mb_aug_dep.py
- Shell:
- Libero:
- Shell:
internvl_chat/run_finetune_libero.sh - Trainer:
internvl_chat/internvl/train/run_finetune_libero.py
- Shell:
MODEL_NAME_OR_PATH=/path/to/pretrain/checkpoint-XXXXX \
META_PATH=/path/to/franka_meta.json \
bash internvl_chat/run_finetune.shMODEL_NAME_OR_PATH=/path/to/pretrain/checkpoint-XXXXX \
bash internvl_chat/run_finetune_libero.shFinetune defaults:
run_finetune*.shdefault todynamic_image_size=Falseanduse_thumbnail=False. Override via:USE_DYNAMIC_IMAGE_SIZE=True/USE_THUMBNAIL=True
Entrypoint:
- Python:
experiments/robot/libero/run_libero_eval_ablation.py
Scripts:
- Goal tasks (parallel):
experiments/robot/libero/scripts/eval_ablation_exp4_parallel_goal_seed.sh - Spatial tasks (parallel):
experiments/robot/libero/scripts/eval_ablation_exp4_parallel_spatial_seed.sh - Memory-bank eval launcher:
vla-scripts/exp_eval/eval_whole_mb.sh
Example:
PRETRAINED_CHECKPOINT=/path/to/finetune/checkpoint \
bash experiments/robot/libero/scripts/eval_ablation_exp4_parallel_goal_seed.sh 0Notes:
- Eval requires external deps (LIBERO + robosuite). If missing, the code will error with a clear message at runtime.
draccusis used for CLI parsing; install with project requirements.