The 3D data-curation pipeline that turns raw EPIC-KITCHENS-100 videos into per-object 6-DoF trajectory extractions.
This is the data pipeline behind ObjectForesight. It produces the extractions released as the raivn/ObjectForesight-EPIC dataset, which trains the model in RustinS/ObjectForesight.
Run step1→step10 in order. Each step is sharded (--num_shards N --shard_idx i) so it parallelizes across many GPUs, and runs in the conda env shown below (set up in Installation).
| Step | Script | What it does | Component | Env |
|---|---|---|---|---|
| 1 | step1_split.py |
Split EPIC videos into action clips (by narration_id) |
ffmpeg/decord | fpose |
| 2 | step2_egohos.py |
Hand & object segmentation | EgoHOS | egohos |
| 3 | step3_filtering.py |
Drop low-quality sequences | none | fpose |
| 4 | step4_sam.py |
Propagate/refine object masks | SAM 2 | sam |
| 5 | step5_obj_filter.py |
Verify object tracks (InternVL); write moved_by_hand.txt |
InternVL | fpose |
| 6 | step6_frame_filter.py |
Filter crops/masks by visibility (InternVL yes/partial/no) | InternVL | fpose |
| 7 | step7_vas.py |
Amodal mask completion | Diffusion-VAS | vas |
| 8 | step8_trellis.py |
Image-to-3D object mesh | TRELLIS | trellis |
| 9 | step9_spatracker.py |
3D point tracks + metric depth → spatracker.npz |
SpaTrackerV2 | fpose |
| 10 | step10_fpose.py |
6-DoF object pose tracking → foundationpose10/ |
FoundationPose | fpose |
utils.py holds shared helpers used across the steps. Default roots are ./EPIC-KITCHENS (source videos) and ./manip_data (clips + per-object outputs); override per script via --help.
Each stage wraps a separate third-party system, bundled in this repo under its own subdirectory with its original install files, weight downloaders, and license. The pipeline uses five conda environments; install the ones you need. A recent CUDA toolkit (11/12) and a GPU are required throughout.
egohos: EgoHOS (step 2)
cd EgoHOS
conda create -n egohos python=3.9 -y && conda activate egohos
pip install -r requirements.txt # includes mmsegmentation
bash download_checkpoints.sh
cd ..sam: SAM 2 (step 4)
cd sam2
conda create -n sam python=3.11 -y && conda activate sam
pip install -e .
# download the SAM 2.1 checkpoints (see sam2/INSTALL.md)
cd ..vas: Diffusion-VAS (step 7)
cd diffusion-vas
conda create -n vas python=3.10 -y && conda activate vas
pip install -r requirements.txt
# download the Diffusion-VAS checkpoints (see diffusion-vas/README.md)
cd ..trellis: TRELLIS (step 8)
cd trellis
. ./setup.sh --new-env --basic # creates the 'trellis' env + installs deps (see ./setup.sh --help)
cd ..fpose: FoundationPose · SpaTrackerV2 · InternVL · video IO (steps 1, 3, 5, 6, 9, 10)
cd FoundationPose
bash build_all_conda.sh # builds its conda env + CUDA extensions (see FoundationPose/readme.md)
conda activate fpose # use the env name the build script creates
pip install -r ../SpaTrackerV2/requirements.txt # SpaTrackerV2 (step 9)
pip install transformers accelerate einops timm decord # InternVL filtering (5–6) + clip splitting (1)
cd ..Each component downloads its own model weights on setup or first use; follow the install/README inside its subdirectory.
- Get EPIC-KITCHENS-100 (agree to its terms) and place the source videos under
./EPIC-KITCHENS. - Run the steps in order, activating the matching env and sharding across jobs. Every script takes
--num_shards/--shard_idxand--help:
conda activate fpose && python step1_split.py --video_root ./EPIC-KITCHENS --out_root ./manip_data
conda activate egohos && python step2_egohos.py --data_root ./manip_data --num_shards 64 --shard_idx 0
conda activate fpose && python step3_filtering.py --data_root ./manip_data
conda activate sam && python step4_sam.py --video_root ./manip_data --num_shards 64 --shard_idx 0
# steps 5–10 likewise (see each script's --help for its flags)Shard a step across N jobs by launching it once per --shard_idx in [0, N).
Per action clip (<narration_id>/): EgoHOS + SAM 2 + amodal masks, a TRELLIS mesh per object, SpaTrackerV2 depth & 3D tracks (spatracker.npz), and FoundationPose 6-DoF poses (foundationpose10/). The dataloader in RustinS/ObjectForesight windows these into training trajectories; the cleaned, packaged result is raivn/ObjectForesight-EPIC.
This pipeline builds on EgoHOS, SAM 2, Diffusion-VAS, TRELLIS, SpaTrackerV2, FoundationPose, and InternVL (the pipeline table shows which stage uses each), bundled here under their respective licenses. The bundled copies may include minor local modifications for pipeline integration.
Pipeline code (step*.py, utils.py) is for non-commercial research use, consistent with EPIC-KITCHENS-100 (CC BY-NC 4.0). Each bundled component keeps its own license (see its subdirectory).
@article{soraki2026objectforesight,
title = {ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos},
author = {Soraki, Rustin and Bharadhwaj, Homanga and Farhadi, Ali and Mottaghi, Roozbeh},
journal = {arXiv preprint arXiv:2601.05237},
year = {2026}
}Please also cite EPIC-KITCHENS-100 and each component (EgoHOS, SAM 2, Diffusion-VAS, TRELLIS, SpaTrackerV2, FoundationPose, InternVL).