ObjectForesight

Predicting future 3D object trajectories from human videos.

ObjectForesight is a 3D object-centric dynamics model: given a single egocentric observation — a scene point cloud and an object's recent 6-DoF pose — it predicts the object's H future 6-DoF poses. This repo is the model code (training, evaluation, inference). The data-curation pipeline that produces the training data lives in a separate repo, RustinS/ObjectForesight-Data; the extracted dataset and pretrained weights are on Hugging Face.

PoserV1 = PointTransformer V3 scene encoder (via Sonata) + a DiT diffusion temporal head. Each predicted pose is a 9-D token [t_x, t_y, t_z, rot6d(6)]; the 6-D rotation maps to SO(3) via Gram–Schmidt.


Encoder	PTv3, `embed_dim=768`, `in_channels=6` (camera-xyz ⊕ object-centric-xyz), `attn_obj` pooling
Temporal head	DiT, 12 layers / 768-d / 12 heads, `adaln_zero` conditioning, cosine β-schedule, v-prediction, 50 DDIM steps
I/O	scene point cloud `[N,3]` + `context_len` past poses → `[H, 9]` future poses
Params	~183 M

Results (EPIC-KITCHENS-100)

6-DoF trajectory metrics from the paper (lower is better). ADE/FDE = average/final translation error (m); ARE/FRE = average/final rotation error (°).

Model	ADE ↓	FDE ↓	ARE ↓	FRE ↓
ObjectForesight-DiT (this model)	0.019	0.035	7.98°	13.93°
ObjectForesight-AR (baseline)	0.067	0.074	9.48°	12.58°

See the paper for the full table (DES/RES error-growth slopes, HOT3D, and the video-generation comparison).

Setup

Requires Python 3.11, CUDA 12.x (with nvcc on PATH), and a GPU — the PTv3 encoder depends on spconv + torch-scatter, which are compiled from source.

# 1. install uv (https://github.com/astral-sh/uv) if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. one-command setup (run on a GPU node; ~20–30 min, builds CUDA packages from source)
./scripts/setup.sh                 # H100/H200 (sm_90) by default
./scripts/setup.sh --cuda-arch 89  # e.g. RTX 40-series (Ada)
./scripts/setup.sh --skip-gpu      # CPU-only (no spconv/flash-attn; for editing/CI)

setup.sh creates .venv (uv, Python 3.11), runs uv sync for the base deps, then builds torch-scatter, flash-attn (optional — the code falls back to PyTorch SDPA if absent), pytorch3d, and cumm/spconv. Compiled kernels are JIT-cached in ~/.cumm after the first run.

Run anything with uv run (no manual activation needed):

uv run python -c "import torch, spconv, torch_scatter, src; print('env OK', torch.cuda.is_available())"

Manual install (advanced): uv venv --python 3.11 && uv sync, then install torch-scatter and flash-attn (--no-build-isolation) and build cumm/spconv matching your CUDA/PyTorch — see scripts/setup.sh for the exact, patched build steps.

Pretrained weights

The main EPIC-KITCHENS model (ObjectForesight-DiT) is on Hugging Face:

huggingface-cli download raivn/ObjectForesight-EPIC-DiT --local-dir checkpoints/of-epic-dit
# -> best.pt (repo-native) and model.safetensors (pickle-free)
uv run python -m src.eval_main --config-name epic_eval eval.ckpt=checkpoints/of-epic-dit/best.pt

Data

The extracted trajectories are released as the gated dataset raivn/ObjectForesight-EPIC:

huggingface-cli download raivn/ObjectForesight-EPIC --repo-type dataset --local-dir of-epic
cd of-epic && python examples/prepare.py    # untar shards -> ./manip_data

Point the loader at it with data.dataset_root=/path/to/manip_data (default: ./manip_data). The dataset ships the windowing/filtering loader; this repo's src/data/ performs the same trajectory construction at train time.

Usage

All runs are configured with Hydra (conf/epic.yaml is the primary config). Override any field on the command line.

# Train (single GPU)
uv run python -m src.train_main data.dataset_root=/path/to/manip_data

# Train (multi-GPU / Slurm)
uv run torchrun --standalone --nproc_per_node=8 -m src.train_main
bash scripts/submit.sh --nodes 1 --gpus-per-node 8

# Evaluate (paper-style filtered eval) / infer / visualize with a checkpoint
uv run python -m src.eval_main  --config-name epic_eval eval.ckpt=checkpoints/of-epic-dit/best.pt
uv run python -m src.infer_main infer.ckpt=checkpoints/of-epic-dit/best.pt
uv run python -m src.viz_main   viz.save_dir=outputs/overlays

# Quick smoke test (synthetic data, no dataset needed)
uv run python -m src.train_main data.dataset_name=synth data.use_synthetic=true \
  train.tiny_overfit=true train.tiny_n=8 train.epochs=1

Configuration highlights

Section	Key	Meaning
`data`	`H`, `context_len`, `n_points`	horizon, # context frames, points sampled from the scene
`model`	`temporal_kind`	`dit` (default) or `ar_transformer`
`model.temporal_dit`	`conditioning`, `ddim_steps`	`adaln_zero`/`film`, # sampling steps
`train`	`batch_size`, `lr`, `amp`, `ema`	standard training knobs
`eval`	`eval_mode`, `steps`, `prefer_ema`	sampler vs loss eval, DDIM steps

Training and evaluation on HOT3D

The code also supports HOT3D-Clips (egocentric Aria sequences with motion-capture ground-truth object poses) through the hot3d config. No HOT3D checkpoint is released, so you train your own.

Download HOT3D-Clips from Meta: the per-split clip tars (train_aria/, test_aria/) and the object models.

Build the SpaTrackerV2 depth cache (needs SpaTrackerV2; one NPZ per clip):

python scripts/preprocess_hot3d_spatracker.py \
  --clips_root /path/to/hot3d-clips/train_aria \
  --output_dir /path/to/hot3d-clips/depth_cache_pinhole
# multi-GPU: torchrun --standalone --nproc_per_node=8 scripts/preprocess_hot3d_spatracker.py \
#   --clips_root .../train_aria --output_dir .../depth_cache_pinhole --skip_existing

(Optional: scripts/preprocess_hot3d_metadata.py pre-extracts clip metadata for faster loading.)

Train with the hot3d config (frame-skip 4 plus stationary-window filtering, matching the paper). Here clips_root is the parent directory that holds train_aria/ and test_aria/:
```
uv run python -m src.train_main --config-name hot3d \
  data.hot3d.clips_root=/path/to/hot3d-clips \
  data.hot3d.depth_cache_dir=/path/to/hot3d-clips/depth_cache_pinhole \
  data.hot3d.object_library=/path/to/hot3d-clips/object_models_eval
```
Evaluate a checkpoint you trained with the same config: uv run python -m src.eval_main --config-name hot3d eval.ckpt=/path/to/best.pt. object_library is optional (it attaches object meshes); use data.hot3d.split=test for the test split.

Repository structure

src/
├── models/poser_v1/   # PoserV1 (PTv3 encoder + DiT/AR temporal head)
├── encoders/          # PointTransformer V3 adapter + serialization
├── temporal/          # DiT diffusion (DDIM) and AR transformer
├── data/              # dataset loaders, windowing, point-cloud / pose IO
├── geom/              # SE(3) ops, 6-D rotation, pose canonicalization
├── dist/              # DDP / FSDP launch
└── utils/             # config adapter, normalization, logging
conf/                  # Hydra configs (epic.yaml [primary], default.yaml, epic_eval.yaml, hot3d.yaml)
scripts/               # setup.sh, submit.sh, preprocessing utilities

Citation

@article{soraki2026objectforesight,
  title   = {ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos},
  author  = {Soraki, Rustin and Bharadhwaj, Homanga and Farhadi, Ali and Mottaghi, Roozbeh},
  journal = {arXiv preprint arXiv:2601.05237},
  year    = {2026}
}

License & acknowledgments

Code released for non-commercial research use. The dataset and weights are derived from EPIC-KITCHENS-100 (CC BY-NC 4.0) — cite EPIC-KITCHENS-100 and comply with its terms when using them.

Built on PointTransformer V3 / Sonata, Hydra, and PyTorch.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
conf		conf
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
architecture.png		architecture.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ObjectForesight

Results (EPIC-KITCHENS-100)

Setup

Pretrained weights

Data

Usage

Configuration highlights

Training and evaluation on HOT3D

Repository structure

Citation

License & acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ObjectForesight

Results (EPIC-KITCHENS-100)

Setup

Pretrained weights

Data

Usage

Configuration highlights

Training and evaluation on HOT3D

Repository structure

Citation

License & acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages