Shrey/hard scenes#5
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces an HPC training workflow for the camera-based Waymo E2E model, along with changes to the data loader and training script, and it checks in several SLURM job logs from recent runs.
Changes:
- Add a SLURM batch script (
train_job.sbatch) to runtrain.pyon a V100 partition with a specific conda environment and scratch paths. - Modify
train.pyto change how Waymo datasets and dataloaders are constructed (switching to pre-generated index files and turning off image loading), while keeping the Lightning training loop and logging. - Update
loader.pyto change the data directory used in the__main__test harness and to decode JPEG images on GPU before moving them back to CPU. - Add multiple
.errand.outSLURM log artifacts (waymo-e2e-*.err/.out) from training runs into the repository.
Notable Issues Found
-
Missing imports in
train.py(critical bug)
The script usesWaymoE2E,torch,pickle,Path,pd(pandas), andplt(matplotlib) without importing them. This will causeNameErrorexceptions as soon as those symbols are referenced (e.g., at dataset construction, DataLoader construction, and loss-plot generation). -
Undefined variables
train_indexandtest_indexintrain.py(critical bug)
The new code block:with open("train_index.pkl", "wb") as f1: pickle.dump(train_index, f1) with open("test_index.pkl", "wb") as f2: pickle.dump(test_index, f2)
references
train_indexandtest_index, which are never defined. This will immediately raise aNameErrorand prevent training from starting. If the intention is to re-save or subset existing indices, those values need to be computed (e.g., from an existing dataset or index file) before dumping. -
Dead / confusing dataset construction in
train.py(maintainability)
train_datasetandtest_datasetare first created withindex_train.pkl/index_val.pklandimages=True, then immediately replaced with newWaymoE2Einstances usingtrain_index.pkl/test_index.pklandimages=False. Given the undefinedtrain_index/test_index, the first pair of datasets are effectively unused and confusing. Even after fixing the undefined variables, this pattern is hard to follow and should be refactored to a single clear construction path. -
SLURM job script environment setup currently fails on the target cluster (bug / operational_implications)
The newtrain_job.sbatchuses:module load anaconda/2023.03 source activate /scratch/gilbreth/shar1159/conda_envs/waymoThe
.errlogs included in this PR showLmod has detected the following error: unknown module "anaconda/2023.03"and/var/spool/slurm/...: line 15: activate: No such file or directory. This indicates the environment module name and/or activation command are incorrect for the target system, so the intended conda env is not actually being activated. -
DataLoader worker count vs. system recommendation (best_practices, non-blocking)
train.pynow usesnum_workers=16for both train and validation DataLoaders. The included.errlogs from the same environment show PyTorch / Lightning warnings that the suggested maximum number of workers on that system is 4, and thatIterableDatasetwith__len__plus multi-process loading can yield inaccurate lengths. While not strictly a correctness bug, this is likely to harm performance and may cause confusing progress reporting; consider tuningnum_workersbased on actual cluster limits and Lightning’s guidance. -
GPU-only decoding in
loader.pycan break on CPU-only setups (best_practices / operational, low priority)
decode_imgnow callstorchvision.io.decode_jpeg(..., device='cuda')and then moves the tensor back to CPU. This will fail on machines where CUDA is unavailable. For the main training code, this is mitigated by usingimages=False(sodecode_imgreturns an empty array and never touches CUDA), but the__main__test harness usesimages=Trueand will crash on CPU-only systems. -
Checked-in SLURM log artifacts (maintainability)
Fileswaymo-e2e-9940377.err,waymo-e2e-9940269.err,waymo-e2e-9940263.err, andwaymo-e2e-9940263.outare pure runtime artifacts with job IDs, timestamps, and environment-specific warnings. Keeping these under version control adds noise, risks leaking environment-specific details, and will quickly become stale; typically these should be excluded (e.g., via.gitignore) unless there is a specific, documented need.
Reviewed changes
Copilot reviewed 7 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/camera-based-e2e/train.py |
Adjusts dataset and DataLoader construction to use new index pickle files and disable image loading, while still building and training a Lightning model; however, it removes required imports and introduces undefined variables, breaking execution. |
src/camera-based-e2e/loader.py |
Changes image decoding to run on CUDA and then move tensors back to CPU, and updates the example DATA_DIR path in the __main__ block. |
src/camera-based-e2e/train_job.sbatch |
New SLURM batch script to run train.py on a V100 partition using a specific conda environment and scratch directory; as written, its module and activation commands fail on the target cluster referenced in the .err logs. |
src/camera-based-e2e/waymo-e2e-9940377.err |
Captures stderr output from a SLURM training run, including module-loading and DataLoader warnings; committed as a log artifact rather than code. |
src/camera-based-e2e/waymo-e2e-9940269.err |
Another SLURM stderr log from a canceled training job, showing the same module/activation issues and Lightning DataLoader warnings. |
src/camera-based-e2e/waymo-e2e-9940263.err |
SLURM stderr log for job 9940263, again showing the broken environment setup and Lightning warnings. |
src/camera-based-e2e/waymo-e2e-9940263.out |
SLURM stdout log with very detailed progress output for a training run (epoch 0), including loss values; not used by the code. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| train_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_train.pkl', data_dir=args.data_dir, images=True, n_items=250000) | ||
| test_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_val.pkl', data_dir=args.data_dir, images=True, n_items=50000) |
There was a problem hiding this comment.
This assignment to 'train_dataset' is unnecessary as it is redefined before this value is used.
| train_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_train.pkl', data_dir=args.data_dir, images=True, n_items=250000) | |
| test_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_val.pkl', data_dir=args.data_dir, images=True, n_items=50000) | |
| WaymoE2E(batch_size=args.batch_size, indexFile='index_train.pkl', data_dir=args.data_dir, images=True, n_items=250000) | |
| WaymoE2E(batch_size=args.batch_size, indexFile='index_val.pkl', data_dir=args.data_dir, images=True, n_items=50000) |
| @@ -31,8 +21,17 @@ | |||
| train_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_train.pkl', data_dir=args.data_dir, images=True, n_items=250000) | |||
| test_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_val.pkl', data_dir=args.data_dir, images=True, n_items=50000) | |||
There was a problem hiding this comment.
This assignment to 'test_dataset' is unnecessary as it is redefined before this value is used.
| test_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_val.pkl', data_dir=args.data_dir, images=True, n_items=50000) | |
| WaymoE2E(batch_size=args.batch_size, indexFile='index_val.pkl', data_dir=args.data_dir, images=True, n_items=50000) |
No description provided.