Shrey/hard scenes by shreyshrma1 · Pull Request #5 · mgagvani/robotvision

shreyshrma1 · 2026-01-28T21:35:24Z

No description provided.

Copilot

Pull request overview

This PR introduces an HPC training workflow for the camera-based Waymo E2E model, along with changes to the data loader and training script, and it checks in several SLURM job logs from recent runs.

Changes:

Add a SLURM batch script (train_job.sbatch) to run train.py on a V100 partition with a specific conda environment and scratch paths.
Modify train.py to change how Waymo datasets and dataloaders are constructed (switching to pre-generated index files and turning off image loading), while keeping the Lightning training loop and logging.
Update loader.py to change the data directory used in the __main__ test harness and to decode JPEG images on GPU before moving them back to CPU.
Add multiple .err and .out SLURM log artifacts (waymo-e2e-*.err/.out) from training runs into the repository.

Notable Issues Found

Missing imports in train.py (critical bug)
The script uses WaymoE2E, torch, pickle, Path, pd (pandas), and plt (matplotlib) without importing them. This will cause NameError exceptions as soon as those symbols are referenced (e.g., at dataset construction, DataLoader construction, and loss-plot generation).
Undefined variables train_index and test_index in train.py (critical bug)
The new code block:
```
with open("train_index.pkl", "wb") as f1:
    pickle.dump(train_index, f1)

with open("test_index.pkl", "wb") as f2:
    pickle.dump(test_index, f2)
```
references train_index and test_index, which are never defined. This will immediately raise a NameError and prevent training from starting. If the intention is to re-save or subset existing indices, those values need to be computed (e.g., from an existing dataset or index file) before dumping.
Dead / confusing dataset construction in train.py (maintainability)
train_dataset and test_dataset are first created with index_train.pkl / index_val.pkl and images=True, then immediately replaced with new WaymoE2E instances using train_index.pkl / test_index.pkl and images=False. Given the undefined train_index/test_index, the first pair of datasets are effectively unused and confusing. Even after fixing the undefined variables, this pattern is hard to follow and should be refactored to a single clear construction path.
SLURM job script environment setup currently fails on the target cluster (bug / operational_implications)
The new train_job.sbatch uses:
```
module load anaconda/2023.03
source activate /scratch/gilbreth/shar1159/conda_envs/waymo
```
The .err logs included in this PR show Lmod has detected the following error: unknown module "anaconda/2023.03" and /var/spool/slurm/...: line 15: activate: No such file or directory. This indicates the environment module name and/or activation command are incorrect for the target system, so the intended conda env is not actually being activated.
DataLoader worker count vs. system recommendation (best_practices, non-blocking)
train.py now uses num_workers=16 for both train and validation DataLoaders. The included .err logs from the same environment show PyTorch / Lightning warnings that the suggested maximum number of workers on that system is 4, and that IterableDataset with __len__ plus multi-process loading can yield inaccurate lengths. While not strictly a correctness bug, this is likely to harm performance and may cause confusing progress reporting; consider tuning num_workers based on actual cluster limits and Lightning’s guidance.
GPU-only decoding in loader.py can break on CPU-only setups (best_practices / operational, low priority)
decode_img now calls torchvision.io.decode_jpeg(..., device='cuda') and then moves the tensor back to CPU. This will fail on machines where CUDA is unavailable. For the main training code, this is mitigated by using images=False (so decode_img returns an empty array and never touches CUDA), but the __main__ test harness uses images=True and will crash on CPU-only systems.
Checked-in SLURM log artifacts (maintainability)
Files waymo-e2e-9940377.err, waymo-e2e-9940269.err, waymo-e2e-9940263.err, and waymo-e2e-9940263.out are pure runtime artifacts with job IDs, timestamps, and environment-specific warnings. Keeping these under version control adds noise, risks leaking environment-specific details, and will quickly become stale; typically these should be excluded (e.g., via .gitignore) unless there is a specific, documented need.

Reviewed changes

Copilot reviewed 7 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`src/camera-based-e2e/train.py`	Adjusts dataset and DataLoader construction to use new index pickle files and disable image loading, while still building and training a Lightning model; however, it removes required imports and introduces undefined variables, breaking execution.
`src/camera-based-e2e/loader.py`	Changes image decoding to run on CUDA and then move tensors back to CPU, and updates the example `DATA_DIR` path in the `__main__` block.
`src/camera-based-e2e/train_job.sbatch`	New SLURM batch script to run `train.py` on a V100 partition using a specific conda environment and scratch directory; as written, its module and activation commands fail on the target cluster referenced in the `.err` logs.
`src/camera-based-e2e/waymo-e2e-9940377.err`	Captures stderr output from a SLURM training run, including module-loading and DataLoader warnings; committed as a log artifact rather than code.
`src/camera-based-e2e/waymo-e2e-9940269.err`	Another SLURM stderr log from a canceled training job, showing the same module/activation issues and Lightning DataLoader warnings.
`src/camera-based-e2e/waymo-e2e-9940263.err`	SLURM stderr log for job 9940263, again showing the broken environment setup and Lightning warnings.
`src/camera-based-e2e/waymo-e2e-9940263.out`	SLURM stdout log with very detailed progress output for a training run (epoch 0), including loss values; not used by the code.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-29T03:20:47Z

    train_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_train.pkl', data_dir=args.data_dir, images=True, n_items=250000)
    test_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_val.pkl', data_dir=args.data_dir, images=True, n_items=50000)


This assignment to 'train_dataset' is unnecessary as it is redefined before this value is used.

Suggested change

train_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_train.pkl', data_dir=args.data_dir, images=True, n_items=250000)

test_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_val.pkl', data_dir=args.data_dir, images=True, n_items=50000)

WaymoE2E(batch_size=args.batch_size, indexFile='index_train.pkl', data_dir=args.data_dir, images=True, n_items=250000)

WaymoE2E(batch_size=args.batch_size, indexFile='index_val.pkl', data_dir=args.data_dir, images=True, n_items=50000)

Copilot · 2026-01-29T03:20:47Z

@@ -31,8 +21,17 @@
    train_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_train.pkl', data_dir=args.data_dir, images=True, n_items=250000)
    test_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_val.pkl', data_dir=args.data_dir, images=True, n_items=50000)


This assignment to 'test_dataset' is unnecessary as it is redefined before this value is used.

Suggested change

test_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_val.pkl', data_dir=args.data_dir, images=True, n_items=50000)

WaymoE2E(batch_size=args.batch_size, indexFile='index_val.pkl', data_dir=args.data_dir, images=True, n_items=50000)

shreyshrma1 and others added 3 commits November 25, 2025 13:33

dataloader and preliminary model training

2bed705

Merge remote-tracking branch 'origin/main' into shreyshrma1/dataloader

430ed25

Merge branch 'main' into shrey/hard-scenes

e461d13

mgagvani requested a review from Copilot January 29, 2026 03:14

Copilot started reviewing on behalf of mgagvani January 29, 2026 03:14 View session

Copilot AI reviewed Jan 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shrey/hard scenes#5

Shrey/hard scenes#5
shreyshrma1 wants to merge 3 commits intomainfrom
shrey/hard-scenes

shreyshrma1 commented Jan 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 29, 2026

Uh oh!

Copilot AI Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		train_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_train.pkl', data_dir=args.data_dir, images=True, n_items=250000)
		test_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_val.pkl', data_dir=args.data_dir, images=True, n_items=50000)

		@@ -31,8 +21,17 @@
		train_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_train.pkl', data_dir=args.data_dir, images=True, n_items=250000)
		test_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_val.pkl', data_dir=args.data_dir, images=True, n_items=50000)

	test_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_val.pkl', data_dir=args.data_dir, images=True, n_items=50000)
	WaymoE2E(batch_size=args.batch_size, indexFile='index_val.pkl', data_dir=args.data_dir, images=True, n_items=50000)

Conversation

shreyshrma1 commented Jan 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Notable Issues Found

Reviewed changes

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants