Skip to content

Shrey/hard scenes#5

Open
shreyshrma1 wants to merge 3 commits intomainfrom
shrey/hard-scenes
Open

Shrey/hard scenes#5
shreyshrma1 wants to merge 3 commits intomainfrom
shrey/hard-scenes

Conversation

@shreyshrma1
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an HPC training workflow for the camera-based Waymo E2E model, along with changes to the data loader and training script, and it checks in several SLURM job logs from recent runs.

Changes:

  • Add a SLURM batch script (train_job.sbatch) to run train.py on a V100 partition with a specific conda environment and scratch paths.
  • Modify train.py to change how Waymo datasets and dataloaders are constructed (switching to pre-generated index files and turning off image loading), while keeping the Lightning training loop and logging.
  • Update loader.py to change the data directory used in the __main__ test harness and to decode JPEG images on GPU before moving them back to CPU.
  • Add multiple .err and .out SLURM log artifacts (waymo-e2e-*.err/.out) from training runs into the repository.

Notable Issues Found

  • Missing imports in train.py (critical bug)
    The script uses WaymoE2E, torch, pickle, Path, pd (pandas), and plt (matplotlib) without importing them. This will cause NameError exceptions as soon as those symbols are referenced (e.g., at dataset construction, DataLoader construction, and loss-plot generation).

  • Undefined variables train_index and test_index in train.py (critical bug)
    The new code block:

    with open("train_index.pkl", "wb") as f1:
        pickle.dump(train_index, f1)
    
    with open("test_index.pkl", "wb") as f2:
        pickle.dump(test_index, f2)

    references train_index and test_index, which are never defined. This will immediately raise a NameError and prevent training from starting. If the intention is to re-save or subset existing indices, those values need to be computed (e.g., from an existing dataset or index file) before dumping.

  • Dead / confusing dataset construction in train.py (maintainability)
    train_dataset and test_dataset are first created with index_train.pkl / index_val.pkl and images=True, then immediately replaced with new WaymoE2E instances using train_index.pkl / test_index.pkl and images=False. Given the undefined train_index/test_index, the first pair of datasets are effectively unused and confusing. Even after fixing the undefined variables, this pattern is hard to follow and should be refactored to a single clear construction path.

  • SLURM job script environment setup currently fails on the target cluster (bug / operational_implications)
    The new train_job.sbatch uses:

    module load anaconda/2023.03
    source activate /scratch/gilbreth/shar1159/conda_envs/waymo

    The .err logs included in this PR show Lmod has detected the following error: unknown module "anaconda/2023.03" and /var/spool/slurm/...: line 15: activate: No such file or directory. This indicates the environment module name and/or activation command are incorrect for the target system, so the intended conda env is not actually being activated.

  • DataLoader worker count vs. system recommendation (best_practices, non-blocking)
    train.py now uses num_workers=16 for both train and validation DataLoaders. The included .err logs from the same environment show PyTorch / Lightning warnings that the suggested maximum number of workers on that system is 4, and that IterableDataset with __len__ plus multi-process loading can yield inaccurate lengths. While not strictly a correctness bug, this is likely to harm performance and may cause confusing progress reporting; consider tuning num_workers based on actual cluster limits and Lightning’s guidance.

  • GPU-only decoding in loader.py can break on CPU-only setups (best_practices / operational, low priority)
    decode_img now calls torchvision.io.decode_jpeg(..., device='cuda') and then moves the tensor back to CPU. This will fail on machines where CUDA is unavailable. For the main training code, this is mitigated by using images=False (so decode_img returns an empty array and never touches CUDA), but the __main__ test harness uses images=True and will crash on CPU-only systems.

  • Checked-in SLURM log artifacts (maintainability)
    Files waymo-e2e-9940377.err, waymo-e2e-9940269.err, waymo-e2e-9940263.err, and waymo-e2e-9940263.out are pure runtime artifacts with job IDs, timestamps, and environment-specific warnings. Keeping these under version control adds noise, risks leaking environment-specific details, and will quickly become stale; typically these should be excluded (e.g., via .gitignore) unless there is a specific, documented need.

Reviewed changes

Copilot reviewed 7 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/camera-based-e2e/train.py Adjusts dataset and DataLoader construction to use new index pickle files and disable image loading, while still building and training a Lightning model; however, it removes required imports and introduces undefined variables, breaking execution.
src/camera-based-e2e/loader.py Changes image decoding to run on CUDA and then move tensors back to CPU, and updates the example DATA_DIR path in the __main__ block.
src/camera-based-e2e/train_job.sbatch New SLURM batch script to run train.py on a V100 partition using a specific conda environment and scratch directory; as written, its module and activation commands fail on the target cluster referenced in the .err logs.
src/camera-based-e2e/waymo-e2e-9940377.err Captures stderr output from a SLURM training run, including module-loading and DataLoader warnings; committed as a log artifact rather than code.
src/camera-based-e2e/waymo-e2e-9940269.err Another SLURM stderr log from a canceled training job, showing the same module/activation issues and Lightning DataLoader warnings.
src/camera-based-e2e/waymo-e2e-9940263.err SLURM stderr log for job 9940263, again showing the broken environment setup and Lightning warnings.
src/camera-based-e2e/waymo-e2e-9940263.out SLURM stdout log with very detailed progress output for a training run (epoch 0), including loss values; not used by the code.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 21 to 22
train_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_train.pkl', data_dir=args.data_dir, images=True, n_items=250000)
test_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_val.pkl', data_dir=args.data_dir, images=True, n_items=50000)
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment to 'train_dataset' is unnecessary as it is redefined before this value is used.

Suggested change
train_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_train.pkl', data_dir=args.data_dir, images=True, n_items=250000)
test_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_val.pkl', data_dir=args.data_dir, images=True, n_items=50000)
WaymoE2E(batch_size=args.batch_size, indexFile='index_train.pkl', data_dir=args.data_dir, images=True, n_items=250000)
WaymoE2E(batch_size=args.batch_size, indexFile='index_val.pkl', data_dir=args.data_dir, images=True, n_items=50000)

Copilot uses AI. Check for mistakes.
@@ -31,8 +21,17 @@
train_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_train.pkl', data_dir=args.data_dir, images=True, n_items=250000)
test_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_val.pkl', data_dir=args.data_dir, images=True, n_items=50000)
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment to 'test_dataset' is unnecessary as it is redefined before this value is used.

Suggested change
test_dataset = WaymoE2E(batch_size=args.batch_size, indexFile='index_val.pkl', data_dir=args.data_dir, images=True, n_items=50000)
WaymoE2E(batch_size=args.batch_size, indexFile='index_val.pkl', data_dir=args.data_dir, images=True, n_items=50000)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants