Skip to content

Bump CUDA base image from 13.1.0 to 13.2.0#85

Open
hannahli-nv wants to merge 9 commits intomainfrom
bump-cuda-13.2
Open

Bump CUDA base image from 13.1.0 to 13.2.0#85
hannahli-nv wants to merge 9 commits intomainfrom
bump-cuda-13.2

Conversation

@hannahli-nv
Copy link
Copy Markdown
Collaborator

@hannahli-nv hannahli-nv commented Mar 24, 2026

Summary

• Bump CI Docker base image from nvcr.io/nvidia/cuda:13.1.0-devel-ubuntu22.04 to nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04
• The latest cuda-tile (1.2.0) declares nvidia-cuda-tileiras >=13.2, <13.3 as its optional [tileiras] dependency. The CI image should ship a matching system tileiras so the compiler version is aligned.
• PyTorch cu130 wheels remain compatible with CUDA 13.2 (backward compatible).

What changed

modeling/transformers/Dockerfile: updated BASE_IMAGE default from cuda:13.1.0 to cuda:13.2.0.

How to verify

In the CI docker build / test-ops job logs, check:
nvcc --version → should show 13.2
dpkg -l | grep tileiras → should show 13.2.x

CI Configuration

config:
  build: true
  test: ["ops", "benchmark"]

🤖 Generated with Claude Code

The latest cuda-tile (1.2.0) requires tileiras >=13.2. Update the CI
Docker base image to CUDA 13.2.0 so the system-installed tileiras
matches what cuda-tile expects.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 24, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@hannahli-nv
Copy link
Copy Markdown
Collaborator Author

/ok to test 7edbbea

With CUDA 13.2, the tileiras compiler uses more memory per compilation.
Running 16 xdist workers with -n auto and different kernel modules
compiling simultaneously causes OOM (exit code 137) on the CI runner.

Adding --dist=loadscope groups tests by module/class so each worker
compiles one kernel type at a time, reducing peak memory. This matches
the Ocean CI configuration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hannahli-nv
Copy link
Copy Markdown
Collaborator Author

/ok to test 24bfa7b

With --dist=loadscope, tests run more sequentially per worker to avoid
OOM. This trades speed for memory safety, so the previous 15-minute
step timeout (17m job timeout) is too tight. Bump to 25m step / 30m job
to match the test-benchmark job timeout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hannahli-nv
Copy link
Copy Markdown
Collaborator Author

/ok to test 4d051ce

The previous --dist=loadscope fix prevented OOM but caused timeouts
because it serializes all tests within a module to one worker (e.g.
156 test_bmm tests on a single worker).

Switch to -n 4 (4 parallel workers instead of 16 from -n auto):
- 4 workers use ~1/4 peak memory, avoiding the CUDA 13.2 OOM
- Tests distribute freely across workers, no serialization bottleneck
- Expected runtime ~12-15 min, well within the 25 min step timeout

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hannahli-nv
Copy link
Copy Markdown
Collaborator Author

/ok to test aa59188

- -n 16 (auto): OOM with CUDA 13.2 tileiras (~2.5 min)
- -n 4: no OOM but too slow, times out at 25 min
- -n 8: half the memory of 16 workers, twice the speed of 4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hannahli-nv
Copy link
Copy Markdown
Collaborator Author

/ok to test cc6a34b

hannahli-nv and others added 2 commits March 24, 2026 16:14
-n 8 was stable (no OOM) but timed out at 25 min (just barely).
Bump to -n 10 for more speed, and increase step timeout to 35 min
(job timeout 40 min) as safety margin for CUDA 13.2 compilation.

Summary of what we know:
- -n 16 (auto): OOM at ~2.5 min
- -n 4: no OOM, timeout at 25 min
- -n 8: no OOM, timeout at 25 min (nearly complete)
- -n 10: should complete in ~20 min

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hannahli-nv
Copy link
Copy Markdown
Collaborator Author

/ok to test 3f2ffb8

@hannahli-nv hannahli-nv requested a review from xjmxyt March 24, 2026 09:09
With -n 12 passing in 19 min, 25m step timeout gives ~6 min headroom.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hannahli-nv
Copy link
Copy Markdown
Collaborator Author

/ok to test b433a75

@hannahli-nv
Copy link
Copy Markdown
Collaborator Author

/ok to test 0a2582e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant