Bump CUDA base image from 13.1.0 to 13.2.0 by hannahli-nv · Pull Request #85 · NVIDIA/TileGym

hannahli-nv · 2026-03-24T03:51:59Z

Summary

• Bump CI Docker base image from nvcr.io/nvidia/cuda:13.1.0-devel-ubuntu22.04 to nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04
• The latest cuda-tile (1.2.0) declares nvidia-cuda-tileiras >=13.2, <13.3 as its optional [tileiras] dependency. The CI image should ship a matching system tileiras so the compiler version is aligned.
• PyTorch cu130 wheels remain compatible with CUDA 13.2 (backward compatible).

What changed

modeling/transformers/Dockerfile: updated BASE_IMAGE default from cuda:13.1.0 to cuda:13.2.0.

How to verify

In the CI docker build / test-ops job logs, check:
• nvcc --version → should show 13.2
• dpkg -l | grep tileiras → should show 13.2.x

CI Configuration

config:
  build: true
  test: ["ops", "benchmark"]

🤖 Generated with Claude Code

The latest cuda-tile (1.2.0) requires tileiras >=13.2. Update the CI Docker base image to CUDA 13.2.0 so the system-installed tileiras matches what cuda-tile expects. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

copy-pr-bot · 2026-03-24T03:52:02Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

hannahli-nv · 2026-03-24T03:56:40Z

/ok to test 7edbbea

With CUDA 13.2, the tileiras compiler uses more memory per compilation. Running 16 xdist workers with -n auto and different kernel modules compiling simultaneously causes OOM (exit code 137) on the CI runner. Adding --dist=loadscope groups tests by module/class so each worker compiles one kernel type at a time, reducing peak memory. This matches the Ocean CI configuration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hannahli-nv · 2026-03-24T05:33:59Z

/ok to test 24bfa7b

With --dist=loadscope, tests run more sequentially per worker to avoid OOM. This trades speed for memory safety, so the previous 15-minute step timeout (17m job timeout) is too tight. Bump to 25m step / 30m job to match the test-benchmark job timeout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hannahli-nv · 2026-03-24T06:08:23Z

/ok to test 4d051ce

The previous --dist=loadscope fix prevented OOM but caused timeouts because it serializes all tests within a module to one worker (e.g. 156 test_bmm tests on a single worker). Switch to -n 4 (4 parallel workers instead of 16 from -n auto): - 4 workers use ~1/4 peak memory, avoiding the CUDA 13.2 OOM - Tests distribute freely across workers, no serialization bottleneck - Expected runtime ~12-15 min, well within the 25 min step timeout Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hannahli-nv · 2026-03-24T07:02:04Z

/ok to test aa59188

- -n 16 (auto): OOM with CUDA 13.2 tileiras (~2.5 min) - -n 4: no OOM but too slow, times out at 25 min - -n 8: half the memory of 16 workers, twice the speed of 4 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hannahli-nv · 2026-03-24T07:40:19Z

/ok to test cc6a34b

-n 8 was stable (no OOM) but timed out at 25 min (just barely). Bump to -n 10 for more speed, and increase step timeout to 35 min (job timeout 40 min) as safety margin for CUDA 13.2 compilation. Summary of what we know: - -n 16 (auto): OOM at ~2.5 min - -n 4: no OOM, timeout at 25 min - -n 8: no OOM, timeout at 25 min (nearly complete) - -n 10: should complete in ~20 min Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hannahli-nv · 2026-03-24T08:31:13Z

/ok to test 3f2ffb8

With -n 12 passing in 19 min, 25m step timeout gives ~6 min headroom. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hannahli-nv · 2026-03-24T09:48:49Z

/ok to test b433a75

hannahli-nv · 2026-03-31T00:51:41Z

/ok to test 0a2582e

Bump CUDA base image from 13.1.0 to 13.2.0

7edbbea

The latest cuda-tile (1.2.0) requires tileiras >=13.2. Update the CI Docker base image to CUDA 13.2.0 so the system-installed tileiras matches what cuda-tile expects. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hannahli-nv and others added 2 commits March 24, 2026 16:14

Bump pytest workers to -n 12

3f2ffb8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hannahli-nv requested a review from xjmxyt March 24, 2026 09:09

Reduce test-ops timeouts to 25m step / 30m job

b433a75

With -n 12 passing in 19 min, 25m step timeout gives ~6 min headroom. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'main' into bump-cuda-13.2

0a2582e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump CUDA base image from 13.1.0 to 13.2.0#85

Bump CUDA base image from 13.1.0 to 13.2.0#85
hannahli-nv wants to merge 9 commits intomainfrom
bump-cuda-13.2

hannahli-nv commented Mar 24, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 24, 2026

Uh oh!

hannahli-nv commented Mar 24, 2026

Uh oh!

hannahli-nv commented Mar 24, 2026

Uh oh!

hannahli-nv commented Mar 24, 2026

Uh oh!

hannahli-nv commented Mar 24, 2026

Uh oh!

hannahli-nv commented Mar 24, 2026

Uh oh!

hannahli-nv commented Mar 24, 2026

Uh oh!

hannahli-nv commented Mar 24, 2026

Uh oh!

hannahli-nv commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hannahli-nv commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

How to verify

CI Configuration

Uh oh!

copy-pr-bot bot commented Mar 24, 2026

Uh oh!

hannahli-nv commented Mar 24, 2026

Uh oh!

hannahli-nv commented Mar 24, 2026

Uh oh!

hannahli-nv commented Mar 24, 2026

Uh oh!

hannahli-nv commented Mar 24, 2026

Uh oh!

hannahli-nv commented Mar 24, 2026

Uh oh!

hannahli-nv commented Mar 24, 2026

Uh oh!

hannahli-nv commented Mar 24, 2026

Uh oh!

hannahli-nv commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hannahli-nv commented Mar 24, 2026 •

edited

Loading