[Common] Fix long compile time in padding.cu on arch 75 #2562

jberchtold-nvidia · 2026-01-05T19:52:03Z

Description

After PR #2548 we observe very long compile times in padding.cu for arch 75, >20+mins where previously it only took ~30seconds to compile.

To temporarily work around this issue, we are moving the scope of row_offset inward but keeping the cast to size_t

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Move scope of static_cast row offset inward to usage to work around long compilation time issue

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

for more information, see https://pre-commit.ci

jberchtold-nvidia · 2026-01-05T19:54:38Z

/te-ci L0

greptile-apps · 2026-01-05T19:56:05Z

Greptile Summary

Addresses severe compilation time regression on arch 75 introduced by PR #2548. The previous fix added size_t row_offset = static_cast<size_t>(row) * row_length as a variable to prevent integer overflow, but this caused compilation to take 20+ minutes instead of the usual 30 seconds on arch 75 GPUs.

Key Changes:

Removed row_offset variable declarations from both multi_padding_kernel and multi_unpadding_kernel
Inlined the static_cast<size_t>(row) * row_length computation at all 6 usage sites (3 per kernel)
Preserved the critical overflow protection by keeping the static_cast<size_t> cast

Technical Assessment:
The workaround trades potential register optimization for practical compilation feasibility. Modern compilers should recognize the repeated computation and optimize appropriately. The change maintains functional correctness since the overflow protection is preserved and the computation is mathematically identical.

Confidence Score: 5/5

This PR is safe to merge - it's a well-justified compiler workaround that maintains correctness while fixing a critical build issue
The change is minimal, focused, and directly addresses a specific compilation time issue on arch 75. The overflow protection from PR Fix overflow of padding/unpadding kernel #2548 is fully preserved through the inline static_cast<size_t> casts. The transformation is mathematically equivalent and should be optimized similarly by the compiler. No functional behavior changes are expected.
No files require special attention

Important Files Changed

Filename	Overview
transformer_engine/common/util/padding.cu	Moved `static_cast<size_t>(row) * row_length` computation inline to fix 20+ minute compile time issue on arch 75, while preserving overflow protection

Sequence Diagram

sequenceDiagram
    participant Dev as Developer
    participant PR2548 as PR #2548
    participant Compiler as NVCC Compiler (arch 75)
    participant PR2562 as PR #2562 (Current)
    
    Note over Dev,PR2548: Original Issue: Integer Overflow
    Dev->>PR2548: Add static_cast<size_t> for row offset
    PR2548->>PR2548: Declare row_offset outside inner loop
    Note over PR2548: size_t row_offset = static_cast<size_t>(row) * row_length
    
    Note over Compiler: Compilation Time Issue Discovered
    PR2548->>Compiler: Compile padding.cu for arch 75
    Compiler->>Compiler: Process row_offset variable scope
    Note over Compiler: Takes 20+ minutes (was 30 seconds)
    
    Note over Dev,PR2562: Workaround: Inline Computation
    Dev->>PR2562: Move static_cast inline to usage sites
    PR2562->>PR2562: Remove row_offset variable declaration
    PR2562->>PR2562: Use static_cast<size_t>(row) * row_length inline
    Note over PR2562: Applied to 3 locations per kernel<br/>(input read, output write, padding write)
    
    PR2562->>Compiler: Compile with inline computation
    Compiler->>Compiler: Optimize without intermediate variable
    Note over Compiler: Compile time restored to ~30 seconds
    
    Note over PR2562: Overflow Protection Maintained<br/>Performance Functionally Equivalent

greptile-apps · 2026-01-05T19:56:05Z

Greptile's behavior is changing!

From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section.

_{This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".}

Fix long compile time in padding.cu

03fdb9f

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

jberchtold-nvidia requested a review from ptrendx January 5, 2026 19:52

jberchtold-nvidia changed the title ~~[Common] Fix long compile time in padding.cu~~ [Common] Fix long compile time in padding.cu on arch 75 Jan 5, 2026

ptrendx previously approved these changes Jan 5, 2026

View reviewed changes

[pre-commit.ci] auto fixes from pre-commit.com hooks

dc6452d

for more information, see https://pre-commit.ci

pre-commit-ci bot dismissed ptrendx’s stale review via dc6452d January 5, 2026 19:53

ptrendx approved these changes Jan 5, 2026

View reviewed changes

jberchtold-nvidia merged commit df69100 into NVIDIA:main Jan 6, 2026
32 of 34 checks passed

jberchtold-nvidia deleted the jberchtold/fix-long-compile-time-in-padding-cu branch January 6, 2026 18:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Common] Fix long compile time in padding.cu on arch 75 #2562

[Common] Fix long compile time in padding.cu on arch 75 #2562

Uh oh!

jberchtold-nvidia commented Jan 5, 2026

Uh oh!

jberchtold-nvidia commented Jan 5, 2026

Uh oh!

greptile-apps bot commented Jan 5, 2026

Uh oh!

greptile-apps bot commented Jan 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Common] Fix long compile time in padding.cu on arch 75 #2562

[Common] Fix long compile time in padding.cu on arch 75 #2562

Uh oh!

Conversation

jberchtold-nvidia commented Jan 5, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

jberchtold-nvidia commented Jan 5, 2026

Uh oh!

greptile-apps bot commented Jan 5, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot commented Jan 5, 2026

Greptile's behavior is changing!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants