Skip to content

Conversation

@jberchtold-nvidia
Copy link
Collaborator

Description

After PR #2548 we observe very long compile times in padding.cu for arch 75, >20+mins where previously it only took ~30seconds to compile.

To temporarily work around this issue, we are moving the scope of row_offset inward but keeping the cast to size_t

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Move scope of static_cast row offset inward to usage to work around long compilation time issue

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
@jberchtold-nvidia jberchtold-nvidia changed the title [Common] Fix long compile time in padding.cu [Common] Fix long compile time in padding.cu on arch 75 Jan 5, 2026
ptrendx
ptrendx previously approved these changes Jan 5, 2026
@jberchtold-nvidia
Copy link
Collaborator Author

/te-ci L0

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 5, 2026

Greptile Summary

Addresses severe compilation time regression on arch 75 introduced by PR #2548. The previous fix added size_t row_offset = static_cast<size_t>(row) * row_length as a variable to prevent integer overflow, but this caused compilation to take 20+ minutes instead of the usual 30 seconds on arch 75 GPUs.

Key Changes:

  • Removed row_offset variable declarations from both multi_padding_kernel and multi_unpadding_kernel
  • Inlined the static_cast<size_t>(row) * row_length computation at all 6 usage sites (3 per kernel)
  • Preserved the critical overflow protection by keeping the static_cast<size_t> cast

Technical Assessment:
The workaround trades potential register optimization for practical compilation feasibility. Modern compilers should recognize the repeated computation and optimize appropriately. The change maintains functional correctness since the overflow protection is preserved and the computation is mathematically identical.

Confidence Score: 5/5

  • This PR is safe to merge - it's a well-justified compiler workaround that maintains correctness while fixing a critical build issue
  • The change is minimal, focused, and directly addresses a specific compilation time issue on arch 75. The overflow protection from PR Fix overflow of padding/unpadding kernel #2548 is fully preserved through the inline static_cast<size_t> casts. The transformation is mathematically equivalent and should be optimized similarly by the compiler. No functional behavior changes are expected.
  • No files require special attention

Important Files Changed

Filename Overview
transformer_engine/common/util/padding.cu Moved static_cast<size_t>(row) * row_length computation inline to fix 20+ minute compile time issue on arch 75, while preserving overflow protection

Sequence Diagram

sequenceDiagram
    participant Dev as Developer
    participant PR2548 as PR #2548
    participant Compiler as NVCC Compiler (arch 75)
    participant PR2562 as PR #2562 (Current)
    
    Note over Dev,PR2548: Original Issue: Integer Overflow
    Dev->>PR2548: Add static_cast<size_t> for row offset
    PR2548->>PR2548: Declare row_offset outside inner loop
    Note over PR2548: size_t row_offset = static_cast<size_t>(row) * row_length
    
    Note over Compiler: Compilation Time Issue Discovered
    PR2548->>Compiler: Compile padding.cu for arch 75
    Compiler->>Compiler: Process row_offset variable scope
    Note over Compiler: Takes 20+ minutes (was 30 seconds)
    
    Note over Dev,PR2562: Workaround: Inline Computation
    Dev->>PR2562: Move static_cast inline to usage sites
    PR2562->>PR2562: Remove row_offset variable declaration
    PR2562->>PR2562: Use static_cast<size_t>(row) * row_length inline
    Note over PR2562: Applied to 3 locations per kernel<br/>(input read, output write, padding write)
    
    PR2562->>Compiler: Compile with inline computation
    Compiler->>Compiler: Optimize without intermediate variable
    Note over Compiler: Compile time restored to ~30 seconds
    
    Note over PR2562: Overflow Protection Maintained<br/>Performance Functionally Equivalent
Loading

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 5, 2026

Greptile's behavior is changing!

From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

@jberchtold-nvidia jberchtold-nvidia merged commit df69100 into NVIDIA:main Jan 6, 2026
32 of 34 checks passed
@jberchtold-nvidia jberchtold-nvidia deleted the jberchtold/fix-long-compile-time-in-padding-cu branch January 6, 2026 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants