Ifu dev 260419 v2.15 by VeeraRajasekhar · Pull Request #616 · ROCm/TransformerEngine

VeeraRajasekhar · 2026-06-08T15:13:09Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

…#2757) * [Common] Fix linker error for to_string(DType) in distributed tests Make transformer_engine::to_string(DType) inline in common.h so that translation units outside libtransformer_engine.so can resolve it without requiring the symbol to be exported. Regression introduced by 61f9594 which added to_string(DType) calls into TRANSFORMER_ENGINE_TYPE_SWITCH_* macros, causing test object files to reference the symbol that the linker version script hides. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> --------- Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…-Cast Fusion Kernel (#2555) * first draft Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * pass numerical unit test Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * format Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * add benchmark script Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * lint and format Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * compile guard Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * warning fix Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * resolve greptile comment Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * minor style fixes Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * fix namespace Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * resolve some comments Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * fix comment Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * attempt to fix compile CI with guard Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * better naming for tests Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * fix deprecate messsage Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * more compile guard Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * new API name Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * fix format all in one Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * try to fix compile CI again Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * AI code review comments Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * to pass oldest compile CI with cuda 12.1 Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * add more guards to nvfp4 Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * make multiply inverse default numerics Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * update numerics of nvfp4 partial cast as well Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * resolve comments Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * add NVTE_BUILD_NUM_PHILOX_ROUNDS after rebase Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * simplify compile guard messsages Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> --------- Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…Linear` (#2761) * Load multi-param checkpoint from single-param config in GroupedLinear Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Multi-param to single param case Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Multi-param to single param case Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Better varnames Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix GMM cuBLAS version and SM arch checks Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Pin python 3.13 in vermin check Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update vermin version for python 3.14 support Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Use sha instead of tag Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* init Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * work finished Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lint fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: root <pgadzinski@nvidia.com> * removed warning.warn Signed-off-by: root <pgadzinski@nvidia.com> * [PyTorch] Remove dead None-check for num_out_tokens in moe_permute_mask_map_forward num_out_tokens is typed as int in the custom_op signature and can never be None; the check was incorrectly carried over from the class-based upstream version during merge conflict resolution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by: root <pgadzinski@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…are detached (#2772) [PyTorch] Change the restore tensor API to ensure tensors are detached from ctx Signed-off-by: Kaining Zhong <kainingz@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

…ges it (#2781) Install pytest in onnx L1 test as Pyt container no longer packages it Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

…_descriptors (#2782) * Fix zero-sized groups in update_tma_descriptors Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> * Update test_cast_mxfp8_grouped.cu Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…+ (#2693) * Enable sm120 support for fused attn if cuDNN is 9.18.1+ Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Force intermediate tensors such as S, Sum_Exp, and Max to be BHS1 shape instead of TH1 for sm120 Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add support for sm120 correct batch, seq dims Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Add support for sm120 BHS1 style max logit even QKV are THD to avoid incorrect max logit calculation (includes padded tokens in max calculation) Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Disable fused and flash attn for sm120 filter:kv cache Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * For CP P2P attn, set softmax_lse_in_packed_format to False if sm120+ Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Assert in TE if T3HD/TH3D layout is used on sm120 before cuDNN F16 sdpa arbitrary kernel call Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Modify is_ragged_q && cudnn_runtime_version >= 90600 check to also include a check for sm120 Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * nit: Code clean up Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Disable fused attn for T3HD and TH3D Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * nit: Add missed sm120 guard Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Modify sm120 condition to be very specific to sm120 and not generalized to sm120+ Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * nit: Fix missing sm120 check in fwd Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Move the check for sm120 T3HD/TH3D to nvte_get_fused_attn_backend() instead of higher layers in TE stack Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * nit: Check for matching sm120 and not sm120+ Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* code drop Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * code drop Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * docs Signed-off-by: root <pgadzinski@nvidia.com> * nvfp4 internals support Signed-off-by: root <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lint fixes Signed-off-by: root <pgadzinski@nvidia.com> * Update transformer_engine/debug/features/dump_tensors.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * fix Signed-off-by: root <pgadzinski@nvidia.com> * Update transformer_engine/debug/features/dump_tensors.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * Update transformer_engine/debug/features/dump_tensors.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update tests/pytorch/debug/test_log.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * Update transformer_engine/debug/features/dump_tensors.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * fix Signed-off-by: root <pgadzinski@nvidia.com> * fix Signed-off-by: root <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove dump_quantized_internals support from DumpTensors Drop the dump_quantized_internals config option, the _get_quantized_internals method, and all helper functions for extracting scales/raw data from Float8Tensor, Float8BlockwiseQTensor, MXFP8Tensor, and NVFP4Tensor. Remove corresponding tests: test_dump_tensors_nvfp4_unpacked_codes and NVFP4_DUMP_TENSORS_CONFIG, and scale/data assertions from test_dump_tensors_sanity. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address Greptile review comments - Add dot ('.') to _sanitize_name to handle common PyTorch dotted layer names like 'encoder.layer.0.attention' - Add docstring note about pickle dependency for the 'quantized' key - Add comment explaining weights_only=False in test - Remove redundant local RecipeState import in test_nvfp4_numeric Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Remove portability suggestion from quantized key docstring Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Compute rank lazily in _expected_root_dir Avoids relying on stale self.rank when ensure_initialized is called before initialize() has set the rank. Consistent with how nvdlfw_inspect logger resolves rank. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * detach tensors before saving; verify dump filename in test Detach both high_precision and quantized tensors before saving to avoid serializing the autograd graph. For QuantizedTensor this is a zero-copy view (make_like), so no extra GPU allocation. Add filename format assertion to test_dump_tensors_sanity to catch regressions in _sanitize_name or the naming convention. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add empty dump_dict log; assert QuantizedTensor type in test Log a message when no tensors are available to dump so the user has an explicit signal that no file was written. Assert that the quantized key round-trips as a QuantizedTensor to catch regressions in detach() or serialisation path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/debug/features/dump_tensors.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * Address review: iter subdirs, remove dead rank field, add allclose test and MSE example - Organize dumps into per-iteration subdirectories (iter_000000/) to keep file count manageable per directory. - Remove unused self.rank attribute from TensorLogger. - Add torch.allclose assertion in test to verify serialization correctness. - Add docstring example showing how to load dumps and compute MSE. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Made-with: Cursor * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: use detach().clone() to avoid shared storage in DumpTensors Using tensor.detach() creates a view sharing the same underlying storage. If any in-place operation modifies the tensor after the dump, the saved data would be silently corrupted. Use .clone() to ensure the dump captures an independent copy of the data. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * test: use torch.equal instead of torch.allclose for serialisation round-trip The saved tensor is an exact bit-for-bit copy (detach().clone()), so torch.equal is the correct check. torch.allclose with its default tolerances could mask a genuine dtype conversion or precision loss introduced by a future change to the serialisation path. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix: add tp_size to DumpTensors.inspect_tensor and fix KeyError in call_feature backward compat pop Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by: root <pgadzinski@nvidia.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * change distributed tests infra for fsdp2 Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * verbose flag for reporting Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * add back coments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * another minor fix Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * not needed for this PR Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * address review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * unecessary comments

* add cudnn dln+add Signed-off-by: CarlosGomes98 <carlosmiguel.gomes@live.com.pt> * try fixing cudnn build issue Signed-off-by: CarlosGomes98 <carlosmiguel.gomes@live.com.pt> * guard against cudnn version Signed-off-by: CarlosGomes98 <carlosmiguel.gomes@live.com.pt> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * change itype to wtype for add in rmsnorm_bwd Signed-off-by: CarlosGomes98 <carlosmiguel.gomes@live.com.pt> * remove dead code Signed-off-by: CarlosGomes98 <carlosmiguel.gomes@live.com.pt> * remove dangling todo Signed-off-by: CarlosGomes98 <carlosmiguel.gomes@live.com.pt> --------- Signed-off-by: CarlosGomes98 <carlosmiguel.gomes@live.com.pt> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* add blackwell support filter for 9.7<=cudnn<9.18.1 Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * simplify conditionals Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fix conditionals again Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fix conditionals again Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * update the error log Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * remove the python filter and correct the cpp filter Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> --------- Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…quired (#2798) * Disable fused attention for sm120 if determinism is required Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * nit: disable fused attn for sm120 determinism, if training Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> --------- Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

… always and `Max` when `return_max_logit=True` (#2677) * cudnn now returns Stats always and Max only with `return_max_logit=true` Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fix a typo that caused a bug Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * update doc strings Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix more docs Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fixes from the feedback Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * update cudnn-frontend to v1.19.1 Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * update the cudnn frontend Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fix a wrong omission Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Move cuDNN FE to v1.21.0 Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Add warning if using BSHD and max_segments_per_seq > 1 Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/jax/attention.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> * Update transformer_engine/jax/attention.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com> Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> * Remove warning test Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com>

* Refactor to group_sizes per tensor Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Support first_dims and last_dims instead of a single group_sizes per tensor Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Refactor GMM FFIs to store static attrs as structs Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Cleanup C++ v2 FFI Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Fix int64 workspace usage Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Address greptile comments Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Refactor wgrad-specific checks to be generic for GMM in gemm.py Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Refactor XLA FFI struct setup Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Fix edge case in TE v1 GMM Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix issues on Hopper Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Refactor Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address comments Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Lint Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Fixes for Hopper Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Address review comments Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Grouped quantization test fixes Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Pass input_output_alias to TritonAutotunedKernelCall Signed-off-by: JAX Toolbox <jax@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add jax version guard for the input_output_aliasing fix Signed-off-by: tdophung <tdophung@nvidia.com> --------- Signed-off-by: JAX Toolbox <jax@nvidia.com> Signed-off-by: tdophung <tdophung@nvidia.com> Co-authored-by: JAX Toolbox <jax@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* done Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * one review comment form greptile Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * instead part of the comment not needed Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * Update transformer_engine/pytorch/tensor/float8_blockwise_tensor.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * No need to set it to None Remove unnecessary columnwise data and scale inv assignments. Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> --------- Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* cudnn now returns Stats always and Max only with `return_max_logit=true` Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fix a typo that caused a bug Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * update doc strings Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix more docs Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fixes from the feedback Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * update cudnn-frontend to v1.19.1 Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * update the cudnn frontend Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fix a wrong omission Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * bugfix: mask out padding tokens when THD Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes from greptile feedback Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor nit Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fixes from feedback Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> --------- Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Enabled persistency with WorkID Query feature Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added a struct with tunable parameters Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added persistency with static scheduling Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed test cases Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Ready for benchmarking Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed out-of-boundary error Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Tuned kernel parameters Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring 2 Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring 3 Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Removed the dynamic (WorkID Query) persistency Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Ready for PR Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes per the review Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Ready for benchmark Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Ready for benchmark - Regular kernel Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added the source code to the profiler Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added constructors to Job and Block descriptors Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Removed the prefetch overlapping between jobs Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Cache tensor ID Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * ShapeRepresentation is not a template parameter Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Removed redundant fence_proxy Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Used mixed precision FMA Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added Quantize parameters Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added the fast math branch Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added the fast math to cpp test suite Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Align tests Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Use STS instead of generic ST Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Add zero-tensor cases Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Used LDS instead of generic LD in colwise path Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Used LDS instead of generic LD in rowwise Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Ready for merge Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Uncommented test cases Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added FP16 Fast math path to rowwise processing Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed lint Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixes Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fix Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed test suite Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed test suite Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixes per the review Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Modifications per the review Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Assert the buffer size Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added fast math RCP for bf16 Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fast math for BF16 is now default Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed compilation error when compiling on previous archs Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Boundary condition fix Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed compilation error Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring. Moved helpers to core-common Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactoring Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactoring per the review Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Addressed the PR review comments Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed the compilation error when PTX was compiled for CUDA 13.0 Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed pytorch extensions Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> --------- Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> Signed-off-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…os (#2823) * Fix: Use jitted kernels for generating THD (and BSHD) segment pos if only segment id is passed Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Make passing of segment_pos to from_segmet_ids_and_pos for creating a SequenceDescriptor mandatory Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Make test changes for from_segmet_ids_and_pos API change. Also some nits. Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * nit: Make segment_pos arg mandatory and not Optional Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Add comments for from_segment_ids_and_pos Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * nit: Change data types for BSHD seg pos and seg id to be int32 adn consistent with THD when setting up test inputs Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Replace a TypeError if segment_pos is not passed with a ValueError with a message Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* GEMM + Swiglu fused Grouped MLP for MXFP8 Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * cleanup/lint Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Properly cache the alpha tensor Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * nD dummy grad Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * 0 tokens in entire rank Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tmp downgrade cublas version check Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * delayed wgrad tests pass for basic gl Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * merge everything Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Rebase into fused_mxfp8_grouped_mlp; unit tests for delayed wgrad working Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix tests being skipped for fusible ops Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Integrate mxfp8 dbias kernel in group_quantize Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add bias/dbias fused support with cute GEMMs Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Check bias/dbias support Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Pack biases more efficiently Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * GroupedTensor for biases to avoid concat Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * format Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Support 1D grouped tensor shape for bias and fix checkpointing Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes and tests Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Refactor grouped tensor marking for paged stashing Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Remove setting logical_shape in mark_grouped_tensor Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Cleanup logical_shape Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * pass the tests for now Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * address some review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * address review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * more cleanups Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * cleanup Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * refactor wgrad logic Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Rename argument from single_grouped_parameter to single_grouped_weight Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Check wgrad store context is not empty for 0 token case. Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Test only checks for fusion if fused kernel is available Signed-off-by: Tim Moon <tmoon@nvidia.com> * fix the tolerance to be of bf16 for the cute gemm Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * Update transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * address further review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address more review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * address more review comments + test for zero grouped tensor work case Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * cublaslt remove zero work gemm avoidance Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix the wgrad test Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * split dbias functionality from gq api Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Format and lint Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * port fixes and add better doc for page stashing war Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Guard fusion via env Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change to trigger CI Remove unnecessary blank line in docstring. * To retrigger CI * Space to trigger the pipeline * fix zero work cublas gemm Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Varun Thumbe <vthumbe@nvidia.com> Co-authored-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

Add tests that demonstrate two known memory issues with FSDP2 + FP8: - Issue #2681: FP8 weight copies created during te.autocast() forward pass accumulate across layers instead of being freed between layers, defeating FSDP2's memory efficiency. Detected by comparing per-layer forward memory increments against a bf16 baseline using layer hooks. - Issue #2717: Transpose cache tensors (_create_transpose) allocated during backward persist until the next forward pass instead of being freed after backward completes. Detected by comparing the backward memory delta (post_bwd - post_fwd) against a bf16 baseline. New tests: - test_bf16_no_excess_forward_memory: control, validates per-layer measurement - test_bf16_no_excess_backward_memory: control, validates backward delta comparison - test_fp8_temp_accumulation_across_layers: xfail, detects #2681 - test_transpose_cache_retained_after_backward: xfail, detects #2717 All parametrized over 5 FP8 recipes x {no_quant_init, quant_init}. Signed-off-by: Peter St. John <pstjohn@nvidia.com> Co-authored-by: vthumbe1503 <vthumbe@nvidia.com>

…x (#2820) * Compute swizzle_idx once per thread and pass into ComputeKernel. Signed-off-by: Cael Ling <caell@nvidia.com> * one __syncthreads per stage in GroupHadamardAmaxTmaKernel Signed-off-by: Cael Ling <caell@nvidia.com> * streamline group Hadamard ComputeKernel loads Signed-off-by: Cael Ling <caell@nvidia.com> * streamline group Hadamard ComputeKernel loads Signed-off-by: Cael Ling <caell@nvidia.com> * streamline group Hadamard ComputeKernel loads Signed-off-by: Cael Ling <caell@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * one __syncthreads per stage in GroupHadamardAmaxTmaKernel Signed-off-by: Cael Ling <caell@nvidia.com> Made-with: Cursor * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Compute swizzle_idx once per thread and pass into ComputeKernel. Signed-off-by: Cael Ling <caell@nvidia.com> * Fix kReturnIdentityAmax path Signed-off-by: Cael Ling <caell@nvidia.com> * Fix kReturnIdentityAmax path Signed-off-by: Cael Ling <caell@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply the change to other variants Signed-off-by: Cael Ling <caell@nvidia.com> * Refactor the change to other variants Signed-off-by: Cael Ling <caell@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor the change to other variants Signed-off-by: Cael Ling <caell@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor the ldmatrix logics Signed-off-by: Cael Ling <caell@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Cael Ling <caell@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Minor misc optimizations in fused GroupedMLP Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Add new test to compare single vs multi-param fused GMLP case Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add bias support Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix: enabling fused _router to be able to handle large topk and number of experts - expanding shared memory when needed - switch to radix topk selection when topk is large - test_fused_router.py updated with large num experts and tolerances refined for different cases * added topk>=16 in tests/pytorch/test_fused_router.py added return value check of cudaFuncSetAttribute in transformer_engine/common/fused_router/fused_topk_with_score_function.cu added dtype dependent eps in tests/pytorch/test_fused_router.py removed unneeded code in transformer_engine/common/fused_router/utils.h * test_fused_router.py needs to skip topk >= num_experts case Signed-off-by: Harry Zhou <hhanyu@nvidia.com> cleaned up raw warp operations added comments added shared_memory check added return code check * warning about dtype for tolerance in test_fused_router.py Signed-off-by: Harry Zhou <hhanyu@nvidia.com> --------- Signed-off-by: Harry Zhou <hhanyu@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Signed-off-by: Gaetan Lepage <gaetan@glepage.com>

* Grouped dequantize for MXFP8 Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Pytorch extension Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Fix CUDA graphs compatibility Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Handling non-full tiles Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Fixes Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes from review Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Refactor grouped MXFP8 dequantize kernel - Use common namespace helpers instead of group_quantize_kernel - Extract shared constants into DequantizeConfig struct - Replace SCALE_DIM template params with single ROWWISE bool - Use initialize_barriers/destroy_barriers helpers - Fix offsets array size (num_tensors + 1) - Skip TMA descriptor update for zero-sized groups - Fix off-by-one in max tensor descriptor check Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Tighten tensor_offsets validation to require num_tensors+1 All producers (splits_to_offsets, quantizer.cpp) and consumers (is_job_valid, get_current_tensor_id, hadamard transform) already use CSR-style num_tensors+1 offsets. Make the validation match. Also fix stale docstring in grouped_tensor_storage.py. Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix group_dequantize: attribute names, dtype, and shape handling In group_dequantize(), GroupedTensor inherits from torch.Tensor, so accessing .data returns the 2D wrapper tensor instead of the 1D quantized data buffer. Fix three issues: - Read "rowwise_data" attribute instead of "data" to get the flat 1D quantized buffer rather than torch.Tensor.data (2D wrapper). - Use quantizer->dtype (e.g. kFloat8E4M3) instead of deriving dtype from the raw tensor's scalar_type() which is just uint8. - Pass numel() as a 1-element shape vector to ensure the grouped tensor data is registered as 1D. Promote DType dtype from quantizer subclasses to the base Quantizer class (defaulting to kNumTypes) so group_dequantize can access it without downcasting. Update tests to compare per-tensor via split_into_quantized_tensors() instead of accessing .data on GroupedTensor. Signed-off-by: Przemyslaw Tredak <ptredak@nvidia.com> Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Przemek Tredak <ptredak@nvidia.com> Signed-off-by: Przemyslaw Tredak <ptredak@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* add fa4 support Signed-off-by: Xin Yao <xiny@nvidia.com> * comment out unused import for cp Signed-off-by: Xin Yao <xiny@nvidia.com> * fix lint Signed-off-by: Xin Yao <xiny@nvidia.com> * install fa4 in L3 test Signed-off-by: Xin Yao <xiny@nvidia.com> * fix sm90 Signed-off-by: Xin Yao <xiny@nvidia.com> --------- Signed-off-by: Xin Yao <xiny@nvidia.com>

* Fix grouped quant checkpointing Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Cleanup Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* adds NVFP4 Fused Adam support Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1429.ipp1a1.colossus.nvidia.com> * un xfail test Signed-off-by: Jonathan Mitchell <jomitchell@r6515-0097.ipp1a1.colossus.nvidia.com> * cleanup Signed-off-by: Jonathan Mitchell <jomitchell@r6515-0097.ipp1a1.colossus.nvidia.com> * adds back copy dispatch handler Signed-off-by: Jonathan Mitchell <jomitchell@r6515-0097.ipp1a1.colossus.nvidia.com> --------- Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1429.ipp1a1.colossus.nvidia.com> Signed-off-by: Jonathan Mitchell <jomitchell@r6515-0097.ipp1a1.colossus.nvidia.com> Co-authored-by: Jonathan Mitchell <jomitchell@ipp1-1429.ipp1a1.colossus.nvidia.com> Co-authored-by: Jonathan Mitchell <jomitchell@r6515-0097.ipp1a1.colossus.nvidia.com> Co-authored-by: vthumbe1503 <vthumbe@nvidia.com>

Resolved all 30 conflicted files from the upstream v2.15 merge: - CMakeLists.txt: keep ROCm source/hipify blocks + upstream CUDA arch flags - cast/core/common.cuh: guard TensorMapStorage under !__HIP_PLATFORM_AMD__ - group_quantize_mxfp8.cuh: adopt upstream quant_config param, keep ROCm guard - rmsnorm_api.cpp: keep ROCm cuDNN/zero_centered_gamma guard in bwd_add - recipe/__init__.py: keep MXFP4BlockScaling class + adopt upstream __repr__ - util/logging.h: add cuSolverMp macro guarded under !__HIP_PLATFORM_AMD__ - util/ptx.cuh: keep ROCm stochastic rounding + upstream BF16_MANTISSA_BITS - extensions.h: add grouped_swizzle_for_gemm before USE_ROCM guard - pybind.cpp: keep Newton-Schulz bindings under USE_ROCM guard - quantizer.cpp: keep ROCm RHT cast fusion eligibility path - quantization.py: keep ROCm nvfp4/mxfp4 checks + adopt upstream cached wrappers - backends.py: keep AITER triton path + adopt upstream FA3 import with IS_HIP guard - utils.py (attn): add IS_HIP_EXTENSION guard + upstream FA3 deterministic check - module/base.py: keep get_weight_workspace method - module/linear.py: keep ROCm inline forward/backward + upstream non_tensor_args - module/grouped_linear.py: keep triton path + adopt upstream backward_override - module/layernorm_linear.py: keep ROCm FP8 state + adopt upstream backward_override - module/layernorm_mlp.py: adopt upstream qstate refactor (take_upstream) - jax/cpp_extensions/gemm.py: keep ROCm grouped gemm quantizer path - jax/csrc/extensions/gemm.cpp: keep ROCm swizzle guard + upstream attr changes - build_tools: keep rocm_build/rocm_path + adopt upstream setup_mpi_flags - test files: keep both ROCm and upstream test additions Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Audit of cleanly-merged upstream changes revealed three issues: 1. newton_schulz/newton_schulz.cpp: added to cuda_only_cpp_sources in CMakeLists.txt — includes cuda_runtime.h directly, uses cuSolverMp, not hipifiable as a .cpp file. 2. pytorch/csrc/extensions/newton_schulz.cpp: wrapped in #ifndef USE_ROCM guard — calls at::cuda::getCurrentCUDAStream() and cuSolverMp APIs. 3. pytorch/__init__.py: guarded newton_schulz import with IS_HIP_EXTENSION check — tex.newton_schulz pybind binding is not registered on ROCm. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Seven build errors resolved: 1. cast/core/common.cuh: restructured __HIP_PLATFORM_AMD__ guards — TensorMapStorage, g_tensor_maps, and TMA helper functions guarded; kernel namespace (reduce_dbias, grouped_reduce_dbias) and block decode helpers left unguarded for ROCm use; new upstream TMA dispatch functions (modify_base_tensor_map, update_tma_descriptors, prefetch/store_output_stage) guarded separately. 2. cast/mxfp8/group_dequantize_mxfp8.cuh: entire file guarded with #ifndef __HIP_PLATFORM_AMD__ — uses CUtensorMap throughout. 3. cast/dispatch/dequantize.cuh: guard MXFP8 group_dequantize call with #ifndef __HIP_PLATFORM_AMD__ — depends on group_dequantize_mxfp8.cuh. 4. util/ptx.cuh: added non-template float exp2f_rcp(e8m0_t) overload under #ifdef __HIP_PLATFORM_AMD__ — ROCm-specific files call without <float> template arg. Moved BF16_MANTISSA_BITS outside the CUDA-only guard so ROCm exp2f_rcp<bf16> works. 5. fused_router/utils.h: changed __ballot_sync mask from unsigned int to uint64_t and switched to __popcll — HIP requires 64-bit ballot mask. 6. normalization/rmsnorm/rmsnorm_api.cpp: guarded use_cudnn_norm_bwd() / NVTE_Norm_Backend::Cudnn / use_zero_centered_gamma_in_weight_dtype() calls with #ifndef __HIP_PLATFORM_AMD__. 7. pytorch/csrc/quantizer.cpp: removed duplicate eligible_for_rht_cast_fusion declaration; fixed columnwise_quant_config → columnwise_quant_config_to_use; added out_transpose wrapper construction for the RHT unfused columnwise path. 8. pytorch/quantization.py: added missing `import functools`. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Resolved API mismatch from Phase 1 conflict resolution where upstream refactored _Linear.forward to receive quantizers as separate args instead of packed in non_tensor_args: 1. Updated _Linear.forward signature to accept 6 quantizer args separately matching the new call convention from Linear.forward. 2. Updated non_tensor_args unpacking to match the new tuple format (no quantizers). 3. Added ctx.save_for_backward and ctx.tensor_objects assignment. 4. Added ctx.backward_override initialization. 5. Fixed _Linear.backward return to reshape dgrad via ctx.inp_shape and return 11 None-padded values matching the new 11-arg forward signature. 6. Added import of restore_from_saved. 7. Added import functools to quantization.py. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

…de ordering Two fixes in _Linear.forward: 1. non_tensor_args at call site incorrectly passed 'cache_name is not None' (bool) at the 'module' position — changed to pass 'self' (the Linear module object). 2. backward_override nullification of ctx fields was placed before the ctx assignments, causing them to be immediately overwritten — moved to run after all ctx fields are set. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Two build fixes: 1. tests/cpp/operator/test_normalization.cu: guard cudnnGetVersion() call with #ifndef __HIP_PLATFORM_AMD__ — cuDNN is not available on ROCm and use_cudnn is always false on this platform. 2. transformer_engine/jax/csrc/extensions/gemm.cpp: fix GroupedGemmV2Handler FFI binding — replace individual .Attr<> entries (M, N, K, lhs_is_trans, rhs_is_trans, scaling_mode, is_grouped_dense_wgrad) with .Attrs<GroupedGemmV2Config>() to match the GroupedGemmV2FFI function signature. This was the Phase 1 conflict resolution error where the old attribute-based registration was kept instead of adopting the upstream struct-based API. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

…ures - Skip GEMM-swizzled scale unswizzle on HIP since ROCm quantize kernel writes compact scales - Reshape MXFP8 scales to 2D on HIP to match the 2D-flattened data path in grouped dequantize - Fix test_grouped_gemm_fp16 to use parametrized use_async_d2h_group_size instead of hardcoded True - Remove duplicate input_shape/q_dtype parametrize decorators in TestGroupedQuantize

github-actions · 2026-06-12T02:21:42Z

 # Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved.


Two copyright-header issues here, both inconsistent with the convention used in every other ROCm-modified file in this PR (see e.g. transformer_engine/jax/cpp_extensions/quantization.py, transformer_engine/pytorch/module/linear.py):

Order is reversed. Per CLAUDE.md, when an existing NVIDIA-only file gains an AMD line, the AMD line must go above the NVIDIA line.

Missing the # This file was modified for portability to AMDGPU marker line that every other ROCm-touched Python file carries above the AMD copyright.

Suggested fix:

Suggested change

# Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved.

# This file was modified for portability to AMDGPU

# Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved.

# Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

github-actions · 2026-06-12T02:21:47Z

+/*************************************************************************
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.


This file was modified in this PR (commit 3d3f9e0 wraps the entire body in #ifndef __HIP_PLATFORM_AMD__) but the header still only carries the NVIDIA copyright. Per CLAUDE.md rule 2, ROCm-modified files with only an NVIDIA line need an AMD line added above it (with the AMDGPU portability marker):

Suggested change

/*************************************************************************

* Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

/*************************************************************************

* This file was modified for portability to AMDGPU

* Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved.

* Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

github-actions · 2026-06-12T02:21:50Z

+/*************************************************************************
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.


This file was added by the IFU and then ROCm-modified in this PR (commit 50a837f wraps it in #ifndef USE_ROCM), but the header carries only the NVIDIA copyright. Per CLAUDE.md rule 2, add an AMD line above the NVIDIA line (with the AMDGPU portability marker):

Suggested change

/*************************************************************************

* Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

/*************************************************************************

* This file was modified for portability to AMDGPU

* Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved.

* Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

github-actions · 2026-06-12T02:21:55Z

 from .utils import rocm_build, rocm_path
 from .utils import all_files_in_dir, get_cuda_include_dirs, debug_build_enabled


The newly added import on line 15 re-imports the same three symbols already imported on line 14 (all_files_in_dir, get_cuda_include_dirs, debug_build_enabled). Looks like a leftover from the merge conflict resolution — both sides of the conflict were kept rather than being unified into a single import (compare with build_tools/pytorch.py, where the equivalent conflict was correctly merged into one from .utils import (...) block).

It works at runtime, but please consolidate to avoid the duplicate, e.g.:

Suggested change

from .utils import rocm_build, rocm_path

from .utils import all_files_in_dir, get_cuda_include_dirs, debug_build_enabled

from .utils import rocm_build, rocm_path

from .utils import (

all_files_in_dir,

get_cuda_include_dirs,

debug_build_enabled,

setup_mpi_flags,

)

github-actions · 2026-06-12T02:22:17Z

Review summary (initial) — IFU dev → v2.15, ~89 commits / 207 files / +26K −7K LoC.

Most of the diff is the upstream v2.15 history; the ROCm-specific work concentrates in the 7 post-merge fix commits (a02dba78 conflict resolution, plus 50a837f6 / 3d3f9e0c / 92859293 / 5104a384 / a0cc9377 / 5dd2c0c9). Focused the review on those plus the seams where ROCm guards/refactors land in upstream-touched files. No leftover conflict markers in the tree.

Spot-checked highlights that look correct:

_Linear.forward signature and _Linear.backward 11-arg return now match the new call site (92859293, 5104a384).
cache_name is not None → self fix at the module arg position is correct.
__ballot_sync → 64-bit mask + __popcll in fused_router/utils.h is the right HIP idiom.
New #ifndef __HIP_PLATFORM_AMD__ guards on TMA / CUtensorMap paths (cast/core/common.cuh, cast/mxfp8/group_dequantize_mxfp8.cuh, cast/dispatch/dequantize.cuh) keep the ROCm build green and route MXFP8 grouped-dequantize to a clear NVTE_ERROR on ROCm.
newton_schulz is excluded from the ROCm build at all three layers (CMake, csrc, py import).

Findings (see inline comments):

Code: build_tools/jax.py:14-15 keeps both branches of the merge — three from .utils import … lines with overlapping symbols. Works, but should be consolidated like build_tools/pytorch.py.
Copyright headers: 4 files need updates — transformer_engine/jax/quantize/dequantizer.py (AMD line was added below NVIDIA and lacks the AMDGPU portability marker), and three ROCm-touched files still carrying only the NVIDIA line: transformer_engine/common/cast/mxfp8/group_dequantize_mxfp8.cuh, transformer_engine/pytorch/csrc/extensions/newton_schulz.cpp, and transformer_engine/pytorch/__init__.py (last one not pinned to an inline comment because the header lines aren't in the diff hunk — please add # Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved. above the existing NVIDIA line in that file).

Given the size, this review is best-effort over a sample of the most load-bearing seams; please flag any specific files you'd like a closer look at.

ipanfilo · 2026-06-11T23:59:35Z

Is it new location of tests/pytorch/distributed/run_fsdp2_fused_adam.py?

ipanfilo · 2026-06-12T00:03:06Z

    test_cmd += ["--layer-type", layer_type]

    subprocess.run(test_cmd, env=os.environ, check=True)
+sys.path.pop(0)


Is it this way in upstream - here but not on line 28?

ipanfilo · 2026-06-12T01:38:06Z

+
+    // Warp ballot to count how many lanes have a qualifying element
+    // Use 64-bit mask for ROCm compatibility (HIP requires uint64_t mask)
+    uint64_t ballot = __ballot_sync(0xFFFFFFFFFFFFFFFFull, is_greater);


It should be guarded and __ballot should be used for ROCm

ipanfilo · 2026-06-12T01:49:53Z

+}
+
+#ifdef __HIP_PLATFORM_AMD__
+// Non-template definition — delegates to the float specialization


why don't define it right where it is declared?

ipanfilo · 2026-06-12T02:05:07Z

+
+    # nvte_grouped_gemm (the v2 kernel) requires SM100+ (Blackwell or newer).
+    # Fall back to the v1 path on SM90 (Hopper) and older architectures.
+    if get_min_device_compute_capability() < 100:


Can we enable it for AMD GPUs?

ipanfilo · 2026-06-12T02:33:49Z

@@ -442,6 +462,18 @@ def get_attention_backend(
        if use_flash_attention_3 and FlashAttentionUtils.v3_is_installed:
            logger.debug("Disabling FlashAttention 3 for compute capability != sm90")
        use_flash_attention_3 = False
+    # FA4 supports SM80, SM90, SM100, SM120


Let's explicitly disable FA3 and FA4 for ROCm

Fixes for FSDP2 distributed tests on ROCm: 1. run_fsdp2_fused_adam.py: - Add IS_HIP_EXTENSION import and AIPYTORCH-427 synchronize() after loss.backward() and optimizer.step() in all training loops to prevent RCCL deadlocks from forward/backward stream overlap with FSDP2. - Add synchronize() in test_fuse_wgrad_accumulation after forward pass. - xfail NVFP4BlockScaling in test_fused_adam_bf16 and test_fused_adam_bf16_store_param_remainders on ROCm: RCCL allreduce_coalesced on NVFP4 amax tensors produces incorrect values, causing scale_inv = inf and NaN outputs. Confirmed by disable_rht=True workaround which bypasses the amax all-reduce path. 2. run_fsdp2_model.py: - Add IS_HIP_EXTENSION import and AIPYTORCH-427 synchronize() after backward() in training loop (matches existing pattern from dev branch). - Fix double dist.destroy_process_group() from keep_both merge: barrier (for torch < 2.6 teardown race) and destroy consolidated into finally. - xfail NVFP4BlockScaling + fp8_init + LayerNormLinear on ROCm: _check_fp8_fsdp2_allgather exceeds atol=5e-4 due to per-shard amax divergence between the FSDP2 unshard path and manual allgather path. 3. test_torch_fsdp2.py: - Fix NameError 'fp8' not defined: replaced deprecated fp8 module references with direct imports from quantization module. - Add _get_free_port() and --master_port to torchrun calls to prevent EADDRINUSE when tests run sequentially. - xfail NVFP4BlockScaling + fp8_init + LayerNormLinear on ROCm. 4. test_cast_master_weights_to_fp8.py: - Skip _test_cast_master_weights_to_nvfp4 on ROCm: same NVFP4 amax RCCL issue causes NaN loss; assert_close(nan, nan) then fails because NaN != NaN by IEEE 754.

ptrendx and others added 30 commits March 16, 2026 10:25

Changed VERSION to 2.15.0.dev0

4058842

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Update cudnnFE to v1.20.0 (#2774)

53a41b2

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

[PyT] Install pytest in onnx L1 test as Pyt container no longer packa…

b7598aa

…ges it (#2781) Install pytest in onnx L1 test as Pyt container no longer packages it Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

Upgrade cuDNN FE to v1.21.0 (#2799)

15cf65a

Move cuDNN FE to v1.21.0 Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Remove integration test for Lightning-Thunder (#2822)

281ff06

Signed-off-by: Tim Moon <tmoon@nvidia.com>

ksivaman and others added 14 commits April 16, 2026 08:49

[PyTorch] Minor optimizations in fused grouped MLP (#2888)

c9035a4

Minor misc optimizations in fused GroupedMLP Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

fix CUDA architectures cmake logic (#2832)

fca261e

Signed-off-by: Gaetan Lepage <gaetan@glepage.com>

auto-merge basis for IFU-dev-260419-v2.15 (raw, conflicts present)

8d31dcc

VeeraRajasekhar self-assigned this Jun 8, 2026

VeeraRajasekhar marked this pull request as ready for review June 10, 2026 15:12

VeeraRajasekhar requested review from ipanfilo, wangye805 and wenchenvincent as code owners June 10, 2026 15:12

ipanfilo added the claude-review label Jun 12, 2026

github-actions Bot reviewed Jun 12, 2026

View reviewed changes

github-actions Bot removed the claude-review label Jun 12, 2026

ipanfilo requested changes Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ifu dev 260419 v2.15#616

Ifu dev 260419 v2.15#616
VeeraRajasekhar wants to merge 90 commits into
devfrom
IFU-dev-260419-v2.15

VeeraRajasekhar commented Jun 8, 2026

Uh oh!

github-actions Bot Jun 12, 2026

Uh oh!

github-actions Bot Jun 12, 2026

Uh oh!

github-actions Bot Jun 12, 2026

Uh oh!

github-actions Bot Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

ipanfilo Jun 11, 2026

Uh oh!

ipanfilo Jun 12, 2026

Uh oh!

ipanfilo Jun 12, 2026

Uh oh!

ipanfilo Jun 12, 2026

Uh oh!

ipanfilo Jun 12, 2026

Uh oh!

ipanfilo Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

		# Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
		# Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved.

		/*************************************************************************
		* Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

		from .utils import rocm_build, rocm_path
		from .utils import all_files_in_dir, get_cuda_include_dirs, debug_build_enabled

Conversation

VeeraRajasekhar commented Jun 8, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

github-actions Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

ipanfilo Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

ipanfilo Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

ipanfilo Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

ipanfilo Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

ipanfilo Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

ipanfilo Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants