Skip to content

Ifu dev 260419 v2.15#616

Open
VeeraRajasekhar wants to merge 90 commits into
devfrom
IFU-dev-260419-v2.15
Open

Ifu dev 260419 v2.15#616
VeeraRajasekhar wants to merge 90 commits into
devfrom
IFU-dev-260419-v2.15

Conversation

@VeeraRajasekhar

Copy link
Copy Markdown
Contributor

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

ptrendx and others added 30 commits March 16, 2026 10:25
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
…#2757)

* [Common] Fix linker error for to_string(DType) in distributed tests

Make transformer_engine::to_string(DType) inline in common.h so that
translation units outside libtransformer_engine.so can resolve it
without requiring the symbol to be exported.

Regression introduced by 61f9594 which added to_string(DType) calls
into TRANSFORMER_ENGINE_TYPE_SWITCH_* macros, causing test object files
to reference the symbol that the linker version script hides.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

---------

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…-Cast Fusion Kernel (#2555)

* first draft

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* pass numerical unit test

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* format

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* add benchmark script

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* lint and format

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* compile guard

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* warning fix

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* resolve greptile comment

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* minor style fixes

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* fix namespace

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* resolve some comments

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* fix comment

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* attempt to fix compile CI with guard

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* better naming for tests

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* fix deprecate messsage

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* more compile guard

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* new API name

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* fix format all in one

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* try to fix compile CI again

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* AI code review comments

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* to pass oldest compile CI with cuda 12.1

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* add more guards to nvfp4

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* make multiply inverse default numerics

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* update numerics of nvfp4 partial cast as well

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* resolve comments

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* add NVTE_BUILD_NUM_PHILOX_ROUNDS after rebase

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* simplify compile guard messsages

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

---------

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…Linear` (#2761)

* Load multi-param checkpoint from single-param config in GroupedLinear

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Multi-param to single param case

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Multi-param to single param case

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Better varnames

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Fix GMM cuBLAS version and SM arch checks

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
* Pin python 3.13 in vermin check

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update vermin version for python 3.14 support

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Use sha instead of tag

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* init

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* work finished

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* lint fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: root <pgadzinski@nvidia.com>

* removed warning.warn

Signed-off-by: root <pgadzinski@nvidia.com>

* [PyTorch] Remove dead None-check for num_out_tokens in moe_permute_mask_map_forward

num_out_tokens is typed as int in the custom_op signature and can never
be None; the check was incorrectly carried over from the class-based
upstream version during merge conflict resolution.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: root <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…are detached (#2772)

[PyTorch] Change the restore tensor API to ensure tensors are detached from ctx

Signed-off-by: Kaining Zhong <kainingz@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
…ges it (#2781)

Install pytest in onnx L1 test as Pyt container no longer packages it

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>
…_descriptors (#2782)

* Fix zero-sized groups in update_tma_descriptors

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Update test_cast_mxfp8_grouped.cu

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…+ (#2693)

* Enable sm120 support for fused attn if cuDNN is 9.18.1+

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Force intermediate tensors such as S, Sum_Exp, and Max to be BHS1 shape instead of TH1 for sm120

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add support for sm120 correct batch, seq dims

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add support for sm120 BHS1 style max logit even QKV are THD to avoid incorrect max logit calculation (includes padded tokens in max calculation)

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Disable fused and flash attn for sm120 filter:kv cache

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* For CP P2P attn, set softmax_lse_in_packed_format to False if sm120+

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Assert in TE if T3HD/TH3D layout is used on sm120 before cuDNN F16 sdpa arbitrary kernel call

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Modify is_ragged_q && cudnn_runtime_version >= 90600 check to also include a check for sm120

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* nit: Code clean up

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Disable fused attn for T3HD and TH3D

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* nit: Add missed sm120 guard

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Modify sm120 condition to be very specific to sm120 and not generalized to sm120+

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* nit: Fix missing sm120 check in fwd

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Move the check for sm120 T3HD/TH3D to nvte_get_fused_attn_backend() instead of higher layers in TE stack

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* nit: Check for matching sm120 and not sm120+

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* code drop

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* code drop

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* docs

Signed-off-by: root <pgadzinski@nvidia.com>

* nvfp4 internals support

Signed-off-by: root <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* lint fixes

Signed-off-by: root <pgadzinski@nvidia.com>

* Update transformer_engine/debug/features/dump_tensors.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>

* fix

Signed-off-by: root <pgadzinski@nvidia.com>

* Update transformer_engine/debug/features/dump_tensors.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>

* Update transformer_engine/debug/features/dump_tensors.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update tests/pytorch/debug/test_log.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>

* Update transformer_engine/debug/features/dump_tensors.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>

* fix

Signed-off-by: root <pgadzinski@nvidia.com>

* fix

Signed-off-by: root <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove dump_quantized_internals support from DumpTensors

Drop the dump_quantized_internals config option, the _get_quantized_internals
method, and all helper functions for extracting scales/raw data from
Float8Tensor, Float8BlockwiseQTensor, MXFP8Tensor, and NVFP4Tensor.

Remove corresponding tests: test_dump_tensors_nvfp4_unpacked_codes and
NVFP4_DUMP_TENSORS_CONFIG, and scale/data assertions from test_dump_tensors_sanity.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Address Greptile review comments

- Add dot ('.') to _sanitize_name to handle common PyTorch dotted layer
  names like 'encoder.layer.0.attention'
- Add docstring note about pickle dependency for the 'quantized' key
- Add comment explaining weights_only=False in test
- Remove redundant local RecipeState import in test_nvfp4_numeric

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Remove portability suggestion from quantized key docstring

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Compute rank lazily in _expected_root_dir

Avoids relying on stale self.rank when ensure_initialized is called
before initialize() has set the rank. Consistent with how nvdlfw_inspect
logger resolves rank.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* detach tensors before saving; verify dump filename in test

Detach both high_precision and quantized tensors before saving to avoid
serializing the autograd graph. For QuantizedTensor this is a zero-copy
view (make_like), so no extra GPU allocation.

Add filename format assertion to test_dump_tensors_sanity to catch
regressions in _sanitize_name or the naming convention.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add empty dump_dict log; assert QuantizedTensor type in test

Log a message when no tensors are available to dump so the user
has an explicit signal that no file was written.

Assert that the quantized key round-trips as a QuantizedTensor
to catch regressions in detach() or serialisation path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/debug/features/dump_tensors.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>

* Address review: iter subdirs, remove dead rank field, add allclose test and MSE example

- Organize dumps into per-iteration subdirectories (iter_000000/) to keep
  file count manageable per directory.
- Remove unused self.rank attribute from TensorLogger.
- Add torch.allclose assertion in test to verify serialization correctness.
- Add docstring example showing how to load dumps and compute MSE.

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Made-with: Cursor

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix: use detach().clone() to avoid shared storage in DumpTensors

Using tensor.detach() creates a view sharing the same underlying
storage. If any in-place operation modifies the tensor after the
dump, the saved data would be silently corrupted. Use .clone()
to ensure the dump captures an independent copy of the data.

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* test: use torch.equal instead of torch.allclose for serialisation round-trip

The saved tensor is an exact bit-for-bit copy (detach().clone()), so
torch.equal is the correct check. torch.allclose with its default
tolerances could mask a genuine dtype conversion or precision loss
introduced by a future change to the serialisation path.

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix: add tp_size to DumpTensors.inspect_tensor and fix KeyError in call_feature backward compat pop

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: root <pgadzinski@nvidia.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* change distributed tests infra for fsdp2

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* verbose flag for reporting

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* add back coments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* another minor fix

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* not needed for this PR

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* unecessary comments
* add cudnn dln+add

Signed-off-by: CarlosGomes98 <carlosmiguel.gomes@live.com.pt>

* try fixing cudnn build issue

Signed-off-by: CarlosGomes98 <carlosmiguel.gomes@live.com.pt>

* guard against cudnn version

Signed-off-by: CarlosGomes98 <carlosmiguel.gomes@live.com.pt>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* change itype to wtype for add in rmsnorm_bwd

Signed-off-by: CarlosGomes98 <carlosmiguel.gomes@live.com.pt>

* remove dead code

Signed-off-by: CarlosGomes98 <carlosmiguel.gomes@live.com.pt>

* remove dangling todo

Signed-off-by: CarlosGomes98 <carlosmiguel.gomes@live.com.pt>

---------

Signed-off-by: CarlosGomes98 <carlosmiguel.gomes@live.com.pt>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* add blackwell support filter for 9.7<=cudnn<9.18.1

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* simplify conditionals

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix conditionals again

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix conditionals again

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* update the error log

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* remove the python filter and correct the cpp filter

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

---------

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…quired (#2798)

* Disable fused attention for sm120 if determinism is required

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* nit: disable fused attn for sm120 determinism, if training

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

---------

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
… always and `Max` when `return_max_logit=True` (#2677)

* cudnn now returns Stats always and Max only with `return_max_logit=true`

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix a typo that caused a bug

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* update doc strings

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix more docs

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fixes from the feedback

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* update cudnn-frontend to v1.19.1

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* update the cudnn frontend

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix a wrong omission

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Move cuDNN FE to v1.21.0

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Add warning if using BSHD and max_segments_per_seq > 1

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/jax/attention.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Update transformer_engine/jax/attention.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Remove warning test

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com>
* Refactor to group_sizes per tensor

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Support first_dims and last_dims instead of a single group_sizes per
tensor

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Refactor GMM FFIs to store static attrs as structs

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Cleanup C++ v2 FFI

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix int64 workspace usage

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Address greptile comments

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Refactor wgrad-specific checks to be generic for GMM in gemm.py

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Refactor XLA FFI struct setup

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix edge case in TE v1 GMM

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix issues on Hopper

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Refactor

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Address comments

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Lint

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fixes for Hopper

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Address review comments

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Grouped quantization test fixes

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Pass input_output_alias to TritonAutotunedKernelCall

Signed-off-by: JAX Toolbox <jax@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add jax version guard for the input_output_aliasing fix

Signed-off-by: tdophung <tdophung@nvidia.com>

---------

Signed-off-by: JAX Toolbox <jax@nvidia.com>
Signed-off-by: tdophung <tdophung@nvidia.com>
Co-authored-by: JAX Toolbox <jax@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
* done

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* one review comment form greptile

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* instead part of the comment not needed

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* Update transformer_engine/pytorch/tensor/float8_blockwise_tensor.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* No need to set it to None

Remove unnecessary columnwise data and scale inv assignments.

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

---------

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
* cudnn now returns Stats always and Max only with `return_max_logit=true`

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix a typo that caused a bug

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* update doc strings

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix more docs

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fixes from the feedback

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* update cudnn-frontend to v1.19.1

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* update the cudnn frontend

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix a wrong omission

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* bugfix: mask out padding tokens when THD

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixes from greptile feedback

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor nit

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fixes from feedback

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

---------

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Enabled persistency with WorkID Query feature

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added a struct with tunable parameters

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added persistency with static scheduling

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed test cases

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Ready for benchmarking

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed out-of-boundary error

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Tuned kernel parameters

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Refactoring

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Refactoring 2

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Refactoring 3

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Removed the dynamic (WorkID Query) persistency

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Ready for PR

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes per the review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Ready for benchmark

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Ready for benchmark - Regular kernel

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added the source code to the profiler

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added constructors to Job and Block descriptors

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Removed the prefetch overlapping between jobs

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Cache tensor ID

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* ShapeRepresentation is not a template parameter

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Removed redundant fence_proxy

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Refactoring

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Used mixed precision FMA

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added Quantize parameters

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added the fast math branch

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added the fast math to cpp test suite

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Align tests

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Use STS instead of generic ST

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Add zero-tensor cases

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Used LDS instead of generic LD in colwise path

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Used LDS instead of generic LD in rowwise

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Ready for merge

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Uncommented test cases

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added FP16 Fast math path to rowwise processing

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Refactoring

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed lint

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixes

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fix

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed test suite

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed test suite

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixes per the review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Modifications per the review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Assert the buffer size

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added fast math RCP for bf16

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fast math for BF16 is now default

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed compilation error when compiling on previous archs

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Boundary condition fix

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed compilation error

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Refactoring. Moved helpers to core-common

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Refactoring

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Refactoring per the review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Addressed the PR review comments

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed the compilation error when PTX was compiled for CUDA 13.0

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed pytorch extensions

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

---------

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…os (#2823)

* Fix: Use jitted kernels for generating THD (and BSHD) segment pos if only segment id is passed

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Make passing of segment_pos to from_segmet_ids_and_pos for creating a SequenceDescriptor mandatory

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Make test changes for from_segmet_ids_and_pos API change. Also some nits.

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* nit: Make segment_pos arg mandatory and not Optional

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add comments for from_segment_ids_and_pos

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* nit: Change data types for BSHD seg pos and seg id to be int32 adn consistent with THD when setting up test inputs

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Replace a TypeError if segment_pos is not passed with a ValueError with a message

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* GEMM + Swiglu fused Grouped MLP for MXFP8

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* cleanup/lint

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Properly cache the alpha tensor

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* nD dummy grad

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 0 tokens in entire rank

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* tmp downgrade cublas version check

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* delayed wgrad tests pass for basic gl

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* merge everything

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Rebase into fused_mxfp8_grouped_mlp; unit tests for delayed wgrad working

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix tests being skipped for fusible ops

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Integrate mxfp8 dbias kernel in group_quantize

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Add bias/dbias fused support with cute GEMMs

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Check bias/dbias support

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Pack biases more efficiently

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* GroupedTensor for biases to avoid concat

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* format

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Support 1D grouped tensor shape for bias and fix checkpointing

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fixes and tests

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Refactor grouped tensor marking for paged stashing

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Remove setting logical_shape in mark_grouped_tensor

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Cleanup logical_shape

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* pass the tests for now

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address some review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* more cleanups

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* cleanup

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* refactor wgrad logic

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Rename argument from single_grouped_parameter to single_grouped_weight

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Check wgrad store context is not empty for 0 token case.

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Test only checks for fusion if fused kernel is available

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* fix the tolerance to be of bf16 for the cute gemm

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* Update transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* address further review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address more review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address more review comments + test for zero grouped tensor work case

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* cublaslt remove zero work gemm avoidance

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix the wgrad test

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* split dbias functionality from gq api

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Format and lint

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* port fixes and add better doc for page stashing war

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Guard fusion via env

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Change to trigger CI

Remove unnecessary blank line in docstring.

* To retrigger CI

* Space to trigger the pipeline

* fix zero work cublas gemm

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Varun Thumbe <vthumbe@nvidia.com>
Co-authored-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Add tests that demonstrate two known memory issues with FSDP2 + FP8:

- Issue #2681: FP8 weight copies created during te.autocast() forward pass
  accumulate across layers instead of being freed between layers, defeating
  FSDP2's memory efficiency. Detected by comparing per-layer forward memory
  increments against a bf16 baseline using layer hooks.

- Issue #2717: Transpose cache tensors (_create_transpose) allocated during
  backward persist until the next forward pass instead of being freed after
  backward completes. Detected by comparing the backward memory delta
  (post_bwd - post_fwd) against a bf16 baseline.

New tests:
- test_bf16_no_excess_forward_memory: control, validates per-layer measurement
- test_bf16_no_excess_backward_memory: control, validates backward delta comparison
- test_fp8_temp_accumulation_across_layers: xfail, detects #2681
- test_transpose_cache_retained_after_backward: xfail, detects #2717

All parametrized over 5 FP8 recipes x {no_quant_init, quant_init}.

Signed-off-by: Peter St. John <pstjohn@nvidia.com>
Co-authored-by: vthumbe1503 <vthumbe@nvidia.com>
…x (#2820)

* Compute swizzle_idx once per thread and pass into ComputeKernel.

Signed-off-by: Cael Ling <caell@nvidia.com>

* one __syncthreads per stage in GroupHadamardAmaxTmaKernel

Signed-off-by: Cael Ling <caell@nvidia.com>

* streamline group Hadamard ComputeKernel loads

Signed-off-by: Cael Ling <caell@nvidia.com>

* streamline group Hadamard ComputeKernel loads

Signed-off-by: Cael Ling <caell@nvidia.com>

* streamline group Hadamard ComputeKernel loads

Signed-off-by: Cael Ling <caell@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* one __syncthreads per stage in GroupHadamardAmaxTmaKernel

Signed-off-by: Cael Ling <caell@nvidia.com>
Made-with: Cursor

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Compute swizzle_idx once per thread and pass into ComputeKernel.

Signed-off-by: Cael Ling <caell@nvidia.com>

* Fix kReturnIdentityAmax path

Signed-off-by: Cael Ling <caell@nvidia.com>

* Fix kReturnIdentityAmax path

Signed-off-by: Cael Ling <caell@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Apply the change to other variants

Signed-off-by: Cael Ling <caell@nvidia.com>

* Refactor the change to other variants

Signed-off-by: Cael Ling <caell@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Refactor the change to other variants

Signed-off-by: Cael Ling <caell@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Refactor the ldmatrix logics

Signed-off-by: Cael Ling <caell@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Cael Ling <caell@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
ksivaman and others added 14 commits April 16, 2026 08:49
Minor misc optimizations in fused GroupedMLP

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Add new test to compare single vs multi-param fused GMLP case

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Add bias support

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* fix: enabling fused _router to be able to handle large topk and number of experts
- expanding shared memory when needed
- switch to radix topk selection when topk is large
- test_fused_router.py updated with large num experts and tolerances refined for different cases

* added topk>=16 in tests/pytorch/test_fused_router.py
added return value check of cudaFuncSetAttribute in transformer_engine/common/fused_router/fused_topk_with_score_function.cu
added dtype dependent eps in tests/pytorch/test_fused_router.py
removed unneeded code in transformer_engine/common/fused_router/utils.h

* test_fused_router.py needs to skip topk >= num_experts case

Signed-off-by: Harry Zhou <hhanyu@nvidia.com>


cleaned up raw warp operations
added comments
added shared_memory check
added return code check

* warning about dtype for tolerance in test_fused_router.py

Signed-off-by: Harry Zhou <hhanyu@nvidia.com>

---------

Signed-off-by: Harry Zhou <hhanyu@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Gaetan Lepage <gaetan@glepage.com>
* Grouped dequantize for MXFP8

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Pytorch extension

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix CUDA graphs compatibility

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Handling non-full tiles

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fixes

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes from review

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Refactor grouped MXFP8 dequantize kernel

- Use common namespace helpers instead of group_quantize_kernel
- Extract shared constants into DequantizeConfig struct
- Replace SCALE_DIM template params with single ROWWISE bool
- Use initialize_barriers/destroy_barriers helpers
- Fix offsets array size (num_tensors + 1)
- Skip TMA descriptor update for zero-sized groups
- Fix off-by-one in max tensor descriptor check

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Tighten tensor_offsets validation to require num_tensors+1

All producers (splits_to_offsets, quantizer.cpp) and consumers
(is_job_valid, get_current_tensor_id, hadamard transform) already
use CSR-style num_tensors+1 offsets. Make the validation match.
Also fix stale docstring in grouped_tensor_storage.py.

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix group_dequantize: attribute names, dtype, and shape handling

In group_dequantize(), GroupedTensor inherits from torch.Tensor, so
accessing .data returns the 2D wrapper tensor instead of the 1D
quantized data buffer. Fix three issues:

- Read "rowwise_data" attribute instead of "data" to get the flat 1D
  quantized buffer rather than torch.Tensor.data (2D wrapper).
- Use quantizer->dtype (e.g. kFloat8E4M3) instead of deriving dtype
  from the raw tensor's scalar_type() which is just uint8.
- Pass numel() as a 1-element shape vector to ensure the grouped
  tensor data is registered as 1D.

Promote DType dtype from quantizer subclasses to the base Quantizer
class (defaulting to kNumTypes) so group_dequantize can access it
without downcasting.

Update tests to compare per-tensor via split_into_quantized_tensors()
instead of accessing .data on GroupedTensor.

Signed-off-by: Przemyslaw Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemyslaw Tredak <ptredak@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* add fa4 support

Signed-off-by: Xin Yao <xiny@nvidia.com>

* comment out unused import for cp

Signed-off-by: Xin Yao <xiny@nvidia.com>

* fix lint

Signed-off-by: Xin Yao <xiny@nvidia.com>

* install fa4 in L3 test

Signed-off-by: Xin Yao <xiny@nvidia.com>

* fix sm90

Signed-off-by: Xin Yao <xiny@nvidia.com>

---------

Signed-off-by: Xin Yao <xiny@nvidia.com>
* Fix grouped quant checkpointing

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Cleanup

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* adds NVFP4 Fused Adam support

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1429.ipp1a1.colossus.nvidia.com>

* un xfail test

Signed-off-by: Jonathan Mitchell <jomitchell@r6515-0097.ipp1a1.colossus.nvidia.com>

* cleanup

Signed-off-by: Jonathan Mitchell <jomitchell@r6515-0097.ipp1a1.colossus.nvidia.com>

* adds back copy dispatch handler

Signed-off-by: Jonathan Mitchell <jomitchell@r6515-0097.ipp1a1.colossus.nvidia.com>

---------

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1429.ipp1a1.colossus.nvidia.com>
Signed-off-by: Jonathan Mitchell <jomitchell@r6515-0097.ipp1a1.colossus.nvidia.com>
Co-authored-by: Jonathan Mitchell <jomitchell@ipp1-1429.ipp1a1.colossus.nvidia.com>
Co-authored-by: Jonathan Mitchell <jomitchell@r6515-0097.ipp1a1.colossus.nvidia.com>
Co-authored-by: vthumbe1503 <vthumbe@nvidia.com>
Resolved all 30 conflicted files from the upstream v2.15 merge:

- CMakeLists.txt: keep ROCm source/hipify blocks + upstream CUDA arch flags
- cast/core/common.cuh: guard TensorMapStorage under !__HIP_PLATFORM_AMD__
- group_quantize_mxfp8.cuh: adopt upstream quant_config param, keep ROCm guard
- rmsnorm_api.cpp: keep ROCm cuDNN/zero_centered_gamma guard in bwd_add
- recipe/__init__.py: keep MXFP4BlockScaling class + adopt upstream __repr__
- util/logging.h: add cuSolverMp macro guarded under !__HIP_PLATFORM_AMD__
- util/ptx.cuh: keep ROCm stochastic rounding + upstream BF16_MANTISSA_BITS
- extensions.h: add grouped_swizzle_for_gemm before USE_ROCM guard
- pybind.cpp: keep Newton-Schulz bindings under USE_ROCM guard
- quantizer.cpp: keep ROCm RHT cast fusion eligibility path
- quantization.py: keep ROCm nvfp4/mxfp4 checks + adopt upstream cached wrappers
- backends.py: keep AITER triton path + adopt upstream FA3 import with IS_HIP guard
- utils.py (attn): add IS_HIP_EXTENSION guard + upstream FA3 deterministic check
- module/base.py: keep get_weight_workspace method
- module/linear.py: keep ROCm inline forward/backward + upstream non_tensor_args
- module/grouped_linear.py: keep triton path + adopt upstream backward_override
- module/layernorm_linear.py: keep ROCm FP8 state + adopt upstream backward_override
- module/layernorm_mlp.py: adopt upstream qstate refactor (take_upstream)
- jax/cpp_extensions/gemm.py: keep ROCm grouped gemm quantizer path
- jax/csrc/extensions/gemm.cpp: keep ROCm swizzle guard + upstream attr changes
- build_tools: keep rocm_build/rocm_path + adopt upstream setup_mpi_flags
- test files: keep both ROCm and upstream test additions

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Audit of cleanly-merged upstream changes revealed three issues:

1. newton_schulz/newton_schulz.cpp: added to cuda_only_cpp_sources in
   CMakeLists.txt — includes cuda_runtime.h directly, uses cuSolverMp,
   not hipifiable as a .cpp file.

2. pytorch/csrc/extensions/newton_schulz.cpp: wrapped in #ifndef USE_ROCM
   guard — calls at::cuda::getCurrentCUDAStream() and cuSolverMp APIs.

3. pytorch/__init__.py: guarded newton_schulz import with IS_HIP_EXTENSION
   check — tex.newton_schulz pybind binding is not registered on ROCm.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Seven build errors resolved:

1. cast/core/common.cuh: restructured __HIP_PLATFORM_AMD__ guards — TensorMapStorage,
   g_tensor_maps, and TMA helper functions guarded; kernel namespace (reduce_dbias,
   grouped_reduce_dbias) and block decode helpers left unguarded for ROCm use;
   new upstream TMA dispatch functions (modify_base_tensor_map, update_tma_descriptors,
   prefetch/store_output_stage) guarded separately.

2. cast/mxfp8/group_dequantize_mxfp8.cuh: entire file guarded with
   #ifndef __HIP_PLATFORM_AMD__ — uses CUtensorMap throughout.

3. cast/dispatch/dequantize.cuh: guard MXFP8 group_dequantize call with
   #ifndef __HIP_PLATFORM_AMD__ — depends on group_dequantize_mxfp8.cuh.

4. util/ptx.cuh: added non-template float exp2f_rcp(e8m0_t) overload under
   #ifdef __HIP_PLATFORM_AMD__ — ROCm-specific files call without <float> template arg.
   Moved BF16_MANTISSA_BITS outside the CUDA-only guard so ROCm exp2f_rcp<bf16> works.

5. fused_router/utils.h: changed __ballot_sync mask from unsigned int to uint64_t
   and switched to __popcll — HIP requires 64-bit ballot mask.

6. normalization/rmsnorm/rmsnorm_api.cpp: guarded use_cudnn_norm_bwd() /
   NVTE_Norm_Backend::Cudnn / use_zero_centered_gamma_in_weight_dtype() calls
   with #ifndef __HIP_PLATFORM_AMD__.

7. pytorch/csrc/quantizer.cpp: removed duplicate eligible_for_rht_cast_fusion
   declaration; fixed columnwise_quant_config → columnwise_quant_config_to_use;
   added out_transpose wrapper construction for the RHT unfused columnwise path.

8. pytorch/quantization.py: added missing `import functools`.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Resolved API mismatch from Phase 1 conflict resolution where upstream
refactored _Linear.forward to receive quantizers as separate args instead
of packed in non_tensor_args:

1. Updated _Linear.forward signature to accept 6 quantizer args separately
   matching the new call convention from Linear.forward.
2. Updated non_tensor_args unpacking to match the new tuple format (no quantizers).
3. Added ctx.save_for_backward and ctx.tensor_objects assignment.
4. Added ctx.backward_override initialization.
5. Fixed _Linear.backward return to reshape dgrad via ctx.inp_shape and
   return 11 None-padded values matching the new 11-arg forward signature.
6. Added import of restore_from_saved.
7. Added import functools to quantization.py.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…de ordering

Two fixes in _Linear.forward:
1. non_tensor_args at call site incorrectly passed 'cache_name is not None' (bool)
   at the 'module' position — changed to pass 'self' (the Linear module object).
2. backward_override nullification of ctx fields was placed before the ctx
   assignments, causing them to be immediately overwritten — moved to run
   after all ctx fields are set.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
@VeeraRajasekhar VeeraRajasekhar self-assigned this Jun 8, 2026
Two build fixes:

1. tests/cpp/operator/test_normalization.cu: guard cudnnGetVersion() call
   with #ifndef __HIP_PLATFORM_AMD__ — cuDNN is not available on ROCm and
   use_cudnn is always false on this platform.

2. transformer_engine/jax/csrc/extensions/gemm.cpp: fix GroupedGemmV2Handler
   FFI binding — replace individual .Attr<> entries (M, N, K, lhs_is_trans,
   rhs_is_trans, scaling_mode, is_grouped_dense_wgrad) with .Attrs<GroupedGemmV2Config>()
   to match the GroupedGemmV2FFI function signature. This was the Phase 1
   conflict resolution error where the old attribute-based registration was
   kept instead of adopting the upstream struct-based API.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
@VeeraRajasekhar VeeraRajasekhar marked this pull request as ready for review June 10, 2026 15:12
…ures

- Skip GEMM-swizzled scale unswizzle on HIP since ROCm quantize kernel writes compact scales
- Reshape MXFP8 scales to 2D on HIP to match the 2D-flattened data path in grouped dequantize
- Fix test_grouped_gemm_fp16 to use parametrized use_async_d2h_group_size instead of hardcoded True
- Remove duplicate input_shape/q_dtype parametrize decorators in TestGroupedQuantize
Comment on lines 1 to +2
# Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two copyright-header issues here, both inconsistent with the convention used in every other ROCm-modified file in this PR (see e.g. transformer_engine/jax/cpp_extensions/quantization.py, transformer_engine/pytorch/module/linear.py):

  1. Order is reversed. Per CLAUDE.md, when an existing NVIDIA-only file gains an AMD line, the AMD line must go above the NVIDIA line.
  2. Missing the # This file was modified for portability to AMDGPU marker line that every other ROCm-touched Python file carries above the AMD copyright.

Suggested fix:

Suggested change
# Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved.
# This file was modified for portability to AMDGPU
# Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved.
# Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Comment on lines +1 to +2
/*************************************************************************
* Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was modified in this PR (commit 3d3f9e0 wraps the entire body in #ifndef __HIP_PLATFORM_AMD__) but the header still only carries the NVIDIA copyright. Per CLAUDE.md rule 2, ROCm-modified files with only an NVIDIA line need an AMD line added above it (with the AMDGPU portability marker):

Suggested change
/*************************************************************************
* Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
/*************************************************************************
* This file was modified for portability to AMDGPU
* Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved.
* Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Comment on lines +1 to +2
/*************************************************************************
* Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was added by the IFU and then ROCm-modified in this PR (commit 50a837f wraps it in #ifndef USE_ROCM), but the header carries only the NVIDIA copyright. Per CLAUDE.md rule 2, add an AMD line above the NVIDIA line (with the AMDGPU portability marker):

Suggested change
/*************************************************************************
* Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
/*************************************************************************
* This file was modified for portability to AMDGPU
* Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved.
* Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Comment thread build_tools/jax.py
Comment on lines 14 to 15
from .utils import rocm_build, rocm_path
from .utils import all_files_in_dir, get_cuda_include_dirs, debug_build_enabled

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newly added import on line 15 re-imports the same three symbols already imported on line 14 (all_files_in_dir, get_cuda_include_dirs, debug_build_enabled). Looks like a leftover from the merge conflict resolution — both sides of the conflict were kept rather than being unified into a single import (compare with build_tools/pytorch.py, where the equivalent conflict was correctly merged into one from .utils import (...) block).

It works at runtime, but please consolidate to avoid the duplicate, e.g.:

Suggested change
from .utils import rocm_build, rocm_path
from .utils import all_files_in_dir, get_cuda_include_dirs, debug_build_enabled
from .utils import rocm_build, rocm_path
from .utils import (
all_files_in_dir,
get_cuda_include_dirs,
debug_build_enabled,
setup_mpi_flags,
)

@github-actions

Copy link
Copy Markdown

Review summary (initial) — IFU dev → v2.15, ~89 commits / 207 files / +26K −7K LoC.

Most of the diff is the upstream v2.15 history; the ROCm-specific work concentrates in the 7 post-merge fix commits (a02dba78 conflict resolution, plus 50a837f6 / 3d3f9e0c / 92859293 / 5104a384 / a0cc9377 / 5dd2c0c9). Focused the review on those plus the seams where ROCm guards/refactors land in upstream-touched files. No leftover conflict markers in the tree.

Spot-checked highlights that look correct:

  • _Linear.forward signature and _Linear.backward 11-arg return now match the new call site (92859293, 5104a384).
  • cache_name is not Noneself fix at the module arg position is correct.
  • __ballot_sync → 64-bit mask + __popcll in fused_router/utils.h is the right HIP idiom.
  • New #ifndef __HIP_PLATFORM_AMD__ guards on TMA / CUtensorMap paths (cast/core/common.cuh, cast/mxfp8/group_dequantize_mxfp8.cuh, cast/dispatch/dequantize.cuh) keep the ROCm build green and route MXFP8 grouped-dequantize to a clear NVTE_ERROR on ROCm.
  • newton_schulz is excluded from the ROCm build at all three layers (CMake, csrc, py import).

Findings (see inline comments):

  • Code: build_tools/jax.py:14-15 keeps both branches of the merge — three from .utils import … lines with overlapping symbols. Works, but should be consolidated like build_tools/pytorch.py.
  • Copyright headers: 4 files need updates — transformer_engine/jax/quantize/dequantizer.py (AMD line was added below NVIDIA and lacks the AMDGPU portability marker), and three ROCm-touched files still carrying only the NVIDIA line: transformer_engine/common/cast/mxfp8/group_dequantize_mxfp8.cuh, transformer_engine/pytorch/csrc/extensions/newton_schulz.cpp, and transformer_engine/pytorch/__init__.py (last one not pinned to an inline comment because the header lines aren't in the diff hunk — please add # Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved. above the existing NVIDIA line in that file).

Given the size, this review is best-effort over a sample of the most load-bearing seams; please flag any specific files you'd like a closer look at.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it new location of tests/pytorch/distributed/run_fsdp2_fused_adam.py?

test_cmd += ["--layer-type", layer_type]

subprocess.run(test_cmd, env=os.environ, check=True)
sys.path.pop(0)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it this way in upstream - here but not on line 28?


// Warp ballot to count how many lanes have a qualifying element
// Use 64-bit mask for ROCm compatibility (HIP requires uint64_t mask)
uint64_t ballot = __ballot_sync(0xFFFFFFFFFFFFFFFFull, is_greater);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be guarded and __ballot should be used for ROCm

}

#ifdef __HIP_PLATFORM_AMD__
// Non-template definition — delegates to the float specialization

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't define it right where it is declared?


# nvte_grouped_gemm (the v2 kernel) requires SM100+ (Blackwell or newer).
# Fall back to the v1 path on SM90 (Hopper) and older architectures.
if get_min_device_compute_capability() < 100:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we enable it for AMD GPUs?

@@ -442,6 +462,18 @@ def get_attention_backend(
if use_flash_attention_3 and FlashAttentionUtils.v3_is_installed:
logger.debug("Disabling FlashAttention 3 for compute capability != sm90")
use_flash_attention_3 = False
# FA4 supports SM80, SM90, SM100, SM120

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's explicitly disable FA3 and FA4 for ROCm

Fixes for FSDP2 distributed tests on ROCm:

1. run_fsdp2_fused_adam.py:
   - Add IS_HIP_EXTENSION import and AIPYTORCH-427 synchronize() after
     loss.backward() and optimizer.step() in all training loops to prevent
     RCCL deadlocks from forward/backward stream overlap with FSDP2.
   - Add synchronize() in test_fuse_wgrad_accumulation after forward pass.
   - xfail NVFP4BlockScaling in test_fused_adam_bf16 and
     test_fused_adam_bf16_store_param_remainders on ROCm: RCCL
     allreduce_coalesced on NVFP4 amax tensors produces incorrect values,
     causing scale_inv = inf and NaN outputs. Confirmed by disable_rht=True
     workaround which bypasses the amax all-reduce path.

2. run_fsdp2_model.py:
   - Add IS_HIP_EXTENSION import and AIPYTORCH-427 synchronize() after
     backward() in training loop (matches existing pattern from dev branch).
   - Fix double dist.destroy_process_group() from keep_both merge: barrier
     (for torch < 2.6 teardown race) and destroy consolidated into finally.
   - xfail NVFP4BlockScaling + fp8_init + LayerNormLinear on ROCm:
     _check_fp8_fsdp2_allgather exceeds atol=5e-4 due to per-shard amax
     divergence between the FSDP2 unshard path and manual allgather path.

3. test_torch_fsdp2.py:
   - Fix NameError 'fp8' not defined: replaced deprecated fp8 module
     references with direct imports from quantization module.
   - Add _get_free_port() and --master_port to torchrun calls to prevent
     EADDRINUSE when tests run sequentially.
   - xfail NVFP4BlockScaling + fp8_init + LayerNormLinear on ROCm.

4. test_cast_master_weights_to_fp8.py:
   - Skip _test_cast_master_weights_to_nvfp4 on ROCm: same NVFP4 amax
     RCCL issue causes NaN loss; assert_close(nan, nan) then fails
     because NaN != NaN by IEEE 754.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.