Ifu dev 260419 v2.15#616
Conversation
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
…#2757) * [Common] Fix linker error for to_string(DType) in distributed tests Make transformer_engine::to_string(DType) inline in common.h so that translation units outside libtransformer_engine.so can resolve it without requiring the symbol to be exported. Regression introduced by 61f9594 which added to_string(DType) calls into TRANSFORMER_ENGINE_TYPE_SWITCH_* macros, causing test object files to reference the symbol that the linker version script hides. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> --------- Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…-Cast Fusion Kernel (#2555) * first draft Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * pass numerical unit test Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * format Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * add benchmark script Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * lint and format Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * compile guard Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * warning fix Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * resolve greptile comment Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * minor style fixes Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * fix namespace Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * resolve some comments Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * fix comment Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * attempt to fix compile CI with guard Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * better naming for tests Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * fix deprecate messsage Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * more compile guard Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * new API name Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * fix format all in one Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * try to fix compile CI again Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * AI code review comments Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * to pass oldest compile CI with cuda 12.1 Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * add more guards to nvfp4 Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * make multiply inverse default numerics Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * update numerics of nvfp4 partial cast as well Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * resolve comments Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * add NVTE_BUILD_NUM_PHILOX_ROUNDS after rebase Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * simplify compile guard messsages Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> --------- Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…Linear` (#2761) * Load multi-param checkpoint from single-param config in GroupedLinear Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Multi-param to single param case Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Multi-param to single param case Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Better varnames Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Fix GMM cuBLAS version and SM arch checks Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
* Pin python 3.13 in vermin check Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update vermin version for python 3.14 support Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Use sha instead of tag Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* init Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * work finished Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lint fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: root <pgadzinski@nvidia.com> * removed warning.warn Signed-off-by: root <pgadzinski@nvidia.com> * [PyTorch] Remove dead None-check for num_out_tokens in moe_permute_mask_map_forward num_out_tokens is typed as int in the custom_op signature and can never be None; the check was incorrectly carried over from the class-based upstream version during merge conflict resolution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by: root <pgadzinski@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…are detached (#2772) [PyTorch] Change the restore tensor API to ensure tensors are detached from ctx Signed-off-by: Kaining Zhong <kainingz@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
…ges it (#2781) Install pytest in onnx L1 test as Pyt container no longer packages it Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>
…_descriptors (#2782) * Fix zero-sized groups in update_tma_descriptors Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> * Update test_cast_mxfp8_grouped.cu Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…+ (#2693) * Enable sm120 support for fused attn if cuDNN is 9.18.1+ Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Force intermediate tensors such as S, Sum_Exp, and Max to be BHS1 shape instead of TH1 for sm120 Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add support for sm120 correct batch, seq dims Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Add support for sm120 BHS1 style max logit even QKV are THD to avoid incorrect max logit calculation (includes padded tokens in max calculation) Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Disable fused and flash attn for sm120 filter:kv cache Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * For CP P2P attn, set softmax_lse_in_packed_format to False if sm120+ Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Assert in TE if T3HD/TH3D layout is used on sm120 before cuDNN F16 sdpa arbitrary kernel call Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Modify is_ragged_q && cudnn_runtime_version >= 90600 check to also include a check for sm120 Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * nit: Code clean up Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Disable fused attn for T3HD and TH3D Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * nit: Add missed sm120 guard Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Modify sm120 condition to be very specific to sm120 and not generalized to sm120+ Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * nit: Fix missing sm120 check in fwd Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Move the check for sm120 T3HD/TH3D to nvte_get_fused_attn_backend() instead of higher layers in TE stack Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * nit: Check for matching sm120 and not sm120+ Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* code drop Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * code drop Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * docs Signed-off-by: root <pgadzinski@nvidia.com> * nvfp4 internals support Signed-off-by: root <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lint fixes Signed-off-by: root <pgadzinski@nvidia.com> * Update transformer_engine/debug/features/dump_tensors.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * fix Signed-off-by: root <pgadzinski@nvidia.com> * Update transformer_engine/debug/features/dump_tensors.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * Update transformer_engine/debug/features/dump_tensors.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update tests/pytorch/debug/test_log.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * Update transformer_engine/debug/features/dump_tensors.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * fix Signed-off-by: root <pgadzinski@nvidia.com> * fix Signed-off-by: root <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove dump_quantized_internals support from DumpTensors Drop the dump_quantized_internals config option, the _get_quantized_internals method, and all helper functions for extracting scales/raw data from Float8Tensor, Float8BlockwiseQTensor, MXFP8Tensor, and NVFP4Tensor. Remove corresponding tests: test_dump_tensors_nvfp4_unpacked_codes and NVFP4_DUMP_TENSORS_CONFIG, and scale/data assertions from test_dump_tensors_sanity. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address Greptile review comments - Add dot ('.') to _sanitize_name to handle common PyTorch dotted layer names like 'encoder.layer.0.attention' - Add docstring note about pickle dependency for the 'quantized' key - Add comment explaining weights_only=False in test - Remove redundant local RecipeState import in test_nvfp4_numeric Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Remove portability suggestion from quantized key docstring Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Compute rank lazily in _expected_root_dir Avoids relying on stale self.rank when ensure_initialized is called before initialize() has set the rank. Consistent with how nvdlfw_inspect logger resolves rank. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * detach tensors before saving; verify dump filename in test Detach both high_precision and quantized tensors before saving to avoid serializing the autograd graph. For QuantizedTensor this is a zero-copy view (make_like), so no extra GPU allocation. Add filename format assertion to test_dump_tensors_sanity to catch regressions in _sanitize_name or the naming convention. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add empty dump_dict log; assert QuantizedTensor type in test Log a message when no tensors are available to dump so the user has an explicit signal that no file was written. Assert that the quantized key round-trips as a QuantizedTensor to catch regressions in detach() or serialisation path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/debug/features/dump_tensors.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * Address review: iter subdirs, remove dead rank field, add allclose test and MSE example - Organize dumps into per-iteration subdirectories (iter_000000/) to keep file count manageable per directory. - Remove unused self.rank attribute from TensorLogger. - Add torch.allclose assertion in test to verify serialization correctness. - Add docstring example showing how to load dumps and compute MSE. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Made-with: Cursor * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: use detach().clone() to avoid shared storage in DumpTensors Using tensor.detach() creates a view sharing the same underlying storage. If any in-place operation modifies the tensor after the dump, the saved data would be silently corrupted. Use .clone() to ensure the dump captures an independent copy of the data. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * test: use torch.equal instead of torch.allclose for serialisation round-trip The saved tensor is an exact bit-for-bit copy (detach().clone()), so torch.equal is the correct check. torch.allclose with its default tolerances could mask a genuine dtype conversion or precision loss introduced by a future change to the serialisation path. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix: add tp_size to DumpTensors.inspect_tensor and fix KeyError in call_feature backward compat pop Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by: root <pgadzinski@nvidia.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * change distributed tests infra for fsdp2 Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * verbose flag for reporting Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * add back coments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * another minor fix Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * not needed for this PR Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * address review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * unecessary comments
* add cudnn dln+add Signed-off-by: CarlosGomes98 <carlosmiguel.gomes@live.com.pt> * try fixing cudnn build issue Signed-off-by: CarlosGomes98 <carlosmiguel.gomes@live.com.pt> * guard against cudnn version Signed-off-by: CarlosGomes98 <carlosmiguel.gomes@live.com.pt> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * change itype to wtype for add in rmsnorm_bwd Signed-off-by: CarlosGomes98 <carlosmiguel.gomes@live.com.pt> * remove dead code Signed-off-by: CarlosGomes98 <carlosmiguel.gomes@live.com.pt> * remove dangling todo Signed-off-by: CarlosGomes98 <carlosmiguel.gomes@live.com.pt> --------- Signed-off-by: CarlosGomes98 <carlosmiguel.gomes@live.com.pt> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* add blackwell support filter for 9.7<=cudnn<9.18.1 Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * simplify conditionals Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fix conditionals again Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fix conditionals again Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * update the error log Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * remove the python filter and correct the cpp filter Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> --------- Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…quired (#2798) * Disable fused attention for sm120 if determinism is required Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * nit: disable fused attn for sm120 determinism, if training Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> --------- Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
… always and `Max` when `return_max_logit=True` (#2677) * cudnn now returns Stats always and Max only with `return_max_logit=true` Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fix a typo that caused a bug Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * update doc strings Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix more docs Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fixes from the feedback Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * update cudnn-frontend to v1.19.1 Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * update the cudnn frontend Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fix a wrong omission Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Move cuDNN FE to v1.21.0 Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Add warning if using BSHD and max_segments_per_seq > 1 Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/jax/attention.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> * Update transformer_engine/jax/attention.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com> Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> * Remove warning test Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com>
* Refactor to group_sizes per tensor Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Support first_dims and last_dims instead of a single group_sizes per tensor Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Refactor GMM FFIs to store static attrs as structs Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Cleanup C++ v2 FFI Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Fix int64 workspace usage Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Address greptile comments Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Refactor wgrad-specific checks to be generic for GMM in gemm.py Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Refactor XLA FFI struct setup Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Fix edge case in TE v1 GMM Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix issues on Hopper Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Refactor Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address comments Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Lint Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Fixes for Hopper Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Address review comments Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Grouped quantization test fixes Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Pass input_output_alias to TritonAutotunedKernelCall Signed-off-by: JAX Toolbox <jax@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add jax version guard for the input_output_aliasing fix Signed-off-by: tdophung <tdophung@nvidia.com> --------- Signed-off-by: JAX Toolbox <jax@nvidia.com> Signed-off-by: tdophung <tdophung@nvidia.com> Co-authored-by: JAX Toolbox <jax@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
* done Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * one review comment form greptile Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * instead part of the comment not needed Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * Update transformer_engine/pytorch/tensor/float8_blockwise_tensor.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * No need to set it to None Remove unnecessary columnwise data and scale inv assignments. Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> --------- Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
* cudnn now returns Stats always and Max only with `return_max_logit=true` Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fix a typo that caused a bug Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * update doc strings Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix more docs Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fixes from the feedback Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * update cudnn-frontend to v1.19.1 Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * update the cudnn frontend Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fix a wrong omission Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * bugfix: mask out padding tokens when THD Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes from greptile feedback Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor nit Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fixes from feedback Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> --------- Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Enabled persistency with WorkID Query feature Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added a struct with tunable parameters Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added persistency with static scheduling Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed test cases Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Ready for benchmarking Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed out-of-boundary error Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Tuned kernel parameters Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring 2 Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring 3 Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Removed the dynamic (WorkID Query) persistency Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Ready for PR Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes per the review Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Ready for benchmark Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Ready for benchmark - Regular kernel Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added the source code to the profiler Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added constructors to Job and Block descriptors Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Removed the prefetch overlapping between jobs Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Cache tensor ID Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * ShapeRepresentation is not a template parameter Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Removed redundant fence_proxy Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Used mixed precision FMA Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added Quantize parameters Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added the fast math branch Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added the fast math to cpp test suite Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Align tests Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Use STS instead of generic ST Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Add zero-tensor cases Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Used LDS instead of generic LD in colwise path Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Used LDS instead of generic LD in rowwise Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Ready for merge Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Uncommented test cases Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added FP16 Fast math path to rowwise processing Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed lint Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixes Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fix Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed test suite Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed test suite Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixes per the review Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Modifications per the review Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Assert the buffer size Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added fast math RCP for bf16 Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fast math for BF16 is now default Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed compilation error when compiling on previous archs Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Boundary condition fix Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed compilation error Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring. Moved helpers to core-common Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactoring Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactoring per the review Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Addressed the PR review comments Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed the compilation error when PTX was compiled for CUDA 13.0 Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed pytorch extensions Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> --------- Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> Signed-off-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…os (#2823) * Fix: Use jitted kernels for generating THD (and BSHD) segment pos if only segment id is passed Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Make passing of segment_pos to from_segmet_ids_and_pos for creating a SequenceDescriptor mandatory Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Make test changes for from_segmet_ids_and_pos API change. Also some nits. Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * nit: Make segment_pos arg mandatory and not Optional Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Add comments for from_segment_ids_and_pos Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * nit: Change data types for BSHD seg pos and seg id to be int32 adn consistent with THD when setting up test inputs Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Replace a TypeError if segment_pos is not passed with a ValueError with a message Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* GEMM + Swiglu fused Grouped MLP for MXFP8 Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * cleanup/lint Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Properly cache the alpha tensor Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * nD dummy grad Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * 0 tokens in entire rank Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tmp downgrade cublas version check Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * delayed wgrad tests pass for basic gl Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * merge everything Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Rebase into fused_mxfp8_grouped_mlp; unit tests for delayed wgrad working Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix tests being skipped for fusible ops Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Integrate mxfp8 dbias kernel in group_quantize Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add bias/dbias fused support with cute GEMMs Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Check bias/dbias support Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Pack biases more efficiently Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * GroupedTensor for biases to avoid concat Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * format Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Support 1D grouped tensor shape for bias and fix checkpointing Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes and tests Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Refactor grouped tensor marking for paged stashing Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Remove setting logical_shape in mark_grouped_tensor Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Cleanup logical_shape Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * pass the tests for now Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * address some review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * address review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * more cleanups Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * cleanup Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * refactor wgrad logic Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Rename argument from single_grouped_parameter to single_grouped_weight Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Check wgrad store context is not empty for 0 token case. Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Test only checks for fusion if fused kernel is available Signed-off-by: Tim Moon <tmoon@nvidia.com> * fix the tolerance to be of bf16 for the cute gemm Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * Update transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * address further review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address more review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * address more review comments + test for zero grouped tensor work case Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * cublaslt remove zero work gemm avoidance Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix the wgrad test Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * split dbias functionality from gq api Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Format and lint Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * port fixes and add better doc for page stashing war Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Guard fusion via env Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change to trigger CI Remove unnecessary blank line in docstring. * To retrigger CI * Space to trigger the pipeline * fix zero work cublas gemm Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Varun Thumbe <vthumbe@nvidia.com> Co-authored-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Add tests that demonstrate two known memory issues with FSDP2 + FP8:
- Issue #2681: FP8 weight copies created during te.autocast() forward pass
accumulate across layers instead of being freed between layers, defeating
FSDP2's memory efficiency. Detected by comparing per-layer forward memory
increments against a bf16 baseline using layer hooks.
- Issue #2717: Transpose cache tensors (_create_transpose) allocated during
backward persist until the next forward pass instead of being freed after
backward completes. Detected by comparing the backward memory delta
(post_bwd - post_fwd) against a bf16 baseline.
New tests:
- test_bf16_no_excess_forward_memory: control, validates per-layer measurement
- test_bf16_no_excess_backward_memory: control, validates backward delta comparison
- test_fp8_temp_accumulation_across_layers: xfail, detects #2681
- test_transpose_cache_retained_after_backward: xfail, detects #2717
All parametrized over 5 FP8 recipes x {no_quant_init, quant_init}.
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
Co-authored-by: vthumbe1503 <vthumbe@nvidia.com>
…x (#2820) * Compute swizzle_idx once per thread and pass into ComputeKernel. Signed-off-by: Cael Ling <caell@nvidia.com> * one __syncthreads per stage in GroupHadamardAmaxTmaKernel Signed-off-by: Cael Ling <caell@nvidia.com> * streamline group Hadamard ComputeKernel loads Signed-off-by: Cael Ling <caell@nvidia.com> * streamline group Hadamard ComputeKernel loads Signed-off-by: Cael Ling <caell@nvidia.com> * streamline group Hadamard ComputeKernel loads Signed-off-by: Cael Ling <caell@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * one __syncthreads per stage in GroupHadamardAmaxTmaKernel Signed-off-by: Cael Ling <caell@nvidia.com> Made-with: Cursor * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Compute swizzle_idx once per thread and pass into ComputeKernel. Signed-off-by: Cael Ling <caell@nvidia.com> * Fix kReturnIdentityAmax path Signed-off-by: Cael Ling <caell@nvidia.com> * Fix kReturnIdentityAmax path Signed-off-by: Cael Ling <caell@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply the change to other variants Signed-off-by: Cael Ling <caell@nvidia.com> * Refactor the change to other variants Signed-off-by: Cael Ling <caell@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor the change to other variants Signed-off-by: Cael Ling <caell@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor the ldmatrix logics Signed-off-by: Cael Ling <caell@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Cael Ling <caell@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Minor misc optimizations in fused GroupedMLP Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Add new test to compare single vs multi-param fused GMLP case Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add bias support Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* fix: enabling fused _router to be able to handle large topk and number of experts - expanding shared memory when needed - switch to radix topk selection when topk is large - test_fused_router.py updated with large num experts and tolerances refined for different cases * added topk>=16 in tests/pytorch/test_fused_router.py added return value check of cudaFuncSetAttribute in transformer_engine/common/fused_router/fused_topk_with_score_function.cu added dtype dependent eps in tests/pytorch/test_fused_router.py removed unneeded code in transformer_engine/common/fused_router/utils.h * test_fused_router.py needs to skip topk >= num_experts case Signed-off-by: Harry Zhou <hhanyu@nvidia.com> cleaned up raw warp operations added comments added shared_memory check added return code check * warning about dtype for tolerance in test_fused_router.py Signed-off-by: Harry Zhou <hhanyu@nvidia.com> --------- Signed-off-by: Harry Zhou <hhanyu@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Gaetan Lepage <gaetan@glepage.com>
* Grouped dequantize for MXFP8 Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Pytorch extension Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Fix CUDA graphs compatibility Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Handling non-full tiles Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Fixes Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes from review Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Refactor grouped MXFP8 dequantize kernel - Use common namespace helpers instead of group_quantize_kernel - Extract shared constants into DequantizeConfig struct - Replace SCALE_DIM template params with single ROWWISE bool - Use initialize_barriers/destroy_barriers helpers - Fix offsets array size (num_tensors + 1) - Skip TMA descriptor update for zero-sized groups - Fix off-by-one in max tensor descriptor check Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Tighten tensor_offsets validation to require num_tensors+1 All producers (splits_to_offsets, quantizer.cpp) and consumers (is_job_valid, get_current_tensor_id, hadamard transform) already use CSR-style num_tensors+1 offsets. Make the validation match. Also fix stale docstring in grouped_tensor_storage.py. Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix group_dequantize: attribute names, dtype, and shape handling In group_dequantize(), GroupedTensor inherits from torch.Tensor, so accessing .data returns the 2D wrapper tensor instead of the 1D quantized data buffer. Fix three issues: - Read "rowwise_data" attribute instead of "data" to get the flat 1D quantized buffer rather than torch.Tensor.data (2D wrapper). - Use quantizer->dtype (e.g. kFloat8E4M3) instead of deriving dtype from the raw tensor's scalar_type() which is just uint8. - Pass numel() as a 1-element shape vector to ensure the grouped tensor data is registered as 1D. Promote DType dtype from quantizer subclasses to the base Quantizer class (defaulting to kNumTypes) so group_dequantize can access it without downcasting. Update tests to compare per-tensor via split_into_quantized_tensors() instead of accessing .data on GroupedTensor. Signed-off-by: Przemyslaw Tredak <ptredak@nvidia.com> Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Przemek Tredak <ptredak@nvidia.com> Signed-off-by: Przemyslaw Tredak <ptredak@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* add fa4 support Signed-off-by: Xin Yao <xiny@nvidia.com> * comment out unused import for cp Signed-off-by: Xin Yao <xiny@nvidia.com> * fix lint Signed-off-by: Xin Yao <xiny@nvidia.com> * install fa4 in L3 test Signed-off-by: Xin Yao <xiny@nvidia.com> * fix sm90 Signed-off-by: Xin Yao <xiny@nvidia.com> --------- Signed-off-by: Xin Yao <xiny@nvidia.com>
* Fix grouped quant checkpointing Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Cleanup Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* adds NVFP4 Fused Adam support Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1429.ipp1a1.colossus.nvidia.com> * un xfail test Signed-off-by: Jonathan Mitchell <jomitchell@r6515-0097.ipp1a1.colossus.nvidia.com> * cleanup Signed-off-by: Jonathan Mitchell <jomitchell@r6515-0097.ipp1a1.colossus.nvidia.com> * adds back copy dispatch handler Signed-off-by: Jonathan Mitchell <jomitchell@r6515-0097.ipp1a1.colossus.nvidia.com> --------- Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1429.ipp1a1.colossus.nvidia.com> Signed-off-by: Jonathan Mitchell <jomitchell@r6515-0097.ipp1a1.colossus.nvidia.com> Co-authored-by: Jonathan Mitchell <jomitchell@ipp1-1429.ipp1a1.colossus.nvidia.com> Co-authored-by: Jonathan Mitchell <jomitchell@r6515-0097.ipp1a1.colossus.nvidia.com> Co-authored-by: vthumbe1503 <vthumbe@nvidia.com>
Resolved all 30 conflicted files from the upstream v2.15 merge: - CMakeLists.txt: keep ROCm source/hipify blocks + upstream CUDA arch flags - cast/core/common.cuh: guard TensorMapStorage under !__HIP_PLATFORM_AMD__ - group_quantize_mxfp8.cuh: adopt upstream quant_config param, keep ROCm guard - rmsnorm_api.cpp: keep ROCm cuDNN/zero_centered_gamma guard in bwd_add - recipe/__init__.py: keep MXFP4BlockScaling class + adopt upstream __repr__ - util/logging.h: add cuSolverMp macro guarded under !__HIP_PLATFORM_AMD__ - util/ptx.cuh: keep ROCm stochastic rounding + upstream BF16_MANTISSA_BITS - extensions.h: add grouped_swizzle_for_gemm before USE_ROCM guard - pybind.cpp: keep Newton-Schulz bindings under USE_ROCM guard - quantizer.cpp: keep ROCm RHT cast fusion eligibility path - quantization.py: keep ROCm nvfp4/mxfp4 checks + adopt upstream cached wrappers - backends.py: keep AITER triton path + adopt upstream FA3 import with IS_HIP guard - utils.py (attn): add IS_HIP_EXTENSION guard + upstream FA3 deterministic check - module/base.py: keep get_weight_workspace method - module/linear.py: keep ROCm inline forward/backward + upstream non_tensor_args - module/grouped_linear.py: keep triton path + adopt upstream backward_override - module/layernorm_linear.py: keep ROCm FP8 state + adopt upstream backward_override - module/layernorm_mlp.py: adopt upstream qstate refactor (take_upstream) - jax/cpp_extensions/gemm.py: keep ROCm grouped gemm quantizer path - jax/csrc/extensions/gemm.cpp: keep ROCm swizzle guard + upstream attr changes - build_tools: keep rocm_build/rocm_path + adopt upstream setup_mpi_flags - test files: keep both ROCm and upstream test additions Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Audit of cleanly-merged upstream changes revealed three issues: 1. newton_schulz/newton_schulz.cpp: added to cuda_only_cpp_sources in CMakeLists.txt — includes cuda_runtime.h directly, uses cuSolverMp, not hipifiable as a .cpp file. 2. pytorch/csrc/extensions/newton_schulz.cpp: wrapped in #ifndef USE_ROCM guard — calls at::cuda::getCurrentCUDAStream() and cuSolverMp APIs. 3. pytorch/__init__.py: guarded newton_schulz import with IS_HIP_EXTENSION check — tex.newton_schulz pybind binding is not registered on ROCm. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Seven build errors resolved: 1. cast/core/common.cuh: restructured __HIP_PLATFORM_AMD__ guards — TensorMapStorage, g_tensor_maps, and TMA helper functions guarded; kernel namespace (reduce_dbias, grouped_reduce_dbias) and block decode helpers left unguarded for ROCm use; new upstream TMA dispatch functions (modify_base_tensor_map, update_tma_descriptors, prefetch/store_output_stage) guarded separately. 2. cast/mxfp8/group_dequantize_mxfp8.cuh: entire file guarded with #ifndef __HIP_PLATFORM_AMD__ — uses CUtensorMap throughout. 3. cast/dispatch/dequantize.cuh: guard MXFP8 group_dequantize call with #ifndef __HIP_PLATFORM_AMD__ — depends on group_dequantize_mxfp8.cuh. 4. util/ptx.cuh: added non-template float exp2f_rcp(e8m0_t) overload under #ifdef __HIP_PLATFORM_AMD__ — ROCm-specific files call without <float> template arg. Moved BF16_MANTISSA_BITS outside the CUDA-only guard so ROCm exp2f_rcp<bf16> works. 5. fused_router/utils.h: changed __ballot_sync mask from unsigned int to uint64_t and switched to __popcll — HIP requires 64-bit ballot mask. 6. normalization/rmsnorm/rmsnorm_api.cpp: guarded use_cudnn_norm_bwd() / NVTE_Norm_Backend::Cudnn / use_zero_centered_gamma_in_weight_dtype() calls with #ifndef __HIP_PLATFORM_AMD__. 7. pytorch/csrc/quantizer.cpp: removed duplicate eligible_for_rht_cast_fusion declaration; fixed columnwise_quant_config → columnwise_quant_config_to_use; added out_transpose wrapper construction for the RHT unfused columnwise path. 8. pytorch/quantization.py: added missing `import functools`. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Resolved API mismatch from Phase 1 conflict resolution where upstream refactored _Linear.forward to receive quantizers as separate args instead of packed in non_tensor_args: 1. Updated _Linear.forward signature to accept 6 quantizer args separately matching the new call convention from Linear.forward. 2. Updated non_tensor_args unpacking to match the new tuple format (no quantizers). 3. Added ctx.save_for_backward and ctx.tensor_objects assignment. 4. Added ctx.backward_override initialization. 5. Fixed _Linear.backward return to reshape dgrad via ctx.inp_shape and return 11 None-padded values matching the new 11-arg forward signature. 6. Added import of restore_from_saved. 7. Added import functools to quantization.py. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…de ordering Two fixes in _Linear.forward: 1. non_tensor_args at call site incorrectly passed 'cache_name is not None' (bool) at the 'module' position — changed to pass 'self' (the Linear module object). 2. backward_override nullification of ctx fields was placed before the ctx assignments, causing them to be immediately overwritten — moved to run after all ctx fields are set. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Two build fixes: 1. tests/cpp/operator/test_normalization.cu: guard cudnnGetVersion() call with #ifndef __HIP_PLATFORM_AMD__ — cuDNN is not available on ROCm and use_cudnn is always false on this platform. 2. transformer_engine/jax/csrc/extensions/gemm.cpp: fix GroupedGemmV2Handler FFI binding — replace individual .Attr<> entries (M, N, K, lhs_is_trans, rhs_is_trans, scaling_mode, is_grouped_dense_wgrad) with .Attrs<GroupedGemmV2Config>() to match the GroupedGemmV2FFI function signature. This was the Phase 1 conflict resolution error where the old attribute-based registration was kept instead of adopting the upstream struct-based API. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…ures - Skip GEMM-swizzled scale unswizzle on HIP since ROCm quantize kernel writes compact scales - Reshape MXFP8 scales to 2D on HIP to match the 2D-flattened data path in grouped dequantize - Fix test_grouped_gemm_fp16 to use parametrized use_async_d2h_group_size instead of hardcoded True - Remove duplicate input_shape/q_dtype parametrize decorators in TestGroupedQuantize
| # Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved. |
There was a problem hiding this comment.
Two copyright-header issues here, both inconsistent with the convention used in every other ROCm-modified file in this PR (see e.g. transformer_engine/jax/cpp_extensions/quantization.py, transformer_engine/pytorch/module/linear.py):
- Order is reversed. Per CLAUDE.md, when an existing NVIDIA-only file gains an AMD line, the AMD line must go above the NVIDIA line.
- Missing the
# This file was modified for portability to AMDGPUmarker line that every other ROCm-touched Python file carries above the AMD copyright.
Suggested fix:
| # Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved. | |
| # This file was modified for portability to AMDGPU | |
| # Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved. | |
| # Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| /************************************************************************* | ||
| * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
This file was modified in this PR (commit 3d3f9e0 wraps the entire body in #ifndef __HIP_PLATFORM_AMD__) but the header still only carries the NVIDIA copyright. Per CLAUDE.md rule 2, ROCm-modified files with only an NVIDIA line need an AMD line added above it (with the AMDGPU portability marker):
| /************************************************************************* | |
| * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| /************************************************************************* | |
| * This file was modified for portability to AMDGPU | |
| * Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved. | |
| * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| /************************************************************************* | ||
| * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
This file was added by the IFU and then ROCm-modified in this PR (commit 50a837f wraps it in #ifndef USE_ROCM), but the header carries only the NVIDIA copyright. Per CLAUDE.md rule 2, add an AMD line above the NVIDIA line (with the AMDGPU portability marker):
| /************************************************************************* | |
| * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| /************************************************************************* | |
| * This file was modified for portability to AMDGPU | |
| * Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved. | |
| * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| from .utils import rocm_build, rocm_path | ||
| from .utils import all_files_in_dir, get_cuda_include_dirs, debug_build_enabled |
There was a problem hiding this comment.
The newly added import on line 15 re-imports the same three symbols already imported on line 14 (all_files_in_dir, get_cuda_include_dirs, debug_build_enabled). Looks like a leftover from the merge conflict resolution — both sides of the conflict were kept rather than being unified into a single import (compare with build_tools/pytorch.py, where the equivalent conflict was correctly merged into one from .utils import (...) block).
It works at runtime, but please consolidate to avoid the duplicate, e.g.:
| from .utils import rocm_build, rocm_path | |
| from .utils import all_files_in_dir, get_cuda_include_dirs, debug_build_enabled | |
| from .utils import rocm_build, rocm_path | |
| from .utils import ( | |
| all_files_in_dir, | |
| get_cuda_include_dirs, | |
| debug_build_enabled, | |
| setup_mpi_flags, | |
| ) |
|
Review summary (initial) — IFU Most of the diff is the upstream v2.15 history; the ROCm-specific work concentrates in the 7 post-merge fix commits ( Spot-checked highlights that look correct:
Findings (see inline comments):
Given the size, this review is best-effort over a sample of the most load-bearing seams; please flag any specific files you'd like a closer look at. |
There was a problem hiding this comment.
Is it new location of tests/pytorch/distributed/run_fsdp2_fused_adam.py?
| test_cmd += ["--layer-type", layer_type] | ||
|
|
||
| subprocess.run(test_cmd, env=os.environ, check=True) | ||
| sys.path.pop(0) |
There was a problem hiding this comment.
Is it this way in upstream - here but not on line 28?
|
|
||
| // Warp ballot to count how many lanes have a qualifying element | ||
| // Use 64-bit mask for ROCm compatibility (HIP requires uint64_t mask) | ||
| uint64_t ballot = __ballot_sync(0xFFFFFFFFFFFFFFFFull, is_greater); |
There was a problem hiding this comment.
It should be guarded and __ballot should be used for ROCm
| } | ||
|
|
||
| #ifdef __HIP_PLATFORM_AMD__ | ||
| // Non-template definition — delegates to the float specialization |
There was a problem hiding this comment.
why don't define it right where it is declared?
|
|
||
| # nvte_grouped_gemm (the v2 kernel) requires SM100+ (Blackwell or newer). | ||
| # Fall back to the v1 path on SM90 (Hopper) and older architectures. | ||
| if get_min_device_compute_capability() < 100: |
There was a problem hiding this comment.
Can we enable it for AMD GPUs?
| @@ -442,6 +462,18 @@ def get_attention_backend( | |||
| if use_flash_attention_3 and FlashAttentionUtils.v3_is_installed: | |||
| logger.debug("Disabling FlashAttention 3 for compute capability != sm90") | |||
| use_flash_attention_3 = False | |||
| # FA4 supports SM80, SM90, SM100, SM120 | |||
There was a problem hiding this comment.
Let's explicitly disable FA3 and FA4 for ROCm
Fixes for FSDP2 distributed tests on ROCm:
1. run_fsdp2_fused_adam.py:
- Add IS_HIP_EXTENSION import and AIPYTORCH-427 synchronize() after
loss.backward() and optimizer.step() in all training loops to prevent
RCCL deadlocks from forward/backward stream overlap with FSDP2.
- Add synchronize() in test_fuse_wgrad_accumulation after forward pass.
- xfail NVFP4BlockScaling in test_fused_adam_bf16 and
test_fused_adam_bf16_store_param_remainders on ROCm: RCCL
allreduce_coalesced on NVFP4 amax tensors produces incorrect values,
causing scale_inv = inf and NaN outputs. Confirmed by disable_rht=True
workaround which bypasses the amax all-reduce path.
2. run_fsdp2_model.py:
- Add IS_HIP_EXTENSION import and AIPYTORCH-427 synchronize() after
backward() in training loop (matches existing pattern from dev branch).
- Fix double dist.destroy_process_group() from keep_both merge: barrier
(for torch < 2.6 teardown race) and destroy consolidated into finally.
- xfail NVFP4BlockScaling + fp8_init + LayerNormLinear on ROCm:
_check_fp8_fsdp2_allgather exceeds atol=5e-4 due to per-shard amax
divergence between the FSDP2 unshard path and manual allgather path.
3. test_torch_fsdp2.py:
- Fix NameError 'fp8' not defined: replaced deprecated fp8 module
references with direct imports from quantization module.
- Add _get_free_port() and --master_port to torchrun calls to prevent
EADDRINUSE when tests run sequentially.
- xfail NVFP4BlockScaling + fp8_init + LayerNormLinear on ROCm.
4. test_cast_master_weights_to_fp8.py:
- Skip _test_cast_master_weights_to_nvfp4 on ROCm: same NVFP4 amax
RCCL issue causes NaN loss; assert_close(nan, nan) then fails
because NaN != NaN by IEEE 754.
Description
Please include a brief summary of the changes, relevant motivation and context.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: