[fix] Fix CUTLASS grouped GEMM segfault for empty groups by Baibaifan · Pull Request #3037 · NVIDIA/TransformerEngine

Baibaifan · 2026-05-22T04:06:46Z

Description

Handle grouped GEMM calls where all groups are empty.

MoE routing can legally produce a microbatch where no local expert receives
tokens. The PyTorch grouped GEMM wrapper filters those zero-token GEMMs, but
the CUTLASS grouped GEMM path could still be reached with num_gemms == 0 and
then dereference A[0]/B[0]/D[0], causing a native segfault.

Return early after filtering all GEMMs in te_general_grouped_gemm, and add a
defensive num_gemms <= 0 guard in nvte_multi_tensor_gemm.

Add a Hopper/CUTLASS regression test covering all-empty grouped GEMM inputs for
TN, NN, and NT layouts.

Type of change

Bug fix (non-breaking change which fixes an issue)

Changes

Please list the changes introduced in this PR:

Return early from te_general_grouped_gemm when all GEMMs were filtered.
Add a defensive num_gemms <= 0 guard in nvte_multi_tensor_gemm.
Add a Hopper-only regression test for all-empty grouped GEMM inputs under
NVTE_USE_CUTLASS_GROUPED_GEMM=1.
Cover TN, NN, and NT layouts.

Testing

pytest -q tests/pytorch/test_numerics.py::test_grouped_gemm_cutlass_empty_groups -s

Checklist:

I have read and followed the [contributing guidelines]
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: yangfan.bai <yangfan.bai@shopee.com>

greptile-apps · 2026-05-22T04:13:50Z

Greptile Summary

This PR fixes a native segfault in the CUTLASS grouped GEMM path when MoE routing produces a microbatch where every expert receives zero tokens. The PyTorch wrapper already filtered zero-token GEMMs individually, but did not handle the case where all groups were filtered, allowing nvte_multi_tensor_gemm to be invoked with an empty tensor array and dereference A[0]/B[0]/D[0].

gemm.cpp: After the per-group filtering loop, an early return is added when te_A_wrappers is empty, returning bias to match the function's existing normal return value.
cublaslt_gemm.cu: A num_gemms <= 0 guard is added at the top of nvte_multi_tensor_gemm as a secondary defence for direct C-API callers.
test_numerics.py: A Hopper-only regression test exercises TN, NN, and NT layouts with all-zero m_splits; the NT case meaningfully asserts that the wgrad buffer is zeroed in-place, while TN/NN serve primarily as no-crash guards.

Confidence Score: 4/5

The fix is narrowly scoped, targets a real crash path, and does not change behaviour for non-empty inputs. Safe to merge.

Both changes are small, isolated guards with no effect on the non-empty GEMM path. The early return in te_general_grouped_gemm correctly mirrors the existing return value semantics. The only notable weakness is that the TN/NN test assertions are trivially true (empty-vs-empty compare), so those sub-cases are purely a no-crash check rather than a semantic postcondition.

No files require special attention beyond the test assertion coverage noted in the inline comment on test_numerics.py.

Important Files Changed

Filename	Overview
transformer_engine/pytorch/csrc/extensions/gemm.cpp	Adds an early return in te_general_grouped_gemm when all GEMM groups are filtered (te_A_wrappers is empty); returns the bias vector, consistent with the existing return at the bottom of the function. The fix is logically correct: empty groups are already handled by the in-loop continue + zero_(), and the early exit prevents a subsequent call to nvte_multi_tensor_gemm with a zero-length vector.
transformer_engine/common/gemm/cublaslt_gemm.cu	Adds a defensive num_gemms <= 0 guard at the entry of nvte_multi_tensor_gemm. This is a secondary safety net for callers that bypass te_general_grouped_gemm and invoke the C API directly with an empty array.
tests/pytorch/test_numerics.py	Adds a Hopper-only regression test for all-empty grouped GEMM. TN/NN out-tensors are 0-element so the zero-check is trivial (useful only as a no-crash guard); NT out-tensor is a real (n,k) buffer whose in-place zero_() is meaningfully exercised. Test env-var setup/teardown follows the existing pattern.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["te_general_grouped_gemm(A, B, D, ...)"] --> B["Loop over all GEMM groups"]
    B --> C{"te_A.numel()==0 or\nte_B.numel()==0?"}
    C -- Yes --> D["zero_() output if non-empty\nzero_() bias/gelu if grad\ncontinue"]
    D --> B
    C -- No --> E["Build te_A/B/D wrappers\nappend to vectors"]
    E --> B
    B -- loop done --> F{"te_A_wrappers.empty()?\n(ALL groups filtered)"}
    F -- Yes --> G["return bias  ← NEW early return\n(prevents null deref in CUTLASS path)"]
    F -- No --> H["swizzle scales\nbuild NVTETensor vectors"]
    H --> I["nvte_multi_tensor_gemm(...)"]
    I --> J{"num_gemms <= 0?\n← NEW guard"}
    J -- Yes --> K["return (no-op)"]
    J -- No --> L{"is_hopper &&\nuse_cutlass?"}
    L -- No --> M["multi_stream_cublas_gemm"]
    L -- Yes --> N["CUTLASS grouped GEMM\n(accesses A[0]/B[0]/D[0])"]

_{Reviews (1): Last reviewed commit: "[fix] fix grouped GEMM zero-work bug." | Re-trigger Greptile}

greptile-apps · 2026-05-22T04:13:54Z

+    for tensor in out:
+        torch.testing.assert_close(tensor, torch.zeros_like(tensor), rtol=0, atol=0)
+
+
 def _pack_grouped_tensor(grouped_tensor: GroupedTensor, tensors: List[torch.Tensor]) -> None:
    data = grouped_tensor.rowwise_data


Zero-assertion is trivially true for TN and NN layouts

For TN and NN, out is constructed as a list containing a single 0-element tensor (torch.empty(0, n/k, ...)). torch.testing.assert_close on two empty tensors passes unconditionally regardless of any computation, so those two sub-cases only serve as crash/segfault guards. The meaningful assertion only fires for NT, where out[0] is a full (n, k) buffer that the C++ code zeros in-place. Consider either documenting this in a comment or, for TN/NN, adding a small non-empty output tensor and asserting it is zero to provide the same level of postcondition coverage as NT.

ptrendx · 2026-05-29T23:30:28Z

Hi @Baibaifan, could you resolve the conflicts?

Baibaifan requested a review from ksivaman as a code owner May 22, 2026 04:06

github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label May 22, 2026

[fix] fix grouped GEMM zero-work bug.

2b4b822

Signed-off-by: yangfan.bai <yangfan.bai@shopee.com>

Baibaifan force-pushed the zero_groupgemm branch from 58fd0f8 to 2b4b822 Compare May 22, 2026 04:09

greptile-apps Bot reviewed May 22, 2026

View reviewed changes

ptrendx approved these changes May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Fix CUTLASS grouped GEMM segfault for empty groups#3037

[fix] Fix CUTLASS grouped GEMM segfault for empty groups#3037
Baibaifan wants to merge 1 commit into
NVIDIA:mainfrom
Baibaifan:zero_groupgemm

Baibaifan commented May 22, 2026

Uh oh!

greptile-apps Bot commented May 22, 2026

Uh oh!

greptile-apps Bot May 22, 2026

Uh oh!

ptrendx commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Baibaifan commented May 22, 2026

Description

Type of change

Changes

Testing

Checklist:

Uh oh!

greptile-apps Bot commented May 22, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants