CK MXFP8 Group Gemm gfx1250 Enablement by aris134 · Pull Request #613 · ROCm/TransformerEngine

aris134 · 2026-06-08T14:46:19Z

Description

Integrates CK MXFP8 Group GEMM pipeline into TE.

Fixes https://github.com/ROCm/frameworks-internal/issues/16039

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Add new submodule 3rdparty/rocm_libraries with sparse checkout for projects/composablekernel. Needed for the CK MXFP8 group GEMM definitions.
Adds CK MXFP8 group GEMM integration into TE with run-time arch detection for gfx1250 support.
Adds relevant cpp tests (tests/cpp/operator/test_ck_grouped_mxfp8.cu). Note that PyTorch grouped linear ck test coverage already includes MXFP8 (set NVTE_ROCM_ENABLE_MXFP8=1 in addition to NVTE_USE_CUTLASS_GROUPED_GEMM=1)

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…rash; remaining issue is numerical validation vs BF16 sequential reference.

…roup-gemm-gfx1250-clean

matthiasdiener · 2026-06-09T18:03:11Z

+  if (arch == 94) {
+    return GPUArch::GFX942;
+  }
+  if (arch == 95) {
+    return GPUArch::GFX950;
+  }
+  if (arch == 1250) {
+    return GPUArch::GFX1250;
+  }


Could this be a switch?

Yeah that looks nicer, thanks. Done in f3ecda3

matthiasdiener · 2026-06-09T18:10:00Z

+  if (arch == 95) {
+    return GPUArch::GFX950;
+  }
+  if (arch == 1250) {


Should this be 125?

Yeah I think you're right, thanks. Fixed in f3ecda3

matthiasdiener · 2026-06-09T18:13:55Z

+  std::vector<mx_grouped_gemm_kargs> descs;
+  descs.reserve(group_num);
+
+  std::vector<std::unique_ptr<ck_tile::DeviceMem>> a_scale_shuffled_bufs;


Does ck_tile::DeviceMem allocate new memory? Can we use a workspace here?

Yes, we can use workspace here. Done in 94b0126

matthiasdiener · 2026-06-09T18:15:18Z

+};
+
+template <typename ScaleType, ck_tile::index_t ScaleBlockSize, bool KStride>
+__global__ void preshuffle_scale_gfx1250_kernel(const ScaleType* __restrict__ src,


Is this the same shuffling as in #605 ? Maybe we can add a comment here.

Not quite the same. The swizzle in #605 groups the scale-K dimension into tiles of 4, whereas this CK preshuffle additionally organizes scales into 32-row M blocks to match the layout expected by the CK gfx1250 WMMA kernel. I've added a comment to help clarify the layout in 479c509.

I think the comment you added goes in the right direction, I would additionally mention what you said here, that this is different from the other mxfp8 gemm swizzling, and that it is expected by CK 1250 WMMA kernel.

Done in f4c97ca

…tch in detect_gpu_arch

alextmagro

Sorry, my review was left as pending, so some of my comments may have already been addressed. Thanks!

alextmagro · 2026-06-09T04:55:10Z

+}
+
+template <typename T>
+static void fill_randn_cpu(Tensor* t, float scale, int seed) {


Why not use our hipRAND generator in test_common?

Good point. Changed it in 5b4b7fe

alextmagro · 2026-06-09T04:56:16Z

+  return cases;
+}
+
+static const std::vector<CaseConfig> kCases = make_cases();


I think we should probably use seeds generated from test names like the rest of the c++ tests

Yeah, it should now be consistent in 5b4b7fe

alextmagro · 2026-06-09T04:57:25Z

 #pragma once

 #include <hip/hip_runtime.h>
+#include "common/util/cuda_runtime.h"


nit: this belongs after common headers

Done in 68ed32a

alextmagro · 2026-06-09T04:57:39Z

 #include "ck_tile/core.hpp"
 #include "ck_tile/ops/epilogue.hpp"
 #include "ck_tile/ops/gemm.hpp"
+#include "ck_tile/host/kernel_launch.hpp"


nit: /host/ goes before /ops/, and /elementwise/ goes before /gemm/

Done in 68ed32a

alextmagro · 2026-06-09T05:04:49Z

+        NVTE_ERROR("ck_tile_mx_grouped_gemm: expected effective A/B scale_inv tensors to be rank-2.");
+      }
+
+      const int64_t M = ctx.transA ? Ad1 : Ad0;


I think these should be size_ts, unless negative values are needed.

Yeah that's fair. I changed that in bdc6b4e

alextmagro · 2026-06-09T05:06:53Z

+            KScale,
+            stream);
+      }
+      descs.emplace_back(mx_grouped_gemm_kargs(


Another stylistic comment, but there are lots of line breaks for functions with 1 parameter per line. I personally prefer a more compact style with only line breaks as needed, especially when variable names are relatively short

Made some additional stylistic changes in bdc6b4e

alextmagro · 2026-06-09T05:07:52Z

+        ok = invoke_mx_grouped_gemm<GroupedGemKernelParam_Wmma,
+                                    AType, BType, CType,
+                                    AScaleType, BScaleType>(descs,ctx,s);
+      });


We need // NOLINT(*) at the end of every TRANSFORMER_ENGINE_TYPE_SWITCH_* statement

Done in bdc6b4e

alextmagro · 2026-06-09T05:09:51Z

+ * License for AMD contributions = MIT. See LICENSE for more information
+ ************************************************************************/
+
+bool ck_tile_mx_grouped_gemm(const NVTETensor* A,


Missing #pragma once, and maybe name file .h instead of .hpp for consistency?

On second thought, can we just add this to ck_grouped_gemm_common.h?

Good catch. Since ck_grouped_gemm.h was meant to be the public API and ck_grouped_gemm_common.h is internal, I moved the declaration to ck_grouped_gemm.h, removing the need for ck_mx_grouped_gemm.h. Changes made in bdc6b4e

alextmagro · 2026-06-09T05:12:07Z

-        }
-        cublas_path();
+    auto *inputA = transformer_engine::convertNVTETensorCheck(A[0]);
+    const bool mxfp8_gemm = transformer_engine::is_mxfp8_scaling(inputA->scaling_mode);


Can probably inline this into the if statement since it is only used once

Done in 457bbc1

alextmagro · 2026-06-09T05:15:04Z

+
+static constexpr ck_tile::index_t ScaleBlockSize = 32;
+
+enum struct MxGemmPipelineType


I do prefer K&R style, and we lean towards that in the codebase. Consider moving open brackets to same line throughout, and maybe using post-increments and attaching references/pointers to the var instead of the type.

Thanks for pointing that out. Made the edits in 2e74a63

…ing utilities from test_common.cu

…over existing implementation

matthiasdiener · 2026-06-10T21:22:40Z

+};
+
+static inline GPUArch detect_gpu_arch() {
+  switch (cuda::sm_arch(0)) {


I think you can just use

Suggested change

switch (cuda::sm_arch(0)) {

switch (cuda::sm_arch()) {

Done in 19151f4

…-gemm-gfx1250-clean

aris134 and others added 28 commits May 6, 2026 14:18

initial commit for CK Tile MXFP8 integration for gfx1250

1f707d7

ck mxfp8 gfx1250 integration builds successfully

e102f00

add entrypoint to ck mx group gemm in caller

52a2887

temporary hacky change to test_numerics for bringup testing

8022777

add warning print to confirm we are in fallback

bc6253d

MXFP8 grouped fwd/bwd now reaches CK path and runs without fallback/c…

d26f52e

…rash; remaining issue is numerical validation vs BF16 sequential reference.

add cpp test for ck tile group mxfp8 gemm forward

e295e74

Fix MXFP8 grouped GEMM scale handling for NN/TN/NT

1784045

update ck mxfp8 group gemm gtest to exercise mixed dtypes

fe99bf3

include renamed test file

e7159c4

clean up code

972cea3

Update cublaslt_gemm.cu

c0fabff

address pr comments

3db2e5a

fix ck group mxfp8 dispatch

910d30f

update CMakeLists.txt

1b66d29

Add direct ROCm libraries dependency for CK grouped GEMM

23b505f

Remove redundant MXFP8 env override from grouped linear test

746afea

factor out common definitions from mxfp8 ck ggemm

175855d

add pr comments

f00fb7f

add MXFP8 pre-swizzling for gfx1250 GEMM (#568)

45343f1

CK Tile Group GEMM gfx1250 (#576)

a67bbe9

Merge remote-tracking branch 'origin/gfx1250' into amartin/ck-mxfp8-g…

74744db

…roup-gemm-gfx1250-clean

Add sparse rocm-libraries submodule for Composable Kernel

7c3f499

update submodule name

7b5ba68

Merge branch 'dev' into amartin/ck-mxfp8-group-gemm-gfx1250-clean

926701a

override CK_ROOT

508613c

fix util

12461ee

add runtime guard for arch

e656341

aris134 requested review from alextmagro and matthiasdiener June 8, 2026 14:46

aris134 added 8 commits June 8, 2026 19:21

Restore unrelated CK grouped GEMM files from dev

baeba44

update dispatch

b18099f

Remove rocm_libraries submodule

d69f40c

Add standalone Composable Kernel submodule

88bb3dd

update gitmodules

e077670

minor fixes

1f764d2

address PR comments

4a262d5

address PR comments

669c4cc

aris134 self-assigned this Jun 9, 2026

aris134 added the ci-level 1 CI test level 1 label Jun 9, 2026

aris134 marked this pull request as ready for review June 9, 2026 01:01

aris134 requested a review from ipanfilo June 9, 2026 01:01

matthiasdiener reviewed Jun 9, 2026

View reviewed changes

aris134 added 3 commits June 9, 2026 21:46

address pr comments: fix gfx1250 arch name and convert if-else to swi…

f3ecda3

…tch in detect_gpu_arch

use workspace for ck group gemm mxfp8 scales

94b0126

add comment to ck gfx1250 mxfp8 scale swizzle

479c509

aris134 requested a review from matthiasdiener June 10, 2026 15:06

alextmagro requested changes Jun 10, 2026

View reviewed changes

aris134 added 5 commits June 10, 2026 19:37

change random generation in test_ck_grouped_mxfp8.cu to use pre-exist…

5b4b7fe

…ing utilities from test_common.cu

address nits in ck_grouped_gemm_common.h

68ed32a

stylistic changes

2e74a63

address pr comment: explicitly mention purpose of ck gfx1250 swizzle …

f4c97ca

…over existing implementation

address PR comments

bdc6b4e

matthiasdiener reviewed Jun 10, 2026

View reviewed changes

aris134 added 4 commits June 10, 2026 21:25

inline mxfp8_gemm bool into if statement

457bbc1

address nit

19151f4

Merge remote-tracking branch 'origin/dev' into amartin/ck-mxfp8-group…

b4b36c7

…-gemm-gfx1250-clean

add warn fallback for mxfp8 ck

231c916

aris134 requested review from alextmagro and matthiasdiener June 10, 2026 21:51


		static constexpr ck_tile::index_t ScaleBlockSize = 32;

		enum struct MxGemmPipelineType

Conversation

aris134 commented Jun 8, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthiasdiener Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alextmagro left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

matthiasdiener Jun 10, 2026 •

edited

Loading