Skip to content

Add INT8 support for LDS transpose load#2214

Open
stefankoncarevic wants to merge 6 commits intolds-transpose-load-fp8from
lds-transpose-load-int8
Open

Add INT8 support for LDS transpose load#2214
stefankoncarevic wants to merge 6 commits intolds-transpose-load-fp8from
lds-transpose-load-int8

Conversation

@stefankoncarevic
Copy link
Contributor

⚠️ Do not merge until #2210 is merged - this PR depends on LDS transpose load fp8 support

Motivation

Extends LDS transpose load optimization to support INT8 data types for GEMM and Attention kernels on gfx950. This enables hardware-accelerated transposed loads (ds_read_tr8_b64) for all INT8 MFMAs (16x16x32, 16x16x64, 32x32x16, 32x32x32), improving performance for INT8 quantized inference.

Technical Details

  • LdsTransposeLoad.cpp: Added INT8 type support, offset formulas for (16,64) and (32,32) geometries, and double-rate K-coverage logic
  • AccelEmitter.cpp: Added K-dimension transformation for INT8 MFMAs with kBase=16 when kpack=1
  • RockDialect.cpp/RockOps.td: Updated validation and type support for INT8 LDS transpose

Test Plan

Added MLIR unit tests
Added E2E tests
All tests verified on gfx950 hardware with numerical correctness validation

Test Result

Submission Checklist

@stefankoncarevic stefankoncarevic force-pushed the lds-transpose-load-fp8 branch 3 times, most recently from f3176a8 to a75ab7a Compare January 29, 2026 14:10
@stefankoncarevic stefankoncarevic force-pushed the lds-transpose-load-fp8 branch 2 times, most recently from 24d9bf6 to 076a998 Compare February 27, 2026 13:37
This commit extends the LDS transpose load optimization to support
workgroups with 8 waves (blockSize=512) and 16 waves (blockSize=1024).

Previously, the optimization was limited to 1-4 waves only. This
restriction has been lifted to enable LDS transpose load for larger
workgroup sizes commonly used in high-performance GEMM configurations.

Changes:
- Extended numWaves limit from 4 to 16 in decideLDSTransposeForOperands()
- Added wave grid layout computation for 8 waves:
  - 2×4, 4×2 (preferred balanced layouts)
  - 1×8, 8×1 (fallback layouts)
- Added wave grid layout computation for 16 waves:
  - 4×4 (preferred balanced layout)
  - 2×8, 8×2 (semi-balanced layouts)
  - 1×16, 16×1 (fallback layouts)

Updated tests:
- lds_transpose_attributes_toblockwise.mlir: Changed CHECK-NOT to
  CHECK for 8 and 16 wave tests, confirming LDS transpose is now
  enabled for these configurations
- PrLdsTransposeLoad.toml: Added e2e test cases for 8-wave (4×2, 1×8)
  and 16-wave (8×2, 1×16) grid configurations
When one operand uses regular load and the other uses LDS transpose
load, the regular load must use a compatible K-access pattern.

The new formula is only applied when:
- useLdsTransposeLoad is true (hybrid scenario)
- kVec >= kBase (enough elements to decompose)

This ensures correct data alignment between regular and transpose
loads for MFMA operations, and prevents assertion failures when
kpack < kBase.

Changes:
- Add useLdsTransposeLoad parameter to wrapLDSBufferForLoad
- Implement hybrid K-access formula with blk_d/blk_k split
- Pass LDS transpose state from BlockwiseGemmToThreadwise
- Update tests in PrLdsTransposeLoad.toml
- Add GEMM1 LDS transpose tests (V transpose + P prefetch) to nightly
- Create PrLdsTransposeLoadAttention.toml with 14 quick PR tests
- Add INT8 (i8) support in LdsTransposeLoad.cpp for ds_read_tr8_b64
- Support mfma_i32_16x16x32_i8, mfma_i32_16x16x64_i8, mfma_i32_32x32x16_i8, mfma_i32_32x32x32_i8
- Add INT8 16x64 and 32x32 MFMA geometries with double-rate K coverage
- Handle kpack=1 case for INT8 MFMAs with kBase=16 in AccelEmitter.cpp
- Add validation for INT8 MFMA geometries in RockDialect.cpp
- Add e2e tests for INT8 LDS transpose in GEMM and Attention
Disable LDS transpose load for INT8 convolutions when N=1600 (40x40
spatial output) and K<=M or K>2*M. This fixes two significant
performance regressions:
- 1x64x40x40 K=64: -62.87% regression
- 1x384x40x40 K=128: -43.79% regression

The heuristic has no impact on GEMM INT8 (no problems with N=1600)
and does not affect any CONV INT8 improvements.
@stefankoncarevic stefankoncarevic force-pushed the lds-transpose-load-int8 branch from 3ccbf35 to a6318df Compare March 2, 2026 10:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant