[Frontend] Represent kernel body as ordered steps; multi-step indirect#283
Open
YWHyuk wants to merge 1 commit into
Open
[Frontend] Represent kernel body as ordered steps; multi-step indirect#283YWHyuk wants to merge 1 commit into
YWHyuk wants to merge 1 commit into
Conversation
ddd5ebe to
a63020f
Compare
YWHyuk
commented
Jun 26, 2026
Collaborator
Author
There was a problem hiding this comment.
self.get_dma_info(name, index)
| @@ -584,14 +599,8 @@ def load(self, name: str, index: sympy.Expr): | |||
| dram_shape, tile_shape, dram_stride, tile_stride, int(padding)) | |||
| self.cse.generate(dma_buffer, code, assignment = False) # FIXME: assignment = False does not support caching | |||
| # FIXME. Any good idea? | ||
| out = sram_var | ||
| self.register_var_info(out, [compute_vec_size, mlir_dtype]) | ||
| with self.override_buffer_cse(buffer=load_buffer): |
a63020f to
2bf2e60
Compare
Rework per-kernel codegen so the loop body is an ordered list of load->compute->store steps (class Step + push_step + codegen_loops iteration) instead of the fixed dma_loads/compute/dma_stores buffers. A step may be compute-only or transfer-only; the compute_idx loop is emitted only for steps that actually have compute content. Drop the ad-hoc self.masks buffer: the reduction-tail mask (get_mask) now emits into the compute buffer, since a mask is just compute. Use the step model to express indirect (gather/scatter) access cleanly. When the indirect index is produced in the same pass, convert_indirect_indexing pushes a new step for the offset build, so the index load, the offset build and the gather become separate steps bridged through spad. This replaces the compute_dependecy -> dma_stores hack in both convert_indirect_indexing and load(). push_step clones the CSE so the name counter is shared but the dedup cache is per-step (each step is its own compute_idx region). Validated: x[idx+2]+1 emits two steps and matches CPU; add/matmul/reduce/softmax/layernorm and gather/scatter unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01EEfyUDpkMLRYZ2NAMbb3jN
2bf2e60 to
65bd365
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Reworks per-kernel codegen so the loop body is an ordered list of
load -> compute -> storesteps (aStepclass +push_step()+ acodegen_loopsstep loop) instead of the fixeddma_loads/compute/dma_storesbuffers. A step may be compute-only or transfer-only; thecompute_idxloop is emitted only for steps that actually carry compute.Two cleanups ride on top:
self.masksbuffer. The reduction-tail mask (get_mask) now emits into the compute buffer, since a mask is just compute.convert_indirect_indexingpushes a new step for the offset build, so the index load, the offset build and the gather become separate steps bridged through spad. This replaces thecompute_dependecy -> dma_storeshack in bothconvert_indirect_indexingandload().push_stepclones the CSE so the tmp-name counter is shared but the dedup cache is per-step (each step is its owncompute_idxSSA region).Validation
x[idx+2]+1(computed index -> the multi-step path) emits twocompute_idxloops and matches CPU (allclose, maxdiff 0).test_add,test_matmul,test_reduce,test_softmax,test_layernorm, plus simple gather / 2D gather / scatter all unchanged. The non-multi-step single gather path stays one step (byte-identical).Follow-up
The next step is to make the indirect offset an explicit
togsim.transferoperand instead of smuggling it through the indexaffine.apply. That is blocked bymemref.dma_startbeing a registered op (cannot carry the extra operand) and needs a directtogsim.transfer -> gemminilowering; tracked in #282.🤖 Generated with Claude Code