[Frontend] Represent kernel body as ordered steps; multi-step indirect by YWHyuk · Pull Request #283 · PSAL-POSTECH/PyTorchSim

YWHyuk · 2026-06-26T11:37:13Z

What

Reworks per-kernel codegen so the loop body is an ordered list of load -> compute -> store steps (a Step class + push_step() + a codegen_loops step loop) instead of the fixed dma_loads / compute / dma_stores buffers. A step may be compute-only or transfer-only; the compute_idx loop is emitted only for steps that actually carry compute.

Two cleanups ride on top:

Drop the ad-hoc self.masks buffer. The reduction-tail mask (get_mask) now emits into the compute buffer, since a mask is just compute.
Express indirect (gather/scatter) as multiple steps. When the indirect index is produced in the same pass, convert_indirect_indexing pushes a new step for the offset build, so the index load, the offset build and the gather become separate steps bridged through spad. This replaces the compute_dependecy -> dma_stores hack in both convert_indirect_indexing and load(). push_step clones the CSE so the tmp-name counter is shared but the dedup cache is per-step (each step is its own compute_idx SSA region).

Validation

x[idx+2]+1 (computed index -> the multi-step path) emits two compute_idx loops and matches CPU (allclose, maxdiff 0).
Regression: test_add, test_matmul, test_reduce, test_softmax, test_layernorm, plus simple gather / 2D gather / scatter all unchanged. The non-multi-step single gather path stays one step (byte-identical).

Follow-up

The next step is to make the indirect offset an explicit togsim.transfer operand instead of smuggling it through the index affine.apply. That is blocked by memref.dma_start being a registered op (cannot carry the extra operand) and needs a direct togsim.transfer -> gemmini lowering; tracked in #282.

🤖 Generated with Claude Code

YWHyuk · 2026-06-26T11:47:46Z

self.get_dma_info(name, index)

YWHyuk · 2026-06-26T11:48:24Z

@@ -584,14 +599,8 @@ def load(self, name: str, index: sympy.Expr):
                                  dram_shape, tile_shape, dram_stride, tile_stride, int(padding))
        self.cse.generate(dma_buffer, code, assignment = False) # FIXME: assignment = False does not support caching


use self.dma_loads

YWHyuk · 2026-06-26T11:50:34Z

-            # FIXME. Any good idea?
-            out = sram_var
-            self.register_var_info(out, [compute_vec_size, mlir_dtype])
+        with self.override_buffer_cse(buffer=load_buffer):


use self.loads

Rework per-kernel codegen so the loop body is an ordered list of load->compute->store steps (class Step + push_step + codegen_loops iteration) instead of the fixed dma_loads/compute/dma_stores buffers. A step may be compute-only or transfer-only; the compute_idx loop is emitted only for steps that actually have compute content. Drop the ad-hoc self.masks buffer: the reduction-tail mask (get_mask) now emits into the compute buffer, since a mask is just compute. Use the step model to express indirect (gather/scatter) access cleanly. When the indirect index is produced in the same pass, convert_indirect_indexing pushes a new step for the offset build, so the index load, the offset build and the gather become separate steps bridged through spad. This replaces the compute_dependecy -> dma_stores hack in both convert_indirect_indexing and load(). push_step clones the CSE so the name counter is shared but the dedup cache is per-step (each step is its own compute_idx region). Validated: x[idx+2]+1 emits two steps and matches CPU; add/matmul/reduce/softmax/layernorm and gather/scatter unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01EEfyUDpkMLRYZ2NAMbb3jN

YWHyuk force-pushed the feature/codegen-ordered-steps branch 2 times, most recently from ddd5ebe to a63020f Compare June 26, 2026 11:46

YWHyuk commented Jun 26, 2026

View reviewed changes

YWHyuk force-pushed the feature/codegen-ordered-steps branch from a63020f to 2bf2e60 Compare June 26, 2026 11:54

YWHyuk force-pushed the feature/codegen-ordered-steps branch from 2bf2e60 to 65bd365 Compare June 26, 2026 12:19

YWHyuk mentioned this pull request Jun 26, 2026

[Frontend] Carry gather/scatter offset as an explicit transfer descriptor (#282) #285

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Frontend] Represent kernel body as ordered steps; multi-step indirect#283

[Frontend] Represent kernel body as ordered steps; multi-step indirect#283
YWHyuk wants to merge 1 commit into
feature/togsim-cpp-tracefrom
feature/codegen-ordered-steps

YWHyuk commented Jun 26, 2026

Uh oh!

YWHyuk Jun 26, 2026

Uh oh!

YWHyuk Jun 26, 2026

Uh oh!

YWHyuk Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -584,14 +599,8 @@ def load(self, name: str, index: sympy.Expr):
		dram_shape, tile_shape, dram_stride, tile_stride, int(padding))
		self.cse.generate(dma_buffer, code, assignment = False) # FIXME: assignment = False does not support caching

Uh oh!

Conversation

YWHyuk commented Jun 26, 2026

What

Validation

Follow-up

Uh oh!

YWHyuk Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

YWHyuk Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

YWHyuk Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant