Skip to content

[Frontend] Represent kernel body as ordered steps; multi-step indirect#283

Open
YWHyuk wants to merge 1 commit into
feature/togsim-cpp-tracefrom
feature/codegen-ordered-steps
Open

[Frontend] Represent kernel body as ordered steps; multi-step indirect#283
YWHyuk wants to merge 1 commit into
feature/togsim-cpp-tracefrom
feature/codegen-ordered-steps

Conversation

@YWHyuk

@YWHyuk YWHyuk commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

What

Reworks per-kernel codegen so the loop body is an ordered list of load -> compute -> store steps (a Step class + push_step() + a codegen_loops step loop) instead of the fixed dma_loads / compute / dma_stores buffers. A step may be compute-only or transfer-only; the compute_idx loop is emitted only for steps that actually carry compute.

Two cleanups ride on top:

  • Drop the ad-hoc self.masks buffer. The reduction-tail mask (get_mask) now emits into the compute buffer, since a mask is just compute.
  • Express indirect (gather/scatter) as multiple steps. When the indirect index is produced in the same pass, convert_indirect_indexing pushes a new step for the offset build, so the index load, the offset build and the gather become separate steps bridged through spad. This replaces the compute_dependecy -> dma_stores hack in both convert_indirect_indexing and load(). push_step clones the CSE so the tmp-name counter is shared but the dedup cache is per-step (each step is its own compute_idx SSA region).

Validation

  • x[idx+2]+1 (computed index -> the multi-step path) emits two compute_idx loops and matches CPU (allclose, maxdiff 0).
  • Regression: test_add, test_matmul, test_reduce, test_softmax, test_layernorm, plus simple gather / 2D gather / scatter all unchanged. The non-multi-step single gather path stays one step (byte-identical).

Follow-up

The next step is to make the indirect offset an explicit togsim.transfer operand instead of smuggling it through the index affine.apply. That is blocked by memref.dma_start being a registered op (cannot carry the extra operand) and needs a direct togsim.transfer -> gemmini lowering; tracked in #282.

🤖 Generated with Claude Code

@YWHyuk YWHyuk force-pushed the feature/codegen-ordered-steps branch 2 times, most recently from ddd5ebe to a63020f Compare June 26, 2026 11:46

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.get_dma_info(name, index)

@@ -584,14 +599,8 @@ def load(self, name: str, index: sympy.Expr):
dram_shape, tile_shape, dram_stride, tile_stride, int(padding))
self.cse.generate(dma_buffer, code, assignment = False) # FIXME: assignment = False does not support caching

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use self.dma_loads

# FIXME. Any good idea?
out = sram_var
self.register_var_info(out, [compute_vec_size, mlir_dtype])
with self.override_buffer_cse(buffer=load_buffer):

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use self.loads

@YWHyuk YWHyuk force-pushed the feature/codegen-ordered-steps branch from a63020f to 2bf2e60 Compare June 26, 2026 11:54
Rework per-kernel codegen so the loop body is an ordered list of
load->compute->store steps (class Step + push_step + codegen_loops
iteration) instead of the fixed dma_loads/compute/dma_stores buffers.
A step may be compute-only or transfer-only; the compute_idx loop is
emitted only for steps that actually have compute content.

Drop the ad-hoc self.masks buffer: the reduction-tail mask (get_mask)
now emits into the compute buffer, since a mask is just compute.

Use the step model to express indirect (gather/scatter) access cleanly.
When the indirect index is produced in the same pass,
convert_indirect_indexing pushes a new step for the offset build, so the
index load, the offset build and the gather become separate steps bridged
through spad. This replaces the compute_dependecy -> dma_stores hack in
both convert_indirect_indexing and load(). push_step clones the CSE so the
name counter is shared but the dedup cache is per-step (each step is its
own compute_idx region). Validated: x[idx+2]+1 emits two steps and matches
CPU; add/matmul/reduce/softmax/layernorm and gather/scatter unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01EEfyUDpkMLRYZ2NAMbb3jN
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant