Skip to content

[Frontend] Carry gather/scatter offset as an explicit transfer descriptor (#282)#285

Open
YWHyuk wants to merge 2 commits into
feature/codegen-ordered-stepsfrom
feature/indirect-offset-descriptor
Open

[Frontend] Carry gather/scatter offset as an explicit transfer descriptor (#282)#285
YWHyuk wants to merge 2 commits into
feature/codegen-ordered-stepsfrom
feature/indirect-offset-descriptor

Conversation

@YWHyuk

@YWHyuk YWHyuk commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Stacked on #283 (base = feature/codegen-ordered-steps).

Makes the gather/scatter offset an explicit transfer descriptor instead of smuggling it through the DRAM-index affine.apply{indirect_access}.

  • convert_indirect_indexing returns the offset spad (drops the + sympy.Symbol(out) fold and the scalar-load/index_cast); the index stays clean (base only).
  • emit_transfer carries it as a togsim.transfer operand.
  • decompose_transfer lifts that operand to a memref.dma_start {indirect_offset = @spad_symbol} attribute. memref.dma_start is a registered op and rejects an extra operand, but it accepts the attribute (verified) -- so we do not need to drop memref.dma_start (the original concern in Indirect access: make offset an explicit togsim.transfer operand (blocked by memref.dma_start; needs direct togsim.transfer -> gemmini lowering) #282).
  • lower_dma_to_gemmini reads the attribute and resolves the global for CONFIG4 (drops _find_indirect).
  • build_skeleton adds the offset spad to the gather DMA's read_bufs, so the offset-build -> gather dependency edge forms in the trace.

Validation

  • Spike functional: computed-index gather + scatter allclose; add/matmul/reduce/softmax regression 0.
  • C++ trace timing path end-to-end (rebuilt TOGSim): allclose; the generated trace.cpp gather togsim_dma read_bufs now includes the offset spad.

Follow-up

Indirect addressing (scattered-DMA timing) in the new trace path is still missing -- tracked in #284. Remaining cleanup: the "tmp"-string indirect detection -> explicit indirect_indexing symbol set.

🤖 Generated with Claude Code

YWHyuk and others added 2 commits June 26, 2026 22:12
…ptor

Replace the affine.apply{indirect_access} symbol smuggle with an explicit
offset descriptor. convert_indirect_indexing returns the offset spad instead
of folding sympy.Symbol(out) into the index; emit_transfer carries it as a
togsim.transfer operand; decompose_transfer lifts that operand to a
memref.dma_start {indirect_offset = @spad_symbol} attribute (memref.dma_start
is a registered op and rejects an extra operand, but accepts the attribute);
lower_dma_to_gemmini reads the attribute and resolves the global for CONFIG4
(drops _find_indirect); build_skeleton adds the offset spad to the gather DMA
read_bufs so the offset-build -> gather dependency edge forms in the trace.

The index stays clean (base only). Validated on both paths: Spike functional
(computed-index gather + scatter allclose, pointwise/reduce regression 0) and
the C++ trace timing path end-to-end (allclose; gather togsim_dma read_bufs now
includes the offset spad).

Indirect addressing (scattered-DMA timing) in the new trace path is a separate
gap tracked in issue #284.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01EEfyUDpkMLRYZ2NAMbb3jN
…ride, compute-loop multi-dim offset

Three refinements on top of the explicit offset descriptor:

1. Detect indirect access by an explicit symbol set instead of an "tmp"
   substring match. indirect_indexing now records str(index_var) in
   self.indirect_symbols; a _has_indirect(expr) helper tests the index
   free_symbols against it; the former "tmp"-string sites (store,
   get_dma_info, convert entry/indirect_dims/stride) use it. Removes the
   now-dead _find_indirect from lower_dma_to_gemmini.

2. Single indirect dim: pass the raw index spad and let the MVIN apply the
   gather stride per position (CONFIG4 offset_stride) instead of a
   vector_load/muli/vector_store round-trip that pre-multiplied the stride.
   emit_transfer carries offset_stride; decompose copies the attribute;
   lower programs CONFIG4 with it (was hardcoded to 1).

3. Multi indirect dim (e.g. x[ix, iy]): the offset is a sum of strided
   indices, which a single CONFIG4 channel cannot do, so the sum stays in
   the kernel -- but build it in the compute loop (chunked by
   compute_vec_size, not a single tile-wide vector) and store it to a
   DEDICATED offset spad so an index that is live elsewhere (x[ix, iy] + ix)
   is not clobbered. push_step separates the offset-build loop from the
   gather that reads it.

Adds multi-dim gather and index-reuse cases to test_indirect_access.
Validated: indirect/scatter/embedding + the two new multi-dim cases pass;
add/matmul/softmax regression 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01EEfyUDpkMLRYZ2NAMbb3jN
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant