Summary
Indirect (gather/scatter) access is still represented in an ad-hoc, "apply-dependent" way. We want to make the per-position offset an explicit descriptor operand on togsim.transfer instead of smuggling it through the DRAM-index affine.apply. This is blocked by memref.dma_start being a registered op that cannot carry the extra operand, so it is deferred: doing it properly requires dropping memref.dma_start from the transfer-lowering path and converting togsim.transfer directly to Gemmini instructions. Filing this so the work is tracked.
Background — how indirect is represented today
For x[idx]-style access the offset is carried implicitly:
- Frontend detects indirect by string-matching
"tmp" in the sympy index (5 sites in mlir_codegen_backend.py).
convert_indirect_indexing builds the per-position offset into a spad buffer, then smuggles it back as a fake sympy symbol: return index + sympy.Symbol(str(out)).
parse_indices threads that symbol as an indirect_dims affine symbol operand, producing affine.apply (...)[%off] {indirect_access} whose map result is just the clean base (the symbol is carried only so it can be found later).
- Both consumers recover the offset by pattern-matching that affine.apply:
passes/lower_dma_to_gemmini.py _find_indirect (Spike functional path -> CONFIG4),
passes/build_tog.py _process_dram_indices ("indirect_access" in apply.attributes).
This is fragile (string matching, symbol smuggling, two independent pattern-matchers) and couples indirect to the index affine.apply.
Goal — explicit descriptor on togsim.transfer
Make the offset a first-class transfer operand, as agreed in the masked-DMA design (an offset descriptor analogous to the planned mask descriptor):
"togsim.transfer"(dram, BASE_idx, sram, sram_idx, tag, dma_type, vst, OFFSET_buf)
{..., indirect = true} : (..., memref<...xi64>) -> ()
BASE_idx is a clean affine index; OFFSET_buf is the per-position offset spad buffer (an SSA memref). Consumers read the operand directly — no affine.apply{indirect_access}, no _find_indirect.
What was prototyped (and reverted)
The frontend half works cleanly: emit_transfer gained an optional offset operand, convert_indirect_indexing returns (clean_index, offset_descriptor) (dropping the + sympy.Symbol smuggle and the scalar-load/index_cast), and load/store pass it through. togsim.transfer is unregistered, so it parses the extra operand fine.
The blocker
togsim.transfer is lowered in two steps: decompose_transfer -> memref.dma_start, then lower_dma_to_gemmini -> Gemmini asm. memref.dma_start is a registered MLIR op whose operand list is already full (we use its strided variant: src, src_idx, dst, dst_idx, num_elements, tag, tag_idx, stride, num_elements_per_stride). Appending the offset memref yields:
error: 'memref.dma_start' op incorrect number of operands
So the offset cannot ride on memref.dma_start. That is exactly why the current design smuggles it through the index affine.apply (the only operand memref.dma_start will accept), and why the representation is "apply-dependent".
Path forward (the actual fix, deferred)
Drop memref.dma_start as the transfer-lowering target and lower togsim.transfer directly to Gemmini instructions in one pass. Because togsim.transfer is unregistered it can carry arbitrary operands (offset, and later the mask descriptor), so the explicit form becomes possible. Concretely this merges today's decompose_transfer (>4D peel) + lower_dma_to_gemmini (CONFIG/MVIN asm) into a single togsim.transfer -> gemmini lowering. This is a sizable change, hence deferred.
Scope when picked up
- Frontend: explicit
offset on emit_transfer / convert_indirect_indexing / load / store (prototyped).
- New
togsim.transfer -> gemmini lowering replacing decompose_transfer + lower_dma_to_gemmini; read the explicit offset operand for CONFIG4.
- TOG / trace ABI (default path):
build_skeleton + lower_to_emitc (and the C++ trace .so) must consume the explicit offset and model the indirect addressing — not the {indirect_access} attr.
- Dependency tracking: the gather DMA must depend on the offset-build. With the offset spad as an explicit operand,
build_skeleton's last-writer-per-buffer DAG captures offset-build -> gather cleanly (better than the current implicit address-arithmetic reference); ensure the gather DMA registers the offset spad as a read buffer.
- Remove the
"tmp"-string detection, the sympy.Symbol smuggle, the indirect_dims affine threading, the {indirect_access} attr, and both _find_indirect / _process_dram_indices pattern-matchers.
Notes
- The unrelated multi-step indirect refactor (gather offset build as its own codegen step, replacing the
compute_dependecy -> dma_stores hack) already landed and is validated (x[idx+2]+1 -> 2 steps, allclose). This issue is only about the offset transfer-operand representation.
Summary
Indirect (gather/scatter) access is still represented in an ad-hoc, "apply-dependent" way. We want to make the per-position offset an explicit descriptor operand on
togsim.transferinstead of smuggling it through the DRAM-indexaffine.apply. This is blocked bymemref.dma_startbeing a registered op that cannot carry the extra operand, so it is deferred: doing it properly requires droppingmemref.dma_startfrom the transfer-lowering path and convertingtogsim.transferdirectly to Gemmini instructions. Filing this so the work is tracked.Background — how indirect is represented today
For
x[idx]-style access the offset is carried implicitly:"tmp"in the sympy index (5 sites inmlir_codegen_backend.py).convert_indirect_indexingbuilds the per-position offset into a spad buffer, then smuggles it back as a fake sympy symbol:return index + sympy.Symbol(str(out)).parse_indicesthreads that symbol as anindirect_dimsaffine symbol operand, producingaffine.apply (...)[%off] {indirect_access}whose map result is just the clean base (the symbol is carried only so it can be found later).passes/lower_dma_to_gemmini.py_find_indirect(Spike functional path ->CONFIG4),passes/build_tog.py_process_dram_indices("indirect_access" in apply.attributes).This is fragile (string matching, symbol smuggling, two independent pattern-matchers) and couples indirect to the index
affine.apply.Goal — explicit descriptor on
togsim.transferMake the offset a first-class transfer operand, as agreed in the masked-DMA design (an offset descriptor analogous to the planned mask descriptor):
BASE_idxis a clean affine index;OFFSET_bufis the per-position offset spad buffer (an SSA memref). Consumers read the operand directly — noaffine.apply{indirect_access}, no_find_indirect.What was prototyped (and reverted)
The frontend half works cleanly:
emit_transfergained an optionaloffsetoperand,convert_indirect_indexingreturns(clean_index, offset_descriptor)(dropping the+ sympy.Symbolsmuggle and the scalar-load/index_cast), andload/storepass it through.togsim.transferis unregistered, so it parses the extra operand fine.The blocker
togsim.transferis lowered in two steps:decompose_transfer->memref.dma_start, thenlower_dma_to_gemmini-> Gemmini asm.memref.dma_startis a registered MLIR op whose operand list is already full (we use its strided variant:src, src_idx, dst, dst_idx, num_elements, tag, tag_idx, stride, num_elements_per_stride). Appending the offset memref yields:So the offset cannot ride on
memref.dma_start. That is exactly why the current design smuggles it through the indexaffine.apply(the only operandmemref.dma_startwill accept), and why the representation is "apply-dependent".Path forward (the actual fix, deferred)
Drop
memref.dma_startas the transfer-lowering target and lowertogsim.transferdirectly to Gemmini instructions in one pass. Becausetogsim.transferis unregistered it can carry arbitrary operands (offset, and later the mask descriptor), so the explicit form becomes possible. Concretely this merges today'sdecompose_transfer(>4D peel) +lower_dma_to_gemmini(CONFIG/MVIN asm) into a singletogsim.transfer -> gemminilowering. This is a sizable change, hence deferred.Scope when picked up
offsetonemit_transfer/convert_indirect_indexing/load/store(prototyped).togsim.transfer -> gemminilowering replacingdecompose_transfer+lower_dma_to_gemmini; read the explicitoffsetoperand forCONFIG4.build_skeleton+lower_to_emitc(and the C++ trace.so) must consume the explicit offset and model the indirect addressing — not the{indirect_access}attr.build_skeleton's last-writer-per-buffer DAG capturesoffset-build -> gathercleanly (better than the current implicit address-arithmetic reference); ensure the gather DMA registers the offset spad as a read buffer."tmp"-string detection, thesympy.Symbolsmuggle, theindirect_dimsaffine threading, the{indirect_access}attr, and both_find_indirect/_process_dram_indicespattern-matchers.Notes
compute_dependecy -> dma_storeshack) already landed and is validated (x[idx+2]+1-> 2 steps, allclose). This issue is only about the offset transfer-operand representation.