Skip to content

Indirect access: make offset an explicit togsim.transfer operand (blocked by memref.dma_start; needs direct togsim.transfer -> gemmini lowering) #282

Description

@YWHyuk

Summary

Indirect (gather/scatter) access is still represented in an ad-hoc, "apply-dependent" way. We want to make the per-position offset an explicit descriptor operand on togsim.transfer instead of smuggling it through the DRAM-index affine.apply. This is blocked by memref.dma_start being a registered op that cannot carry the extra operand, so it is deferred: doing it properly requires dropping memref.dma_start from the transfer-lowering path and converting togsim.transfer directly to Gemmini instructions. Filing this so the work is tracked.

Background — how indirect is represented today

For x[idx]-style access the offset is carried implicitly:

  1. Frontend detects indirect by string-matching "tmp" in the sympy index (5 sites in mlir_codegen_backend.py).
  2. convert_indirect_indexing builds the per-position offset into a spad buffer, then smuggles it back as a fake sympy symbol: return index + sympy.Symbol(str(out)).
  3. parse_indices threads that symbol as an indirect_dims affine symbol operand, producing affine.apply (...)[%off] {indirect_access} whose map result is just the clean base (the symbol is carried only so it can be found later).
  4. Both consumers recover the offset by pattern-matching that affine.apply:
    • passes/lower_dma_to_gemmini.py _find_indirect (Spike functional path -> CONFIG4),
    • passes/build_tog.py _process_dram_indices ("indirect_access" in apply.attributes).

This is fragile (string matching, symbol smuggling, two independent pattern-matchers) and couples indirect to the index affine.apply.

Goal — explicit descriptor on togsim.transfer

Make the offset a first-class transfer operand, as agreed in the masked-DMA design (an offset descriptor analogous to the planned mask descriptor):

"togsim.transfer"(dram, BASE_idx, sram, sram_idx, tag, dma_type, vst, OFFSET_buf)
   {..., indirect = true} : (..., memref<...xi64>) -> ()

BASE_idx is a clean affine index; OFFSET_buf is the per-position offset spad buffer (an SSA memref). Consumers read the operand directly — no affine.apply{indirect_access}, no _find_indirect.

What was prototyped (and reverted)

The frontend half works cleanly: emit_transfer gained an optional offset operand, convert_indirect_indexing returns (clean_index, offset_descriptor) (dropping the + sympy.Symbol smuggle and the scalar-load/index_cast), and load/store pass it through. togsim.transfer is unregistered, so it parses the extra operand fine.

The blocker

togsim.transfer is lowered in two steps: decompose_transfer -> memref.dma_start, then lower_dma_to_gemmini -> Gemmini asm. memref.dma_start is a registered MLIR op whose operand list is already full (we use its strided variant: src, src_idx, dst, dst_idx, num_elements, tag, tag_idx, stride, num_elements_per_stride). Appending the offset memref yields:

error: 'memref.dma_start' op incorrect number of operands

So the offset cannot ride on memref.dma_start. That is exactly why the current design smuggles it through the index affine.apply (the only operand memref.dma_start will accept), and why the representation is "apply-dependent".

Path forward (the actual fix, deferred)

Drop memref.dma_start as the transfer-lowering target and lower togsim.transfer directly to Gemmini instructions in one pass. Because togsim.transfer is unregistered it can carry arbitrary operands (offset, and later the mask descriptor), so the explicit form becomes possible. Concretely this merges today's decompose_transfer (>4D peel) + lower_dma_to_gemmini (CONFIG/MVIN asm) into a single togsim.transfer -> gemmini lowering. This is a sizable change, hence deferred.

Scope when picked up

  • Frontend: explicit offset on emit_transfer / convert_indirect_indexing / load / store (prototyped).
  • New togsim.transfer -> gemmini lowering replacing decompose_transfer + lower_dma_to_gemmini; read the explicit offset operand for CONFIG4.
  • TOG / trace ABI (default path): build_skeleton + lower_to_emitc (and the C++ trace .so) must consume the explicit offset and model the indirect addressing — not the {indirect_access} attr.
  • Dependency tracking: the gather DMA must depend on the offset-build. With the offset spad as an explicit operand, build_skeleton's last-writer-per-buffer DAG captures offset-build -> gather cleanly (better than the current implicit address-arithmetic reference); ensure the gather DMA registers the offset spad as a read buffer.
  • Remove the "tmp"-string detection, the sympy.Symbol smuggle, the indirect_dims affine threading, the {indirect_access} attr, and both _find_indirect / _process_dram_indices pattern-matchers.

Notes

  • The unrelated multi-step indirect refactor (gather offset build as its own codegen step, replacing the compute_dependecy -> dma_stores hack) already landed and is validated (x[idx+2]+1 -> 2 steps, allclose). This issue is only about the offset transfer-operand representation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions