From fd152bfc9d127b915bc02e5a10104ddcf04568d2 Mon Sep 17 00:00:00 2001
From: Wonhyuk Yang <wonhyuk@postech.ac.kr>
Date: Wed, 24 Jun 2026 22:35:51 +0900
Subject: [PATCH 01/24] [Docs] C++ trace pipeline design (runtime-tag pairing,
 ABI)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
---
 docs/design/togsim_cpp_trace.md         | 987 ++++++++++++++++++++++++
 docs/design/togsim_cpp_trace_HANDOFF.md | 191 +++++
 docs/design/togsim_cpp_trace_STATUS.md  | 219 ++++++
 3 files changed, 1397 insertions(+)
 create mode 100644 docs/design/togsim_cpp_trace.md
 create mode 100644 docs/design/togsim_cpp_trace_HANDOFF.md
 create mode 100644 docs/design/togsim_cpp_trace_STATUS.md

diff --git a/docs/design/togsim_cpp_trace.md b/docs/design/togsim_cpp_trace.md
new file mode 100644
index 00000000..b67662aa
--- /dev/null
+++ b/docs/design/togsim_cpp_trace.md
@@ -0,0 +1,987 @@
+# TOGSim C++ Trace Generation — Design Proposal
+
+**Status:** Implemented end-to-end through the real timing Core (256^3 GEMM); see
+§11 for remaining work.
+**Branch:** `feature/togsim-cpp-trace`
+**Scope:** Replace the timing-path TOG producer (MLIR → Python-dict → ONNX → C++
+parser) with a compiled, shape-parametric trace producer (MLIR → C++ → `.so`).
+TOGSim's timing core is preserved.
+
+**Note on the sync mechanism (read before §3, §5, §9).** An earlier version of
+this design synchronized an asynchronous DMA with the consumer that waits on its
+data using a compile-time integer `event_id` — one id per static `togsim.dma`/
+`togsim.wait` op, paired through a heap "event buffer" of opaque handles. That
+mechanism was *removed*: a single static `togsim.dma` op executes once per loop
+iteration, each iteration writing a different runtime tag slot, so one
+compile-time id per op cannot represent the per-iteration pairing. The current
+design (ABI v11) pairs an async DMA with its sync point by the **runtime tag
+slot** instead. Sections below have been rewritten to the runtime-tag model;
+where a section still mentions `event_id` / event handles / `togsim_wait` /
+`togsim_signal`, it is flagged as the superseded design, not current behavior.
+
+---
+
+## 1. Motivation
+
+The current Tile-Operation Graph (TOG) pipeline has accumulated structural debt
+that blocks where we want to go (notably dynamic shape for LLM decode / MoE):
+
+1. **"ONNX in name only."** TOG is serialized as ONNX, but every op is a custom
+   `torchsim_*` attribute. We pay ONNX's costs (rigid schema, protobuf,
+   stringly-typed attribute encoding) and use none of its interop value
+   (onnxruntime, standard ops, netron). The schema lives in three places —
+   Python dict (`extension_op.py`), ONNX (`AsmParser/onnx_utility.py`), C++
+   (`TOGSim/.../TileGraphParser`) — and drifts.
+
+2. **Synchronization is ad-hoc and DMA-specific.** Completion tracking is a
+   counting-semaphore in disguise, but unnamed and tangled:
+   - `DMA.h`: `tag_table[subgraph][tag_key] -> uint32` with overloaded magic
+     values (`0` pending, `1` signaled, `>1` consumed-count, `-1` sparse) plus a
+     parallel `waiters` wait-queue. The `tag_key` is a hand-rolled
+     content-addressed vector computed from loop indices/strides (`calc_tag`),
+     with implicit fallbacks (push `0` when an index is missing, dedup by
+     silently `continue`-ing).
+   - A *second*, separate dependency mechanism — `Instruction::ready_counter` +
+     `child_inst` graph edges — handles structural ordering.
+   - Net: one concept ("an async op completed; a consumer may proceed") is
+     expressed two different ways, and the event-like one only works for DMA.
+
+3. **Static shape is baked in.** `build_tog._affine_for_bounds` resolves loop
+   bounds to constants (`_const_index_value`). The graph is fully materialized
+   per static shape, so dynamic shape forces recompile-per-shape — pathological
+   for decode (a new `seq_len` every step) and MoE (variable expert load).
+
+4. **Loop-flattening hackery.** Much of the roughness (`loop_end` tricks,
+   `calc_tag`, dedup-by-skip, magic offsets) exists only to flatten loop nests
+   into a static graph.
+
+See [Appendix A](#appendix-a-current-state-references) for file:line references.
+
+## 2. Key idea: trace-driven → execution-driven
+
+Instead of materializing a flattened graph, **TOG becomes a stream emitted by
+*running* a shape-parametric producer.** The producer is C++ compiled from the
+kernel's MLIR; it keeps loops as loops (with symbolic bounds) and calls a small
+**event-based API**. Each API call emits one trace record = one modeled
+instruction. TOGSim `dlopen`s the producer `.so`, injects a callback context
+that records and times the stream.
+
+This directly resolves the four problems:
+
+| Problem | Resolution |
+|---|---|
+| ONNX-in-name-only / 3-place schema | The API signature is the single contract. No ONNX. |
+| DMA-only, ad-hoc sync | An async DMA and the consumer that waits on its data are paired at runtime by the tile's tag slot, through the existing Core tag table (`prepare_tag_key`/`set_tag_finish`/`register_tag_waiter`). The DMA signals the tag when its data arrives; an explicit `togsim.memory_barrier` waits on it and becomes the last-writer of the loaded buffer, so consumers gate on data arrival. No content-hashed `calc_tag`, no magic values. |
+| Static shape | Loop bounds flow from MLIR as-is; symbolic bounds become native loop bounds in C++, so trip count is dynamic. |
+| Loop-flatten hacks | Loops stay loops; the trace is generated by executing them. `calc_tag`/dedup disappear. |
+
+It is *not* a dynamic hardware scheduler: control flow is still statically
+emitted by the compiler. The `.so` is a deterministic **trace generator**, not a
+timing model — it keeps the trace-as-data boundary, so TOGSim's timing core is
+untouched.
+
+## 3. Core algebra
+
+Small, orthogonal primitives. Everything else is composition (Layer-1 helpers
+like `double_buffered_loop`, not IR primitives).
+
+- `dma(dir, arg_id, offset, shape, is_async, tag_id, tag_slot, …)` —
+  `dir ∈ {LOAD, STORE}`. Returns void. A **synchronous** (non-async) DMA is
+  blocking: it finishes when its data arrives, and consumers depend on it
+  directly. An **async** DMA returns control immediately and signals its tag at
+  data arrival (DMA response-complete); a later `memory_barrier` is the explicit
+  point that waits on it.
+- `compute(tile_id, dims…)` — references a fixed-size tile kernel; cost is looked
+  up (§6), not computed here.
+- `memory_barrier(tag_id, tag_slot, write_bufs)` — the explicit async-DMA sync.
+  It waits until the async DMA carrying the same `(tag_id, tag_slot)` has
+  delivered its data, then becomes the last-writer of the loaded buffer so
+  consumers gate on data arrival. It is the original `memref.dma_wait` mapped
+  through from the source IR, not a synthesized barrier.
+- `compute_barrier()` — a compute fence inserted before a store, so the store
+  sees the drained accumulator. This is the **one** remaining auto-inserted
+  barrier; it is marked FIXME in the code as something that should also become
+  explicit in the source IR later (§10.7.3).
+- **Control flow lives in the producer** — ordinary `for`/`if`/`while` with
+  runtime bounds. Loop types (normal/parallel/accumulation/inner) and dynamic
+  shape are just producer loops; the emitted trace is already specialized.
+
+Two distinct things share the word "tag", and the design uses **both together**
+as the dma↔barrier pairing key:
+
+- **`tag_id`** — the identity of a DMA's *tag memref*. It plus the runtime
+  `tag_slot` index identifies which async DMA a `memory_barrier` is waiting on.
+- **`tag_slot`** — the SRAM tile slot the loaded tile occupies (the
+  double-buffer / SRAM-capacity index). It is *also* part of the pairing key
+  because each load's tile maps to its own slot.
+
+Pairing is done at runtime by the existing TOGSim Core tag table: the async DMA
+calls `prepare_tag_key` and `set_tag_finish` (signal at data arrival), the
+`memory_barrier` calls `register_tag_waiter` (wait on `(tag_id, tag_slot)`).
+A synchronous DMA needs no barrier — it blocks until data arrival itself.
+
+> **Superseded.** An earlier version used a neutral `event` completion token
+> (freely allocated, not tied to memory) with `signal`/`wait`/`wait_all`
+> primitives. That has been removed in favor of the runtime-tag mechanism above.
+
+## 4. Decisions (locked)
+
+| Axis | Decision |
+|---|---|
+| Input MLIR | Use the **given MLIR as-is**. Do not touch inductor / MLIR templates / shape plumbing. Whatever bounds the MLIR carries (const or symbolic) pass through verbatim. |
+| MLIR → C++ | **EmitC dialect + `mlir-translate --mlir-to-cpp`** (upstream). |
+| `.so` ↔ TOGSim | **`dlopen` + `EmitCtx` callback** (execution-driven). The ABI boundary is the main design surface. |
+| `.so` role | **Timing trace only.** Functional correctness stays on the existing Spike/LLVM path. Strip every op without a timing dependency; keep loop skeleton + API ops + ops feeding bounds/addresses. |
+| Compute cycle | A **separate annotation pass** reuses the existing **sample-mode** to produce a **precomputed `tile_id → cycle` table**, looked up at runtime. |
+| Dynamic shape | Falls out of symbolic loop bounds in the MLIR. Per-tile cost is static (tiles are fixed-size); only trip count is dynamic. |
+
+## 5. Architecture
+
+### 5.1 Artifacts (per kernel)
+
+- **Trace `.so`** — compiled from the skeleton+API MLIR. Shape-parametric:
+  symbolic bounds become C++ function parameters. Calls the runtime API
+  (`togsim_dma`, `togsim_compute`, `togsim_memory_barrier`, …).
+- **Cycle table** — `tile_id → cycle`, produced by the annotation pass.
+
+### 5.2 Pipeline (input = given MLIR)
+
+```
+given MLIR (affine/scf.for + memref.dma_start/dma_wait + vcix/vector compute)
+│
+├── Branch A (trace):
+│     C2 build_skeleton pass  (reuse build_tog traversal)
+│        • affine/scf.for kept, bounds as-is (symbolic preserved)
+│        • dma_start → togsim.dma(... tag_id, %tag[%idx], is_async)
+│        • dma_wait  → togsim.memory_barrier(tag_id, %tag[%idx], write_bufs)
+│        • compute block       → togsim.compute(tile_id, dims)
+│        • DCE: drop ops with no dependency to loop/address/API operands
+│     → C4 togsim→emitc lowering  (togsim.* → emitc.call_opaque;
+│        convert-scf/arith-to-emitc; func args incl. symbolic shapes)
+│     → mlir-translate --mlir-to-cpp
+│     → C5 compile → trace .so   (cached by kernel key)
+│
+└── Branch B (cost):
+      C3 annotation pass over the same MLIR
+        • extract per-tile compute bodies, assign tile_id
+        • run through existing sample-mode → tile_id → cycle table
+
+TOGSim (C6):
+  dlopen(trace.so) → resolve togsim_emit
+  inject EmitCtx { tag table; record sink; cost = cycle_table[tile_id] }
+  togsim_emit(ctx, runtime_shape_args...)   // producer runs, emits stream
+  → existing timing core consumes the recorded Instruction stream
+```
+
+### 5.3 Components
+
+- **C1 — `togsim` API op vocabulary.** `togsim.dma(...)` (void result, carrying
+  `tag_id`, the runtime tag-index operand, `is_async`),
+  `togsim.memory_barrier(tag_id, tag_slot, write_bufs)`,
+  `togsim.compute(tile_id, dims)`, `togsim.compute_barrier`. Kept *unregistered*
+  (like the existing `togsim.transfer`), so no C++ dialect registration; the
+  togsim→emitc step is a custom Python rewrite, not a registered ConversionPass.
+- **C2 — `build_skeleton` pass.** Sibling to `build_tog.py`, reusing its
+  traversal (matmul FSM, `_dma_start_fields`, loop typing). Emits the
+  skeleton+API MLIR instead of TOG nodes; preserves `is_async`. The original
+  `memref.dma_wait` is mapped through to an explicit `togsim.memory_barrier`
+  carrying the DMA's `tag_id` and the runtime tag-index operand.
+- **C3 — annotation pass + cycle table.** Reuses sample-mode to sample the
+  deterministic per-tile cycle; emits the `tile_id → cycle` table artifact.
+- **C4 — togsim→emitc lowering.** Maps each `togsim.*` op to an
+  `emitc.call_opaque "togsim_*"`; lowers control flow via `convert-scf-to-emitc`
+  / `convert-arith-to-emitc`; func arguments (including symbolic shapes) become
+  C++ parameters. Then `mlir-translate --mlir-to-cpp`.
+- **C5 — `.so` build.** Compile emitted `.cpp` + `togsim_runtime.h` to `.so`
+  via the existing toolchain; cache by kernel key.
+- **C6 — TOGSim runtime + loader.** `togsim_runtime.h/.cc`: `EmitCtx` and the
+  `togsim_dma/compute/memory_barrier/compute_barrier/core_alloc`
+  implementations (compute looks up the cycle table). Loader `dlopen`s the
+  `.so`, calls `togsim_emit` with runtime shape args, records the stream, feeds
+  the existing timing core. An async DMA and its `memory_barrier` are paired at
+  runtime by `(tag_id, tag_slot)` through the existing Core tag table.
+
+### 5.4 ABI sketch (current: v11)
+
+```c
+// togsim_runtime.h — shared contract between emitted .cpp and TOGSim
+typedef struct EmitCtx EmitCtx;
+
+void togsim_dma(EmitCtx*, int32_t dir, int32_t arg_id, uint64_t offset,
+                int32_t ndim, const int64_t* dims, const int64_t* strides,
+                int32_t elem_bits, int32_t is_async,
+                int32_t tag_id, uint64_t tag_slot,
+                const int64_t* read_bufs, int32_t n_read,
+                const int64_t* write_bufs, int32_t n_write);
+
+void togsim_memory_barrier(EmitCtx*, int32_t tag_id, uint64_t tag_slot,
+                           const int64_t* write_bufs, int32_t n_write);
+
+void togsim_compute(EmitCtx*, uint64_t tile_id, int32_t compute_type, /* dims */ ...);
+void togsim_compute_barrier(EmitCtx*);
+int32_t togsim_core_alloc(EmitCtx*);
+
+// entry point the loader resolves:
+void togsim_emit(EmitCtx*, int64_t* shape_args, int32_t n_shape_args);
+```
+
+`togsim_dma` returns void (no handle). An async DMA carries `(tag_id, tag_slot)`;
+the matching `togsim_memory_barrier` waits on the same pair through the Core tag
+table. The symbols are resolved as free `extern "C"` functions: the loaded `.so`
+links back into the Simulator binary (built with `ENABLE_EXPORTS`).
+
+> **Superseded.** v2–v10 evolved through a `togsim_event` handle type with
+> `togsim_dma` returning a handle and `togsim_wait`/`togsim_signal`/
+> `togsim_wait_all` plus `togsim_event_alloc`/`togsim_event_free`. v11 removed
+> all of those; see the note at the top of this doc and §9.6.1.
+
+## 6. Compute cost model
+
+The annotation pass (C3) reuses **sample-mode** to measure each tile's
+deterministic cycle once and stores a **precomputed `tile_id → cycle` table**.
+`togsim_compute` looks it up at runtime.
+
+This is consistent with dynamic shape because **tiles are fixed-size**
+(`TILE_M/N/K`): the per-tile cycle is invariant; only the *number* of tiles
+(loop trip count) varies, and that is handled by the symbolic loop in the `.so`.
+
+**Open edge case — remainder tiles.** When a dimension is not divisible by the
+tile size, edge tiles are partial and have a different cycle than the table
+entry. Options: pad to full-tile cost (simple, small error) vs. sample a
+separate `tile_id` for the remainder. Decided at P4.
+
+## 7. Milestones
+
+- **P0** — DONE. New branch; runtime API header (C6 surface) + `togsim` op
+  vocabulary (C1).
+- **P1** — DONE. `build_skeleton` pass (C2) on a matmul kernel; verified against
+  the legacy `build_tog` TOG. The async DMA's `memref.dma_wait` is mapped through
+  to an explicit `togsim.memory_barrier` carrying the DMA's `tag_id` and the
+  runtime tag-index operand; the IR verifies across sibling prefetch/compute loop
+  nests because the pairing is by runtime tag slot, not a cross-region SSA edge.
+- **P2** — DONE. togsim→emitc (C4) + `mlir-translate` + compile (C5) → `.so` for
+  that kernel (static shape). C4 rewrites the unregistered `togsim.*`/signature
+  then drives the upstream `lower-affine`/`convert-*-to-emitc` passes, with a
+  small fold for residual `emitc.for` bound casts (see §8). Base addresses
+  stubbed to 0 (wired in P3).
+- **P3** — DONE. TOGSim loader + runtime (C6) + cycle table (C3); runs end-to-end
+  through the real Simulator/Core (256^3 GEMM via `--trace_so`). Parallelism /
+  reduction / core dispatch design is locked in **§9** (core-transparent work
+  function + `togsim_core_alloc` hook). Async DMA↔consumer sync is the runtime
+  tag-slot mechanism (`togsim.memory_barrier`), not an event-id.
+- **P4** — Symbolic bounds end-to-end on a decode-style kernel; verify trace
+  length scales with runtime shape; decide remainder-tile handling.
+- **P5** — Migrate remaining op families (conv, SDPA, vector).
+
+## 8. Risks / open questions
+
+- **Remainder tiles vs. precomputed table** (§6) — P4.
+- **ABI versioning** — RESOLVED. Free `extern "C"` symbols (the `.so` links back
+  into the Simulator binary via `ENABLE_EXPORTS`); `TOGSIM_ABI_VERSION` is v11.
+- **togsim→emitc for unregistered ops** — must be a custom rewrite to
+  `emitc.call_opaque`, since unregistered ops have no registered conversion
+  patterns.
+- **EmitC coverage** — RESOLVED (P2). C4 uses the upstream conversion passes
+  (`lower-affine`, `convert-scf-to-emitc`, `convert-arith-to-emitc`,
+  `convert-func-to-emitc`). One gap in this LLVM 20 build:
+  `convert-scf-to-emitc` emits `emitc.for` with `index` bounds, so
+  `convert-arith-to-emitc` leaves `builtin.unrealized_conversion_cast` on the
+  bounds (`emitc.size_t`↔`index`) that `--reconcile-unrealized-casts` cannot
+  fold and `mlir-to-cpp` cannot print. C4 adds a small post-pass
+  (`_retype_for_to_size_t`) that retypes each `emitc.for` to `!emitc.size_t`
+  bounds + IV (`emitc.for` accepts size_t with the explicit type) and folds the
+  residual index<->size_t casts. A size_t IV also makes the lowered *address*
+  arithmetic cast-free, which is what lets P3 wire real addresses (approach A):
+  `togsim_dma` passes `(arg_id, element offset)` where the offset is computed
+  from the loop IVs and lowered by `convert-arith-to-emitc`.
+- **async/fire-and-forget** — `is_async` preserved on `togsim.dma`. An async DMA
+  signals its tag at data arrival; a sync DMA is blocking. A DMA with no matching
+  `memory_barrier` is fire-and-forget (nothing waits its tag).
+
+## 9. P3 design: parallelism, reduction, and core dispatch (locked)
+
+How the trace producer expresses *which core runs what*, *what is parallel*, and
+*what is a reduction* (cross-iteration dependency). This is the design for P3.
+
+### 9.1 Where the semantics come from
+
+Nothing new has to be inferred — the post-vcix `affine.for` already carries the
+mapping decision the frontend made, and `build_skeleton` preserves it:
+
+| attribute | meaning | role |
+|---|---|---|
+| `outer_loop` | PARALLEL axis (e.g. GEMM m, n) | independent output tiles -> distributable across cores |
+| `accumulation_loop` | REDUCTION axis (e.g. GEMM k) | partial sums into one output tile -> ordered dependency |
+| `inner_loop` | tile micro-loop | within one tile |
+
+This matches what legacy TOGSim already does with `torchsim_loop_type`
+(`TileGraphParser`: PARALLEL -> `outer_loop_idx` selects a core; ACCUMULATION ->
+`accum_tag` groups dependent partials). The current gap is only that
+`lower_to_emitc` (P2) *drops* these attributes when it lowers `affine.for` to
+`emitc.for`, producing a flat single-stream producer.
+
+### 9.2 Principle: bake intrinsic, parameterize extrinsic
+
+Two different kinds of hardware dependence must be treated differently:
+
+- **Intrinsic** (vlane / vector width, `TILE_M/N/K`, systolic size) — defines the
+  *content and cost of each instruction*. Already baked into the IR; correct.
+- **Extrinsic** (`num_cores`) — defines only the *distribution* of an otherwise
+  fixed set of work-items. The tile set, the per-tile cost table
+  (`tile_id -> cycle`), and the DMA tile shapes are all `num_cores`-invariant.
+
+Therefore `num_cores` is **not** baked into the producer. The producer is
+**core-count transparent**: it knows nothing about how many cores exist.
+
+### 9.3 Model: core-transparent work function + dispatch hook
+
+The producer is two functions, split at the PARALLEL/ACCUMULATION boundary:
+
+```c
+// WORK: trace for ONE independent output tile. Core-transparent: takes the
+// PARALLEL indices directly, names no core. Reduction (k) is program order ->
+// the dependency is implicit (the accumulator is core-local). An async load is
+// synced to its consumer by an explicit memory_barrier on the same tag slot.
+void togsim_emit_tile(EmitCtx* ctx, int64_t mi, int64_t ni, int64_t* shape) {
+  togsim_core_alloc(ctx);                // first line: new work-item + pick core
+  togsim_compute(ctx, /*tile_id=*/0, ...);            // acc init
+  for (size_t ki = 0; ki < KT; ++ki) {                // REDUCTION = program order
+    togsim_dma(ctx, LOAD, A, offA(mi,ki), ..., /*is_async=*/1, /*tag_id=*/0, ki%D, ...);
+    togsim_dma(ctx, LOAD, B, offB(ki,ni), ..., /*is_async=*/1, /*tag_id=*/1, ki%D, ...);
+    togsim_memory_barrier(ctx, /*tag_id=*/1, ki%D, ...); togsim_compute(ctx, 1, ...);
+    togsim_memory_barrier(ctx, /*tag_id=*/0, ki%D, ...); togsim_compute(ctx, 2, ...);
+  }
+  togsim_dma(ctx, STORE, C, offC(mi,ni), ...);
+}
+
+// DISPATCH: enumerate the PARALLEL domain, one call per work-item.
+extern "C" void togsim_emit(EmitCtx* ctx, int64_t* shape, int32_t n) {
+  size_t MT = shape[0]/256, NT = shape[1]/256;
+  for (size_t mi = 0; mi < MT; ++mi)
+    for (size_t ni = 0; ni < NT; ++ni)
+      togsim_emit_tile(ctx, mi, ni, shape);
+}
+```
+
+Reduced to two orthogonal concepts:
+
+- **Parallel** = each `togsim_emit_tile` call is an independent work-item (no
+  tags shared across calls). TOGSim is free to place it on any core.
+- **Reduction** = ordering *inside* one work-item: program order on its core
+  (no explicit barrier). The `memory_barrier`/tag-slot mechanism is only the
+  async-DMA → consumer data sync.
+- **Core assignment** = `togsim_core_alloc(ctx)` (a runtime callback, body in
+  TOGSim) marks the work-item boundary and binds the following ops to a chosen
+  core. The producer never sees `core_id`/`num_cores`; those live only in
+  TOGSim's dispatch policy (round-robin / blocked / cost-aware via the cycle
+  table).
+
+The boundary callback lives at the start of each work-item; it cannot be folded
+away because TOGSim cannot intercept the producer-internal work-function call --
+only `togsim_*` callbacks are visible across the `dlopen` boundary.
+
+> FINAL API (supersedes the `togsim_dispatch` naming used below): the boundary +
+> core binding is **`int32_t togsim_core_alloc(EmitCtx*)`** (header v6). The
+> producer calls it at each work-item start; the **runtime owns the core pool**
+> and round-robins -- `num_cores` is NEVER baked into the producer (it is purely
+> a runtime quantity). There is **no free**: a core is an assignment, not a held
+> resource; the next `togsim_core_alloc` starts the next work-item. The returned
+> id is discarded by the producer. This keeps the producer core-count transparent
+> while making the core mapping an explicit runtime allocation. Wherever the text
+> below says `togsim_dispatch`, read `togsim_core_alloc`.
+
+### 9.4 Codegen (lower_to_emitc) and ABI deltas
+
+- `lower_to_emitc` splits the loop nest at the PARALLEL/ACCUMULATION boundary
+  into two `emitc.func`: the PARALLEL loops become `togsim_emit` (dispatcher,
+  passing the loop indices as args); the ACCUMULATION+INNER body becomes
+  `togsim_emit_tile`, with `togsim_core_alloc(ctx)` inserted at its entry.
+- ABI additions in `togsim_runtime.h`: `int32_t togsim_core_alloc(EmitCtx*)`
+  (runtime owns the core pool; no `num_cores` in the producer; no free).
+  `togsim_emit_tile` may stay internal (`static`) for now; export it only if a
+  future loader wants to own the parallel enumeration (which would also need a
+  `num_tiles`-style count — not required now).
+- `tile_id -> cycle` table unchanged (num_cores-invariant).
+
+> Implementation status (P3): `lower_to_emitc` inserts the `togsim_core_alloc`
+> marker at the innermost PARALLEL-loop body inside the single `togsim_emit`
+> function — the emitted *trace* is identical to the two-function form (one
+> core_alloc per work-item, then the work ops). Address arithmetic is wired
+> (approach A): each `togsim_dma` passes `(arg_id, element offset)` with the
+> offset computed from the loop IVs (lowered by `convert-arith-to-emitc`, cast-
+> free thanks to the size_t IV retype); the runtime adds the tensor base.
+> Outlining the work body into a separate `togsim_emit_tile` is now *meaningful*
+> (the body uses the parallel IVs in the offset) but still deferred — the
+> single-function trace is identical, so the split is cosmetic until needed.
+
+### 9.5 Stance and the split-K exception
+
+This refines the design's "not a dynamic scheduler / static control flow":
+**per-work-item trace is static and deterministic; only the work-item -> core
+binding is dynamic** (decided by `togsim_core_alloc`). That is independent-task
+distribution, not data-dependent control flow, and it matches a real tile
+scheduler more closely.
+
+The transparent model holds while work-items are independent (data-parallel over
+output tiles). **Split-K** (a reduction split *across* cores) breaks
+independence: the producer must emit `c` partials + a combine, so the
+instruction stream then depends on `num_cores`, and the cross-core dependency
+must be a real dataflow edge (not program order). Treat split-K as a deliberate,
+scoped exception — start P3 with data-parallel only.
+
+### 9.6 Work-items form a DAG (barriers, cross-parallel reduction)
+
+Work-items are not always a flat independent set. When there is a computation
+*between* parallel loops (e.g. an op at the m-level after the inner n parallel
+loop), it can only run once the inner parallel region completes — a join /
+barrier:
+
+```
+parallel for m:
+  parallel for n: A(m,n)     # leaf work-items
+  B(m)                       # join: needs all n of this m
+```
+
+This needs **no new primitive**: it is the same dataflow-edge mechanism the trace
+already uses (§10), just at work-item granularity. The join op declares the
+leaves' output buffers as its inputs, so the bridge makes it depend on every leaf
+through the last-writer-per-buffer analysis:
+
+```
+parallel for m:
+  parallel for n: A(m,n)   // each writes a tile of m's intermediate buffer
+  B(m)                     // reads that buffer -> depends on all n of this m
+```
+
+So the general picture: **work-items form a DAG; edges are buffer producer →
+consumer dependencies.** The independent data-parallel case is the degenerate
+edge-less DAG; barriers, reduction-across-a-parallel-axis, and split-K are the
+same DAG with real dataflow edges. (Async-DMA data arrival is the one edge that
+needs an explicit `memory_barrier` on the tag slot, because the buffer write
+completes only at DMA response-complete, later than the producing op's
+issue — see §10.7.4.)
+
+> **Superseded.** An earlier version expressed these joins with a per-leaf
+> completion `event` plus `togsim_wait_all`. Those primitives were removed; joins
+> are now ordinary buffer dependencies in the dataflow DAG (§10).
+
+### 9.6.1 How a barrier finds its DMA: runtime tag-slot pairing (locked)
+
+How the explicit `togsim.memory_barrier` (lowered from `memref.dma_wait`) finds
+*which* `togsim.dma` instance's data it must wait for. The hard case is a
+reduction loop: one static `togsim.dma` op executes once per iteration, each
+iteration loading a different tile into a different runtime tag slot. The pairing
+must therefore key on a *runtime* value, not a compile-time one.
+
+The locked model: pair by the **runtime tag slot**, using the existing TOGSim
+Core tag table.
+
+- **A DMA carries `(tag_id, tag_slot)`.** `tag_id` is the compile-time identity
+  of the DMA's tag memref (which logical channel — e.g. A-load vs B-load).
+  `tag_slot` is the *runtime* tag index `%tag[%idx]`, i.e. the SRAM tile slot
+  the loaded tile occupies this iteration. Together they uniquely name this
+  iteration's load.
+- **An async DMA signals; the barrier waits.** At DMA response-complete (the
+  moment data has actually arrived in SRAM), the runtime calls
+  `set_tag_finish(tag_id, tag_slot)`. The matching `togsim.memory_barrier`
+  carries the same `(tag_id, tag_slot)`; it calls `register_tag_waiter` and is
+  woken at that signal. The barrier then becomes the **last-writer** of the
+  loaded SRAM buffer (`write_bufs`), so every consumer that reads the buffer
+  gates on data arrival through the ordinary dataflow-edge analysis (§10).
+- **A synchronous DMA needs no barrier.** It is blocking — it finishes at data
+  arrival itself, and consumers depend on it directly.
+- **Reduction iterations do not collide.** Because `tag_slot` is the runtime
+  index, iteration `i`'s DMA and iteration `i`'s barrier share a slot that is
+  distinct from (or correctly reused after) other iterations — exactly the
+  per-iteration pairing a compile-time id could not express. The
+  double-buffer/pipeline depth is the slot's lifetime, owned by the Core's tag
+  table.
+
+**What this drops vs legacy `tag_table`:** no `calc_tag` content-hash, no magic
+values (`0`/`1`/`-1`/`>1`), no FIFO, no in-order assumption. The pairing key is
+`(tag_id, tag_slot)`, both carried explicitly on the trace ops.
+
+> Status: IMPLEMENTED (ABI v11). `build_skeleton` maps `memref.dma_wait` to
+> `togsim.memory_barrier` and tags `togsim.dma` with `tag_id` + the runtime
+> tag-index operand; `lower_to_emitc` lowers both; the runtime pairs them via
+> `prepare_tag_key`/`set_tag_finish`/`register_tag_waiter`. Verified bad=0 on the
+> 256^3 GEMM. (All current fixtures have tag memref size 1, i.e. single-buffer;
+> deeper double-buffer pipelines exercise more slots but use the same key.)
+>
+> **Superseded.** ABI v5–v10 used a dynamically minted `togsim_event` handle
+> parked in a heap "event buffer" (`togsim_event_alloc`/`togsim_event_free`),
+> with `togsim_dma` returning the handle and `togsim_wait(handle)` consuming it.
+> That mechanism — and the earlier static `event_id` it replaced — could not
+> represent per-iteration reduction pairing and was removed in v11 in favor of
+> the runtime tag slot above.
+
+### 9.7 Execution / simulation model: trace generation (not co-execution)
+
+The producer is a **pure trace (DAG) generator**: running its loops *emits* the
+ordered op stream + dependency edges. It never computes cycles, models hardware,
+or schedules. Two consequences pin the model:
+
+- **What is an edge vs. what blocks.** Data dependencies (buffer producer →
+  consumer edges, plus the async-DMA `memory_barrier` on its tag slot) are
+  recorded *edges* — the producer does not block on them. The only thing that
+  ever blocks the producer is *resource backpressure* (finite cores,
+  double-buffer / SRAM slots, DMA-queue depth), and that is pure flow control,
+  not timing semantics.
+- **Cores, double-buffering, DRAM/NoC are the timing core's job — reused, not
+  reimplemented.** TOGSim's timing core already models all of this when it
+  consumes the legacy TOG (Appendix A: `tag_table` double-buffer sync,
+  `num_cores`). The producer stays oblivious; depths/counts are consumer-side
+  config.
+
+Consumption is staged via a swappable **sink** behind the callbacks, so the
+choice does not touch the producer or the ABI:
+
+| | sink | threads | when |
+|---|---|---|---|
+| **P3** | *materializing* — callbacks append to the timing core's input; reuse its existing scheduler/timing | none | static shape; like-for-like cycle-equivalence vs `build_tog` |
+| **P4+** | *streaming* — callbacks push to a bounded queue; the producer runs as a fiber/coroutine and blocks on backpressure; the DES loop advances time, frees resources, resumes it | producer fiber | only when dynamic-shape trace size makes full materialization impractical |
+
+This is **not** timing co-execution: even the streaming sink only blocks the
+producer on resource flow-control, never on timing-resolved data events. It is
+the lazy/streamed realization of the same trace model. Decision: **do P3 with
+the materializing sink (no threads); defer streaming to P4 as a sink swap.** The
+single forward-compat requirement is that the callback sink is an interface.
+
+### 9.8 P3 task list
+
+1. DONE. `togsim_runtime.h` + `togsim_runtime.cc`/`togsim_loader.h`: C6 runtime
+   (`EmitCtx`) + `dlopen` loader (`run_producer`), materializing sink. Callees:
+   `togsim_core_alloc` (runtime core pool), `togsim_dma` (records a tile load/
+   store, signals its tag at data arrival), `togsim_compute` (cycle-table lookup),
+   `togsim_memory_barrier` (waits the matching `(tag_id, tag_slot)`),
+   `togsim_compute_barrier`.
+2. DONE (single-buffer). `lower_to_emitc`: inserts `togsim_core_alloc` at the
+   work-item boundary, lowers `togsim.memory_barrier`, and reads `loop_type`.
+   (Two-function outline still deferred; trace identical.)
+3. DONE. Real tile addresses wired (approach A): build_skeleton keeps the DRAM
+   index operand on `togsim.dma`; lower_to_emitc passes `(arg_id, offset)` and
+   `convert-arith-to-emitc` lowers the offset (size_t IV retype makes it
+   cast-free). Verified on 1024^3 GEMM: per-tile offsets are correct
+   (A[m,k]=m*1024+k, B[k,n]=k*1024+n).
+4. PARTIAL. C3 cycle table: `cycle_table.py` builds `tile_id -> (cycle,
+   overlapping_cycle)` from a per-tile `cycle_list`, with `overlapping_cycle =
+   max(cycle - offset[type], 0)` (the legacy formula) and a JSON sidecar dump.
+   Remaining (folds into task 5): feed it the gem5 sample-mode `cycle_list`
+   already computed in `extension_codecache` (reused -> both paths stay
+   cycle-consistent), and have `togsim_compute` set BOTH cycle and
+   overlapping_cycle on the Instruction.
+5. PARTIAL. C6 runtime + loader: `TOGSim/src/togsim_runtime.cc` +
+   `togsim_loader.h` implement the producer ABI and `run_producer` -- dlopen the
+   `.so`, run `togsim_emit` against an `EmitCtx`, and record a `TraceRec` stream
+   (the materializing sink): each dma resolves `base[arg_id] + offset*elem_bytes`
+   and signals its tag at data arrival, each compute looks up the cycle table,
+   core_alloc round-robins the core. Verified standalone on the 256^3 GEMM:
+   addresses/cycles resolved correctly. DONE (sec 10, 10.7.4): the recorded
+   stream is fed into the existing timing core (Core/Simulator) -- TraceRec maps
+   to `Instruction` (compute_cycle + overlapping_cycle, dataflow-buffer deps +
+   runtime-tag barriers).
+
+Legacy path: the ONNX-TOG producer (`run_tog` -> `tog_generator` -> ONNX ->
+C++ `TileGraphParser`) is marked DEPRECATED in place (comments in
+`extension_codecache.py` and `tog_generator.py`) but kept live -- it must not
+break during the transition. It is retired only once this trace pipeline is
+stable. The cycle measurement (`cycle_list`, `x_offset`/`w_offset`) is shared,
+so the two paths stay cycle-consistent meanwhile.
+
+### 9.9 Task-5 completion roadmap: TraceRec -> Core (DONE; see §10)
+
+> **Status: implemented.** This roadmap is retained for context. The dependency
+> model it sketches (a per-`togsim_wait`-handle RAW edge) was *superseded* during
+> implementation by the explicit dataflow-DAG model in §10: edges come from SRAM
+> last-writer-per-buffer plus the vcix preload/matmul FSM, and async-DMA data
+> arrival is gated by an explicit `togsim.memory_barrier` paired on the runtime
+> `(tag_id, tag_slot)` (§10.7.4) — not by a returned event handle. Read the
+> bullets below as the original target shape, with that one substitution.
+
+Grounded by reading `Instruction.h`, `Core.cc`, `TileGraphParser.h/.cc`,
+`Simulator.cc`.
+
+**Target architecture (legacy, reused):** `ONNX -> TileGraphParser -> TileGraph
+(TileLoopNode / TileMemoryNode / TileMemoryWaitNode / TileComputeNode) ->
+Simulator distributes Tiles to Cores -> Core runs Instructions`. We replace only
+the front: build the same `TileGraph` / `Instruction`s from the recorded
+`TraceRec` stream, then hand it to the existing `Simulator`.
+
+**Mapping (TraceRec -> Instruction):** `Instruction(opcode, compute_cycle,
+num_parents, dram_addr, tile_size, tile_stride, elem_bits, tag_idx_list,
+tag_stride_list, accum_tag_idx_list)`; `ready_counter = num_parents`.
+- DMA load/store -> `MOVIN`/`MOVOUT`: `dram_addr = TraceRec.addr`, `tile_size`/
+  `tile_stride`/`elem_bits` from the dma, `tag_idx_list = {tag_slot}` (the
+  SRAM-slot key), `is_async` set. compute_cycle 0.
+- COMPUTE -> `COMP`: `compute_cycle = TraceRec.cycle`,
+  `set_overlapping_cycle(TraceRec.overlapping)`, `set_compute_type(...)`.
+- Dependency (RAW): a compute depends on its loads through the SRAM
+  last-writer-per-buffer analysis (§10); for an async load the last-writer is the
+  `togsim.memory_barrier` paired on the load's runtime `(tag_id, tag_slot)`, so
+  the compute's `ready_counter` only clears once the data has arrived (§10.7.4).
+- SRAM double-buffer / capacity (WAR): the existing Core enforces it through the
+  tag mechanism (`register_tag`/`set_tag_finish`/`mark_tag_used`, DMA.h) keyed by
+  `tag_idx_list`; our `(arg_id, tag_slot)` is that key. Reduction grouping ->
+  `accum_tag_idx_list` (the accumulation-loop index).
+
+**Build/wiring:** compile the bridge into TOGSim (it needs the conan deps;
+include flags are in `TOGSim/build/compile_commands.json`, notably
+`-D_GLIBCXX_USE_CXX11_ABI=0` and the `/root/.conan/data/{robin-hood,spdlog,fmt,
+yaml-cpp,boost}` include dirs). Add `togsim_runtime.cc` + the bridge to
+`TOGSim/CMakeLists.txt`. Either (a) build `TileGraph`/`Tile` nodes from TraceRec
+(maximal reuse of `Simulator`'s tile distribution + Core), or (b) build the
+`Instruction` DAG directly and drive a single Core. (a) is closer to legacy and
+gives multi-core for free.
+
+**Cycle-table feed:** reuse the gem5 `cycle_list` already computed in
+`extension_codecache` (so both paths stay cycle-consistent); pass it +
+`x_offset`/`w_offset` to `cycle_table.build_cycle_table`, dump the sidecar, and
+have the loader populate `EmitCtx.cyc/ovl`.
+
+**Validation:** same post-vcix fixture through both paths; compare the
+`Simulator`'s total cycles / DRAM traffic. Start with the 256^3 GEMM (static
+shape, single-buffer), then multi-tile / double-buffer kernels.
+
+This is a focused C++ integration (TOGSim build + TileGraph construction), not a
+small increment -- best executed as its own push; all the producer-side inputs
+(addresses, cycles, handles, core, tag_slot) are already in the trace.
+
+## Appendix A: current-state references
+
+- `TOGSim/include/DMA.h:27-115` — `tag_table` (overloaded `0/1/-1/>1`) +
+  `waiters`; `register_tag` / `set_tag_finish` / `register_tag_waiter` /
+  `mark_tag_used` (= init / signal / wait / consume).
+- `TOGSim/src/Core.cc:118-140, 214-324` — async-DMA signal path and the `BAR`
+  wait/consume path over the tag table.
+- `TOGSim/include/Instruction.h:40-48, 104-117` — `ready_counter` / `child_inst`
+  (the second, separate dependency mechanism) and the tag fields.
+- `PyTorchSimFrontend/mlir/passes/build_tog.py` — `TogBuilder.print_operation`
+  dispatch (`affine.for` / `memref.dma_start` / `memref.dma_wait` / `vcix.*`);
+  `_affine_for_bounds` (constant-bound resolution → static shape).
+- `PyTorchSimFrontend/mlir/passes/__init__.py`,
+  `PyTorchSimFrontend/mlir/passes/lower_to_llvm.py` — in-process Python MLIR pass
+  orchestration via the bindings; the functional Spike/LLVM path (unchanged).
+- `PyTorchSimFrontend/mlir/mlir_gemm_template.py` — kernel template emitting the
+  `affine.for` nest + `linalg.matmul` + `togsim.transfer` DMA ops.
+
+## 10. Explicit dependency-edge trace (revised dependency model)
+
+Supersedes the in-order / runtime-tag approach for expressing dependencies. The
+trace is an explicit dataflow DAG: every op declares the producers of the data it
+consumes; the consumer (Core) does all resource scheduling. Reached after finding
+that (a) flat in-order over-serializes parallel tiles, (b) the current TOG pass
+does NO dependency analysis (it emits a lexical loop tree + tags resolved at
+runtime by the C++ tag_table), and (c) compute I/O is collapsed away by
+build_skeleton, so dependencies must be recovered before the collapse.
+
+### 10.1 Representation
+
+The dependency edge is "consumer reads the buffer that producer wrote". As
+landed (ABI v9 onward; see STATUS "sec 10 explicit-edge bridge"), each op
+declares the **SRAM buffer ids** it reads and writes (`read_bufs` / `write_bufs`);
+the bridge builds the Instruction DAG by **last-writer per buffer**, scoped per
+work-item. There is no SSA event token threaded by the producer and no event
+handle returned by an op.
+
+- The edge source is data, not order: an op that reads buffer `b` gets an edge
+  from whatever op most recently wrote `b`.
+- No in-order chain, no runtime tag content-hash, no op-pattern heuristics.
+- Resource scheduling -- SA round-robin, double-buffer (<=N in flight), SRAM --
+  stays entirely in the Core. The trace never reasons about SRAM occupancy or
+  timing; it only states producer->consumer order.
+- One exception: an **async** DMA's write completes only at data arrival (DMA
+  response-complete), later than its issue, so its last-writer edge is routed
+  through an explicit `togsim.memory_barrier` that waits the load's runtime
+  `(tag_id, tag_slot)` (§10.7.4). A synchronous DMA is blocking and needs no
+  barrier.
+
+> The sketch below uses an `out_ev = op(ctx, in_events[])` SSA notation to
+> *illustrate* the edges; it predates the landed `read_bufs`/`write_bufs` form
+> and is no longer the literal ABI. Read `in={…}` as "reads these buffers".
+
+Producer C++ form (events threaded like SSA; loop-carried = a reassigned var):
+
+    for mi, ni:                                  // PARALLEL: independent tiles
+      ev acc = compute(ctx, INIT, in={});
+      for ki:                                    // REDUCTION: loop-carried acc
+        ev a = dma_load(ctx, A[mi,ki], in={});
+        ev b = dma_load(ctx, B[ki,ni], in={});
+        ev w = compute(ctx, PRELOAD, in={b});
+        acc  = compute(ctx, MATMUL,  in={a,w,acc});  // new acc event each iter
+      dma_store(ctx, C[mi,ni], in={acc});
+
+The INIT dependency reaches every accumulate transitively through the acc chain
+(INIT -> mm_k0 -> mm_k1 -> store); each node only needs edges to its immediate
+producers. Different (mi,ni) -> separate acc chains -> independent -> parallel.
+
+### 10.2 Two dependency sources (both available pre-collapse in the TOG pass)
+
+A single "SRAM access" analysis is necessary but NOT sufficient -- verified on the
+GEMM post-vcix:
+
+| dependency | source | visible in SRAM? |
+|---|---|---|
+| load -> compute (DMA writes X_spad/W_spad, preload/matmul read) | SRAM last-writer per (buffer, slot) | yes |
+| accumulator chain (INIT writes Y_spad; the drain/epilogue read-modify-writes Y_spad; store reads it) | SRAM last-writer on Y_spad | yes |
+| **preload -> matmul** (preload loads weights into the systolic-array registers; matmul consumes them) | **vcix opcode FSM** (op1=preload pairs with the following op0=matmul; build_tog already tracks this via `current_preload_node`) | **no -- SA-internal, not a memref access** |
+
+So the analysis derives edges from (1) SRAM (buffer, slot) last-writer for loads
+and the accumulator, and (2) the vcix preload/matmul pairing for the SA-weight
+dependency. The slot is a concrete value at run time (the producer runs the
+loops), so matching is by value -- no static affine-overlap math.
+
+Key facts (256^3 GEMM, post-vcix): SRAM buffers are %0=X_spad(A), %1=W_spad(B),
+%2=Y_spad(acc/out). matmul (vcix op0) reads %0 only; preload (vcix op1) reads %1;
+the matmul does NOT read %1 (weights come from the SA), which is exactly why a
+memref-only analysis lets it run before the weight load -- the preload->matmul
+edge must come from the FSM. The accumulation is the epilogue's `transfer_read
+%2 + addf + transfer_write %2`, which IS SRAM-visible.
+
+### 10.3 Components changed (as landed)
+
+- TOG pass (`build_skeleton` + `dep_analysis`, on post-vcix before collapse): per
+  op, the read/write SRAM buffer ids + the preload->matmul pairing (folded as a
+  virtual `SA_WEIGHTS` buffer) -> the read/write buffer sets.
+- ABI (`togsim_runtime.h`): `togsim_dma`/`togsim_compute` carry
+  `read_bufs`/`write_bufs`; an async DMA also carries `(tag_id, tag_slot)` for the
+  `togsim.memory_barrier` pairing. No `in_events[]`, no returned event, no
+  `event_id`/handle-buffer mechanism.
+- `lower_to_emitc`: emits the buffer-id arrays on each op (and lowers
+  `togsim.memory_barrier`).
+- bridge: builds the Instruction DAG by last-writer per buffer (`add_child`);
+  no in-order chain, no runtime tag content-hash.
+- Core: unchanged (ready_counter DAG + SA pipeline + double-buffer already exist).
+
+### 10.4 Open decisions
+
+- Reduction timing: model the acc chain as completion-serial (conservative,
+  simple) first; SA-pipelined (matches legacy's overlap) — RESOLVED via the
+  occupancy/latency split (§10.7).
+- Buffer-id lifetime: the last-writer map is scoped per work-item (reset at each
+  `togsim_core_alloc`).
+
+### 10.5 Known issue: preload concurrency not bounded by #systolic-arrays
+
+Observed in the --trace_so run (256^3 GEMM): 4 PRELOADs execute concurrently
+(issue ~1028, finish ~1119-1122), but with num_systolic_array_per_core = 2 at
+most 2 should overlap, and two preloads on the same SA should serialize (one
+weight register file per array). Cause: a preload's overlapping_cycle equals its
+compute_cycle (91 == 91), so its occupancy (compute - overlapping) is ~0 and the
+Core's SA compute pipeline accepts unbounded back-to-back preloads.
+
+This is a PRE-EXISTING Core SA-model property, NOT introduced by the trace
+pipeline: the legacy build_tog path shows the same -- its 4 preloads issue at
+1215-1218 and finish 1306-1309 (4 concurrent). So it is not a trace-vs-legacy
+regression, but it is a real hardware-fidelity gap: the model should cap
+concurrent preloads at the systolic-array count and serialize same-SA preloads on
+the single weight buffer. Track separately from the trace work (affects both
+paths equally).
+
+### 10.6 Known issue: accumulator dependency over-serializes the reduction
+
+Observed in the --trace_so run: consecutive matmuls run 396 cycles apart (fully
+serial: issue 1120, 1516, 1912, ...), but physically matmuls that accumulate into
+the same output should PIPELINE on the systolic array (the partial sums stream
+through; consecutive matmuls overlap by overlapping_cycle, ~128 effective). They
+should NOT wait the previous matmul to complete.
+
+Cause: the explicit-edge bridge builds a hard completion edge (add_child) for the
+Y_spad accumulator read-modify-write, so matmul_k1 waits matmul_k0's
+finish_instruction -> when it issues, k0 is already done -> the overlapping_cycle
+window is empty -> no pipeline. This is the mechanism behind the 4888 vs legacy
+2095 gap (legacy has NO inter-matmul edges, so its matmuls pipeline on 2 SAs:
+finishes 1704,1707 | 1832,1835 = +128 within an SA, +3 across SAs).
+
+So the accumulator (Y_spad) dependency is a PIPELINED/ordering dependency, not a
+completion barrier. add_child cannot express that. Fix direction: do not create a
+matmul->matmul completion edge through the accumulator -- the accumulation order
+is preserved implicitly by same-SA issue order + the SA pipeline (overlapping_
+cycle), exactly as legacy does. Keep the real barriers: load->compute, and
+store->last-matmul (the store needs the final accumulator). The asymmetry (a
+matmul consuming Y pipelines; the store consuming Y waits) is the crux to model --
+likely "do not barrier when the consumer is a same-unit pipelined compute".
+
+Related to the same root as 10.5 (the SA/compute-pipeline occupancy model): both
+are about modeling the systolic array's streaming/pipelined execution rather than
+treating each compute as an atomic completion.
+
+### 10.7 Occupancy/latency split for pipelined computes (design + prototype)
+
+Idea (keeps add_child uniform): give each compute two completion points instead of
+one. A systolic-array op occupies its unit for occupancy = compute_cycle -
+overlapping_cycle (the initiation interval, ~128 for the matmul) and its result is
+ready at latency = compute_cycle (~395). Then add_child releases:
+  - a same-unit pipelined successor (next matmul, accumulator RMW) at OCCUPANCY
+    -> it starts ~128 later -> pipeline;
+  - a result consumer (the store reads the drained accumulator) at LATENCY
+    -> it waits the full drain (tail).
+So a single add_child mechanism stays, but the release point depends on whether
+the edge is an occupancy-dependency (same-unit pipeline) or a latency-dependency
+(reads the result). This also fixes 10.5: a preload then occupies its SA for its
+occupancy, so concurrent preloads are naturally capped at the SA count.
+
+Prototype (bridge stopgap, committed): skip the matmul->matmul accumulator edge
+(treat it as pipelined, not a barrier); keep every other edge. Result on 256^3
+GEMM: matmuls now issue back-to-back (1120-1127) and finish pipelined on 2 SAs
+(1515,1516 | 1643,1644 | 1771,1772 | 1899,1900 = +128 within an SA, +1 across),
+exactly like legacy. Total 4888 -> 2501 (vs legacy 2095 / 2608-incl-store; our
+matmuls finish at 1900 vs legacy 2091 -- our load chain is shorter). This
+confirms the accumulator dependency is pipelined. The clean replacement is the
+occupancy/latency split above in the Core so add_child stays uniform and the
+bridge needs no matmul-specific skip.
+
+#### 10.7.1 preload->matmul is also an occupancy dependency (preload fully overlaps)
+
+The preload->matmul edge is the SAME kind as matmul->matmul: a same-SA pipeline
+(occupancy) dependency, not a latency barrier. A preload's overlapping_cycle
+equals its compute_cycle (91 == 91), so its occupancy = compute - overlapping = 0
+-- it fully overlaps. With the occupancy/latency split, the matmul (successor)
+released at the preload's OCCUPANCY (= preload issue + 0) starts immediately, so
+the preload's 91-cycle latency is entirely hidden under the matmul.
+
+In the current prototype the preload->matmul edge is still an add_child barrier
+(only matmul->matmul was skipped), so the matmul issues at 1120 -- right after the
+preload finishes at ~1119 -- paying the full 91. The bridge cannot cleanly skip
+preload->matmul (skipping it outright loses the ordering: the matmul could be
+ready before the preload and reach the SA without weights). So preload-overlap is
+another reason the proper fix is the Core occupancy/latency split (10.7), which
+releases the matmul at the preload's occupancy (0) while keeping the issue order.
+
+Net: the Core occupancy/latency split resolves three notes at once -- 10.5
+(concurrent preloads capped at SA count via preload occupancy), 10.6 (matmuls
+pipeline), 10.7.1 (preload fully overlaps) -- all instances of "model the SA as a
+pipeline (occupancy + latency) instead of atomic completion".
+
+#### 10.7.2 Occupancy/latency split: implemented + POC result
+
+Implemented uniformly: Instruction gains add_pipeline_child / release_pipeline_
+children; the Core releases an op's pipeline children when it ISSUES (enters the
+SA pipeline), and its normal children at finish. The bridge classifies edges: a
+preload/matmul -> matmul edge is occupancy (add_pipeline_child), everything else
+is latency (add_child). No matmul-specific skip heuristic.
+
+256^3 GEMM result: preloads issue 1028-1031, matmuls issue 1032-1039 (right after
+the preloads ISSUE, not after they finish at ~1119 -> preload fully overlaps), and
+matmuls finish pipelined on 2 SAs (1427,1428 | 1555,1556 | 1683,1684 | 1811,1812
+= +128 within an SA, +1 across). Total 4888 -> 2501 (matmul-skip) -> 2413
+(occupancy/latency). Legacy is 2095 (matmul completion; our matmuls finish at 1812
+vs legacy 2091 -- shorter load chain -- and our 2413 includes the store).
+
+Note on 10.5 (preload concurrency): NOT fixed by this alone. A preload's
+overlapping_cycle == compute_cycle, so its occupancy is 0 -> it does not hold the
+SA -> 4 preloads still issue concurrently (1028-1031). Capping concurrent preloads
+at the SA count needs the preload to have a non-zero occupancy reflecting the
+weight-load time (a cycle-model input), separate from this edge-release change.
+
+#### 10.7.3 Explicit compute fence: implemented (COMPUTE_BAR), BAR -> MEMORY_BAR
+
+The compute fence is now a first-class trace entity, not a bridge-internal edge:
+  - togsim_ops: `togsim.compute_barrier`; ABI v10 adds `togsim_compute_barrier(ctx)`.
+  - build_skeleton emits a `togsim.compute_barrier` before each store DMA; lower_to_emitc
+    lowers it; the runtime records a COMPUTE_BAR TraceRec.
+  - The two barrier kinds are now named distinctly: Opcode::BAR -> Opcode::MEMORY_BAR
+    (the DMA/tag memory barrier, unchanged) and a new Opcode::COMPUTE_BAR.
+  - Core: COMPUTE_BAR finishes only once ALL compute pipelines drain (every systolic
+    array + the VPU empty); until then it stays in the ready queue (re-checked each
+    cycle). Its ready_counter is gated (pipeline-child of the outstanding async
+    computes) so it is only evaluated after they have ISSUED into the pipeline.
+  - bridge: a COMPUTE_BAR record -> a COMPUTE_BAR Instruction (pipeline-child of the
+    outstanding async matmuls); the following store add_child's the fence.
+
+256^3 GEMM: trace shows `... matmul x N -> COMPUTE_BAR -> STORE`; the COMPUTE_BAR
+instruction finishes at 1813 (after the SAs drain, last matmul ~1812), the store
+issues at 1814. Total 2414 (matches the implicit-flush 2413 + the 1-cycle fence).
+Multiple SAs handled (drains all _sa_compute_pipeline[*]). 7 python tests pass.
+
+#### 10.7.4 load->compute uses MEMORY_BAR (async DMA data wait); fixes a real bug
+
+Bug found: a consumer reading an async-loaded buffer ran BEFORE the data arrived
+(preload issued @1028 but its weight load W finished @1131). Cause: a raw
+add_child on an async DMA fires at the load's ISSUE-complete (program flow), not
+its DATA-ready (resp-complete) -- the async DMA signals data only via the tag
+table (set_tag_finish at resp-complete). So the buffer-edge model alone cannot
+gate compute on async-loaded data.
+
+Fix (symmetric with COMPUTE_BAR): route async load -> compute through a MEMORY_BAR
+that carries the load's tag. The load registers the tag at issue; the MEMORY_BAR
+(made ready after the load issues, via add_child) parks on the tag and is woken at
+resp-complete; consumers depend on the MEMORY_BAR (last_writer[buf] = bar). So the
+memory-arrival notification (set_tag_finish) connects to compute via the existing
+tag mechanism -- now explicit in the trace as a MEMORY_BAR instruction.
+
+256^3 GEMM: preload now issues @1132 (after W resp-done @1131), correct. Total
+2414 (buggy/optimistic) -> 2518 (correct: compute waits the slow weight load).
+Both barriers are explicit and symmetric: MEMORY_BAR (DMA tag, resp-complete) for
+load->compute, COMPUTE_BAR (SA pipeline drain) for compute->store.
+
+## 11. Remaining work + next-session handoff
+
+### 11.1 Status
+
+PR #267 (feature/togsim-cpp-trace -> develop). The trace pipeline runs end-to-end
+through the REAL Simulator/Core on a 256^3 GEMM via `--trace_so`, with an explicit
+dataflow dependency model (SRAM last-writer + vcix FSM) and two explicit barriers:
+MEMORY_BAR for async load->compute data (paired to its DMA by the runtime
+`(tag_id, tag_slot)` tag slot) and COMPUTE_BAR for the SA drain before a store.
+The async-DMA sync is the runtime tag slot, NOT a compile-time event-id (ABI
+bumped to v11; the event-id / event-handle / wait/signal design was removed).
+Legacy ONNX-TOG path kept + DEPRECATED. All togsim python tests pass; TOGSim
+builds.
+
+**Validation (256^3 GEMM, real gem5 cycle table):** through the real Core the
+trace path totals **2518 cycles** vs the legacy path's **2698** on the same
+table. The earlier 10.x notes (with a stub table) report different absolute
+numbers; 2518-vs-2698 is the current real-table figure.
+
+### 11.2 Remaining work (priority order)
+
+1. **Cycle-equivalence closure.** Characterize/close the trace-vs-legacy gap on the
+   256^3 GEMM with the SAME gem5 cycle_list. Sub-items 2-3 are the main drivers.
+2. **Preload concurrency cap (sec 10.5).** 4 preloads run concurrently though there
+   are 2 SAs, because a preload's occupancy is 0 (overlapping_cycle == compute).
+   Give the preload a non-zero occupancy (the weight-load time) so concurrent
+   preloads are capped at the SA count. Pre-existing in BOTH paths.
+3. **Robust gem5 cycle_list wiring.** The extension_codecache `TORCHSIM_DUMP_TRACE_SO=1`
+   hook dumps trace.so + trace_cycles.tsv from the real cycle_list, but is flaky
+   under concurrent compiles (saw cycle_list==[] once). Make it robust (or force a
+   single-thread compile), so `--trace_so --cycle_table` uses real per-tile cycles.
+4. **Parallel output tiles / multi-core.** One dispatch per work-item today; for
+   distributing independent output tiles across cores, emit a dispatch per parallel
+   (m_sub, n_sub) tile. The inner sub-tile loops are currently unlabeled (only the
+   macro loops carry subtile/accumulation), so the axis role must be recovered.
+5. **Cleanup.** The obsolete WAIT/SIGNAL trace records and the event-handle
+   buffer are dropped (v11). COMPUTE_BAR logs finish twice (cosmetic). The
+   preload node mis-attributes an X_spad read (build_tog `_steal_leading_transfer_read`)
+   -> a harmless extra edge.
+6. **P5 op coverage.** Only GEMM is exercised. Extend to conv / SDPA / vector / pool.
+7. **P4.** Symbolic/dynamic shape; streaming sink (coroutine, alloc-blocks).
+8. **Two-function outline** (togsim_emit_tile) -- deferred (trace identical).
+9. **Retire the legacy ONNX-TOG path** once the trace path is stable.
+
+### 11.3 Next-session context
+
+- Worktree `/workspace/PyTorchSim-cpptrace`, branch `feature/togsim-cpp-trace`,
+  PR #267 -> develop. The branch is rebased ONTO develop (the retire-floormod base
+  was dropped -- develop already has it). `source .envrc` in the worktree.
+- Build TOGSim: submodules are init'd; `cd TOGSim/build && cmake .. -DCMAKE_BUILD_TYPE=Release && make -j$(nproc)`.
+  The Simulator target has ENABLE_EXPORTS (so a dlopen'd .so resolves the togsim_*
+  callbacks); togsim_runtime.cc + togsim_trace_bridge.cc are picked up by the src glob.
+- Run the trace path:
+  `python -m PyTorchSimFrontend.mlir.passes.lower_to_emitc <postvcix.mlir> --so trace.so [--emit-cpp x.cpp]`
+  then `bin/Simulator --config <yml> --trace_so trace.so [--cycle_table cyc.tsv] [--log_level trace]`.
+- Get a post-vcix fixture: a real torch.compile GEMM with `TORCHSIM_DUMP_MLIR_IR=1
+  pytorchsim_functional_mode=False` writes `outputs/<hash>/..._sample_postvcix.mlir`.
+  Real cycle data + legacy reference: add `TORCHSIM_DUMP_TRACE_SO=1` to also dump
+  trace.so + trace_cycles.tsv in `outputs/<hash>/` (see 11.2 #3). (Prior /tmp
+  fixtures are ephemeral -- regenerate.)
+- Env (.envrc): gem5 `/gem5/release/gem5.opt`, spike `/release/bin/spike`,
+  LLVM `/riscv-llvm/bin`.
+- Tests: `TOGSIM_SKELETON_FIXTURE=<postvcix.mlir> pytest tests/test_togsim_{skeleton,emitc,runtime}.py`.
+  These are NOT in the CI allowlist (`.github/workflows/pytorchsim_test.yml`) -- register them to gate CI.
+- Key files: passes `build_skeleton.py`, `lower_to_emitc.py`, `dep_analysis.py`,
+  `cycle_table.py`, `togsim_ops.py`; `TOGSim/include/{togsim_runtime.h, togsim_loader.h, togsim_trace_bridge.h}`,
+  `TOGSim/src/{togsim_runtime.cc, togsim_trace_bridge.cc}`; `Core.cc`/`Instruction.{h,cc}`
+  (COMPUTE_BAR + MEMORY_BAR rename); `main.cc` (--trace_so); `extension_codecache.py`
+  (TORCHSIM_DUMP_TRACE_SO hook).
+- Local-only backups of the pre-squash/pre-rebase 28-commit history: tag
+  `pr-backup-ccfea43e`, branch `backup-presquash-3cfd4a3f` (NOT pushed).
diff --git a/docs/design/togsim_cpp_trace_HANDOFF.md b/docs/design/togsim_cpp_trace_HANDOFF.md
new file mode 100644
index 00000000..2c5d5bd6
--- /dev/null
+++ b/docs/design/togsim_cpp_trace_HANDOFF.md
@@ -0,0 +1,191 @@
+# Handoff — TOGSim C++ Trace Generation
+
+Continuation notes for picking this work up in a fresh session. Read alongside
+the full design: [`togsim_cpp_trace.md`](./togsim_cpp_trace.md) and the snapshot
+[`togsim_cpp_trace_STATUS.md`](./togsim_cpp_trace_STATUS.md).
+
+## Goal (one line)
+
+Replace the timing-path TOG producer (MLIR -> Python-dict -> ONNX -> C++ parser)
+with a compiled, shape-parametric trace producer (MLIR -> EmitC -> C++ -> `.so`);
+TOGSim's timing core is preserved.
+
+## Current state (one paragraph)
+
+The trace pipeline is implemented end-to-end and runs through the REAL
+Simulator/Core on a 256^3 GEMM (`--trace_so`). Dependencies are an explicit
+dataflow DAG (SRAM last-writer per buffer + the vcix preload/matmul FSM). An
+asynchronous DMA is synced to the consumer of its data by the **runtime tag
+slot** `(tag_id, tag_slot)` through an explicit `togsim.memory_barrier` (lowered
+from the source `memref.dma_wait`); a sync DMA is blocking. ABI is **v11**. An
+earlier design used a compile-time `event_id` / heap event handle with
+`wait`/`signal`; it was removed because one static DMA op runs once per loop
+iteration into a different tag slot, which a compile-time id cannot pair per
+iteration. **Validation:** on the 256^3 GEMM with the real gem5 cycle table, the
+trace path totals **2518 cycles** vs the legacy path's **2698** through the real
+Core; all togsim python tests pass; TOGSim builds.
+
+## Branch
+
+- Work branch: `feature/togsim-cpp-trace` (PR #267 -> develop)
+
+## Status
+
+| Milestone | State |
+|---|---|
+| P0 — ABI header + op vocabulary | DONE (ABI evolved to v11) |
+| P1 — `build_skeleton` pass | DONE, verified — runs on a real GEMM fixture, module verifies, compute grouping + dma/barrier counts match the legacy `build_tog` TOG. |
+| P2 — togsim -> emitc -> cpp -> .so | DONE — `lower_to_emitc.py` builds EmitC, `mlir-translate` -> C++, `g++ -shared` -> `.so`; validated by build/symbol checks and a dlopen run harness. |
+| P3 — TOGSim loader + runtime + cycle table; real-Core run | DONE — runs end-to-end through the real Simulator/Core (256^3 GEMM, `--trace_so`). Runtime tag-slot pairing (ABI v11, `togsim.memory_barrier`), explicit dataflow DAG (read/write_bufs last-writer + vcix FSM), real tile addresses, cycle_table. `togsim_runtime.cc`/`togsim_loader.h`/`togsim_trace_bridge.cc` feed TraceRec into the real Core. Cycle comparison vs legacy on the real gem5 table: trace 2518 vs legacy 2698. Legacy ONNX-TOG path DEPRECATED in place, kept live. |
+| P4 — symbolic-bound dynamic shape, streaming sink | not started |
+| P5 — op-family migration (conv/SDPA/vector) | not started |
+
+### Async-DMA sync: runtime tag slot (current), event-id (removed)
+
+The original P1 threaded the dma->wait dependency as an SSA `!togsim.event`
+value, which fails `module.verify()` on a software-pipelined kernel (the
+`togsim.dma` sits in the prefetch loop nest, its consumer in a sibling compute
+nest, so the value does not dominate its use). An intermediate fix used a
+compile-time `event_id` attribute (later a heap-allocated event handle). Both
+were **removed**: one static `togsim.dma` op executes once per loop iteration
+into a *different* runtime tag slot, so a compile-time id (one per static op)
+cannot pair iteration i's DMA with iteration i's wait.
+
+Current mechanism (ABI v11): `togsim.dma` carries `tag_id` (its tag-memref
+identity) plus the runtime tag-index operand `%tag[%idx]` and returns void. The
+source `memref.dma_wait` is mapped through to an explicit
+`togsim.memory_barrier {tag_id, write_bufs}` carrying the runtime tag index. At
+runtime an async DMA and its barrier are paired by `(tag_id, tag_slot)` through
+the existing Core tag table (`prepare_tag_key`/`set_tag_finish`/
+`register_tag_waiter`): the DMA signals at data arrival, the barrier waits, and
+the barrier becomes the loaded buffer's last-writer so consumers gate on
+arrival. (The one remaining auto-inserted barrier is `togsim.compute_barrier`,
+the compute fence before a store — marked FIXME to become explicit later.)
+
+### P2 decisions
+
+* **ABI v11 (runtime tag slot).** `togsim_dma` returns void and carries
+  `(is_async, tag_id, tag_slot, read_bufs, write_bufs)`. The
+  `togsim_memory_barrier(tag_id, tag_slot, write_bufs)` is the explicit
+  async-DMA sync. No `event_id`, no event handle, no `wait`/`signal`.
+* **C4 drives the upstream EmitC conversion passes** (it does not hand-build
+  EmitC). It only does the parts upstream cannot: rewrite the *unregistered*
+  `togsim.*` ops to `emitc.call_opaque` and rewrite the kernel signature to the
+  ABI form. Then it runs, in-process (`mlir.passmanager`),
+  `func.func(lower-affine), convert-scf-to-emitc, convert-arith-to-emitc,
+  convert-func-to-emitc`. One local fixup: in this LLVM 20 build
+  `convert-scf-to-emitc` emits `emitc.for` with `index` bounds, so
+  `convert-arith-to-emitc` (constants -> `!emitc.size_t`) leaves
+  `unrealized_conversion_cast` on the bounds that nothing folds and
+  `mlir-to-cpp` can't print (design sec 8 risk). `_fold_for_bound_casts`
+  rewrites those bound constants to `index`-typed `emitc.constant`, clearing
+  the casts. (`emitc.for` *does* accept `size_t` bounds with an explicit
+  `: !emitc.size_t`, but keeping the bounds `index` avoids retyping the IV.)
+* **Addresses (wired in P3, approach A):** `togsim_dma` passes `(arg_id, element
+  offset)` with the offset computed from the loop IVs; the runtime adds the
+  tensor base. `togsim.compute` is keyed by `tile_id` for cost.
+
+## Files (key)
+
+- `TOGSim/include/togsim_runtime.h` — extern "C" ABI v11 (`togsim_dma`,
+  `togsim_memory_barrier`, `togsim_compute`, `togsim_compute_barrier`,
+  `togsim_core_alloc`, `togsim_emit` entry, `TOGSIM_ABI_VERSION`, opaque
+  `EmitCtx`).
+- `PyTorchSimFrontend/mlir/passes/togsim_ops.py` — single source of truth for the
+  skeleton+API MLIR vocabulary (op names, attr keys, op->callee map).
+- `PyTorchSimFrontend/mlir/passes/build_skeleton.py` + `dep_analysis.py` — the P1
+  pass + dependency analysis (reuse build_tog's `TogBuilder`/`_build`; map
+  dma_start->togsim.dma, dma_wait->togsim.memory_barrier, attach read/write_bufs;
+  use-based DCE).
+- `TOGSim/src/togsim_runtime.cc`, `TOGSim/include/togsim_loader.h`,
+  `TOGSim/src/togsim_trace_bridge.cc` — C6 runtime, dlopen loader, and the bridge
+  that feeds the recorded TraceRec stream into the real Core.
+- `tests/test_togsim_skeleton.py` — `test_togsim_ops_contract` (runs anywhere) +
+  `test_build_skeleton_on_fixture` (gated on bindings + a fixture).
+- `PyTorchSimFrontend/mlir/passes/lower_to_emitc.py` — the P2/C4 pass: skeleton
+  module -> EmitC `togsim_emit` -> C++ (`mlir-translate`) -> `.so` (`g++`).
+  Entry points: `lower_to_emitc(module)`, `build_trace_so(postvcix_path, so)`,
+  and a `__main__` CLI (`--so`, `--emit-cpp`, `--include-dir`).
+- `tests/test_togsim_emitc.py` — `test_build_trace_so` (EmitC + symbol checks) +
+  `test_trace_so_runs` (dlopen the `.so` against a stub runtime, run it). Gated
+  on bindings + `mlir-translate` + a C++ compiler + the fixture.
+
+## Reproduce P1 + P2 (one GEMM kernel)
+
+```bash
+# 1. post-vcix fixture: compile a GEMM (needs the built PyTorchSimDevice .so).
+export pytorchsim_functional_mode=False
+python tests/ops/gemm/test_matmul.py
+FIX=$(find "${TORCHSIM_DUMP_PATH:-.}" -name '*_postvcix.mlir' | head -1)
+# build_skeleton/lower_to_emitc only need the .mlir + bindings, not torch, so a
+# fixture compiled in any worktree is fine.
+
+# 2. P1: skeleton+API MLIR.
+python -m PyTorchSimFrontend.mlir.passes.build_skeleton "$FIX" --out /tmp/skel.mlir
+#   stderr: "skeleton: compute=.. dma=.. memory_barrier=.."
+
+# 3. P2: skeleton -> EmitC -> C++ -> .so (reads skel from $FIX via build_skeleton).
+python -m PyTorchSimFrontend.mlir.passes.lower_to_emitc "$FIX" \
+    --so /tmp/trace.so --emit-cpp /tmp/trace.cpp
+nm -D /tmp/trace.so | grep togsim     # togsim_emit = T; togsim_dma/memory_barrier/compute = U
+
+# 4. tests
+TOGSIM_SKELETON_FIXTURE="$FIX" python -m pytest \
+    tests/test_togsim_skeleton.py tests/test_togsim_emitc.py -q
+```
+
+Note: `mlir-opt`/`mlir-translate` live in `$TORCHSIM_LLVM_PATH` but are not on
+`$PATH`; `lower_to_emitc` resolves `mlir-translate` from `TORCHSIM_LLVM_PATH`.
+
+## Next steps (P3 is done; remaining work)
+
+The producer is wired into TOGSim and runs through the real Core (trace 2518 vs
+legacy 2698 on the 256^3 GEMM). The parallelism / reduction / core-dispatch
+design is in `togsim_cpp_trace.md` §9. Summary: the producer is core-transparent
+(knows nothing about `num_cores`); it enumerates parallel output-tile work-items
+and calls `togsim_core_alloc` at each work-item boundary. Parallel = independent
+work-items; reduction = program order inside one work-item; core binding = the
+`togsim_core_alloc` runtime callback (policy lives in TOGSim). Async-DMA data
+sync = the runtime `(tag_id, tag_slot)` via `togsim.memory_barrier`. `num_cores`
+is extrinsic so it is never baked; vlane/tile sizes are intrinsic and stay baked.
+Split-K is a deferred exception.
+
+Remaining (priority order; full list in STATUS §7 and design §11.2):
+
+- **SRAM tile lifecycle (double-buffer throttle).** `togsim.dma` carries
+  `tag_slot` (the SRAM slot key); the consumer must use it to throttle in-flight
+  loads to the buffer depth on multi-tile / double-buffered kernels.
+- **Preload concurrency cap (design §10.5).** Give a preload a non-zero occupancy
+  (its weight-load time) so concurrent preloads are capped at the SA count.
+  Pre-existing in BOTH paths.
+- **Per-output-tile dispatch / multi-core.** One `togsim_core_alloc` per
+  work-item today; distribute independent output tiles across cores.
+- **Robust gem5 cycle_list wiring.** The extension_codecache
+  `TORCHSIM_DUMP_TRACE_SO=1` hook is flaky under concurrent compiles.
+- **P5 op coverage** (conv/SDPA/vector) and **P4** (symbolic shape, streaming
+  sink), then **retire the legacy ONNX-TOG path**.
+
+Full design: `togsim_cpp_trace.md` §5-11.
+
+## Environment requirements (for the new session)
+
+- MLIR Python bindings importable (`import mlir.ir`). They ship with the LLVM
+  build at `${TORCHSIM_LLVM_PATH%/bin}/python_packages/mlir_core`; the CI docker
+  image `ghcr.io/psal-postech/torchsim-ci` has them. `passes/__init__` also
+  derives the path from `TORCHSIM_LLVM_PATH`.
+- `pytest` to run the test files directly (`pip install pytest` if absent).
+- `mlir-translate` (in `$TORCHSIM_LLVM_PATH`) and a host C++ compiler (`g++`/
+  `$CXX`) for the P2 `.so` path.
+- TOGSim build (for `--trace_so`): `cd TOGSim/build && cmake ..
+  -DCMAKE_BUILD_TYPE=Release && make -j$(nproc)`. The Simulator target has
+  ENABLE_EXPORTS so a dlopen'd `.so` resolves the `togsim_*` callbacks.
+- When iterating on passes, clear the codegen caches (`$TORCHSIM_DUMP_PATH`,
+  default `outputs/`) between runs — see CLAUDE.md "Codegen changes are sticky".
+
+## Verification that already passes anywhere (sanity)
+
+```bash
+python -m py_compile PyTorchSimFrontend/mlir/passes/build_skeleton.py \
+    PyTorchSimFrontend/mlir/passes/togsim_ops.py tests/test_togsim_skeleton.py
+# contract test (no bindings needed): see test_togsim_ops_contract
+```
diff --git a/docs/design/togsim_cpp_trace_STATUS.md b/docs/design/togsim_cpp_trace_STATUS.md
new file mode 100644
index 00000000..1fb0bf1c
--- /dev/null
+++ b/docs/design/togsim_cpp_trace_STATUS.md
@@ -0,0 +1,219 @@
+# TOGSim C++ Trace Generation — Status Report
+
+Branch: `feature/togsim-cpp-trace`. Design of record: `togsim_cpp_trace.md` (esp.
+§9); continuation notes: `togsim_cpp_trace_HANDOFF.md`. This file is a snapshot of
+progress.
+
+## 1. Goal
+
+Replace the timing-path TOG producer (`MLIR -> Python dict -> ONNX -> C++
+TileGraphParser`) with a compiled, shape-parametric trace producer
+(`MLIR -> skeleton -> EmitC -> C++ -> .so`). TOGSim's timing core is preserved;
+only the producer of its input changes. The key idea: do not flatten the TOG;
+instead **run** a compiled C++ producer that emits the trace as a stream of API
+calls.
+
+Each API call emits one trace record = one modeled instruction, fed to the
+existing timing Core. Dependencies are an explicit dataflow DAG (SRAM
+last-writer per buffer + the vcix preload/matmul FSM). An asynchronous DMA is
+synced to the consumer of its data by the **runtime tag slot** `(tag_id,
+tag_slot)` through an explicit `togsim.memory_barrier` (ABI v11). An earlier
+design used a compile-time `event_id` / event handle with `wait`/`signal`; that
+was removed because one static DMA op runs once per loop iteration into a
+different tag slot, which a single compile-time id cannot pair per iteration.
+
+## 2. Pipeline
+
+```
+post-vcix .mlir (torch.compile output)
+  | build_skeleton.py + dep_analysis.py (P1)  keep loops;
+  |   memref.dma_start -> togsim.dma(tag_id, %tag[%idx], is_async, read/write_bufs);
+  |   memref.dma_wait  -> togsim.memory_barrier(tag_id, tag_slot, write_bufs);
+  |   compute block    -> togsim.compute; DCE the rest
+  v
+skeleton+API MLIR
+  | lower_to_emitc.py (P2/C4)  togsim.* -> emitc.call_opaque; ABI signature; drive upstream
+  |                            lower-affine/convert-*-to-emitc; _retype_for_to_size_t fixups
+  v
+EmitC --mlir-translate--> C++ --g++ -shared--> trace.so
+                                                 | TOGSim loader (C6): dlopen + EmitCtx callbacks
+                                                 v
+                                       TraceRec stream (materializing sink)
+                                                 | togsim_trace_bridge.cc -> existing Core timing
+                                                 v
+                                       cycles / DRAM traffic (real Core)
+```
+
+Side artifact: cycle table `tile_id -> (cycle, overlapping_cycle)` (cycle_table.py).
+
+## 3. Milestones
+
+| | State |
+|---|---|
+| P0 ABI header + togsim vocabulary | DONE (ABI evolved to v11) |
+| P1 build_skeleton | DONE, verified (compute/dma/barrier match legacy TOG) |
+| P2 lower_to_emitc -> .so | DONE (real GEMM .so built and run) |
+| P3 loader/runtime + cycle table + real-Core run | DONE (runs end-to-end through the real Simulator/Core; below) |
+| P4 symbolic/dynamic shape, streaming sink | TODO |
+| P5 op-family migration (conv/SDPA/vector) | TODO |
+
+P3 detail:
+
+| | State |
+|---|---|
+| ABI (core_alloc, runtime tag pairing, dma address) | DONE (v11) |
+| work-item boundary (togsim_core_alloc) | DONE |
+| real tile DRAM addresses (approach A) | DONE, verified on 1024^3 |
+| cycle_table builder (cycle + overlapping) | DONE |
+| async DMA <-> consumer sync (runtime tag slot, memory_barrier) | DONE |
+| explicit dataflow DAG (read/write_bufs last-writer) | DONE |
+| C6 runtime + dlopen loader (materializing) | DONE |
+| TraceRec -> existing Core timing feed | DONE (runs end-to-end through real Core) |
+| cycle comparison vs build_tog (real gem5 table) | DONE: trace 2518 vs legacy 2698 |
+| SRAM tile lifecycle / preload-occupancy refinements | partial (see §7) |
+
+### TraceRec -> Core: now running end-to-end
+
+`TOGSim/src/togsim_trace_bridge.cc` (`trace_to_tilegraph`) + a `--trace_so` mode
+in `main.cc` feed the recorded trace into the REAL Simulator/Core. The producer
+`.so` is `dlopen`'d (the Simulator is built with ENABLE_EXPORTS so the `.so`
+resolves the `togsim_*` callbacks back into the binary), its trace recorded, then
+bridged to a `TileGraph`: one `TileSubGraph` per work-item (core_alloc marker)
+bound to its core, one `Tile` of MOVIN/MOVOUT/COMP/MEMORY_BAR/COMPUTE_BAR
+`Instruction`s. Dependency edges are built by **last-writer per SRAM buffer**
+(`read_bufs`/`write_bufs`); an async load's last-writer is the MEMORY_BAR paired
+to it by the runtime `(tag_id, tag_slot)` (so a consumer waits actual data
+arrival), and a COMPUTE_BAR drains the systolic-array pipeline before a store.
+Build it (`cd TOGSim/build && cmake .. && make`) and run:
+`bin/Simulator --config <yml> --trace_so gemm_trace.so`.
+
+### Cycle comparison vs legacy build_tog (256^3 GEMM, real gem5 table)
+
+Ran the same kernel through the legacy path (torch.compile -> gem5 -> build_tog
+-> Simulator) and the trace path (the same post-vcix IR -> trace .so + the SAME
+gem5 cycle_list -> --trace_so), both through the REAL Core. extension_codecache
+has an opt-in TORCHSIM_DUMP_TRACE_SO=1 hook that dumps trace.so + trace_cycles.tsv
+from the same cycle_list/offsets (best-effort, never breaks the legacy path);
+compute-unit routing uses compute_type and the tag key uses a per-tensor addr_id
+(set_addr_name(arg_id)+prepare_tag_key) so A and B don't collide on tag_slot 0.
+
+**Result: the trace path totals 2518 cycles vs the legacy path's 2698 on the
+same gem5 cycle table.** All togsim python tests pass; TOGSim builds. Compute
+work and DRAM traffic match; the remaining difference is scheduling (the
+explicit dataflow DAG plus the occupancy/latency SA-pipeline model overlap
+differently than legacy's per-iteration BARs).
+
+## 4. Components
+
+- `build_skeleton.py` + `dep_analysis.py` — in-place reduction of post-vcix to
+  "loop skeleton + togsim.* API"; `memref.dma_wait` mapped through to an explicit
+  `togsim.memory_barrier`; read/write SRAM buffer ids attached; reuses legacy
+  `TogBuilder` traversal.
+- `lower_to_emitc.py` — skeleton -> EmitC by driving the upstream conversion
+  passes plus `_retype_for_to_size_t` (clears residual index<->size_t casts).
+  `togsim_dma` carries `(tag_id, runtime tag-index, is_async, read/write_bufs)`
+  and returns void; `togsim_memory_barrier` carries `(tag_id, tag_slot,
+  write_bufs)`; `togsim_core_alloc` inserted at the work-item boundary.
+- `cycle_table.py` — `tile_id -> (cycle, overlapping)`, overlapping
+  `= max(cycle - offset[type], 0)` (legacy formula); JSON sidecar.
+- `TOGSim/src/togsim_runtime.cc` + `TOGSim/include/togsim_loader.h` — C6 runtime
+  and `run_producer` (dlopen -> togsim_emit -> records TraceRec). dma resolves
+  `base[arg] + offset*elem_bytes` and signals its tag at data arrival; the
+  matching memory_barrier waits the `(tag_id, tag_slot)`; compute looks up the
+  cycle table; core_alloc round-robins a runtime core pool.
+- `TOGSim/src/togsim_trace_bridge.cc` — bridges the recorded TraceRec stream into
+  the existing `TileGraph`/`Instruction` form for the real Core.
+- `TOGSim/include/togsim_runtime.h` — producer ABI v11.
+
+## 5. Locked design decisions
+
+1. **Trace is a DAG, not a time order.** The consumer (existing Core) schedules
+   per-core timelines from: op kind -> hardware unit, SRAM-buffer last-writer ->
+   data dependency, same-core -> serial (reduction accumulate), SRAM slot ->
+   capacity. Emission order != execution order.
+2. **Async-DMA sync = runtime tag slot.** A `togsim.dma` carries `(tag_id,
+   tag_slot)`; the matching `togsim.memory_barrier` (lowered from the source
+   `memref.dma_wait`) waits on the same pair through the existing Core tag table
+   (`prepare_tag_key`/`set_tag_finish`/`register_tag_waiter`). The DMA signals at
+   data arrival; the barrier becomes the loaded buffer's last-writer so consumers
+   gate on arrival. A sync DMA is blocking (no barrier). This replaced an earlier
+   `event_id` / heap event-handle design, which could not pair a DMA op with its
+   wait per loop iteration (one static op, a different tag slot each iteration).
+   No `calc_tag` content-hash, no magic values, no FIFO.
+3. **Core = runtime allocation.** `togsim_core_alloc` returns a core id (no free).
+   `num_cores` is never baked into the producer -- it is the runtime pool size.
+   A work-item's reduction stays on one core (sticky); different work-items get
+   different cores -> multi-core.
+4. **Intrinsic baked / extrinsic parametric.** vlane / tile sizes / systolic
+   define instructions (baked); num_cores only distributes (runtime).
+5. **Execution model:** P3 materializing (run producer to completion -> record ->
+   feed existing Core); P4 streaming (coroutine, alloc-blocks on resources).
+6. **Double-buffer = resource constraint.** Producer emits everything (no skew);
+   capacity is the consumer's throttle. Requires SRAM tile lifecycle
+   (alloc/free) in the trace -- the currently missing piece.
+
+## 6. Verification (reproducible)
+
+- togsim python tests pass: skeleton (contract + fixture), emitc (build + dlopen
+  run), cycle_table, runtime. TOGSim builds.
+- 256^3 GEMM: core_alloc -> dma(tag_id, tag_slot) -> memory_barrier(tag_id,
+  tag_slot) -> compute; addresses A/B/C resolved (offset 0, single tile).
+- 1024^3 GEMM: per-tile addresses correct (A[m,k]=m*1024+k -> 0,256,512;
+  B[k,n]=k*1024+n -> 0,262144,524288).
+- End-to-end through the real Core (256^3 GEMM, real gem5 table): trace 2518
+  cycles vs legacy 2698.
+- Legacy ONNX-TOG path untouched (comment-only diff), marked DEPRECATED, kept as
+  the comparison reference.
+
+## 6b. Reference timer (early sanity check; superseded by the real Core feed)
+
+`togsim::simulate(RunResult, TimingParams)` (togsim_runtime.cc) was an early
+standalone scheduler that timed the recorded TraceRec to prove the stream is
+sufficient to be timed: per core a DMA-engine timeline (DMAs serialize, overlap
+compute), a compute timeline (serial = reduction accumulate, with the `finish =
+prev.finish + cycle - overlapped` pipeline overlap of Core.cc), and data deps.
+It is NOT the production Core (no DRAM/NoC/L2 contention). It has since been
+superseded: the recorded stream is now bridged into the real Tile/TileGraph ->
+Core (see §3, and the 2518-vs-2698 result above). Retained here as context.
+
+## 7. Remaining work (priority order)
+
+1. DONE. Map TraceRec -> existing TOGSim Core Instructions (Tile/TileGraph,
+   compute_cycle+overlapping, dataflow-buffer deps + runtime-tag barriers) and
+   run through the real Core. Result: trace 2518 vs legacy 2698 on the same gem5
+   table.
+2. SRAM tile lifecycle in the trace (double-buffer throttle). togsim_dma carries
+   `tag_slot` (the lowered SRAM tag index = the slot key the existing Core's
+   Instruction.tag_idx needs); 0 for single-buffer kernels. Remaining: the
+   consumer must use it to throttle in-flight loads to the buffer depth. The
+   SRAM-buffer key is effectively (arg_id, tag_slot) since each load's DRAM
+   tensor maps to its spad.
+3. Preload concurrency cap / preload occupancy (design doc §10.5): give a preload
+   a non-zero occupancy so concurrent preloads are capped at the SA count.
+   Pre-existing in BOTH paths.
+4. (later) deeper double-buffer pipelines (more tag slots), two-function outline,
+   P4 streaming, symbolic shape, P5 op coverage (conv/SDPA/vector).
+
+## 8. Risks / open
+
+- SRAM lifecycle (double-buffer throttle) not yet implemented -- central to
+  double-buffer/capacity accuracy on multi-tile kernels.
+- LLVM 20 emitc constraints absorbed: emitc.for index bounds; old
+  subscript-returns-element model; arith.divui/remui not lowerable -> core id is
+  a runtime allocation (which became a design improvement).
+
+### Explicit dataflow-edge dependency model: implemented
+
+The dependency model is an explicit dataflow DAG, not in-order or runtime-tag
+content-hashing. `togsim_dma`/`togsim_compute` carry read_bufs/write_bufs (SRAM
+buffer ids; a virtual SA_WEIGHTS buffer folds the preload->matmul edge).
+dep_analysis + build_skeleton attach them; lower_to_emitc emits them; the runtime
+records them; the bridge builds the Instruction DAG by last-writer per buffer,
+scoped per work-item. The one runtime-paired edge is the async-DMA data wait,
+routed through an explicit `togsim.memory_barrier` keyed on `(tag_id, tag_slot)`
+(see design doc §10.7.4). The systolic-array pipeline uses the occupancy/latency
+split (§10.7), so accumulating matmuls pipeline rather than serialize.
+
+Net (256^3 GEMM, real gem5 table, real Core): trace 2518 vs legacy 2698.
+Per-output-tile dispatch for multi-core distribution is the next refinement
+(today one dispatch per work-item).

From be967cf1645376c75ecff2970194f653a7fd7260 Mon Sep 17 00:00:00 2001
From: Wonhyuk Yang <wonhyuk@postech.ac.kr>
Date: Wed, 24 Jun 2026 22:35:51 +0900
Subject: [PATCH 02/24] [TOGSim] C++ trace pipeline: front end, runtime,
 loader, bridge, Core feed

Skeleton + EmitC + cost/dep analysis on the frontend; the trace runtime,
loader, bridge, and Core feed on the simulator; shared MLIR pass helpers and
the pipeline tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
---
 AsmParser/tog_generator.py                    |   6 +
 PyTorchSimFrontend/extension_codecache.py     |  41 ++
 PyTorchSimFrontend/mlir/passes/_mlir_util.py  |  87 ++++
 .../mlir/passes/build_skeleton.py             | 426 ++++++++++++++++
 PyTorchSimFrontend/mlir/passes/cycle_table.py | 103 ++++
 .../mlir/passes/decompose_transfer.py         |  10 +-
 .../mlir/passes/dep_analysis.py               | 194 ++++++++
 .../mlir/passes/dma_fine_grained.py           |  24 +-
 .../mlir/passes/lower_dma_to_gemmini.py       |  11 +-
 .../mlir/passes/lower_to_emitc.py             | 454 ++++++++++++++++++
 .../mlir/passes/lower_to_vcix.py              |  51 +-
 .../mlir/passes/lower_vlane_idx.py            |  10 +-
 PyTorchSimFrontend/mlir/passes/togsim_ops.py  | 101 ++++
 TOGSim/include/Instruction.h                  |  12 +-
 TOGSim/include/togsim_loader.h                |  76 +++
 TOGSim/include/togsim_runtime.h               | 157 ++++++
 TOGSim/include/togsim_trace_bridge.h          |  18 +
 TOGSim/src/CMakeLists.txt                     |   5 +
 TOGSim/src/Core.cc                            |  29 +-
 TOGSim/src/CoreTraceLog.cc                    |   2 +-
 TOGSim/src/Instruction.cc                     |  13 +-
 TOGSim/src/TileGraphParser.cc                 |   2 +-
 TOGSim/src/main.cc                            |  47 ++
 TOGSim/src/togsim_runtime.cc                  | 194 ++++++++
 TOGSim/src/togsim_trace_bridge.cc             | 192 ++++++++
 docs/design/togsim_cpp_trace.md               |  30 +-
 docs/design/togsim_cpp_trace_HANDOFF.md       |   6 +-
 docs/design/togsim_cpp_trace_STATUS.md        |   2 +-
 tests/test_togsim_emitc.py                    | 151 ++++++
 tests/test_togsim_runtime.py                  | 181 +++++++
 tests/test_togsim_skeleton.py                 | 119 +++++
 31 files changed, 2653 insertions(+), 101 deletions(-)
 create mode 100644 PyTorchSimFrontend/mlir/passes/_mlir_util.py
 create mode 100644 PyTorchSimFrontend/mlir/passes/build_skeleton.py
 create mode 100644 PyTorchSimFrontend/mlir/passes/cycle_table.py
 create mode 100644 PyTorchSimFrontend/mlir/passes/dep_analysis.py
 create mode 100644 PyTorchSimFrontend/mlir/passes/lower_to_emitc.py
 create mode 100644 PyTorchSimFrontend/mlir/passes/togsim_ops.py
 create mode 100644 TOGSim/include/togsim_loader.h
 create mode 100644 TOGSim/include/togsim_runtime.h
 create mode 100644 TOGSim/include/togsim_trace_bridge.h
 create mode 100644 TOGSim/src/togsim_runtime.cc
 create mode 100644 TOGSim/src/togsim_trace_bridge.cc
 create mode 100644 tests/test_togsim_emitc.py
 create mode 100644 tests/test_togsim_runtime.py
 create mode 100644 tests/test_togsim_skeleton.py

diff --git a/AsmParser/tog_generator.py b/AsmParser/tog_generator.py
index a12460e3..0de76246 100644
--- a/AsmParser/tog_generator.py
+++ b/AsmParser/tog_generator.py
@@ -1,3 +1,9 @@
+# DEPRECATED (timing path): legacy ONNX Tile-Operation-Graph producer. Builds
+# the TOG and serializes it to ONNX for the C++ TileGraphParser. Superseded by
+# the C++ trace pipeline (PyTorchSimFrontend/mlir/passes/build_skeleton.py +
+# lower_to_emitc.py + cycle_table.py -> a compiled trace .so). Kept live so the
+# current pipeline does not break; to be retired once the trace pipeline (P3+)
+# stabilizes. See docs/design/togsim_cpp_trace.md.
 import os
 import sys
 import importlib.util
diff --git a/PyTorchSimFrontend/extension_codecache.py b/PyTorchSimFrontend/extension_codecache.py
index 492133a3..247c07d5 100644
--- a/PyTorchSimFrontend/extension_codecache.py
+++ b/PyTorchSimFrontend/extension_codecache.py
@@ -241,8 +241,19 @@ def load(cls, source_code,
             # Run cyclesim
             cyclesim = CycleSimulator()
             cycle_list = cyclesim.compile_and_simulate(os.path.join(write_path, cycle_binary_name), vectorlane_size, silent_mode=silent_mode)
+            # Snapshot for the P3-trace hook below: generate_tile_graph consumes
+            # cycle_list in place (cycle_list.pop(0) per tile), leaving it empty.
+            cycle_list_for_trace = list(cycle_list)
 
             # Create TOG
+            # DEPRECATED (timing path): this ONNX-TOG producer -- run_tog ->
+            # tog_generator.generate_tile_graph -> ONNX -> C++ TileGraphParser --
+            # is being superseded by the C++ trace pipeline (build_skeleton +
+            # lower_to_emitc -> compiled .so, + the cycle_table sidecar). The
+            # per-tile cycle_list / x_offset / w_offset computed here are exactly
+            # what cycle_table.build_cycle_table will reuse, so both paths stay
+            # cycle-consistent during the transition. Kept live (pipeline must not
+            # break); to be retired once the trace pipeline (P3+) stabilizes.
             w_offset, x_offset = vectorlane_size, vectorlane_size
             if kwargs['loop_size'] is not None and kwargs['loop_size'][-3] < vectorlane_size:
                 x_offset = kwargs['loop_size'][-3]
@@ -258,6 +269,36 @@ def load(cls, source_code,
                 w_offset=w_offset, # FIXME.
                 vector_lane=vectorlane_size
             )
+
+            # P3 trace pipeline (opt-in, TORCHSIM_DUMP_TRACE_SO=1): also emit the
+            # compiled trace producer .so + the cycle-table TSV from the SAME
+            # post-vcix IR and gem5 cycle_list/offsets, so the trace path can be
+            # run and compared cycle-consistently against this legacy path.
+            # Best-effort: never breaks the legacy compile.
+            if os.environ.get("TORCHSIM_DUMP_TRACE_SO") == "1":
+                try:
+                    import mlir.ir as ir
+                    from PyTorchSimFrontend.mlir.passes import (
+                        build_skeleton as _bs, cycle_table as _ct, lower_to_emitc as _l2e)
+                    pv = sample_mlir_path + "_postvcix.mlir"
+                    _ctx = ir.Context(); _ctx.allow_unregistered_dialects = True
+                    with _ctx:
+                        _mod = ir.Module.parse(open(pv).read(), _ctx)
+                        _bs.build_skeleton(_mod)
+                        _ntiles = len(_ct._compute_types(_mod))
+                        # align lengths: gem5 gives one numCycles per compute node;
+                        # pad with the last value / truncate if it disagrees.
+                        _cl = list(cycle_list_for_trace)
+                        if _cl and len(_cl) != _ntiles:
+                            _cl = (_cl + [_cl[-1]] * _ntiles)[:_ntiles]
+                        logger.info(f"[P3-trace] cycle_list={cycle_list_for_trace} -> {_cl} "
+                                    f"(#tiles={_ntiles}, x_off={x_offset}, w_off={w_offset})")
+                        _tbl = _ct.build_cycle_table(_mod, _cl, x_offset, w_offset)
+                    _ct.dump_cycle_table_tsv(_tbl, os.path.join(write_path, "trace_cycles.tsv"))
+                    _l2e.build_trace_so(pv, os.path.join(write_path, "trace.so"))
+                    logger.info(f"[P3-trace] wrote trace.so + trace_cycles.tsv in {write_path}")
+                except Exception as e:
+                    logger.warning(f"[P3-trace] trace .so/sidecar dump skipped: {e}")
         return key
 
 class CustomAsyncCompile(AsyncCompile):
diff --git a/PyTorchSimFrontend/mlir/passes/_mlir_util.py b/PyTorchSimFrontend/mlir/passes/_mlir_util.py
new file mode 100644
index 00000000..e39f9d6f
--- /dev/null
+++ b/PyTorchSimFrontend/mlir/passes/_mlir_util.py
@@ -0,0 +1,87 @@
+"""Small, dependency-light helpers shared across the MLIR passes.
+
+Every pass had its own copy of the same op-walk generator (named variously
+`_iter_ops` / `_walk` / `_walk_ops`) and the same one-line attribute builders
+(`_i32` / `_i64` / ...). This module is the single source for both.
+
+Import-safety: `walk_ops` is pure block/op attribute access and needs no MLIR
+bindings, so this module does NOT import `mlir.ir` at top level -- some passes
+(e.g. lower_vlane_idx, decompose_transfer) are deliberately importable without
+the bindings present and only touch `mlir.ir` inside their run functions. The
+attribute builders therefore import `mlir.ir` lazily; they require an active
+MLIR context (the caller's `with ctx:`), exactly as the per-pass copies did.
+"""
+
+
+def walk_ops(block):
+    """Yield every op under `block` in program order, recursing into regions.
+
+    Snapshots each block's operation list, so a caller may erase ops while
+    iterating (the strictest of the former copies; a superset of the rest)."""
+    for op in list(block.operations):
+        yield op
+        for region in op.operation.regions:
+            for b in region.blocks:
+                yield from walk_ops(b)
+
+
+def _ir():
+    import mlir.ir as ir
+    return ir
+
+
+def i32(v):
+    """`i32` IntegerAttr for `v` (uses the active MLIR context)."""
+    ir = _ir()
+    return ir.IntegerAttr.get(ir.IntegerType.get_signless(32), int(v))
+
+
+def i64(v):
+    """`i64` IntegerAttr for `v`."""
+    ir = _ir()
+    return ir.IntegerAttr.get(ir.IntegerType.get_signless(64), int(v))
+
+
+def i64_array(vals):
+    """ArrayAttr of `i64` IntegerAttrs for `vals`."""
+    ir = _ir()
+    i = ir.IntegerType.get_signless(64)
+    return ir.ArrayAttr.get([ir.IntegerAttr.get(i, int(v)) for v in vals])
+
+
+def str_attr(v):
+    """StringAttr of `str(v)`."""
+    ir = _ir()
+    return ir.StringAttr.get(str(v))
+
+
+# ---------------------------------------------------------------------------
+# attribute readers -- accept an OpView or an Operation; `default` is returned
+# when `key` is absent (callers that want the strict "must be present" behaviour
+# simply never pass an absent key).
+# ---------------------------------------------------------------------------
+def _attrs(op):
+    return getattr(op, "operation", op).attributes
+
+
+def attr_int(op, key, default=None):
+    """Integer value of `op`'s `key` attribute, or `default` if absent."""
+    ir = _ir()
+    a = _attrs(op)
+    return ir.IntegerAttr(a[key]).value if key in a else default
+
+
+def attr_bool(op, key, default=False):
+    """Bool value of `op`'s `key` attribute, or `default` if absent."""
+    ir = _ir()
+    a = _attrs(op)
+    return bool(ir.BoolAttr(a[key]).value) if key in a else default
+
+
+def attr_i64_array(op, key, default=None):
+    """`op`'s `key` ArrayAttr of integers as a Python list, or `default` if
+    absent (pass `default=[]` for the "missing -> empty" convention)."""
+    ir = _ir()
+    a = _attrs(op)
+    return ([ir.IntegerAttr(x).value for x in ir.ArrayAttr(a[key])]
+            if key in a else default)
diff --git a/PyTorchSimFrontend/mlir/passes/build_skeleton.py b/PyTorchSimFrontend/mlir/passes/build_skeleton.py
new file mode 100644
index 00000000..deb7d9ae
--- /dev/null
+++ b/PyTorchSimFrontend/mlir/passes/build_skeleton.py
@@ -0,0 +1,426 @@
+"""build_skeleton pass (C2): reduce a kernel's post-vcix MLIR to the
+*skeleton + API* form, in place.
+
+The trace pipeline (docs/design/togsim_cpp_trace.md) compiles a kernel to a
+shape-parametric C++ trace producer. The producer is just the kernel's loop
+skeleton with the data computation replaced by calls to the event-based runtime
+API. This pass performs that reduction at the MLIR level:
+
+  * `memref.dma_start`  -> `togsim.dma(...) {tag_id, is_async, ...}` carrying the
+                            runtime tag index operand (`%tag[%idx]`).
+  * `memref.dma_wait`   -> `togsim.memory_barrier(tag_idx) {tag_id, write_bufs}`,
+                            the explicit async-DMA sync. It pairs with its dma by
+                            the RUNTIME tag slot (tag_id + the tag index), not a
+                            compile-time id: one static dma op runs once per loop
+                            iteration with a different `%tag[%idx]`, so only the
+                            runtime slot can pair iteration i's dma with its wait.
+  * each compute node   -> a single `togsim.compute {tile_id, compute_type}`
+  * everything else      -> removed by a use-based DCE, keeping the loops and the
+                            index/address arithmetic the survivors depend on.
+
+It reuses build_tog's traversal (`TogBuilder` / `_build`): loops, DMAs and
+compute blocks are already identified there, each with a back-pointer to its
+MLIR op(s), so this pass only adds the *rewrite*. Keeping a single traversal
+guarantees the skeleton and the legacy TOG see the same structure.
+
+Counterpart to `build_tog.build_tog_and_mutate`.
+
+The DCE is safe by construction: it never erases an op whose results still have
+uses, so at worst it leaves extra ops in the dump (visible for diagnosis) rather
+than producing invalid IR.
+
+Requires the MLIR Python bindings (importing `build_tog` pulls in `mlir.ir`).
+"""
+
+from . import togsim_ops as ts
+from ._mlir_util import walk_ops, i32, i64, i64_array, str_attr
+from .build_tog import (
+    ir,
+    TogBuilder,
+    _build,
+    _reset_ids,
+    _find_kernel,
+    _value_key,
+    TOGDMANode,
+    TOGDMAWaitNode,
+    _COMPUTE_TYPE_NAME,
+)
+
+#: Marker op names for the passes/__init__ fast-path (skip parsing if absent).
+MARKERS = ("memref.dma_start", "memref.dma_wait")
+
+#: Ops the DCE must never remove (loops, terminators, our API ops).
+_KEEP = {
+    "affine.for", "scf.for", "scf.while",
+    "affine.yield", "scf.yield", "func.return",
+    ts.DMA, ts.COMPUTE, ts.COMPUTE_BAR, ts.MEMORY_BAR,
+}
+
+
+def _kernel_block(module):
+    func_op = _find_kernel(module)
+    if func_op is None:
+        return None
+    return func_op.regions[0].blocks[0]
+
+
+# ---------------------------------------------------------------------------
+# op construction
+# ---------------------------------------------------------------------------
+def _arg_id_of(base_addr):
+    """Tensor func-arg ordinal from a build_tog base name ("arg3" -> 3); -1 if
+    it is not a plain block-arg base."""
+    s = str(base_addr)
+    return int(s[3:]) if s.startswith("arg") and s[3:].isdigit() else -1
+
+
+def _emit_dma(ctx, dma_node, tag_id, dram_index, tag_index, read_bufs, write_bufs):
+    """Insert a `togsim.dma` before the original `memref.dma_start`.
+
+    `tag_id` is the identity of this DMA's tag memref. An async DMA pairs with
+    its `togsim.memory_barrier` (the original dma_wait) by the RUNTIME tag slot
+    -- (tag_id, tag_index) -- not a compile-time identifier: one static dma op runs
+    once per loop iteration, each with a different runtime `%tag[%idx]` slot, so
+    only a runtime key can pair iteration i's dma with iteration i's wait.
+
+    `dram_index` is the original linear DRAM index Value (the `affine.apply`
+    result that indexed the tensor in the `memref.dma_start`) -- carried as an
+    operand so the DCE keeps the address arithmetic live and the C4 lowering can
+    compute the real `base_addr = base[arg_id] + index*elem` (P3, approach A).
+
+    `tag_index` is the original SRAM tag index Value (`%tag[%idx]`), carried as a
+    second operand: the runtime tag slot, used both to pair with the barrier and
+    for the double-buffer / SRAM-capacity (WAR) model.
+    Operand order: [dram_index, tag_index] (each omitted if absent)."""
+    op = dma_node.op
+    attrs = {
+        ts.ATTR_DIR: i32(ts.DIR_STORE if dma_node.is_write else ts.DIR_LOAD),
+        ts.ATTR_DIMS: i64_array(dma_node.tile_size),
+        ts.ATTR_STRIDES: i64_array(dma_node.tile_stride),
+        ts.ATTR_ELEM_BITS: i32(dma_node.element_size),
+        ts.ATTR_IS_ASYNC: ir.BoolAttr.get(bool(dma_node.is_async)),
+        ts.ATTR_TAG_ID: i32(tag_id),
+        ts.ATTR_ARG_ID: i32(_arg_id_of(dma_node.base_addr)),
+        "base": str_attr(dma_node.base_addr),
+        # SRAM spad this DMA touches (load writes it, store reads it) -- sec 10.
+        ts.ATTR_READ_BUFS: i64_array(read_bufs),
+        ts.ATTR_WRITE_BUFS: i64_array(write_bufs),
+    }
+    operands = [v for v in (dram_index, tag_index) if v is not None]
+    ir.Operation.create(
+        ts.DMA,
+        results=[],
+        operands=operands,
+        attributes=attrs,
+        loc=ir.Location.unknown(ctx),
+        ip=ir.InsertionPoint(op),
+    )
+
+
+def _emit_compute_bar(ctx, anchor_op):
+    """Insert a `togsim.compute_barrier` before `anchor_op` -- the fence that
+    drains in-flight async compute (the systolic-array matmuls) before a store
+    consumes their result (sec 10.7).
+
+    FIXME: this is the one barrier still synthesized here rather than read from
+    the IR. Like the async-load memory barrier (now mapped 1:1 from the explicit
+    dma_wait), the compute fence should eventually appear explicitly in the input
+    MLIR and be mapped through, not auto-inserted -- no surprising insertion."""
+    ir.Operation.create(
+        ts.COMPUTE_BAR, results=[], operands=[], attributes={},
+        loc=ir.Location.unknown(ctx), ip=ir.InsertionPoint(anchor_op))
+
+
+def _emit_memory_bar(ctx, anchor_op, tag_id, tag_index, write_bufs):
+    """Insert a `togsim.memory_barrier` before `anchor_op` -- the explicit
+    async-DMA sync that was the original `memref.dma_wait`. It pairs with its
+    async `togsim.dma` by the RUNTIME tag slot (tag_id + tag_index), and carries
+    the SRAM buffer that dma loaded so consumers gate on data-arrival, not on the
+    async dma's issue-complete."""
+    attrs = {
+        ts.ATTR_TAG_ID: i32(tag_id),
+        ts.ATTR_WRITE_BUFS: i64_array(write_bufs),
+    }
+    operands = [tag_index] if tag_index is not None else []
+    ir.Operation.create(
+        ts.MEMORY_BAR, results=[], operands=operands, attributes=attrs,
+        loc=ir.Location.unknown(ctx), ip=ir.InsertionPoint(anchor_op))
+
+
+def _emit_compute(ctx, compute_node, tile_id, read_bufs, write_bufs):
+    front = compute_node.operations[0]
+    attrs = {
+        ts.ATTR_TILE_ID: i64(tile_id),
+        # int code (0 vector / 1 matmul / 2 preload) consumed by the C4 lowering;
+        # maps directly to the Core compute-unit enum. Keep the readable name too.
+        ts.ATTR_COMPUTE_TYPE: i32(int(compute_node.compute_type)),
+        "compute_type_name": str_attr(_COMPUTE_TYPE_NAME[compute_node.compute_type]),
+        # SRAM buffer ids read/written (sec 10 dataflow); the bridge builds the
+        # dependency DAG by last-writer per buffer.
+        ts.ATTR_READ_BUFS: i64_array(read_bufs),
+        ts.ATTR_WRITE_BUFS: i64_array(write_bufs),
+    }
+    ir.Operation.create(
+        ts.COMPUTE,
+        results=[],
+        operands=[],
+        attributes=attrs,
+        loc=ir.Location.unknown(ctx),
+        ip=ir.InsertionPoint(front),
+    )
+
+
+# ---------------------------------------------------------------------------
+# DCE
+# ---------------------------------------------------------------------------
+def _has_nonempty_region(op):
+    for region in op.operation.regions:
+        for b in region.blocks:
+            if len(list(b.operations)) > 0:
+                return True
+    return False
+
+
+def _results_unused(op):
+    for r in op.operation.results:
+        if len(list(r.uses)) > 0:
+            return False
+    return True
+
+
+def _dce(block):
+    """Erase non-kept ops with no used results, to a fixed point. Safe: an op
+    with live SSA uses is never touched."""
+    changed = True
+    while changed:
+        changed = False
+        victims = []
+        for op in walk_ops(block):
+            name = op.operation.name
+            if name in _KEEP:
+                continue
+            if _has_nonempty_region(op):
+                continue
+            if _results_unused(op):
+                victims.append(op)
+        for op in victims:
+            try:
+                op.operation.erase()
+                changed = True
+            except Exception:
+                # Still referenced via something we will erase next round; retry.
+                pass
+
+
+# ---------------------------------------------------------------------------
+# driver
+# ---------------------------------------------------------------------------
+def _collect_dma_nodes(builder):
+    """Map op-identity -> DMA/DMAWait node, by walking the built tree."""
+    by_op = {}
+    seen = set()
+
+    def visit(n):
+        if id(n) in seen:
+            return
+        seen.add(id(n))
+        if isinstance(n, (TOGDMANode, TOGDMAWaitNode)) and n.op is not None:
+            by_op[id(n.op.operation)] = n
+        for c in n.children:
+            visit(c)
+
+    for ln in builder.loop_nodes:
+        visit(ln)
+    return by_op
+
+
+class _BufferIds:
+    """Assigns each SRAM buffer name a stable small int id, shared by DMA and
+    compute so the bridge can match a reader to its buffer's writer (sec 10).
+    The virtual SA_WEIGHTS buffer (preload -> matmul) is numbered here too, on
+    first sight. `None` (a non-buffer base) is -1."""
+
+    def __init__(self):
+        self._ids = {}
+
+    def of(self, name):
+        if name is None:
+            return -1
+        return self._ids.setdefault(name, len(self._ids))
+
+
+class _TagIds:
+    """Identity of a DMA's tag memref -> stable small int, plus the SRAM buffer
+    that tag's async DMA loads. An async dma and its memory_barrier (the original
+    dma_wait) share a tag memref; this assigns it a tag_id (so the runtime can
+    pair them by the runtime tag slot) and remembers the loaded buffer so the
+    barrier can release it to consumers. Pairing is by tag, never a static id."""
+
+    def __init__(self):
+        self._ids = {}   # tag value-key -> tag_id
+        self._buf = {}   # tag value-key -> SRAM buffer id the dma loads
+
+    def bind(self, key, buf):
+        tag_id = self._ids.setdefault(key, len(self._ids))
+        self._buf[key] = buf
+        return tag_id
+
+    def lookup(self, key):
+        """(tag_id, buffer) for a tag memref, or None if no dma used it."""
+        if key not in self._ids:
+            return None
+        return self._ids[key], self._buf[key]
+
+
+def _emit_computes(ctx, builder, bufs):
+    """Step 1: each compute node -> one togsim.compute carrying its tile_id and
+    the ids of the SRAM buffers it reads/writes. Returns the count."""
+    from . import dep_analysis as dep  # lazy: dep_analysis imports build_skeleton
+    n = 0
+    for tile_id, cn in enumerate(builder.compute_nodes):
+        if not cn.operations:
+            continue
+        reads, writes = dep.compute_buffers(cn)
+        _emit_compute(ctx, cn, tile_id,
+                      sorted(bufs.of(b) for b in reads),
+                      sorted(bufs.of(b) for b in writes))
+        n += 1
+    return n
+
+
+def _emit_one_dma(ctx, op, node, builder, bufs, tags):
+    """Rewrite one memref.dma_start as togsim.dma. A load reads DRAM and writes
+    its SRAM spad; a store reads the spad and writes DRAM -- which sets the
+    read/write buffer that drives the dependency edge (sec 10). The tag memref is
+    bound to a tag_id (with its loaded buffer) so the paired memory_barrier finds
+    it by the runtime tag slot."""
+    from . import dep_analysis as dep  # lazy: dep_analysis imports build_skeleton
+    f = builder._dma_start_fields(op)
+    dram_indices = f["dst_indices"] if node.is_write else f["src_indices"]
+    dram_index = dram_indices[0] if dram_indices else None
+    tag_indices = f["tag_indices"]
+    tag_index = tag_indices[0] if tag_indices else None
+    # the spad is the SRAM side of the copy: dst for a load, src for a store.
+    spad_id = bufs.of(dep._global_of(f["src"] if node.is_write else f["dst"]))
+    read_bufs = [spad_id] if node.is_write else []
+    write_bufs = [] if node.is_write else [spad_id]
+    tag_id = tags.bind(_value_key(f["tag"]), spad_id)
+    if node.is_write:
+        _emit_compute_bar(ctx, op)   # FIXME(sec10.7): auto-inserted; should be explicit in the IR.
+    _emit_dma(ctx, node, tag_id, dram_index, tag_index, read_bufs, write_bufs)
+
+
+def _emit_one_wait(ctx, op, tags):
+    """Rewrite one memref.dma_wait as togsim.memory_barrier -- the explicit
+    async-DMA sync already in the IR. Paired with its dma by the tag memref
+    (tag_id) and the runtime tag index; carries the buffer the dma loaded.
+    Returns True iff emitted (a wait whose tag no dma used is dropped)."""
+    operands = list(op.operation.operands)
+    tag = operands[0]
+    tag_index = operands[1] if len(operands) >= 2 else None
+    binding = tags.lookup(_value_key(tag))
+    if binding is None:
+        return False
+    tag_id, buf = binding
+    _emit_memory_bar(ctx, op, tag_id, tag_index, [buf])
+    return True
+
+
+def _emit_dmas_and_waits(ctx, block, builder, dma_by_op, bufs):
+    """Step 2: rewrite memref.dma_start -> togsim.dma and memref.dma_wait ->
+    togsim.memory_barrier in program order. An async dma and its barrier are
+    paired by the RUNTIME tag slot (tag_id + tag index), not a compile-time id:
+    one static dma op runs per loop iteration with a different `%tag[%idx]`, so
+    only the runtime slot can pair iteration i's dma with iteration i's wait.
+    Returns the original ops to erase and the (dma, wait) counts."""
+    tags = _TagIds()
+    originals = []
+    n_dma = n_wait = 0
+    for op in list(walk_ops(block)):
+        name = op.operation.name
+        if name == "memref.dma_start":
+            node = dma_by_op.get(id(op.operation))
+            if node is None:
+                continue
+            _emit_one_dma(ctx, op, node, builder, bufs, tags)
+            originals.append(op)
+            n_dma += 1
+        elif name == "memref.dma_wait":
+            if _emit_one_wait(ctx, op, tags):
+                n_wait += 1
+            originals.append(op)
+    return originals, n_dma, n_wait
+
+
+def build_skeleton(module):
+    """Reduce `func.func @kernel` in `module` to the skeleton+API form, in place.
+
+    Four steps: analyze the kernel into loop/compute/DMA nodes, emit a
+    togsim.compute per compute node, rewrite the DMAs/waits to togsim.dma/wait,
+    then DCE the leftover data computation. Returns a short text report (counts).
+    """
+    _reset_ids()
+    builder = TogBuilder()
+    _build(module, builder)  # populates loop/compute nodes + op back-pointers
+
+    block = _kernel_block(module)
+    if block is None:
+        return "no @kernel found"
+    ctx = module.context
+    dma_by_op = _collect_dma_nodes(builder)
+    bufs = _BufferIds()
+
+    n_compute = _emit_computes(ctx, builder, bufs)
+    originals, n_dma, n_wait = _emit_dmas_and_waits(ctx, block, builder, dma_by_op, bufs)
+
+    # erase the now-replaced originals (result-less -> safe), then strip the
+    # leftover data computation.
+    for op in originals:
+        try:
+            op.operation.erase()
+        except Exception:
+            pass
+    _dce(block)
+
+    return ("skeleton: compute=%d dma=%d wait=%d (unpaired waits dropped)"
+            % (n_compute, n_dma, n_wait))
+
+
+def run(module, vectorlane=128):
+    """passes/__init__ pass protocol entry (vectorlane unused; kept for parity)."""
+    build_skeleton(module)
+
+
+def run_skeleton(in_path, out_path=None):
+    """Read post-vcix MLIR at `in_path`, reduce to skeleton+API, write it out.
+
+    Requires the MLIR bindings.
+    """
+    if out_path is None:
+        out_path = in_path
+    ctx = ir.Context()
+    ctx.allow_unregistered_dialects = True
+    with ctx:
+        module = ir.Module.parse(open(in_path).read(), ctx)
+        report = build_skeleton(module)
+        with open(out_path, "w") as fh:
+            fh.write(str(module))
+    return report
+
+
+def main(argv):
+    import argparse
+
+    parser = argparse.ArgumentParser(prog="build_skeleton.py")
+    parser.add_argument("input")
+    parser.add_argument("--out", default=None)
+    args = parser.parse_args(argv[1:])
+    report = run_skeleton(args.input, args.out)
+    import sys
+    sys.stderr.write(report + "\n")
+    return 0
+
+
+if __name__ == "__main__":
+    import sys
+    sys.exit(main(sys.argv))
diff --git a/PyTorchSimFrontend/mlir/passes/cycle_table.py b/PyTorchSimFrontend/mlir/passes/cycle_table.py
new file mode 100644
index 00000000..40dd3459
--- /dev/null
+++ b/PyTorchSimFrontend/mlir/passes/cycle_table.py
@@ -0,0 +1,103 @@
+"""cycle_table (C3): the precomputed tile_id -> (cycle, overlapping_cycle) table
+the C++ trace pipeline looks up at runtime (docs/design/togsim_cpp_trace.md sec
+6, sec 9.8 task 4).
+
+A `togsim.compute(tile_id=...)` in the trace says *which* tile to compute, not
+how long it takes. Because tiles are fixed size, each tile's cost is invariant
+(only the trip count varies with shape), so it is sampled once and stored here,
+keyed by `tile_id`. Two numbers per tile, mirroring the legacy TOG:
+
+  * `cycle`            -- full compute latency, sampled by gem5 sample-mode
+                          (the existing measurement: `_rewrite_loop_steps` +
+                          `_insert_compute_markers` in build_tog, run through
+                          CycleSimulator -> the per-tile `cycle_list`).
+  * `overlapping_cycle` -- the portion that overlaps the previous instruction in
+                          the systolic pipeline; the timing core uses it as
+                          `finish = prev.finish + cycle - overlapped` (Core.cc).
+                          Derived exactly as the legacy path does
+                          (tog_generator.generate_tile_graph):
+                              type 0 (VectorCompute)  -> 0
+                              type 1 (MatmulCompute)  -> max(cycle - x_offset, 0)
+                              type 2 (MatmulPreload)  -> max(cycle - w_offset, 0)
+
+This module only *builds/serializes* the table from a cycle_list; obtaining the
+cycle_list reuses the existing sample-mode + gem5 path (wired in P3 task 5). The
+`tile_id` order matches build_skeleton's `compute_nodes` order, which matches the
+legacy TOG, so the same sampling keys both paths.
+
+Requires the MLIR Python bindings (to read the skeleton's togsim.compute ops).
+"""
+
+import json
+
+from . import togsim_ops as ts
+from ._mlir_util import walk_ops
+from .build_tog import (
+    ir,
+    VECTOR_COMPUTE,
+    MATMUL_COMPUTE,   # noqa: F401 (documents the type enum used by the formula)
+    MATMUL_PRELOAD,
+)
+
+
+def overlapping_cycle(cycle, compute_type, x_offset, w_offset):
+    """Hideable (pipeline-overlapped) portion of `cycle`. Mirrors
+    tog_generator.generate_tile_graph."""
+    if compute_type <= VECTOR_COMPUTE:           # VectorCompute: no systolic overlap
+        return 0
+    offset = w_offset if compute_type == MATMUL_PRELOAD else x_offset
+    return max(int(cycle) - int(offset), 0)
+
+
+def _compute_types(skeleton_module):
+    """tile_id-ordered list of compute_type ints, from the skeleton's
+    togsim.compute ops."""
+    items = []
+    for op in walk_ops(skeleton_module.body):
+        if op.operation.name != ts.COMPUTE:
+            continue
+        tid = ir.IntegerAttr(op.operation.attributes[ts.ATTR_TILE_ID]).value
+        ct = ir.IntegerAttr(op.operation.attributes[ts.ATTR_COMPUTE_TYPE]).value
+        items.append((tid, ct))
+    items.sort()
+    return [t for _, t in items]
+
+
+def build_cycle_table(skeleton_module, cycle_list, x_offset, w_offset):
+    """Return `[(cycle, overlapping_cycle), ...]` indexed by tile_id.
+
+    `cycle_list` is the per-tile gem5 measurement (compute_nodes order ==
+    tile_id order). `x_offset`/`w_offset` are the systolic-fill offsets the
+    legacy path computes from the vector-lane size / loop size."""
+    types = _compute_types(skeleton_module)
+    if len(cycle_list) != len(types):
+        raise ValueError(
+            "cycle_list (%d) does not match #compute tiles (%d)"
+            % (len(cycle_list), len(types)))
+    return [(int(c), overlapping_cycle(c, t, x_offset, w_offset))
+            for c, t in zip(cycle_list, types)]
+
+
+def dump_cycle_table(table, path, x_offset=None, w_offset=None):
+    """Serialize the table as a sidecar JSON next to the trace `.so`. The P3 C6
+    loader reads it and sets compute_cycle + overlapping_cycle on each emitted
+    Instruction."""
+    with open(path, "w") as fh:
+        json.dump({"x_offset": x_offset, "w_offset": w_offset,
+                   "table": [list(e) for e in table]}, fh)
+    return path
+
+
+def load_cycle_table(path):
+    with open(path) as fh:
+        return json.load(fh)
+
+
+def dump_cycle_table_tsv(table, path):
+    """Plain `cycle<TAB>overlapping` per line, in tile_id order -- the trivial
+    format the C++ `--cycle_table` loader (main.cc, P3 trace pipeline) reads with
+    ifstream (no JSON dependency in TOGSim)."""
+    with open(path, "w") as fh:
+        for cycle, overlapping in table:
+            fh.write("%d\t%d\n" % (int(cycle), int(overlapping)))
+    return path
diff --git a/PyTorchSimFrontend/mlir/passes/decompose_transfer.py b/PyTorchSimFrontend/mlir/passes/decompose_transfer.py
index c0e82b66..10b2edfb 100644
--- a/PyTorchSimFrontend/mlir/passes/decompose_transfer.py
+++ b/PyTorchSimFrontend/mlir/passes/decompose_transfer.py
@@ -32,13 +32,7 @@
 OP_NAME = "togsim.transfer"
 MARKERS = (OP_NAME,)
 
-
-def _iter_ops(block):
-    for op in list(block.operations):
-        yield op
-        for region in op.operation.regions:
-            for b in region.blocks:
-                yield from _iter_ops(b)
+from ._mlir_util import walk_ops
 
 
 def _int_array(attr):
@@ -92,7 +86,7 @@ def run(module, vectorlane=128, **_):
     targets = []
     for region in module.operation.regions:
         for b in region.blocks:
-            for op in _iter_ops(b):
+            for op in walk_ops(b):
                 if op.operation.name == OP_NAME:
                     targets.append(op.operation)
 
diff --git a/PyTorchSimFrontend/mlir/passes/dep_analysis.py b/PyTorchSimFrontend/mlir/passes/dep_analysis.py
new file mode 100644
index 00000000..bc53bfc9
--- /dev/null
+++ b/PyTorchSimFrontend/mlir/passes/dep_analysis.py
@@ -0,0 +1,194 @@
+"""dep_analysis.py -- dependency-edge analysis for the C++ trace pipeline (P3, sec 10).
+
+The current TOG pass does NO dependency analysis (it emits a lexical loop tree +
+runtime tags). This module derives the producer->consumer edges that the explicit
+dataflow trace needs, from two sources available on the post-vcix IR (before
+build_skeleton collapses the compute regions):
+
+  1. SRAM access: each DMA/compute's read/write SRAM buffer(s), recovered by
+     following SSA (a vcix.iv's input vector -> its vector.transfer_read -> the
+     memref -> @global), and the DMA's spad operand. Edge: a reader depends on
+     the last node that wrote the same buffer.
+  2. vcix preload/matmul pairing: a matmul (vcix opcode 0) consumes the weights a
+     preceding preload (opcode 1) loaded into the systolic array -- an SA-internal
+     dependency NOT visible as a memref access, so it comes from the opcode order.
+
+This is a node-level analysis (one node per build_tog compute/DMA node); the loops
+replay the nodes, so loop-carried edges (the Y_spad accumulator) are materialized
+per iteration downstream. First cut: buffer granularity (slot-level value matching
+is a later refinement). Output is an edge list for validation / to drive emit.
+"""
+import sys
+import os
+
+from .build_tog import TogBuilder, ir, _reset_ids
+from . import build_skeleton as _bs
+
+
+def _global_of(memref_val):
+    """memref SSA value -> @global symbol name (e.g. 'X_spad'), or None."""
+    owner = memref_val.owner
+    op = owner if isinstance(owner, ir.Operation) else getattr(owner, "operation", None)
+    if op is None:
+        return None
+    if op.name == "memref.get_global":
+        return str(op.attributes["name"]).strip('@" ')
+    # walk through view-like ops (subview/cast) to their source
+    if op.operands:
+        try:
+            return _global_of(op.operands[0])
+        except Exception:
+            return None
+    return None
+
+
+def _read_buffers_of_compute(cn):
+    """SRAM buffers a compute node reads: (a) each vcix.iv input traced to its
+    vector.transfer_read source (activations/weights streamed into the SA), and
+    (b) any direct vector.transfer_read in the node (the epilogue's accumulator
+    read-modify-write of Y_spad)."""
+    bufs = set()
+    for op in cn.operations:
+        if op.name == "vector.transfer_read" and list(op.operands):
+            b = _global_of(op.operands[0])
+            if b:
+                bufs.add(b)
+        elif op.name == "vcix.iv" and list(op.operands):
+            v = op.operands[0]
+            defop = v.owner if isinstance(v.owner, ir.Operation) else getattr(v.owner, "operation", None)
+            if defop is not None and defop.name == "vector.transfer_read" and list(defop.operands):
+                b = _global_of(defop.operands[0])
+                if b:
+                    bufs.add(b)
+    return bufs
+
+
+def _write_buffers_of_compute(cn):
+    """SRAM buffers a compute node writes: vector.transfer_write / vector_store target."""
+    bufs = set()
+    for op in cn.operations:
+        if op.name in ("vector.transfer_write", "affine.vector_store", "vector.store"):
+            # target memref is the last memref operand
+            for v in op.operands:
+                try:
+                    if ir.MemRefType.isinstance(v.type):
+                        b = _global_of(v)
+                        if b:
+                            bufs.add(b)
+                except Exception:
+                    pass
+    return bufs
+
+
+def _dma_buffer(builder, dma_node):
+    """The SRAM spad buffer a DMA touches (dst for load, src for store)."""
+    try:
+        f = builder._dma_start_fields(dma_node.op)
+    except Exception:
+        return None
+    val = f["dst"] if not dma_node.is_write else f["src"]
+    return _global_of(val)
+
+
+# Virtual buffer for the systolic-array weight registers: a preload writes it,
+# the following matmul reads it. This folds the SA-internal preload->matmul
+# dependency (not a memref access) into the uniform "last-writer per buffer" rule.
+SA_WEIGHTS = "__SA_WEIGHTS__"
+
+
+def compute_buffers(cn):
+    """(read_buffers, write_buffers) for one compute node, including the virtual
+    SA_WEIGHTS edge (preload writes it, matmul reads it)."""
+    reads = set(_read_buffers_of_compute(cn))
+    writes = set(_write_buffers_of_compute(cn))
+    if cn.compute_type == 1:      # MATMUL consumes the preloaded weights
+        reads.add(SA_WEIGHTS)
+    elif cn.compute_type == 2:    # PRELOAD loads them
+        writes.add(SA_WEIGHTS)
+    return reads, writes
+
+
+def analyze(module):
+    """Return (nodes, edges). nodes: list of dicts; edges: list of (consumer_idx,
+    producer_idx, reason)."""
+    _reset_ids()
+    builder = TogBuilder()
+    _bs._build(module, builder)
+
+    nodes = []
+    # DMA nodes only (the map also contains TOGDMAWaitNode; keep real DMAs).
+    dma_nodes = [dn for dn in dict.fromkeys(_bs._collect_dma_nodes(builder).values())
+                 if hasattr(dn, "is_write")]
+    for dn in dma_nodes:
+        buf = _dma_buffer(builder, dn)
+        nodes.append({
+            "kind": "STORE" if dn.is_write else "LOAD",
+            "buf": buf, "arg": str(dn.base_addr),
+            "reads": {buf} if dn.is_write else set(),
+            "writes": {buf} if not dn.is_write else set(),
+            "node": dn,
+        })
+    for cn in builder.compute_nodes:
+        if not cn.operations:
+            continue
+        ct = {0: "VECTOR", 1: "MATMUL", 2: "PRELOAD"}.get(cn.compute_type, f"c{cn.compute_type}")
+        nodes.append({
+            "kind": ct,
+            "reads": _read_buffers_of_compute(cn),
+            "writes": _write_buffers_of_compute(cn),
+            "node": cn,
+            "compute_type": cn.compute_type,
+        })
+
+    # Order nodes by program position (last-writer needs program order: e.g. the
+    # store reads Y_spad written by the matmul, which lexically precedes it).
+    pos = {}
+    idx = [0]
+    def _index(op):
+        pos[op] = idx[0]; idx[0] += 1
+        for r in op.regions:
+            for b in r.blocks:
+                for o in b.operations:
+                    _index(o)
+    _index(module.operation)
+    def _key(n):
+        node = n["node"]
+        op = getattr(node, "op", None) or (node.operations[0] if getattr(node, "operations", None) else None)
+        return pos.get(op, 1 << 30)
+    nodes.sort(key=_key)
+
+    # Edges: (1) buffer last-writer, (2) preload->matmul.
+    edges = []
+    last_writer = {}  # buffer -> node idx
+    prev_preload = None
+    for i, n in enumerate(nodes):
+        for b in sorted(n["reads"]):
+            if b in last_writer:
+                edges.append((i, last_writer[b], f"reads {b}"))
+        if n["kind"] == "MATMUL" and prev_preload is not None:
+            edges.append((i, prev_preload, "uses preloaded weights (vcix op1->op0)"))
+        for b in n["writes"]:
+            last_writer[b] = i
+        if n["kind"] == "PRELOAD":
+            prev_preload = i
+    return nodes, edges
+
+
+def _main():
+    path = sys.argv[1]
+    ctx = ir.Context(); ctx.allow_unregistered_dialects = True
+    with ctx:
+        module = ir.Module.parse(open(path).read(), ctx)
+        nodes, edges = analyze(module)
+    print("=== nodes ===")
+    for i, n in enumerate(nodes):
+        r = ",".join(sorted(n["reads"])) or "-"
+        w = ",".join(sorted(n["writes"])) or "-"
+        print(f"  #{i:<2} {n['kind']:<8} reads[{r}] writes[{w}]")
+    print("=== edges (consumer -> producer) ===")
+    for c, p, why in edges:
+        print(f"  #{c} ({nodes[c]['kind']}) -> #{p} ({nodes[p]['kind']})   [{why}]")
+
+
+if __name__ == "__main__":
+    _main()
diff --git a/PyTorchSimFrontend/mlir/passes/dma_fine_grained.py b/PyTorchSimFrontend/mlir/passes/dma_fine_grained.py
index 3f583ef2..994f1f1f 100644
--- a/PyTorchSimFrontend/mlir/passes/dma_fine_grained.py
+++ b/PyTorchSimFrontend/mlir/passes/dma_fine_grained.py
@@ -30,6 +30,8 @@
 
 import mlir.ir as ir  # noqa: E402
 
+from ._mlir_util import walk_ops, attr_i64_array
+
 MARKERS = ("subtile_size",)   # only subtile DMAs are split
 
 MVIN, MVIN2, MVIN3, MVOUT = 2, 1, 14, 3
@@ -54,12 +56,6 @@ def _const_int(value, default=-1):
         return default
 
 
-def _int_array_attr(op, key):
-    if key not in op.attributes:
-        return []
-    return [ir.IntegerAttr(a).value for a in ir.ArrayAttr(op.attributes[key])]
-
-
 def _is_block_arg(v):
     return isinstance(v, ir.BlockArgument)
 
@@ -106,13 +102,13 @@ def tile_shape(self):
         return list(mt.shape)
 
     def subtile_size(self):
-        return _int_array_attr(self.op, "subtile_size")
+        return attr_i64_array(self.op, "subtile_size", default=[])
 
     def sram_stride(self):
-        return _int_array_attr(self.op, "sram_stride")
+        return attr_i64_array(self.op, "sram_stride", default=[])
 
     def dram_stride(self):
-        return _int_array_attr(self.op, "dram_stride")
+        return attr_i64_array(self.op, "dram_stride", default=[])
 
     def is_async(self):
         a = self.op.attributes
@@ -293,20 +289,12 @@ def _reaches(value, target):
 # ---------------------------------------------------------------------------
 # Pass driver
 # ---------------------------------------------------------------------------
-def _iter_ops(block):
-    for op in list(block.operations):
-        yield op
-        for region in op.operation.regions:
-            for b in region.blocks:
-                yield from _iter_ops(b)
-
-
 def _run_func(func, vectorlane):
     from mlir.dialects import linalg
     # First matmul only.
     matmul = None
     dmas = []
-    for op in _iter_ops(func.regions[0].blocks[0]):
+    for op in walk_ops(func.regions[0].blocks[0]):
         name = op.operation.name
         if name == "linalg.matmul" and matmul is None:
             matmul = op
diff --git a/PyTorchSimFrontend/mlir/passes/lower_dma_to_gemmini.py b/PyTorchSimFrontend/mlir/passes/lower_dma_to_gemmini.py
index f5b841bb..998a6db5 100644
--- a/PyTorchSimFrontend/mlir/passes/lower_dma_to_gemmini.py
+++ b/PyTorchSimFrontend/mlir/passes/lower_dma_to_gemmini.py
@@ -22,6 +22,8 @@
 WAIT_NAME = "memref.dma_wait"
 MARKERS = (OP_NAME, WAIT_NAME)
 
+from ._mlir_util import attr_i64_array
+
 # func7 instruction codes (CustomDMAAttribute.h)
 CONFIG, CONFIG2, CONFIG3, CONFIG4 = 0, 4, 5, 6
 MVIN, MVIN2, MVIN3, MVOUT = 2, 1, 14, 3
@@ -124,8 +126,8 @@ def elem_addr_i64(memref_val, indices, mtype, elem_bytes):
         tile_shape = _subtile(op)
         if tile_shape is None:
             tile_shape = list(dst_ty.shape) if is_mvin else list(src_ty.shape)
-        dram_strides = _int_array(op, "dram_stride")
-        spad_strides = _int_array(op, "sram_stride")
+        dram_strides = attr_i64_array(op, "dram_stride")
+        spad_strides = attr_i64_array(op, "sram_stride")
         assert len(tile_shape) == len(dram_strides) == len(spad_strides), \
             f"shape/stride rank mismatch: {tile_shape} {dram_strides} {spad_strides}"
 
@@ -180,11 +182,6 @@ def _subtile(op):
     return [IntegerAttr(a).value for a in ArrayAttr(op.attributes["subtile_size"])]
 
 
-def _int_array(op, name):
-    from mlir.ir import ArrayAttr, IntegerAttr
-    return [IntegerAttr(a).value for a in ArrayAttr(op.attributes[name])]
-
-
 def _elem_bytes(elem_type):
     from mlir.ir import IntegerType, FloatType
     bits = (IntegerType(elem_type).width if IntegerType.isinstance(elem_type)
diff --git a/PyTorchSimFrontend/mlir/passes/lower_to_emitc.py b/PyTorchSimFrontend/mlir/passes/lower_to_emitc.py
new file mode 100644
index 00000000..4ba40e16
--- /dev/null
+++ b/PyTorchSimFrontend/mlir/passes/lower_to_emitc.py
@@ -0,0 +1,454 @@
+"""lower_to_emitc pass (C4): skeleton+API MLIR -> EmitC -> C++ -> trace `.so`.
+
+Second stage of the C++ trace pipeline (docs/design/togsim_cpp_trace.md, sec
+5-7). Takes the skeleton+API module from `build_skeleton` (loop nest +
+`togsim.*` ops) and produces an EmitC module whose single entry function
+
+    extern "C" void togsim_kernel(EmitCtx* ctx, int64_t* shape_args, int32_t n)
+
+mirrors the loop skeleton, with every `togsim.*` op as an `emitc.call_opaque`
+to the matching `togsim_runtime.h` free function (`togsim_ops.EMITC_CALLEE`).
+`mlir-translate --mlir-to-cpp` renders it to C++, compiled to a `.so` that
+exports `togsim_kernel` and leaves `togsim_dma/wait/compute/signal` undefined for
+the TOGSim loader to resolve at `dlopen`.
+
+How the lowering is done -- it drives the *upstream* EmitC conversion passes and
+adds only the glue they cannot do:
+
+  1. (python) Rewrite the unregistered `togsim.*` ops to `emitc.call_opaque`.
+     Unregistered ops have no registered conversion patterns, so this must be a
+     custom rewrite (design sec 8). Also rewrite the kernel's signature to the
+     ABI form (drop the memref tensor args -- the trace producer never touches
+     tensor data; base addresses are deferred to P3) and drop the aux
+     globals / wrapper func.
+  2. (upstream passes, in-process PassManager)
+        func.func(lower-affine) -> convert-scf-to-emitc
+        -> convert-arith-to-emitc -> convert-func-to-emitc
+     This is the EmitC infrastructure: it lowers the affine/scf loop nest to
+     `emitc.for`, the index/arith (loop bounds, and in P3 the address
+     arithmetic) to EmitC, and the func to `emitc.func`.
+  3. (python) Two small fixups the passes leave behind in this LLVM 20 build:
+       * `convert-scf-to-emitc` emits `emitc.for` with `index`-typed bounds, so
+         `convert-arith-to-emitc` (which makes constants `!emitc.size_t`) leaves
+         `builtin.unrealized_conversion_cast` on the bounds that nothing folds
+         and `mlir-to-cpp` cannot print (design sec 8 "EmitC coverage" risk).
+         `_fold_for_bound_casts` rewrites those casts away.
+       * add the `extern "C"` specifier so `dlsym` finds the entry unmangled.
+
+Requires the MLIR Python bindings (incl. `mlir.passmanager`); the .cpp/.so
+steps additionally require `mlir-translate` (TORCHSIM_LLVM_PATH) and a host C++
+compiler.
+"""
+
+import os
+import subprocess
+
+from mlir.passmanager import PassManager
+
+from . import togsim_ops as ts
+from ._mlir_util import walk_ops, i32, i64, attr_int, attr_i64_array
+from .build_tog import ir, _find_kernel
+
+#: emitted entry symbol (== ts.ENTRY_SYMBOL == "togsim_kernel").
+ENTRY = ts.ENTRY_SYMBOL
+
+#: EmitC type of the opaque EmitCtx* threaded through every call.
+CTX_TYPE = '!emitc.ptr<!emitc.opaque<"EmitCtx">>'
+
+#: upstream EmitC conversion pipeline (the infrastructure this pass drives).
+_PIPELINE = ("builtin.module("
+             "func.func(lower-affine),"
+             "convert-scf-to-emitc,"
+             "convert-arith-to-emitc,"
+             "convert-func-to-emitc)")
+
+#: prepended to the mlir-to-cpp output; pulls in size_t/intN_t and the ABI.
+_PRELUDE = (
+    "#include <cstddef>\n"
+    "#include <cstdint>\n"
+    "using std::size_t;\n"
+    '#include "togsim_runtime.h"\n'
+)
+
+
+# ---------------------------------------------------------------------------
+# attribute builders / readers
+# ---------------------------------------------------------------------------
+def _idx(v):
+    return ir.IntegerAttr.get(ir.IndexType.get(), int(v))
+
+
+def _opaque(ctx, text):
+    return ir.Attribute.parse('#emitc.opaque<"%s">' % text, ctx)
+
+
+def _arr(ctx, vals):
+    """A C compound-literal `(const int64_t[]){...}` arg, or `nullptr` if empty
+    (the call site decays it to a `const int64_t*`)."""
+    vals = list(vals)
+    if not vals:
+        return _opaque(ctx, "nullptr")
+    return _opaque(ctx, "(const int64_t[]){%s}" % ", ".join(str(int(v)) for v in vals))
+
+
+def _attr_bool(op, key):
+    return 1 if ir.BoolAttr(op.operation.attributes[key]).value else 0
+
+
+# ---------------------------------------------------------------------------
+# step 1: rewrite signature + togsim.* ops (the unregistered-op glue)
+# ---------------------------------------------------------------------------
+def _strip_aux(module):
+    """Erase memref.global decls and every func except @kernel (the wrapper)."""
+    victims = []
+    for op in module.body.operations:
+        name = op.operation.name
+        if name == "memref.global":
+            victims.append(op)
+        elif name == "func.func":
+            if ir.StringAttr(op.operation.attributes["sym_name"]).value != "kernel":
+                victims.append(op)
+    for op in victims:
+        op.operation.erase()
+
+
+def _rewrite_signature(kernel, ctx):
+    """Replace @kernel's memref tensor args with the ABI args
+    (EmitCtx*, int64_t* shape_args, int32_t n) and rename it to togsim_kernel.
+    Returns the ctx Value."""
+    block = kernel.regions[0].blocks[0]
+    for arg in block.arguments:
+        if len(list(arg.uses)) > 0:
+            raise ValueError(
+                "kernel arg still used after build_skeleton; cannot drop it "
+                "(expected the DCE to have removed all tensor-data ops)")
+    # erase existing (memref) args high-to-low, then append the ABI args.
+    for i in reversed(range(len(block.arguments))):
+        block.erase_argument(i)
+    ptr = ir.Type.parse(CTX_TYPE, ctx)
+    i64ptr = ir.Type.parse("!emitc.ptr<i64>", ctx)
+    i32 = ir.IntegerType.get_signless(32)
+    loc = ir.Location.unknown(ctx)
+    block.add_argument(ptr, loc)
+    block.add_argument(i64ptr, loc)
+    block.add_argument(i32, loc)
+    kernel.operation.attributes["function_type"] = ir.TypeAttr.get(
+        ir.FunctionType.get([ptr, i64ptr, i32], []))
+    kernel.operation.attributes["sym_name"] = ir.StringAttr.get(ENTRY)
+    return block.arguments[0]
+
+
+def _call(ctx, ctx_val, op, callee, arg_attrs):
+    """Insert emitc.call_opaque <callee>(ctx) {args=[0:index, ...]} before `op`.
+    The leading `0 : index` references operand 0 (ctx); other entries are
+    literal C args (integer attr -> literal, #emitc.opaque -> verbatim)."""
+    ir.Operation.create(
+        "emitc.call_opaque", results=[], operands=[ctx_val],
+        attributes={"callee": ir.StringAttr.get(callee),
+                    "args": ir.ArrayAttr.get([_idx(0)] + arg_attrs)},
+        loc=ir.Location.unknown(ctx), ip=ir.InsertionPoint(op))
+
+
+def _innermost_outer_loop(block):
+    """Deepest `affine.for {outer_loop=true}` (the PARALLEL/ACCUMULATION
+    boundary). Returns the op or None if the kernel has no parallel loop."""
+    found = [None]
+
+    def is_outer(op):
+        a = op.operation.attributes
+        return "outer_loop" in a and ir.BoolAttr(a["outer_loop"]).value
+
+    def walk(b):
+        for op in b.operations:
+            if op.operation.name == "affine.for" and is_outer(op):
+                found[0] = op   # nested outer loops overwrite -> deepest wins
+            for r in op.operation.regions:
+                for bb in r.blocks:
+                    walk(bb)
+
+    walk(block)
+    return found[0]
+
+
+def _insert_core_alloc(ctx, kernel, ctx_val):
+    """Insert `togsim_core_alloc(ctx)` at the start of each parallel work-item:
+    the first op of the innermost PARALLEL loop body (or the kernel entry if the
+    kernel has no parallel loop -> a single work-item). The runtime binds the
+    following ops to the returned core (sec 9.3); the producer never names
+    num_cores. The return value is discarded (no free -- a core is an assignment,
+    not a held resource)."""
+    block = kernel.regions[0].blocks[0]
+    target = _innermost_outer_loop(block)
+    body = target.operation.regions[0].blocks[0] if target is not None else block
+    ir.Operation.create(
+        "emitc.call_opaque", results=[], operands=[ctx_val],
+        attributes={"callee": ir.StringAttr.get(ts.CORE_ALLOC_CALLEE),
+                    "args": ir.ArrayAttr.get([_idx(0)])},
+        loc=ir.Location.unknown(ctx),
+        ip=ir.InsertionPoint.at_block_begin(body))
+
+
+def _rewrite_togsim_ops(ctx, kernel, ctx_val):
+    block = kernel.regions[0].blocks[0]
+    victims = []
+    for op in walk_ops(block):
+        name = op.operation.name
+        ipo = ir.InsertionPoint(op)
+        if name == ts.DMA:
+            dims = attr_i64_array(op, ts.ATTR_DIMS)
+            # The DRAM element offset is the togsim.dma operand (the original
+            # affine index, kept live by build_skeleton); pass it as a call
+            # operand so convert-arith-to-emitc lowers the address arithmetic
+            # into the producer (P3 approach A). The runtime adds the tensor base.
+            # Operands carried by build_skeleton: [dram_index, tag_index] (each
+            # optional). Pass each as a call operand so convert-arith-to-emitc
+            # lowers it; reference it from `args` by its operand position. offset
+            # -> DRAM byte address (runtime adds the tensor base); tag_slot -> the
+            # SRAM tile slot (runtime uses it for double-buffer/SRAM-capacity).
+            ins = list(op.operation.operands)
+            dram_operand = ins[0] if len(ins) >= 1 else None
+            tag_operand = ins[1] if len(ins) >= 2 else None
+            operands = [ctx_val]
+            offset_arg = i64(0)
+            tag_arg = i64(0)
+            if dram_operand is not None:
+                operands.append(dram_operand)
+                offset_arg = _idx(len(operands) - 1)
+            if tag_operand is not None:
+                operands.append(tag_operand)
+                tag_arg = _idx(len(operands) - 1)
+            args = [_idx(0),
+                    i32(attr_int(op, ts.ATTR_DIR)),
+                    i32(attr_int(op, ts.ATTR_ARG_ID)),
+                    offset_arg,
+                    i32(len(dims)),
+                    _arr(ctx, dims),
+                    _arr(ctx, attr_i64_array(op, ts.ATTR_STRIDES)),
+                    i32(attr_int(op, ts.ATTR_ELEM_BITS)),
+                    i32(_attr_bool(op, ts.ATTR_IS_ASYNC)),
+                    i32(attr_int(op, ts.ATTR_TAG_ID)),
+                    tag_arg]
+            _rb = attr_i64_array(op, ts.ATTR_READ_BUFS)
+            _wb = attr_i64_array(op, ts.ATTR_WRITE_BUFS)
+            args += [_arr(ctx, _rb), i32(len(_rb)), _arr(ctx, _wb), i32(len(_wb))]
+            # togsim_dma is void: the dma is paired with its barrier by the runtime
+            # (tag_id, tag_slot), not a returned handle.
+            ir.Operation.create(
+                "emitc.call_opaque", results=[], operands=operands,
+                attributes={"callee": ir.StringAttr.get(ts.EMITC_CALLEE[ts.DMA]),
+                            "args": ir.ArrayAttr.get(args)},
+                loc=ir.Location.unknown(ctx), ip=ipo)
+            victims.append(op)
+        elif name == ts.MEMORY_BAR:
+            # explicit async-DMA sync (the original dma_wait) ->
+            # togsim_memory_barrier(ctx, tag_id, tag_slot, write_bufs). The tag
+            # index operand (if any) is the runtime tag slot.
+            ins = list(op.operation.operands)
+            operands = [ctx_val]
+            tag_arg = i64(0)
+            if ins:
+                operands.append(ins[0])
+                tag_arg = _idx(len(operands) - 1)
+            _wb = attr_i64_array(op, ts.ATTR_WRITE_BUFS)
+            ir.Operation.create(
+                "emitc.call_opaque", results=[], operands=operands,
+                attributes={"callee": ir.StringAttr.get(ts.EMITC_CALLEE[ts.MEMORY_BAR]),
+                            "args": ir.ArrayAttr.get(
+                                [_idx(0), i32(attr_int(op, ts.ATTR_TAG_ID)), tag_arg,
+                                 _arr(ctx, _wb), i32(len(_wb))])},
+                loc=ir.Location.unknown(ctx), ip=ipo)
+            victims.append(op)
+        elif name == ts.COMPUTE:
+            # skeleton compute carries no dims (cost is keyed by tile_id) -> 0/null.
+            _rb = attr_i64_array(op, ts.ATTR_READ_BUFS)
+            _wb = attr_i64_array(op, ts.ATTR_WRITE_BUFS)
+            _call(ctx, ctx_val, op, ts.EMITC_CALLEE[ts.COMPUTE],
+                  [i64(attr_int(op, ts.ATTR_TILE_ID)),
+                   i32(attr_int(op, ts.ATTR_COMPUTE_TYPE)),
+                   i32(0), _opaque(ctx, "nullptr"),
+                   _arr(ctx, _rb), i32(len(_rb)), _arr(ctx, _wb), i32(len(_wb))])
+            victims.append(op)
+        elif name == ts.COMPUTE_BAR:
+            # explicit compute fence -> togsim_compute_barrier(ctx) (sec 10.7).
+            ir.Operation.create(
+                "emitc.call_opaque", results=[], operands=[ctx_val],
+                attributes={"callee": ir.StringAttr.get(ts.EMITC_CALLEE[ts.COMPUTE_BAR]),
+                            "args": ir.ArrayAttr.get([_idx(0)])},
+                loc=ir.Location.unknown(ctx), ip=ipo)
+            victims.append(op)
+    for op in victims:
+        op.operation.erase()
+
+
+# ---------------------------------------------------------------------------
+# step 3: post-conversion fixups
+# ---------------------------------------------------------------------------
+def _retype_for_to_size_t(module):
+    """Make every `emitc.for` use `!emitc.size_t` bounds + induction variable,
+    then drop the `index`<->`!emitc.size_t` `unrealized_conversion_cast` ops that
+    `convert-scf-to-emitc` / `convert-arith-to-emitc` leave behind (mlir-to-cpp
+    cannot print them; --reconcile cannot fold them).
+
+    `emitc.for` accepts `size_t` bounds with the explicit type, and a `size_t` IV
+    makes the lowered address arithmetic (`convert-arith-to-emitc`, which works
+    in `size_t`) cast-free. So: set each IV to size_t, then for every
+    index<->size_t cast replace its result with its source (every consumer here
+    -- `emitc.for` bounds, `emitc.call_opaque` operands, `emitc` arith -- accepts
+    either, and after the IV retype each such cast bridges equal types)."""
+    idx = ir.IndexType.get()
+    st = ir.Type.parse("!emitc.size_t", module.context)
+
+    for op in list(walk_ops(module.body)):
+        if op.operation.name == "emitc.for":
+            op.operation.regions[0].blocks[0].arguments[0].set_type(st)
+
+    dead = []
+    for op in list(walk_ops(module.body)):
+        if op.operation.name != "builtin.unrealized_conversion_cast":
+            continue
+        res = op.results[0]
+        src = list(op.operation.operands)[0]
+        # idx<->size_t bridges (incl. the size_t->size_t identities left after
+        # the IV retype): every consumer here accepts either, so fold to source.
+        if src.type in (idx, st) and res.type in (idx, st):
+            res.replace_all_uses_with(src)
+            dead.append(op)
+    for d in dead:
+        try:
+            d.operation.erase()
+        except Exception:
+            pass
+
+
+def _add_extern_c(module, ctx):
+    for op in module.body.operations:
+        if (op.operation.name == "emitc.func"
+                and ir.StringAttr(op.operation.attributes["sym_name"]).value == ENTRY):
+            op.operation.attributes["specifiers"] = ir.ArrayAttr.get(
+                [ir.StringAttr.get('extern "C"')])
+            return
+    raise ValueError("emitc.func @%s not found after conversion" % ENTRY)
+
+
+# ---------------------------------------------------------------------------
+# driver
+# ---------------------------------------------------------------------------
+def lower_to_emitc(skeleton_module):
+    """Lower a skeleton+API module (in place) to an EmitC module with the
+    `togsim_kernel` entry function. Returns the same module."""
+    ctx = skeleton_module.context
+    kernel = _find_kernel(skeleton_module)
+    if kernel is None:
+        raise ValueError("no @kernel found in skeleton module")
+
+    _strip_aux(skeleton_module)
+    ctx_val = _rewrite_signature(kernel, ctx)
+    _insert_core_alloc(ctx, kernel, ctx_val)          # core_alloc per work-item
+    _rewrite_togsim_ops(ctx, kernel, ctx_val)
+
+    PassManager.parse(_PIPELINE, ctx).run(skeleton_module.operation)
+
+    _retype_for_to_size_t(skeleton_module)
+    _add_extern_c(skeleton_module, ctx)
+    return skeleton_module
+
+
+# ---------------------------------------------------------------------------
+# C++ / .so backend
+# ---------------------------------------------------------------------------
+def _mlir_translate_bin():
+    return os.path.join(os.environ.get("TORCHSIM_LLVM_PATH", "/usr/bin"),
+                        "mlir-translate")
+
+
+def emitc_to_cpp(emitc_module, mlir_translate=None):
+    """Render `emitc_module` to C++ source (prelude + mlir-to-cpp body)."""
+    mlir_translate = mlir_translate or _mlir_translate_bin()
+    proc = subprocess.run(
+        [mlir_translate, "--mlir-to-cpp"],
+        input=str(emitc_module), capture_output=True, text=True)
+    if proc.returncode != 0:
+        raise RuntimeError("mlir-translate --mlir-to-cpp failed:\n" + proc.stderr)
+    return _PRELUDE + proc.stdout
+
+
+def compile_so(cpp_text, so_path, include_dir, cxx=None):
+    """Compile producer C++ to `so_path`. `include_dir` must hold
+    togsim_runtime.h. togsim_* symbols are left undefined (resolved at dlopen)."""
+    cxx = cxx or os.environ.get("CXX", "g++")
+    cpp_path = os.path.splitext(so_path)[0] + ".cpp"
+    with open(cpp_path, "w") as fh:
+        fh.write(cpp_text)
+    proc = subprocess.run(
+        [cxx, "-shared", "-fPIC", "-std=gnu++17", "-O2",
+         "-I", include_dir, cpp_path, "-o", so_path],
+        capture_output=True, text=True)
+    if proc.returncode != 0:
+        raise RuntimeError("%s failed:\n%s" % (cxx, proc.stderr))
+    return so_path
+
+
+def _default_include_dir():
+    root = os.environ.get("TORCHSIM_DIR")
+    if not root:
+        root = os.path.dirname(os.path.dirname(os.path.dirname(
+            os.path.dirname(os.path.abspath(__file__)))))
+    return os.path.join(root, "TOGSim", "include")
+
+
+def skeleton_to_so(skeleton_module, so_path, include_dir=None):
+    """skeleton module -> EmitC -> C++ -> compiled trace `.so`. Returns the
+    EmitC module text (for inspection / caching)."""
+    emitc = lower_to_emitc(skeleton_module)
+    cpp = emitc_to_cpp(emitc)
+    compile_so(cpp, so_path, include_dir or _default_include_dir())
+    return str(emitc)
+
+
+def build_trace_so(postvcix_path, so_path, include_dir=None):
+    """Full P2 path from a post-vcix kernel .mlir to a trace `.so`."""
+    from . import build_skeleton as bs
+
+    ctx = ir.Context()
+    ctx.allow_unregistered_dialects = True
+    with ctx:
+        module = ir.Module.parse(open(postvcix_path).read(), ctx)
+        bs.build_skeleton(module)
+        return skeleton_to_so(module, so_path, include_dir)
+
+
+def main(argv):
+    import argparse
+
+    parser = argparse.ArgumentParser(prog="lower_to_emitc.py")
+    parser.add_argument("input", help="post-vcix kernel .mlir")
+    parser.add_argument("--so", required=True, help="output .so path")
+    parser.add_argument("--include-dir", default=None,
+                        help="dir holding togsim_runtime.h (default: TOGSim/include)")
+    parser.add_argument("--emit-cpp", default=None,
+                        help="also write the generated C++ here")
+    parser.add_argument("--emit-mlir", default=None,
+                        help="also write the EmitC MLIR here")
+    args = parser.parse_args(argv[1:])
+
+    from . import build_skeleton as bs
+    ctx = ir.Context()
+    ctx.allow_unregistered_dialects = True
+    with ctx:
+        module = ir.Module.parse(open(args.input).read(), ctx)
+        bs.build_skeleton(module)
+        emitc = lower_to_emitc(module)
+        if args.emit_mlir:
+            open(args.emit_mlir, "w").write(str(emitc))
+        cpp = emitc_to_cpp(emitc)
+        if args.emit_cpp:
+            open(args.emit_cpp, "w").write(cpp)
+        compile_so(cpp, args.so, args.include_dir or _default_include_dir())
+    import sys
+    sys.stderr.write("wrote %s\n" % args.so)
+    return 0
+
+
+if __name__ == "__main__":
+    import sys
+    sys.exit(main(sys.argv))
diff --git a/PyTorchSimFrontend/mlir/passes/lower_to_vcix.py b/PyTorchSimFrontend/mlir/passes/lower_to_vcix.py
index ac93ebc8..9635db68 100644
--- a/PyTorchSimFrontend/mlir/passes/lower_to_vcix.py
+++ b/PyTorchSimFrontend/mlir/passes/lower_to_vcix.py
@@ -29,6 +29,8 @@
 
 import mlir.ir as ir  # noqa: E402
 
+from ._mlir_util import walk_ops, i32, i64, attr_bool
+
 MARKERS = ("linalg.matmul", "math.exp", "math.erf", "math.tanh", "math.sin", "math.cos")
 
 # math op name -> (opcode, imm) for the vcix.v.iv lowering (mirror Math*ToVCIX).
@@ -80,20 +82,12 @@ def _legalize_vector_type(vt, vlen):
     return n, ir.VectorType.get([elt_count >> (n - 1)], elt_ty, scalable=[True])
 
 
-def _i64(v):
-    return ir.IntegerAttr.get(ir.IntegerType.get_signless(64), v)
-
-
-def _i32(v):
-    return ir.IntegerAttr.get(ir.IntegerType.get_signless(32), v)
-
-
 def _viv(operand, result_ty, opcode, imm, rvl=None):
     """Create an unregistered vcix.v.iv (vcix::BinaryImmOp) op at the current IP."""
     operands = [operand] if rvl is None else [operand, rvl]
     return ir.Operation.create(
         "vcix.v.iv", results=[result_ty], operands=operands,
-        attributes={"opcode": _i64(opcode), "imm": _i32(imm)}).results[0]
+        attributes={"opcode": i64(opcode), "imm": i32(imm)}).results[0]
 
 
 def _make_sf_vc_v_iv(vec, op_vt, n, legal_ty, opcode, imm):
@@ -104,7 +98,7 @@ def _make_sf_vc_v_iv(vec, op_vt, n, legal_ty, opcode, imm):
     scalable = legal_ty.scalable
     rvl = None
     if scalable:
-        rvl = arith.ConstantOp(ir.IntegerType.get_signless(64), _i64(9)).result
+        rvl = arith.ConstantOp(ir.IntegerType.get_signless(64), i64(9)).result
     if n == 1:
         return _viv(vec, legal_ty, opcode, imm, rvl)
     elt_ty = legal_ty.element_type
@@ -119,24 +113,16 @@ def _make_sf_vc_v_iv(vec, op_vt, n, legal_ty, opcode, imm):
         for i in range(total // elt_count):
             ext = vector.ExtractStridedSliceOp(
                 legal_ty, vec,
-                ir.ArrayAttr.get([_i64(i * elt_count)]),
-                ir.ArrayAttr.get([_i64(elt_count)]),
-                ir.ArrayAttr.get([_i64(1)])).result
+                ir.ArrayAttr.get([i64(i * elt_count)]),
+                ir.ArrayAttr.get([i64(elt_count)]),
+                ir.ArrayAttr.get([i64(1)])).result
             v = _viv(ext, legal_ty, opcode, imm, rvl)
             res = vector.InsertStridedSliceOp(
-                v, res, ir.ArrayAttr.get([_i64(i * elt_count)]),
-                ir.ArrayAttr.get([_i64(1)])).result
+                v, res, ir.ArrayAttr.get([i64(i * elt_count)]),
+                ir.ArrayAttr.get([i64(1)])).result
     return res
 
 
-def _iter_ops(block):
-    for op in list(block.operations):
-        yield op
-        for region in op.operation.regions:
-            for b in region.blocks:
-                yield from _iter_ops(b)
-
-
 # ---------------------------------------------------------------------------
 # matmul lowering helpers (mirror MatmulOpLowering)
 # ---------------------------------------------------------------------------
@@ -146,11 +132,6 @@ def _elt_bits(elt_ty):
     return ir.FloatType(elt_ty).width
 
 
-def _bool_attr_true(op, key):
-    a = op.attributes
-    return key in a and ir.BoolAttr(a[key]).value
-
-
 def _enclosing_loops(op):
     """Walk ancestor ops; return (accumulation, outer, inner) affine.for lists,
     outermost-first (mirror the C++ insert-at-begin)."""
@@ -158,11 +139,11 @@ def _enclosing_loops(op):
     parent = op.operation.parent
     while parent is not None:
         if parent.name == "affine.for":
-            if _bool_attr_true(parent, "accumulation_loop"):
+            if attr_bool(parent, "accumulation_loop"):
                 acc.insert(0, parent)
-            if _bool_attr_true(parent, "outer_loop"):
+            if attr_bool(parent, "outer_loop"):
                 outer.insert(0, parent)
-            if _bool_attr_true(parent, "inner_loop"):
+            if attr_bool(parent, "inner_loop"):
                 inner.insert(0, parent)
         parent = parent.parent
     return acc, outer, inner
@@ -200,7 +181,7 @@ def _scan_conv_offsets(ow_loop, o_h, k_h, o_w, k_w):
     """Mirror the heuristic offset scan: find affine.apply(o_h,k_h)/(o_w,k_w) in the
     o_w loop and read the constant in its map (default 1)."""
     offset_h = offset_w = 1
-    for o in _iter_ops(ow_loop.regions[0].blocks[0]):
+    for o in walk_ops(ow_loop.regions[0].blocks[0]):
         if o.operation.name != "affine.apply":
             continue
         ops = list(o.operation.operands)
@@ -391,7 +372,7 @@ def _root(v):
                 return owner.operands[0]
         return v
     rootA, rootB = _root(A), _root(B)
-    for o in _iter_ops(outer[-1].regions[0].blocks[0]):
+    for o in walk_ops(outer[-1].regions[0].blocks[0]):
         if o.operation.name == "affine.vector_store":
             dest = _root(o.operation.operands[1])
             if dest == rootA:
@@ -617,7 +598,7 @@ def run(module, vectorlane=128, vlen=128, **_):
     mms = []
     for region in module.operation.regions:
         for b in region.blocks:
-            for o in _iter_ops(b):
+            for o in walk_ops(b):
                 if o.operation.name == "linalg.matmul":
                     mms.append(o.operation)
     for o in mms:
@@ -625,7 +606,7 @@ def run(module, vectorlane=128, vlen=128, **_):
     targets = []
     for region in module.operation.regions:
         for b in region.blocks:
-            for op in _iter_ops(b):
+            for op in walk_ops(b):
                 if op.operation.name in _MATH_VIV:
                     targets.append(op.operation)
     for op in targets:
diff --git a/PyTorchSimFrontend/mlir/passes/lower_vlane_idx.py b/PyTorchSimFrontend/mlir/passes/lower_vlane_idx.py
index 76e30cb3..3ed0a394 100644
--- a/PyTorchSimFrontend/mlir/passes/lower_vlane_idx.py
+++ b/PyTorchSimFrontend/mlir/passes/lower_vlane_idx.py
@@ -24,13 +24,7 @@
 OP_NAME = "torchsim.vlane_idx"
 MARKERS = (OP_NAME,)
 
-
-def _iter_ops(block):
-    for op in list(block.operations):
-        yield op
-        for region in op.operation.regions:
-            for b in region.blocks:
-                yield from _iter_ops(b)
+from ._mlir_util import walk_ops
 
 
 def run(module, **_):
@@ -46,7 +40,7 @@ def run(module, **_):
     targets = []
     for region in module.operation.regions:
         for b in region.blocks:
-            for op in _iter_ops(b):
+            for op in walk_ops(b):
                 if op.operation.name == OP_NAME:
                     targets.append(op.operation)
 
diff --git a/PyTorchSimFrontend/mlir/passes/togsim_ops.py b/PyTorchSimFrontend/mlir/passes/togsim_ops.py
new file mode 100644
index 00000000..c2be060d
--- /dev/null
+++ b/PyTorchSimFrontend/mlir/passes/togsim_ops.py
@@ -0,0 +1,101 @@
+"""Shared vocabulary for the skeleton+API MLIR form (C1).
+
+The trace pipeline (docs/design/togsim_cpp_trace.md) reduces a kernel's MLIR to
+a *loop skeleton + API calls*: native `affine.for`/`scf.for` loops (bounds kept
+as-is, symbolic preserved) plus a handful of `togsim.*` ops that stand for the
+runtime API. This module is the single source of truth for those op names and
+attribute keys, shared by:
+
+  * build_skeleton (C2) -- produces the skeleton+API MLIR, and
+  * togsim->emitc lowering (C4) -- rewrites each op to an `emitc.call_opaque`.
+
+The ops are kept *unregistered* (like the existing `togsim.transfer`), so there
+is no C++ dialect to register; C4 is a custom rewrite, not a registered
+ConversionPass.
+
+Grammar (each op lowers 1:1 to a `togsim_runtime.h` free function):
+
+    "togsim.dma"(%dram_idx, %tag_idx) {         -> togsim_dma(ctx, dir, arg_id,
+            dir = 0 | 1,            # LOAD|STORE      offset, ndim, dims, strides,
+            dims = [..], strides = [..],                elem_bits, is_async,
+            elem_bits = i32, is_async = bool,           tag_id, tag_slot,
+            tag_id = i32, arg_id = i32,                 read_bufs, write_bufs)
+            read_bufs = [..], write_bufs = [..]
+         } : (index, index) -> ()
+
+    "togsim.compute"() {                        -> togsim_compute(ctx, tile_id,
+            tile_id = i64, compute_type = i32,          compute_type, ndim, dims,
+            read_bufs = [..], write_bufs = [..]         read_bufs, write_bufs)
+         } : () -> ()
+
+    "togsim.memory_barrier"(%tag_idx) {         -> togsim_memory_barrier(ctx,
+            tag_id = i32, write_bufs = [..]             tag_id, tag_slot, write_bufs)
+         } : (index) -> ()
+
+    "togsim.compute_barrier"() : () -> ()       -> togsim_compute_barrier(ctx)
+
+How an async dma pairs with its sync point: NOT by a compile-time id. One static
+`togsim.dma` op runs once per loop iteration, each with a different RUNTIME tag
+slot `%tag[%idx]`, so the pairing must be a runtime key. `togsim.dma` carries a
+`tag_id` (its tag memref identity) and the runtime `%tag[%idx]` operand; the
+original `memref.dma_wait` becomes an explicit `togsim.memory_barrier` carrying
+the same `tag_id` + tag index. They pair at runtime by `(tag_id, tag_slot)` via
+the Core's tag table (the dma signals the tag at data-arrival; the barrier waits
+it). `tag_id` (which tag memref) is distinct from `tag_slot` (the SRAM tile slot,
+used for the double-buffer / capacity model). A sync (non-async) dma is blocking,
+so it needs no barrier. (Supersedes the earlier static `event_id` + `togsim.wait`
+design, which could not express per-iteration pairing.)
+
+Keep this in lockstep with TOGSim/include/togsim_runtime.h (TOGSIM_ABI_VERSION).
+"""
+
+# ---- op names -------------------------------------------------------------
+DMA    = "togsim.dma"
+COMPUTE = "togsim.compute"
+COMPUTE_BAR = "togsim.compute_barrier"  # fence: drain async compute before a consumer (sec 10.7)
+MEMORY_BAR = "togsim.memory_barrier"    # explicit async-DMA sync (the original dma_wait); tag-keyed
+
+#: every op this module owns (for matchers / DCE roots in C2).
+OP_NAMES = (DMA, COMPUTE, COMPUTE_BAR, MEMORY_BAR)
+
+#: op name -> the togsim_runtime.h symbol C4 lowers it to.
+EMITC_CALLEE = {
+    DMA:     "togsim_dma",
+    COMPUTE: "togsim_compute",
+    COMPUTE_BAR: "togsim_compute_barrier",
+    MEMORY_BAR: "togsim_memory_barrier",
+}
+
+#: producer entry-point symbol the TOGSim loader resolves (see togsim_runtime.h).
+ENTRY_SYMBOL = "togsim_kernel"
+
+#: runtime callee emitted directly by lower_to_emitc (not a skeleton op): the
+#: per-work-item core allocation. See togsim_cpp_trace.md sec 9.3. Kept in
+#: lockstep with togsim_runtime.h.
+CORE_ALLOC_CALLEE = "togsim_core_alloc"
+
+# ---- attribute keys -------------------------------------------------------
+ATTR_DIR       = "dir"        # i32: DIR_LOAD | DIR_STORE
+ATTR_DIMS      = "dims"       # i64 array: tile extents
+ATTR_STRIDES   = "strides"    # i64 array: tile strides
+ATTR_ELEM_BITS = "elem_bits"  # i32
+ATTR_IS_ASYNC  = "is_async"   # bool
+ATTR_TILE_ID   = "tile_id"    # i64: key into the precomputed tile_id->cycle table
+ATTR_COMPUTE_TYPE = "compute_type"  # i32: 0 vector / 1 matmul / 2 preload (Core enum)
+ATTR_READ_BUFS  = "read_bufs"   # i64 array: SRAM buffer ids this op reads  (sec 10 dataflow)
+ATTR_WRITE_BUFS = "write_bufs"  # i64 array: SRAM buffer ids this op writes (sec 10 dataflow)
+ATTR_TAG_ID    = "tag_id"     # i32: identity of the DMA's tag memref; pairs an async dma with
+                              #      its memory_barrier by the RUNTIME tag slot (tag_id + tag index)
+ATTR_ARG_ID    = "arg_id"     # i32: which tensor (func arg) this DMA's base is
+
+# Must match togsim_dma_dir in togsim_runtime.h.
+DIR_LOAD  = 0
+DIR_STORE = 1
+
+
+def is_togsim_op(op):
+    """True if `op` (an Operation or a wrapping view) is one of ours."""
+    name = getattr(op, "name", None)
+    if name is None:
+        name = getattr(getattr(op, "operation", None), "name", None)
+    return name in OP_NAMES
diff --git a/TOGSim/include/Instruction.h b/TOGSim/include/Instruction.h
index bb62a440..b513ee5c 100644
--- a/TOGSim/include/Instruction.h
+++ b/TOGSim/include/Instruction.h
@@ -12,7 +12,11 @@
 #include <memory>
 #include <vector>
 
-enum class Opcode { MOVIN, MOVOUT, COMP, BAR, COUNT};
+// MEMORY_BAR: the DMA/memory barrier (waits a DMA tag in the tag table).
+// COMPUTE_BAR: the compute barrier -- waits the systolic-array compute pipeline(s)
+//              to drain (all SAs empty), then finishes. Used as the explicit
+//              fence before a store consumes async matmul results (sec 10.7).
+enum class Opcode { MOVIN, MOVOUT, COMP, MEMORY_BAR, COMPUTE_BAR, COUNT};
 
 typedef uint64_t addr_type;
 typedef uint64_t cycle_type;
@@ -29,6 +33,11 @@ class Instruction : public std::enable_shared_from_this<Instruction> {
   Instruction(Opcode opcode);
   void finish_instruction();
   void add_child(std::shared_ptr<Instruction> child);
+  // Occupancy (SA-pipeline) dependency: the child is released when THIS op is
+  // ISSUED (enters the pipeline), not when it finishes -- so a preload/matmul
+  // successor overlaps it instead of waiting its full latency (sec 10.7).
+  void add_pipeline_child(std::shared_ptr<Instruction> child);
+  void release_pipeline_children();
   bool check_ready() { return ready_counter == 0; }
   const Opcode get_opcode() { return opcode; }
   bool is_dma_read() { return opcode == Opcode::MOVIN; }
@@ -103,6 +112,7 @@ class Instruction : public std::enable_shared_from_this<Instruction> {
   cycle_type overlapping_cycle;
   size_t ready_counter;
   std::set<std::shared_ptr<Instruction>> child_inst;
+  std::set<std::shared_ptr<Instruction>> _pipeline_children;  // released at issue (sec 10.7)
   std::vector<size_t> tile_size;
   std::vector<int> tile_stride;
   size_t _tile_numel;
diff --git a/TOGSim/include/togsim_loader.h b/TOGSim/include/togsim_loader.h
new file mode 100644
index 00000000..5df94b75
--- /dev/null
+++ b/TOGSim/include/togsim_loader.h
@@ -0,0 +1,76 @@
+#pragma once
+// togsim_loader.h
+// -----------------------------------------------------------------------------
+// TOGSim-side loader for the compiled trace producer (C6, P3 task 5). NOT part
+// of the producer ABI (togsim_runtime.h) -- this is the TOGSim half that
+// `dlopen`s a producer `.so`, runs its `togsim_kernel`, and records the emitted
+// instruction stream. See docs/design/togsim_cpp_trace.md sec 5.3 / 9.7.
+//
+// This first cut is the "materializing sink": the callbacks resolve each tile's
+// DRAM address (base[arg_id] + offset*elem_bytes) and per-tile compute cost
+// (the cycle table), mint event handles, and append a TraceRec per modeled
+// instruction. Feeding the recorded stream into the existing timing core
+// (Core/Simulator) for cycle-equivalence vs the build_tog path is the remaining
+// task-5 step.
+// -----------------------------------------------------------------------------
+
+#include <cstdint>
+#include <vector>
+
+#include "togsim_runtime.h"
+
+namespace togsim {
+
+// One modeled instruction recorded by the runtime callbacks.
+struct TraceRec {
+  enum Kind { DISPATCH, DMA, COMPUTE, MEMORY_BAR, COMPUTE_BAR } kind;
+  int32_t  core;          // work-item -> core binding (set by togsim_core_alloc)
+  // DMA / MEMORY_BAR
+  int32_t  dir;           // togsim_dma_dir
+  int32_t  arg_id;        // tensor
+  int32_t  elem_bits;
+  int32_t  is_async;
+  uint64_t addr;          // resolved DRAM byte address = base[arg_id] + off*bytes
+  int32_t  tag_id;        // DMA/MEMORY_BAR: tag memref identity; with tag_slot the
+                          // runtime pairing key (an async dma <-> its memory_barrier)
+  uint64_t tag_slot;      // SRAM tile slot (double-buffer / capacity model)
+  std::vector<int64_t> dims;     // tile extents (DMA)
+  std::vector<int64_t> strides;  // tile strides (DMA)
+  std::vector<int64_t> read_bufs;   // SRAM buffer ids read  (sec 10 dataflow DAG)
+  std::vector<int64_t> write_bufs;  // SRAM buffer ids written (MEMORY_BAR: released bufs)
+  // COMPUTE
+  uint64_t tile_id;
+  int32_t  compute_type;  // 0 vector / 1 matmul / 2 preload (Core unit enum)
+  int64_t  cycle;         // looked up from the cycle table
+  int64_t  overlapping;   // looked up from the cycle table
+};
+
+struct RunResult {
+  bool ok = false;
+  std::vector<TraceRec> trace;
+};
+
+// Load `so_path`, run its `togsim_kernel(shape_args, n_shape)` against a freshly
+// built EmitCtx, and return the recorded trace.
+//   tensor_base[arg_id] : DRAM base address of each kernel tensor argument
+//   cyc[tile_id] / ovl[tile_id] : the cycle table (cycle, overlapping_cycle)
+//   num_cores : dispatch round-robins work-items across this many cores
+RunResult run_producer(const char* so_path,
+                       const int64_t* shape_args, int32_t n_shape,
+                       const uint64_t* tensor_base, int32_t n_tensors,
+                       const int64_t* cyc, const int64_t* ovl, int32_t n_tiles,
+                       int32_t num_cores);
+
+// First-order reference timing over a recorded trace, to validate that the
+// stream carries enough to be scheduled (it is NOT the production Core -- no
+// DRAM/NoC/L2 contention; the real cycle-equivalence path feeds Tile/TileGraph
+// into Core). Models, per core: a DMA-engine timeline (DMAs serialize, overlap
+// compute), a compute timeline (serial = reduction accumulate, with the
+// finish = prev.finish + cycle - overlapped pipeline overlap of Core.cc), and
+// data dependencies (a compute waits the dmas whose handles its preceding
+// togsim_wait()s named).
+struct TimingParams { uint64_t dma_latency = 100; };
+struct SimResult { uint64_t total_cycle = 0; int n_compute = 0, n_dma = 0; };
+SimResult simulate(const RunResult& run, const TimingParams& params);
+
+}  // namespace togsim
diff --git a/TOGSim/include/togsim_runtime.h b/TOGSim/include/togsim_runtime.h
new file mode 100644
index 00000000..6748b9e6
--- /dev/null
+++ b/TOGSim/include/togsim_runtime.h
@@ -0,0 +1,157 @@
+#pragma once
+// togsim_runtime.h
+// -----------------------------------------------------------------------------
+// Shared C ABI between a compiled, shape-parametric trace producer (`.so`,
+// generated MLIR -> EmitC -> C++) and TOGSim. See docs/design/togsim_cpp_trace.md.
+//
+// The producer keeps loops as native loops (symbolic bounds become function
+// parameters) and calls the functions below; each call emits one trace record =
+// one modeled instruction. TOGSim `dlopen`s the producer, constructs an
+// `EmitCtx`, calls the entry point, records the emitted stream, and feeds it to
+// the existing timing core. The producer carries NO timing model and NO
+// functional compute -- it is a deterministic trace generator only.
+//
+// ABI shape rationale: `mlir-translate --mlir-to-cpp` lowers our `togsim.*` ops
+// (via `emitc.call_opaque`) to *free function* calls, so the contract is a set
+// of `extern "C"` free functions taking an opaque `EmitCtx*` as the first
+// argument. Implementations live in TOGSim and may dispatch internally; the
+// `EmitCtx` is opaque to the producer. `togsim_abi_version()` guards against a
+// producer `.so` built against a stale header.
+//
+// STATUS: firmed up in P2. The signatures below match what the C4
+// togsim->emitc lowering (PyTorchSimFrontend/mlir/passes/lower_to_emitc.py)
+// emits as `emitc.call_opaque` targets and what `mlir-translate --mlir-to-cpp`
+// renders. Synchronization is event-id based: each async op is registered
+// under an integer `event_id` and the matching wait passes the same id (the
+// "event-id table replaces the memory-keyed tag_table" decision). Tile DRAM
+// base addresses are still passed as a stub (0) until P3 wires real addresses.
+// -----------------------------------------------------------------------------
+
+#include <cstdint>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+// Bump whenever the signatures below change incompatibly. TOGSim refuses to load
+// a producer whose embedded version (a `togsim_producer_abi_version` symbol, or
+// a value passed at the entry point) does not match.
+//   v1 -> v2 (P2): dma takes an event_id and returns void (was: returns a
+//                  handle); togsim_kernel shape_args is non-const to match the
+//                  emitc/mlir-to-cpp output.
+//   v2 -> v3 (P3): add togsim_dispatch (work-item boundary + core binding) and
+//                  togsim_wait_all (join / barrier).
+//   v3 -> v4 (P3): togsim_dma takes (arg_id, element offset) instead of a
+//                  precomputed base_addr; the producer lowers the address
+//                  arithmetic and the runtime adds the tensor base.
+//   v4 -> v5 (P3): event handles. togsim_dma RETURNS a fresh handle (drops the
+//                  event_id arg); the producer parks it in a heap event buffer
+//                  (togsim_event_alloc/free) and togsim_wait takes the handle.
+//   v5 -> v6 (P3): replace togsim_dispatch with togsim_core_alloc (returns a
+//                  core id; no free) -- the runtime owns the core pool, num_cores
+//                  is never baked into the producer.
+//   v6 -> v7 (P3): togsim_dma takes a tag_slot (SRAM tile slot) for the runtime's
+//                  double-buffer / SRAM-capacity model.
+//   v7 -> v8 (P3): togsim_compute takes a compute_type (vector/matmul/preload) so
+//                  the Core routes it to the right compute unit.
+//   v8 -> v9 (P3 sec10): togsim_dma/compute take read_bufs/write_bufs (SRAM buffer
+//                  ids); the loader builds an explicit dependency DAG by
+//                  last-writer per buffer (replaces in-order/tag dependencies).
+//   v9 -> v10 (P3 sec10.7): add togsim_compute_barrier (the explicit compute fence
+//                  before a store; loader -> COMPUTE_BAR instruction).
+//   v10 -> v11 (P3 sec10): replace the static event-id pairing with the RUNTIME
+//                  tag slot. togsim_dma takes a tag_id (its tag memref identity)
+//                  and returns void; the original dma_wait becomes an explicit
+//                  togsim_memory_barrier(tag_id, tag_slot, write_bufs) that pairs
+//                  with its async dma by the runtime (tag_id, tag_slot) -- one
+//                  static dma op runs once per loop iteration with a different
+//                  %tag[%idx], so only a runtime key can pair them. Drops
+//                  togsim_wait/signal/wait_all/event_alloc/event_free + the
+//                  togsim_event handle (no compile-time pairing token).
+#define TOGSIM_ABI_VERSION 11
+int32_t togsim_abi_version(void);
+
+// Opaque per-invocation context owned by TOGSim. Holds the record sink and the
+// tile_id->cycle lookup. Never dereferenced by the producer.
+typedef struct EmitCtx EmitCtx;
+
+// Direction for togsim_dma.
+typedef enum {
+  TOGSIM_DMA_LOAD  = 0,  // DRAM -> SRAM (MOVIN)
+  TOGSIM_DMA_STORE = 1,  // SRAM -> DRAM (MOVOUT)
+} togsim_dma_dir;
+
+// Emit a DMA.
+//   dir       : load/store
+//   arg_id    : which tensor (kernel func arg) this tile lives in
+//   offset    : ELEMENT offset of this tile within that tensor, computed by the
+//               producer from the loop indices (the affine address arithmetic is
+//               lowered into the producer -- P3 approach A). The runtime forms
+//               the DRAM address as base[arg_id] + offset*elem_bytes (only the
+//               runtime knows the tensors' allocation base addresses).
+//   ndim      : rank of the tile
+//   dims      : ndim tile extents
+//   strides   : ndim tile strides (may be null => contiguous)
+//   elem_bits : element width in bits
+//   is_async  : non-zero => issue-complete is the finish; the consumer must be
+//               gated by an explicit togsim_memory_barrier (data arrives later).
+//               Zero => blocking: the dma finishes at data-arrival.
+//   tag_id    : identity of this dma's tag memref. With tag_slot it forms the
+//               RUNTIME pairing key (tag_id, tag_slot) the matching
+//               togsim_memory_barrier waits on -- not a compile-time id, since
+//               one static dma op runs once per loop iteration.
+//   tag_slot  : the SRAM tile slot this tile occupies (the producer's lowered
+//               tag index, evaluated at runtime). Also the double-buffer /
+//               SRAM-capacity slot. Single-buffer kernels pass 0.
+//   read_bufs/n_read, write_bufs/n_write : SRAM buffer ids this op reads/writes
+//   (sec 10 dataflow). The loader builds the dependency DAG by last-writer per
+//   buffer.
+void togsim_dma(EmitCtx* ctx, int32_t dir, int32_t arg_id,
+                uint64_t offset, int32_t ndim, const int64_t* dims,
+                const int64_t* strides, int32_t elem_bits,
+                int32_t is_async, int32_t tag_id, uint64_t tag_slot,
+                const int64_t* read_bufs, int32_t n_read,
+                const int64_t* write_bufs, int32_t n_write);
+
+// Emit a fixed-size tile compute. Cost is looked up from the precomputed
+// tile_id->cycle table (annotation pass / sample-mode); `dims` are passed for
+// logging and future remainder-tile handling, not to compute cost here.
+//   compute_type : 0 vector / 1 matmul / 2 preload (maps to the Core unit enum;
+//                  routes the op to the VPU vs the systolic array).
+void togsim_compute(EmitCtx* ctx, uint64_t tile_id, int32_t compute_type,
+                    int32_t ndim, const int64_t* dims,
+                    const int64_t* read_bufs, int32_t n_read,
+                    const int64_t* write_bufs, int32_t n_write);
+
+// Explicit async-DMA sync -- the original memref.dma_wait. Pairs with its async
+// togsim_dma by the RUNTIME tag slot (tag_id, tag_slot) and gates consumers on
+// data-arrival (resp-complete), since an async dma's own finish is only
+// issue-complete. `write_bufs` is the SRAM buffer(s) that dma loaded; the loader
+// makes the barrier the last writer of them so consumers depend on it. Sync DMAs
+// need no barrier (they block to data-arrival themselves).
+void togsim_memory_barrier(EmitCtx* ctx, int32_t tag_id, uint64_t tag_slot,
+                           const int64_t* write_bufs, int32_t n_write);
+
+// Core allocation (sec 9.3). The producer calls togsim_core_alloc at the start
+// of each parallel work-item (output tile); the ops that follow are bound to the
+// returned core until the next togsim_core_alloc. No free: a core is just an
+// assignment, not a held resource. The producer is core-count transparent: it
+// NEVER names num_cores or a physical core -- the runtime owns the pool and the
+// policy (round-robin etc.). A work-item's whole reduction sits after one alloc,
+// so it stays on the same core; different work-items get different cores ->
+// multi-core.
+int32_t togsim_core_alloc(EmitCtx* ctx);
+
+// Compute fence: drain in-flight async compute (the systolic-array matmuls)
+// before the following op (a store) consumes their result. Explicit barrier in
+// the trace; the loader turns it into a COMPUTE_BAR instruction (sec 10.7).
+void togsim_compute_barrier(EmitCtx* ctx);
+
+// Entry point the loader resolves in the producer `.so`. `shape_args` carries
+// the runtime values for the kernel's symbolic dimensions (in a kernel-specific
+// order recorded alongside the cached `.so`); `n_shape_args` is their count.
+void togsim_kernel(EmitCtx* ctx, int64_t* shape_args, int32_t n_shape_args);
+
+#ifdef __cplusplus
+}  // extern "C"
+#endif
diff --git a/TOGSim/include/togsim_trace_bridge.h b/TOGSim/include/togsim_trace_bridge.h
new file mode 100644
index 00000000..f0213ef5
--- /dev/null
+++ b/TOGSim/include/togsim_trace_bridge.h
@@ -0,0 +1,18 @@
+#pragma once
+// togsim_trace_bridge.h
+// -----------------------------------------------------------------------------
+// Bridge from the recorded trace (togsim_loader.h RunResult) to a TileGraph the
+// existing Simulator/Core can run, for production cycle-equivalence (P3 task 5;
+// see togsim_cpp_trace.md sec 9.9). First cut: one Tile per work-item (the span
+// between two togsim_core_alloc markers), bound to that work-item's core; the
+// DMA/compute records become MOVIN/MOVOUT/COMP Instructions with the RAW
+// dependency edges (a compute waits the dmas its preceding waits named).
+// -----------------------------------------------------------------------------
+#include <memory>
+
+#include "TileGraph.h"
+#include "togsim_loader.h"
+
+// Build a TileGraph from a recorded trace. `path`/`name` label the graph.
+std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
+                                              const std::string& name);
diff --git a/TOGSim/src/CMakeLists.txt b/TOGSim/src/CMakeLists.txt
index 65cd4dd4..d782d4d1 100644
--- a/TOGSim/src/CMakeLists.txt
+++ b/TOGSim/src/CMakeLists.txt
@@ -12,3 +12,8 @@ file(GLOB_RECURSE SRC_FILES
 
 # build
 add_executable(${LIB_NAME} ${SRC_FILES})
+
+# Export the executable's dynamic symbols (-rdynamic) so a dlopen'd trace
+# producer .so resolves the togsim_* runtime callbacks back into this binary
+# (P3 trace pipeline).
+set_target_properties(${LIB_NAME} PROPERTIES ENABLE_EXPORTS ON)
diff --git a/TOGSim/src/Core.cc b/TOGSim/src/Core.cc
index 9dad8597..6ec63b22 100644
--- a/TOGSim/src/Core.cc
+++ b/TOGSim/src/Core.cc
@@ -154,7 +154,7 @@ void Core::dma_cycle() {
       } else if(!finished_inst->is_dma_read()) {
         core_trace_log::log_error_dma_instruction_invalid(_core_cycle, _id);
         exit(EXIT_FAILURE);
-      } else if (finished_inst->get_opcode() == Opcode::BAR) {
+      } else if (finished_inst->get_opcode() == Opcode::MEMORY_BAR) {
         core_trace_log::trace_instruction_line(_core_cycle,
                                                _id,
                                                TraceLogTag::pad15(TraceLogTag::kInstructionFinished),
@@ -265,6 +265,10 @@ void Core::cycle() {
           break;
         case Opcode::COMP:
           {
+            // sec 10.7: this op is now entering the pipeline -> release its
+            // occupancy (pipeline) dependents so a preload/matmul successor
+            // overlaps it instead of waiting its full latency.
+            inst->release_pipeline_children();
             auto& target_pipeline = get_compute_pipeline(inst->get_compute_type());
             if (target_pipeline.empty()) {
               inst->finish_cycle = _core_cycle + inst->get_compute_cycle();
@@ -297,7 +301,7 @@ void Core::cycle() {
             }
           }
           break;
-        case Opcode::BAR:
+        case Opcode::MEMORY_BAR:
           {
             auto& key = inst->get_tag_id();
             uint32_t finished = _dma.get_tag_finish(inst->subgraph_id, key);
@@ -324,6 +328,27 @@ void Core::cycle() {
             issued = true;
           }
           break;
+        case Opcode::COMPUTE_BAR:
+          {
+            // Compute fence (sec 10.7): finish only once ALL compute pipelines
+            // have drained (every systolic array + the VPU empty). Its
+            // ready_counter already gated it until the async computes ISSUED
+            // (pipeline-child release), so by now they are in the pipeline; we
+            // wait for them to drain. If not yet drained, do not issue -- it
+            // stays in the ready queue and is re-checked next cycle.
+            bool drained = _vu_compute_pipeline.empty();
+            for (int s = 0; s < _num_systolic_array_per_core; s++)
+              drained = drained && _sa_compute_pipeline.at(s).empty();
+            if (drained) {
+              core_trace_log::trace_instruction_line(_core_cycle, _id,
+                  TraceLogTag::pad15(TraceLogTag::kInstructionFinished),
+                  inst->get_global_inst_id(),
+                  core_trace_log::format_instruction_detail_line(*inst));
+              finish_instruction(inst);
+              issued = true;
+            }
+          }
+          break;
         default:
           core_trace_log::log_error_undefined_opcode();
           exit(EXIT_FAILURE);
diff --git a/TOGSim/src/CoreTraceLog.cc b/TOGSim/src/CoreTraceLog.cc
index ebc31de0..9761f9ec 100644
--- a/TOGSim/src/CoreTraceLog.cc
+++ b/TOGSim/src/CoreTraceLog.cc
@@ -70,7 +70,7 @@ std::string format_instruction_detail_line(Instruction& inst) {
   if (op == Opcode::MOVIN || op == Opcode::MOVOUT) {
     return fmt::format("{} (addr_name={})", opname, inst.get_addr_name());
   }
-  if (op == Opcode::BAR) {
+  if (op == Opcode::MEMORY_BAR) {
     return fmt::format("{} (addr_name={} tag_id=[{}] tag_idx=[{}] tag_stride=[{}])",
                        opname,
                        inst.get_addr_name(),
diff --git a/TOGSim/src/Instruction.cc b/TOGSim/src/Instruction.cc
index f236d160..54e50511 100644
--- a/TOGSim/src/Instruction.cc
+++ b/TOGSim/src/Instruction.cc
@@ -23,7 +23,8 @@ std::string opcode_to_string(Opcode opcode) {
         case Opcode::MOVIN:        return "MOVIN";
         case Opcode::MOVOUT:       return "MOVOUT";
         case Opcode::COMP:         return "COMP";
-        case Opcode::BAR:          return "BAR";
+        case Opcode::MEMORY_BAR:   return "MEMORY_BAR";
+        case Opcode::COMPUTE_BAR:  return "COMPUTE_BAR";
         default:                   return "Unknown";
     }
 }
@@ -60,6 +61,16 @@ void Instruction::add_child(std::shared_ptr<Instruction> child) {
   child_inst.insert(child);
 }
 
+void Instruction::add_pipeline_child(std::shared_ptr<Instruction> child) {
+  child->inc_ready_counter();
+  _pipeline_children.insert(child);
+}
+
+void Instruction::release_pipeline_children() {
+  for (auto& c : _pipeline_children) c->dec_ready_counter();
+  _pipeline_children.clear();
+}
+
 void Instruction::inc_waiting_request() {
   _nr_waiting_request++;
 }
diff --git a/TOGSim/src/TileGraphParser.cc b/TOGSim/src/TileGraphParser.cc
index 5060d336..572062e0 100644
--- a/TOGSim/src/TileGraphParser.cc
+++ b/TOGSim/src/TileGraphParser.cc
@@ -543,7 +543,7 @@ std::vector<std::shared_ptr<Tile>> TileLoopNode::get_tiles_from_iter(TileGraphPa
              fmt::join(new_tag_stride_list, ", "));
 
       std::shared_ptr<Instruction> inst = std::make_shared<Instruction>(
-        Opcode::BAR, 0,
+        Opcode::MEMORY_BAR, 0,
         0, base_addr,
         std::vector<size_t>(), std::vector<int>(), 0,
         tag_list, new_tag_stride_list, accum_tag_list
diff --git a/TOGSim/src/main.cc b/TOGSim/src/main.cc
index 010826ef..8726cfdf 100644
--- a/TOGSim/src/main.cc
+++ b/TOGSim/src/main.cc
@@ -8,6 +8,8 @@
 #include "Simulator.h"
 #include "TileGraphParser.h"
 #include "helper/CommandLineParser.h"
+#include "togsim_loader.h"        // P3 trace pipeline: run a compiled producer .so
+#include "togsim_trace_bridge.h"  // ... and bridge its trace to a TileGraph
 
 namespace fs = std::filesystem;
 namespace po = boost::program_options;
@@ -104,6 +106,11 @@ int main(int argc, char** argv) {
       "models_list", "Path for the trace file (.trace)");
   cmd_parser.add_command_line_option<std::string>(
       "log_level", "Set for log level [trace, debug, info], default = info");
+  cmd_parser.add_command_line_option<std::string>(
+      "trace_so", "Path to a compiled trace producer .so (P3 trace pipeline)");
+  cmd_parser.add_command_line_option<std::string>(
+      "cycle_table", "Path to a 'cycle<TAB>overlapping' per-tile_id sidecar (TSV) "
+                     "for --trace_so; falls back to a flat stub if omitted");
   try {
     cmd_parser.parse(argc, argv);
   } catch (const CommandLineParser::ParsingError& e) {
@@ -147,6 +154,46 @@ int main(int argc, char** argv) {
     exit(1);
   }
 
+  // P3 trace pipeline: if a compiled producer .so is given, run it, bridge the
+  // recorded trace to a TileGraph, and run the existing Simulator on it.
+  std::string trace_so_path;
+  cmd_parser.set_if_defined("trace_so", &trace_so_path);
+  if (!trace_so_path.empty()) {
+    const auto& cfg = simulator->get_hardware_config_yaml();
+    int num_cores = cfg["num_cores"] ? cfg["num_cores"].as<int>() : 1;
+    // First cut: stub tensor bases (real per-tensor addresses come later).
+    std::vector<uint64_t> bases(16);
+    for (size_t i = 0; i < bases.size(); ++i) bases[i] = 0x100000ull * (i + 1);
+    // Cycle table: load the per-tile_id TSV sidecar if given, else a flat stub.
+    std::vector<int64_t> cyc, ovl;
+    std::string cycle_table_path;
+    cmd_parser.set_if_defined("cycle_table", &cycle_table_path);
+    if (!cycle_table_path.empty()) {
+      std::ifstream ct(cycle_table_path);
+      if (!ct.is_open()) { spdlog::error("[TOGSim] cannot open cycle_table {}", cycle_table_path); exit(1); }
+      int64_t c, o;
+      while (ct >> c >> o) { cyc.push_back(c); ovl.push_back(o); }
+      spdlog::info("[TOGSim-trace] loaded cycle table: {} tiles from {}", cyc.size(), cycle_table_path);
+    } else {
+      cyc.assign(256, 128);
+      ovl.assign(256, 0);
+    }
+    auto run = togsim::run_producer(trace_so_path.c_str(), nullptr, 0,
+                                    bases.data(), (int)bases.size(),
+                                    cyc.data(), ovl.data(), (int)cyc.size(),
+                                    num_cores);
+    if (!run.ok) { spdlog::error("[TOGSim] trace producer run failed"); exit(1); }
+    spdlog::info("[TOGSim-trace] recorded {} instructions", run.trace.size());
+    auto tg = trace_to_tilegraph(run, "trace_kernel");
+    tg->set_arrival_time(simulator->get_core_cycle());
+    tg->set_kernel_id(0);
+    simulator->enqueue_graph(0, std::move(tg));
+    simulator->run_simulator();
+    spdlog::info("[TOGSim-trace] Total cycles: {}", simulator->get_core_cycle());
+    simulator->print_core_stat();
+    return 0;
+  }
+
   // Get trace file path
   cmd_parser.set_if_defined("models_list", &trace_file_path);
 
diff --git a/TOGSim/src/togsim_runtime.cc b/TOGSim/src/togsim_runtime.cc
new file mode 100644
index 00000000..ca18fd29
--- /dev/null
+++ b/TOGSim/src/togsim_runtime.cc
@@ -0,0 +1,194 @@
+// togsim_runtime.cc
+// -----------------------------------------------------------------------------
+// C6 runtime + loader for the compiled trace producer (P3 task 5). Implements
+// the producer ABI (togsim_runtime.h) and the TOGSim-side loader
+// (togsim_loader.h). See docs/design/togsim_cpp_trace.md sec 5.3 / 9.6.1 / 9.7.
+//
+// The producer `.so` calls the extern "C" togsim_* functions below; each one
+// records a TraceRec on the EmitCtx. EmitCtx is the opaque type the producer
+// only ever passes back to us. This is the "materializing sink": it resolves
+// addresses and per-tile cycles into a recorded instruction stream. Wiring the
+// stream into the existing timing core (Core/Simulator) is the remaining step.
+// -----------------------------------------------------------------------------
+
+#include "togsim_loader.h"
+
+#include <algorithm>
+#include <cstdio>
+#include <cstdlib>
+#include <dlfcn.h>
+#include <map>
+#include <unordered_map>
+#include <utility>
+#include <vector>
+
+// Full definition of the opaque handle from togsim_runtime.h. The producer holds
+// only EmitCtx* and never dereferences it.
+struct EmitCtx {
+  // inputs supplied by the loader
+  const uint64_t* tensor_base = nullptr;
+  int32_t         n_tensors = 0;
+  const int64_t*  cyc = nullptr;   // tile_id -> cycle
+  const int64_t*  ovl = nullptr;   // tile_id -> overlapping_cycle
+  int32_t         n_tiles = 0;
+  int32_t         num_cores = 1;
+  // mutable run state
+  int32_t  rr = 0;            // round-robin core cursor
+  int32_t  cur_core = -1;     // current work-item's core
+  std::vector<togsim::TraceRec> trace;
+};
+
+namespace {
+inline togsim::TraceRec blank(togsim::TraceRec::Kind k, int32_t core) {
+  togsim::TraceRec r{};
+  r.kind = k;
+  r.core = core;
+  return r;
+}
+}  // namespace
+
+extern "C" {
+
+int32_t togsim_abi_version(void) { return TOGSIM_ABI_VERSION; }
+
+int32_t togsim_core_alloc(EmitCtx* ctx) {
+  // Round-robin a core from the pool; the producer never sees num_cores. Binds
+  // it as the current core for the ops that follow (the work-item's reduction).
+  ctx->cur_core = ctx->num_cores > 0 ? (ctx->rr++ % ctx->num_cores) : 0;
+  ctx->trace.push_back(blank(togsim::TraceRec::DISPATCH, ctx->cur_core));
+  return ctx->cur_core;
+}
+
+void togsim_dma(EmitCtx* ctx, int32_t dir, int32_t arg_id,
+                uint64_t offset, int32_t ndim, const int64_t* dims,
+                const int64_t* strides, int32_t elem_bits,
+                int32_t is_async, int32_t tag_id, uint64_t tag_slot,
+                const int64_t* read_bufs, int32_t n_read,
+                const int64_t* write_bufs, int32_t n_write) {
+  uint64_t base = (arg_id >= 0 && arg_id < ctx->n_tensors)
+                      ? ctx->tensor_base[arg_id] : 0;
+  uint64_t addr = base + offset * (uint64_t)(elem_bits / 8);
+  togsim::TraceRec r = blank(togsim::TraceRec::DMA, ctx->cur_core);
+  r.dir = dir; r.arg_id = arg_id; r.elem_bits = elem_bits;
+  r.is_async = is_async; r.addr = addr; r.tag_id = tag_id; r.tag_slot = tag_slot;
+  for (int32_t i = 0; i < ndim; ++i) {
+    if (dims) r.dims.push_back(dims[i]);
+    if (strides) r.strides.push_back(strides[i]);
+  }
+  for (int32_t i = 0; i < n_read; ++i) r.read_bufs.push_back(read_bufs[i]);
+  for (int32_t i = 0; i < n_write; ++i) r.write_bufs.push_back(write_bufs[i]);
+  ctx->trace.push_back(r);
+}
+
+void togsim_compute(EmitCtx* ctx, uint64_t tile_id, int32_t compute_type,
+                    int32_t ndim, const int64_t* dims,
+                    const int64_t* read_bufs, int32_t n_read,
+                    const int64_t* write_bufs, int32_t n_write) {
+  (void)ndim; (void)dims;
+  togsim::TraceRec r = blank(togsim::TraceRec::COMPUTE, ctx->cur_core);
+  r.tile_id = tile_id;
+  r.compute_type = compute_type;
+  for (int32_t i = 0; i < n_read; ++i) r.read_bufs.push_back(read_bufs[i]);
+  for (int32_t i = 0; i < n_write; ++i) r.write_bufs.push_back(write_bufs[i]);
+  if (ctx->cyc && (int32_t)tile_id < ctx->n_tiles) r.cycle = ctx->cyc[tile_id];
+  if (ctx->ovl && (int32_t)tile_id < ctx->n_tiles) r.overlapping = ctx->ovl[tile_id];
+  ctx->trace.push_back(r);
+}
+
+void togsim_memory_barrier(EmitCtx* ctx, int32_t tag_id, uint64_t tag_slot,
+                           const int64_t* write_bufs, int32_t n_write) {
+  togsim::TraceRec r = blank(togsim::TraceRec::MEMORY_BAR, ctx->cur_core);
+  r.tag_id = tag_id; r.tag_slot = tag_slot;
+  for (int32_t i = 0; i < n_write; ++i) r.write_bufs.push_back(write_bufs[i]);
+  ctx->trace.push_back(r);
+}
+
+void togsim_compute_barrier(EmitCtx* ctx) {
+  ctx->trace.push_back(blank(togsim::TraceRec::COMPUTE_BAR, ctx->cur_core));
+}
+
+}  // extern "C"
+
+namespace togsim {
+
+RunResult run_producer(const char* so_path,
+                       const int64_t* shape_args, int32_t n_shape,
+                       const uint64_t* tensor_base, int32_t n_tensors,
+                       const int64_t* cyc, const int64_t* ovl, int32_t n_tiles,
+                       int32_t num_cores) {
+  RunResult res;
+  void* lib = dlopen(so_path, RTLD_NOW | RTLD_GLOBAL);
+  if (!lib) { fprintf(stderr, "togsim: dlopen failed: %s\n", dlerror()); return res; }
+  auto emit = (void (*)(EmitCtx*, int64_t*, int32_t))dlsym(lib, "togsim_kernel");
+  if (!emit) { fprintf(stderr, "togsim: dlsym togsim_kernel failed: %s\n", dlerror()); return res; }
+
+  EmitCtx ctx;
+  ctx.tensor_base = tensor_base; ctx.n_tensors = n_tensors;
+  ctx.cyc = cyc; ctx.ovl = ovl; ctx.n_tiles = n_tiles;
+  ctx.num_cores = num_cores > 0 ? num_cores : 1;
+  emit(&ctx, (int64_t*)shape_args, n_shape);
+
+  res.ok = true;
+  res.trace = std::move(ctx.trace);
+  return res;
+}
+
+SimResult simulate(const RunResult& run, const TimingParams& params) {
+  SimResult out;
+  std::unordered_map<int, uint64_t> dma_free;     // DMA-engine free time, per core
+  std::unordered_map<int, uint64_t> comp_free;    // compute free time, per core
+  std::unordered_map<int, uint64_t> prev_comp;    // prev compute finish (overlap), per core
+  std::map<std::pair<int32_t, uint64_t>, uint64_t> tag_finish;  // (tag_id,tag_slot) -> finish
+  std::vector<uint64_t> pending;                    // barrier-resolved deps since last compute
+
+  for (const auto& t : run.trace) {
+    const int c = t.core;
+    switch (t.kind) {
+      case TraceRec::DMA: {
+        // DMAs serialize on the core's DMA engine (overlap compute -> separate
+        // timeline). finish = issue + latency, recorded under the runtime tag.
+        uint64_t start = dma_free[c];
+        uint64_t fin = start + params.dma_latency;
+        dma_free[c] = fin;
+        tag_finish[{t.tag_id, t.tag_slot}] = fin;
+        out.n_dma++;
+        break;
+      }
+      case TraceRec::MEMORY_BAR: {
+        // the explicit async-DMA sync: gate the next compute on the paired dma's
+        // data-arrival, found by the runtime tag (tag_id, tag_slot).
+        auto it = tag_finish.find({t.tag_id, t.tag_slot});
+        if (it != tag_finish.end()) pending.push_back(it->second);
+        break;
+      }
+      case TraceRec::COMPUTE: {
+        uint64_t deps = 0;
+        for (uint64_t f : pending) deps = std::max(deps, f);
+        pending.clear();
+        uint64_t start = std::max(comp_free[c], deps);
+        uint64_t fin;
+        auto pit = prev_comp.find(c);
+        if (pit != prev_comp.end()) {
+          uint64_t prev = pit->second;
+          uint64_t tail = prev > start ? prev - start : 0;     // prev still running
+          uint64_t overlapped = std::min<uint64_t>(tail, (uint64_t)t.overlapping);
+          fin = std::max(start, prev) + (uint64_t)t.cycle - overlapped;
+        } else {
+          fin = start + (uint64_t)t.cycle;
+        }
+        comp_free[c] = fin;
+        prev_comp[c] = fin;
+        out.n_compute++;
+        break;
+      }
+      case TraceRec::DISPATCH:
+      case TraceRec::COMPUTE_BAR:
+        break;  // work-item boundary / compute fence: no cost in this reference timer
+    }
+  }
+  for (auto& kv : dma_free) out.total_cycle = std::max(out.total_cycle, kv.second);
+  for (auto& kv : comp_free) out.total_cycle = std::max(out.total_cycle, kv.second);
+  return out;
+}
+
+}  // namespace togsim
diff --git a/TOGSim/src/togsim_trace_bridge.cc b/TOGSim/src/togsim_trace_bridge.cc
new file mode 100644
index 00000000..bbf6252c
--- /dev/null
+++ b/TOGSim/src/togsim_trace_bridge.cc
@@ -0,0 +1,192 @@
+// togsim_trace_bridge.cc -- see togsim_trace_bridge.h
+#include "togsim_trace_bridge.h"
+
+#include <map>
+#include <utility>
+#include <vector>
+
+#include "Tile.h"
+#include "Instruction.h"
+
+namespace {
+
+std::shared_ptr<Instruction> make_dma(const togsim::TraceRec& t) {
+  Opcode op = (t.dir == 1) ? Opcode::MOVOUT : Opcode::MOVIN;
+  std::vector<size_t> tile_size(t.dims.begin(), t.dims.end());
+  std::vector<int> tile_stride(t.strides.begin(), t.strides.end());
+  // tag_idx_list / tag_stride_list must match in size; one slot key per dma.
+  std::vector<int64_t> tag_idx{(int64_t)t.tag_slot};
+  std::vector<int64_t> tag_stride{1};
+  auto inst = std::make_shared<Instruction>(
+      op, /*compute_cycle=*/0, /*num_parents=*/0, /*dram_addr=*/t.addr,
+      tile_size, tile_stride, (size_t)t.elem_bits, tag_idx, tag_stride,
+      /*accum_tag_idx_list=*/std::vector<int64_t>{});
+  inst->set_is_async(t.is_async != 0);
+  // The tag key is [addr_id, ..., sum(tag_idx*tag_stride)]. addr_id is the tag
+  // memref identity (tag_id): an async dma and its memory_barrier share a tag
+  // memref, so the same (tag_id, tag_slot) keys both and the Core's tag table
+  // pairs them. (Distinct tag memrefs -> distinct tag_id, so no false collision.)
+  inst->set_addr_name("tag" + std::to_string(t.tag_id), t.tag_id);
+  inst->prepare_tag_key();
+  return inst;
+}
+
+// A MEMORY_BAR carrying the SAME tag key as the async dma it gates -- the Core's
+// tag table signals it at the dma's DATA-ready (resp-complete), unlike a raw
+// add_child which the async dma releases at issue-complete. Tag inputs match
+// make_dma (tag_idx={tag_slot}, stride={1}, addr_id=tag_id) so the keys collide.
+std::shared_ptr<Instruction> make_mem_bar(const togsim::TraceRec& t) {
+  auto bar = std::make_shared<Instruction>(
+      Opcode::MEMORY_BAR, 0, 0, 0,
+      std::vector<size_t>{}, std::vector<int>{}, 0,
+      std::vector<int64_t>{(int64_t)t.tag_slot}, std::vector<int64_t>{1},
+      std::vector<int64_t>{});
+  bar->set_addr_name("tag" + std::to_string(t.tag_id), t.tag_id);
+  bar->prepare_tag_key();
+  return bar;
+}
+
+std::shared_ptr<Instruction> make_compute(const togsim::TraceRec& t) {
+  auto inst = std::make_shared<Instruction>(
+      Opcode::COMP, /*compute_cycle=*/(cycle_type)t.cycle, /*num_parents=*/0,
+      /*dram_addr=*/0, std::vector<size_t>{}, std::vector<int>{}, /*elem_bits=*/0,
+      std::vector<int64_t>{}, std::vector<int64_t>{}, std::vector<int64_t>{});
+  inst->set_overlapping_cycle((cycle_type)t.overlapping);
+  inst->set_compute_type(t.compute_type);  // route to VPU vs systolic array
+  return inst;
+}
+
+}  // namespace
+
+std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
+                                              const std::string& name) {
+  using togsim::TraceRec;
+  auto tg = std::make_unique<TileGraph>(name, name);
+  // Empty cache plan (no L2/CMEM persistence) -- append_subgraph propagates it
+  // to each subgraph, and DMA::is_cacheable dereferences it, so it must be a
+  // valid (if empty) IntervalTree rather than null.
+  tg->init_cache_plan({});
+
+  std::shared_ptr<TileSubGraph> sg;
+  std::shared_ptr<Tile> tile;
+  // Explicit dependency DAG (sec 10): a reader depends on the last writer of each
+  // SRAM buffer it reads. Scoped per work-item (reset at each dispatch) -- buffers
+  // are work-item-local, so distinct work-items are independent (-> parallel).
+  std::map<int64_t, std::shared_ptr<Instruction>> last_writer;  // buffer id -> producer
+  // An async dma is paired with its explicit memory_barrier by the RUNTIME tag
+  // (tag_id, tag_slot): the dma records itself here so the later barrier can find
+  // it and depend on it. Scoped per work-item (the tag table is per subgraph).
+  std::map<std::pair<int32_t, uint64_t>, std::shared_ptr<Instruction>> tag_to_dma;
+  // Async compute (matmul/preload): issued and pipelined on the systolic array;
+  // they do not block each other. A store then needs the drained result, so it
+  // FLUSHes -- waits all outstanding async compute before running (like a fence
+  // after async ops). No per-op completion events; one barrier before the store.
+  std::vector<std::shared_ptr<Instruction>> outstanding_async;
+  std::shared_ptr<Instruction> pending_bar;   // last COMPUTE_BAR fence, awaited by the next store
+  auto is_async_compute = [](int ct) { return ct == 1 || ct == 2; };  // matmul / preload
+
+  auto flush = [&]() {
+    if (sg && tile) {
+      sg->add_tile(tile);
+      tile->set_owner(sg);
+      tg->append_subgraph(sg);
+    }
+    sg.reset();
+    tile.reset();
+    last_writer.clear();
+    tag_to_dma.clear();
+    outstanding_async.clear();
+    pending_bar.reset();
+  };
+
+  // Build edges from the recorded read/write buffer sets: reader <- last writer of
+  // each buffer it reads (the virtual SA_WEIGHTS buffer carries preload->matmul;
+  // the Y_spad accumulator carries the reduction chain; the spads carry load->
+  // compute). No in-order chain, no tag matching, no op heuristics.
+  // sec 10.7 occupancy/latency split. An edge from a systolic-array producer
+  // (preload=2 or matmul=1) to a matmul (1) is an OCCUPANCY dependency: the
+  // successor overlaps the producer on the SA pipeline, so use add_pipeline_child
+  // (released when the producer ISSUES). Every other edge is a LATENCY
+  // dependency (the consumer needs the producer's result): load->compute,
+  // init->matmul, matmul->store -> add_child (released at the producer's finish).
+  const int MATMUL_CT = 1, PRELOAD_CT = 2;
+  auto link = [&](std::shared_ptr<Instruction> inst,
+                  const std::vector<int64_t>& reads,
+                  const std::vector<int64_t>& writes) {
+    for (int64_t b : reads) {
+      auto it = last_writer.find(b);
+      if (it == last_writer.end()) continue;
+      int pct = it->second->get_compute_type();
+      if (inst->get_compute_type() == MATMUL_CT && (pct == MATMUL_CT || pct == PRELOAD_CT))
+        it->second->add_pipeline_child(inst);  // SA pipeline -> occupancy (overlap)
+      else
+        it->second->add_child(inst);           // data/result -> latency (full wait)
+    }
+    for (int64_t b : writes) last_writer[b] = inst;
+    tile->append_instuction(inst);
+  };
+
+  for (const auto& t : run.trace) {
+    if (t.kind == TraceRec::DISPATCH) {
+      // new work-item -> new subgraph (bound to its core) + tile.
+      flush();
+      sg = std::make_shared<TileSubGraph>();
+      sg->set_core_id(t.core);
+      tile = std::make_shared<Tile>(Tile::Status::INITIALIZED);
+      continue;
+    }
+    if (!tile) continue;  // defensive: ops before the first core_alloc
+
+    if (t.kind == TraceRec::DMA) {
+      auto inst = make_dma(t);
+      size_t numel = 1;                                  // SRAM footprint (ready-tile ordering)
+      for (auto d : t.dims) numel *= (size_t)d;
+      tile->inc_required_sram_size(numel * (t.elem_bits / 8));
+      if (t.dir == 1) {                                  // STORE
+        if (pending_bar) {
+          // after a compute fence: wait it (drains the async matmuls) -- covers
+          // the accumulator read, so no per-buffer read edge.
+          pending_bar->add_child(inst);
+          pending_bar.reset();
+          for (int64_t b : t.write_bufs) last_writer[b] = inst;
+          tile->append_instuction(inst);
+        } else {
+          link(inst, t.read_bufs, t.write_bufs);
+        }
+      } else {                                           // LOAD
+        tile->append_instuction(inst);
+        // Record the dma under its runtime tag so the explicit memory_barrier
+        // (the original dma_wait) can pair with it. last_writer = the dma for now;
+        // for an async load the barrier overwrites it (consumers gate on data
+        // arrival), for a sync load the dma itself blocks to data arrival.
+        if (t.is_async) tag_to_dma[{t.tag_id, t.tag_slot}] = inst;
+        for (int64_t b : t.write_bufs) last_writer[b] = inst;
+      }
+    } else if (t.kind == TraceRec::MEMORY_BAR) {
+      // the explicit async-DMA sync (the original dma_wait). Pair with its dma by
+      // the runtime tag; the dma releases the bar at issue-complete (add_child),
+      // then the bar parks on the tag table until the data arrives (resp-complete,
+      // set_tag_finish). Consumers of the loaded buffer then gate on the bar.
+      auto bar = make_mem_bar(t);
+      auto it = tag_to_dma.find({t.tag_id, t.tag_slot});
+      if (it != tag_to_dma.end()) it->second->add_child(bar);
+      tile->append_instuction(bar);
+      for (int64_t b : t.write_bufs) last_writer[b] = bar;
+    } else if (t.kind == TraceRec::COMPUTE) {
+      auto inst = make_compute(t);
+      link(inst, t.read_bufs, t.write_bufs);
+      if (is_async_compute(t.compute_type)) outstanding_async.push_back(inst);
+    } else if (t.kind == TraceRec::COMPUTE_BAR) {
+      // explicit compute fence: ready once all outstanding async compute have
+      // ISSUED (pipeline-child release); the Core then waits the SA pipelines to
+      // drain before it finishes (-> the store it gates).
+      auto bar = std::make_shared<Instruction>(Opcode::COMPUTE_BAR);
+      for (auto& a : outstanding_async) a->add_pipeline_child(bar);
+      outstanding_async.clear();
+      tile->append_instuction(bar);
+      pending_bar = bar;
+    }
+  }
+  flush();
+  return tg;
+}
diff --git a/docs/design/togsim_cpp_trace.md b/docs/design/togsim_cpp_trace.md
index b67662aa..710c1b23 100644
--- a/docs/design/togsim_cpp_trace.md
+++ b/docs/design/togsim_cpp_trace.md
@@ -167,9 +167,9 @@ given MLIR (affine/scf.for + memref.dma_start/dma_wait + vcix/vector compute)
         • run through existing sample-mode → tile_id → cycle table
 
 TOGSim (C6):
-  dlopen(trace.so) → resolve togsim_emit
+  dlopen(trace.so) → resolve togsim_kernel
   inject EmitCtx { tag table; record sink; cost = cycle_table[tile_id] }
-  togsim_emit(ctx, runtime_shape_args...)   // producer runs, emits stream
+  togsim_kernel(ctx, runtime_shape_args...)   // producer runs, emits stream
   → existing timing core consumes the recorded Instruction stream
 ```
 
@@ -197,7 +197,7 @@ TOGSim (C6):
 - **C6 — TOGSim runtime + loader.** `togsim_runtime.h/.cc`: `EmitCtx` and the
   `togsim_dma/compute/memory_barrier/compute_barrier/core_alloc`
   implementations (compute looks up the cycle table). Loader `dlopen`s the
-  `.so`, calls `togsim_emit` with runtime shape args, records the stream, feeds
+  `.so`, calls `togsim_kernel` with runtime shape args, records the stream, feeds
   the existing timing core. An async DMA and its `memory_barrier` are paired at
   runtime by `(tag_id, tag_slot)` through the existing Core tag table.
 
@@ -222,7 +222,7 @@ void togsim_compute_barrier(EmitCtx*);
 int32_t togsim_core_alloc(EmitCtx*);
 
 // entry point the loader resolves:
-void togsim_emit(EmitCtx*, int64_t* shape_args, int32_t n_shape_args);
+void togsim_kernel(EmitCtx*, int64_t* shape_args, int32_t n_shape_args);
 ```
 
 `togsim_dma` returns void (no handle). An async DMA carries `(tag_id, tag_slot)`;
@@ -342,7 +342,7 @@ The producer is two functions, split at the PARALLEL/ACCUMULATION boundary:
 // PARALLEL indices directly, names no core. Reduction (k) is program order ->
 // the dependency is implicit (the accumulator is core-local). An async load is
 // synced to its consumer by an explicit memory_barrier on the same tag slot.
-void togsim_emit_tile(EmitCtx* ctx, int64_t mi, int64_t ni, int64_t* shape) {
+void togsim_kernel_tile(EmitCtx* ctx, int64_t mi, int64_t ni, int64_t* shape) {
   togsim_core_alloc(ctx);                // first line: new work-item + pick core
   togsim_compute(ctx, /*tile_id=*/0, ...);            // acc init
   for (size_t ki = 0; ki < KT; ++ki) {                // REDUCTION = program order
@@ -355,17 +355,17 @@ void togsim_emit_tile(EmitCtx* ctx, int64_t mi, int64_t ni, int64_t* shape) {
 }
 
 // DISPATCH: enumerate the PARALLEL domain, one call per work-item.
-extern "C" void togsim_emit(EmitCtx* ctx, int64_t* shape, int32_t n) {
+extern "C" void togsim_kernel(EmitCtx* ctx, int64_t* shape, int32_t n) {
   size_t MT = shape[0]/256, NT = shape[1]/256;
   for (size_t mi = 0; mi < MT; ++mi)
     for (size_t ni = 0; ni < NT; ++ni)
-      togsim_emit_tile(ctx, mi, ni, shape);
+      togsim_kernel_tile(ctx, mi, ni, shape);
 }
 ```
 
 Reduced to two orthogonal concepts:
 
-- **Parallel** = each `togsim_emit_tile` call is an independent work-item (no
+- **Parallel** = each `togsim_kernel_tile` call is an independent work-item (no
   tags shared across calls). TOGSim is free to place it on any core.
 - **Reduction** = ordering *inside* one work-item: program order on its core
   (no explicit barrier). The `memory_barrier`/tag-slot mechanism is only the
@@ -393,24 +393,24 @@ only `togsim_*` callbacks are visible across the `dlopen` boundary.
 ### 9.4 Codegen (lower_to_emitc) and ABI deltas
 
 - `lower_to_emitc` splits the loop nest at the PARALLEL/ACCUMULATION boundary
-  into two `emitc.func`: the PARALLEL loops become `togsim_emit` (dispatcher,
+  into two `emitc.func`: the PARALLEL loops become `togsim_kernel` (dispatcher,
   passing the loop indices as args); the ACCUMULATION+INNER body becomes
-  `togsim_emit_tile`, with `togsim_core_alloc(ctx)` inserted at its entry.
+  `togsim_kernel_tile`, with `togsim_core_alloc(ctx)` inserted at its entry.
 - ABI additions in `togsim_runtime.h`: `int32_t togsim_core_alloc(EmitCtx*)`
   (runtime owns the core pool; no `num_cores` in the producer; no free).
-  `togsim_emit_tile` may stay internal (`static`) for now; export it only if a
+  `togsim_kernel_tile` may stay internal (`static`) for now; export it only if a
   future loader wants to own the parallel enumeration (which would also need a
   `num_tiles`-style count — not required now).
 - `tile_id -> cycle` table unchanged (num_cores-invariant).
 
 > Implementation status (P3): `lower_to_emitc` inserts the `togsim_core_alloc`
-> marker at the innermost PARALLEL-loop body inside the single `togsim_emit`
+> marker at the innermost PARALLEL-loop body inside the single `togsim_kernel`
 > function — the emitted *trace* is identical to the two-function form (one
 > core_alloc per work-item, then the work ops). Address arithmetic is wired
 > (approach A): each `togsim_dma` passes `(arg_id, element offset)` with the
 > offset computed from the loop IVs (lowered by `convert-arith-to-emitc`, cast-
 > free thanks to the size_t IV retype); the runtime adds the tensor base.
-> Outlining the work body into a separate `togsim_emit_tile` is now *meaningful*
+> Outlining the work body into a separate `togsim_kernel_tile` is now *meaningful*
 > (the body uses the parallel IVs in the offset) but still deferred — the
 > single-function trace is identical, so the split is cosmetic until needed.
 
@@ -572,7 +572,7 @@ single forward-compat requirement is that the callback sink is an interface.
    overlapping_cycle on the Instruction.
 5. PARTIAL. C6 runtime + loader: `TOGSim/src/togsim_runtime.cc` +
    `togsim_loader.h` implement the producer ABI and `run_producer` -- dlopen the
-   `.so`, run `togsim_emit` against an `EmitCtx`, and record a `TraceRec` stream
+   `.so`, run `togsim_kernel` against an `EmitCtx`, and record a `TraceRec` stream
    (the materializing sink): each dma resolves `base[arg_id] + offset*elem_bytes`
    and signals its tag at data arrival, each compute looks up the cycle table,
    core_alloc round-robins the core. Verified standalone on the 256^3 GEMM:
@@ -955,7 +955,7 @@ numbers; 2518-vs-2698 is the current real-table figure.
    -> a harmless extra edge.
 6. **P5 op coverage.** Only GEMM is exercised. Extend to conv / SDPA / vector / pool.
 7. **P4.** Symbolic/dynamic shape; streaming sink (coroutine, alloc-blocks).
-8. **Two-function outline** (togsim_emit_tile) -- deferred (trace identical).
+8. **Two-function outline** (togsim_kernel_tile) -- deferred (trace identical).
 9. **Retire the legacy ONNX-TOG path** once the trace path is stable.
 
 ### 11.3 Next-session context
diff --git a/docs/design/togsim_cpp_trace_HANDOFF.md b/docs/design/togsim_cpp_trace_HANDOFF.md
index 2c5d5bd6..23f642bb 100644
--- a/docs/design/togsim_cpp_trace_HANDOFF.md
+++ b/docs/design/togsim_cpp_trace_HANDOFF.md
@@ -89,7 +89,7 @@ the compute fence before a store — marked FIXME to become explicit later.)
 
 - `TOGSim/include/togsim_runtime.h` — extern "C" ABI v11 (`togsim_dma`,
   `togsim_memory_barrier`, `togsim_compute`, `togsim_compute_barrier`,
-  `togsim_core_alloc`, `togsim_emit` entry, `TOGSIM_ABI_VERSION`, opaque
+  `togsim_core_alloc`, `togsim_kernel` entry, `TOGSIM_ABI_VERSION`, opaque
   `EmitCtx`).
 - `PyTorchSimFrontend/mlir/passes/togsim_ops.py` — single source of truth for the
   skeleton+API MLIR vocabulary (op names, attr keys, op->callee map).
@@ -103,7 +103,7 @@ the compute fence before a store — marked FIXME to become explicit later.)
 - `tests/test_togsim_skeleton.py` — `test_togsim_ops_contract` (runs anywhere) +
   `test_build_skeleton_on_fixture` (gated on bindings + a fixture).
 - `PyTorchSimFrontend/mlir/passes/lower_to_emitc.py` — the P2/C4 pass: skeleton
-  module -> EmitC `togsim_emit` -> C++ (`mlir-translate`) -> `.so` (`g++`).
+  module -> EmitC `togsim_kernel` -> C++ (`mlir-translate`) -> `.so` (`g++`).
   Entry points: `lower_to_emitc(module)`, `build_trace_so(postvcix_path, so)`,
   and a `__main__` CLI (`--so`, `--emit-cpp`, `--include-dir`).
 - `tests/test_togsim_emitc.py` — `test_build_trace_so` (EmitC + symbol checks) +
@@ -127,7 +127,7 @@ python -m PyTorchSimFrontend.mlir.passes.build_skeleton "$FIX" --out /tmp/skel.m
 # 3. P2: skeleton -> EmitC -> C++ -> .so (reads skel from $FIX via build_skeleton).
 python -m PyTorchSimFrontend.mlir.passes.lower_to_emitc "$FIX" \
     --so /tmp/trace.so --emit-cpp /tmp/trace.cpp
-nm -D /tmp/trace.so | grep togsim     # togsim_emit = T; togsim_dma/memory_barrier/compute = U
+nm -D /tmp/trace.so | grep togsim     # togsim_kernel = T; togsim_dma/memory_barrier/compute = U
 
 # 4. tests
 TOGSIM_SKELETON_FIXTURE="$FIX" python -m pytest \
diff --git a/docs/design/togsim_cpp_trace_STATUS.md b/docs/design/togsim_cpp_trace_STATUS.md
index 1fb0bf1c..d13043af 100644
--- a/docs/design/togsim_cpp_trace_STATUS.md
+++ b/docs/design/togsim_cpp_trace_STATUS.md
@@ -117,7 +117,7 @@ differently than legacy's per-iteration BARs).
 - `cycle_table.py` — `tile_id -> (cycle, overlapping)`, overlapping
   `= max(cycle - offset[type], 0)` (legacy formula); JSON sidecar.
 - `TOGSim/src/togsim_runtime.cc` + `TOGSim/include/togsim_loader.h` — C6 runtime
-  and `run_producer` (dlopen -> togsim_emit -> records TraceRec). dma resolves
+  and `run_producer` (dlopen -> togsim_kernel -> records TraceRec). dma resolves
   `base[arg] + offset*elem_bytes` and signals its tag at data arrival; the
   matching memory_barrier waits the `(tag_id, tag_slot)`; compute looks up the
   cycle table; core_alloc round-robins a runtime core pool.
diff --git a/tests/test_togsim_emitc.py b/tests/test_togsim_emitc.py
new file mode 100644
index 00000000..285b9512
--- /dev/null
+++ b/tests/test_togsim_emitc.py
@@ -0,0 +1,151 @@
+"""Tests for the C4 emitc lowering + compiled .so trace producer (P2).
+
+The pipeline under test (docs/design/togsim_cpp_trace.md, sec 5-7):
+
+    post-vcix .mlir --build_skeleton--> skeleton+API
+                    --lower_to_emitc--> EmitC module
+                    --mlir-translate--> C++
+                    --g++ -shared----> trace .so  (exports togsim_kernel;
+                                                    togsim_* left undefined)
+
+`test_build_trace_so` builds the .so and checks the EmitC/symbol-table shape.
+`test_trace_so_runs` additionally dlopens it against a stub runtime and confirms
+the producer executes and emits a non-empty deterministic trace.
+
+Both are skipped unless the MLIR bindings, `mlir-translate` (from
+TORCHSIM_LLVM_PATH), a host C++ compiler, AND a post-vcix `.mlir` fixture (via
+`TOGSIM_SKELETON_FIXTURE`) are available -- the same fixture used by
+test_togsim_skeleton.py.
+"""
+import importlib.util
+import os
+import pathlib
+import shutil
+import subprocess
+import sys
+import tempfile
+
+import pytest
+
+_ROOT = pathlib.Path(__file__).resolve().parents[1]
+_CXX = os.environ.get("CXX", "g++")
+_INCLUDE = _ROOT / "TOGSim" / "include"
+
+
+def _mlir_translate():
+    return os.path.join(os.environ.get("TORCHSIM_LLVM_PATH", "/usr/bin"),
+                        "mlir-translate")
+
+
+def _tools_ready():
+    return (importlib.util.find_spec("mlir") is not None
+            and os.path.isfile(_mlir_translate())
+            and shutil.which(_CXX) is not None)
+
+
+def _fixture():
+    fix = os.environ.get("TOGSIM_SKELETON_FIXTURE")
+    if not fix or not os.path.isfile(fix):
+        pytest.skip("set TOGSIM_SKELETON_FIXTURE to a post-vcix kernel .mlir")
+    return fix
+
+
+_HARNESS = r'''
+#include <cstdio>
+#include <cstdint>
+#include <cstdlib>
+#include <dlfcn.h>
+#include "togsim_runtime.h"
+static int n_dma=0, n_membar=0, n_compute=0, n_core=0, bad=0;
+extern "C" {
+void togsim_dma(EmitCtx*, int32_t, int32_t, uint64_t, int32_t,
+                const int64_t*, const int64_t*, int32_t, int32_t,
+                int32_t, uint64_t, const int64_t*, int32_t,
+                const int64_t*, int32_t){ ++n_dma; }
+void togsim_compute(EmitCtx*, uint64_t, int32_t, int32_t, const int64_t*,
+                    const int64_t*, int32_t, const int64_t*, int32_t){ ++n_compute; }
+void togsim_memory_barrier(EmitCtx*, int32_t tag_id, uint64_t, const int64_t*, int32_t){
+  ++n_membar; if(tag_id<0) ++bad; }   // tag_id pairs it with its async dma
+int32_t togsim_core_alloc(EmitCtx*){ return n_core++; }   // count + assign a core
+void togsim_compute_barrier(EmitCtx*){}
+}
+int main(int argc, char** argv){
+  void* h = dlopen(argv[1], RTLD_NOW | RTLD_GLOBAL);
+  if(!h){ printf("dlopen failed: %s\n", dlerror()); return 2; }
+  auto emit = (void(*)(EmitCtx*, int64_t*, int32_t))dlsym(h, "togsim_kernel");
+  if(!emit){ printf("dlsym failed: %s\n", dlerror()); return 3; }
+  emit(nullptr, nullptr, 0);
+  printf("TRACE core=%d dma=%d membar=%d compute=%d bad=%d\n",
+         n_core, n_dma, n_membar, n_compute, bad);
+  return 0;
+}
+'''
+
+
+@pytest.mark.skipif(not _tools_ready(),
+                    reason="need mlir bindings + mlir-translate + C++ compiler")
+def test_build_trace_so():
+    fix = _fixture()
+    sys.path.insert(0, str(_ROOT))
+    from PyTorchSimFrontend.mlir.passes import lower_to_emitc as c4
+
+    with tempfile.TemporaryDirectory() as d:
+        so = os.path.join(d, "trace.so")
+        emitc_text = c4.build_trace_so(fix, so)
+        assert os.path.isfile(so)
+
+        # EmitC form: one entry func, dma/memory_barrier/compute as call_opaque targets.
+        assert "emitc.func" in emitc_text
+        assert ("@%s" % c4.ENTRY) in emitc_text
+        assert 'emitc.call_opaque "togsim_dma"' in emitc_text
+        assert 'emitc.call_opaque "togsim_memory_barrier"' in emitc_text
+        assert 'emitc.call_opaque "togsim_compute"' in emitc_text
+
+        # Symbol table: entry exported (defined, text), runtime hooks undefined
+        # so the TOGSim loader resolves them at dlopen.
+        nm = subprocess.run(["nm", "-D", so], capture_output=True, text=True).stdout
+        syms = {parts[-1]: parts[-2] for parts in
+                (ln.split() for ln in nm.splitlines()) if len(parts) >= 2}
+        assert syms.get("togsim_kernel") == "T", nm
+        assert syms.get("togsim_dma") == "U", nm
+        assert syms.get("togsim_core_alloc") == "U", nm
+        assert syms.get("togsim_memory_barrier") == "U", nm
+        # The per-work-item core alloc is emitted.
+        assert 'emitc.call_opaque "togsim_core_alloc"' in emitc_text
+
+
+@pytest.mark.skipif(not _tools_ready(),
+                    reason="need mlir bindings + mlir-translate + C++ compiler")
+def test_trace_so_runs():
+    fix = _fixture()
+    sys.path.insert(0, str(_ROOT))
+    from PyTorchSimFrontend.mlir.passes import lower_to_emitc as c4
+
+    with tempfile.TemporaryDirectory() as d:
+        so = os.path.join(d, "trace.so")
+        c4.build_trace_so(fix, so)
+
+        harness_cpp = os.path.join(d, "harness.cpp")
+        harness_bin = os.path.join(d, "harness")
+        with open(harness_cpp, "w") as fh:
+            fh.write(_HARNESS)
+        # -rdynamic so the harness's togsim_* are visible to the dlopened .so.
+        build = subprocess.run(
+            [_CXX, "-std=gnu++17", "-O2", "-rdynamic", "-I", str(_INCLUDE),
+             harness_cpp, "-o", harness_bin, "-ldl"],
+            capture_output=True, text=True)
+        assert build.returncode == 0, build.stderr
+
+        run = subprocess.run([harness_bin, so], capture_output=True, text=True)
+        assert run.returncode == 0, run.stdout + run.stderr
+        out = run.stdout.strip()
+        assert out.startswith("TRACE "), out
+        counts = dict(kv.split("=") for kv in out.split()[1:])
+        # The producer ran and emitted a real trace, with >=1 work-item (core alloc).
+        assert int(counts["core"]) >= 1
+        assert int(counts["dma"]) >= 1
+        assert int(counts["compute"]) >= 1
+        # Async loads are synced by explicit memory barriers, each carrying a
+        # valid (non-negative) tag_id that pairs it with its dma.
+        assert int(counts["membar"]) >= 1, out
+        assert int(counts["bad"]) == 0, out
diff --git a/tests/test_togsim_runtime.py b/tests/test_togsim_runtime.py
new file mode 100644
index 00000000..8a0ae82e
--- /dev/null
+++ b/tests/test_togsim_runtime.py
@@ -0,0 +1,181 @@
+"""P3 task 5: the TOGSim C6 runtime + loader (togsim_runtime.cc / togsim_loader.h).
+
+Builds a producer `.so` from a post-vcix fixture, links the real C6 runtime, runs
+the loader (`run_producer`) against the `.so`, and checks the recorded trace:
+DRAM addresses are resolved (base[arg_id] + offset*elem_bytes), compute cycles
+are looked up from the cycle table, and every wait gets a handle a dma minted.
+
+Skipped unless the MLIR bindings, `mlir-translate`, a C++ compiler, and a
+post-vcix `.mlir` fixture (`TOGSIM_SKELETON_FIXTURE`) are available.
+"""
+import importlib.util
+import os
+import pathlib
+import shutil
+import subprocess
+import sys
+import tempfile
+
+import pytest
+
+_ROOT = pathlib.Path(__file__).resolve().parents[1]
+_CXX = os.environ.get("CXX", "g++")
+_INCLUDE = _ROOT / "TOGSim" / "include"
+_RUNTIME = _ROOT / "TOGSim" / "src" / "togsim_runtime.cc"
+
+
+def _mlir_translate():
+    return os.path.join(os.environ.get("TORCHSIM_LLVM_PATH", "/usr/bin"),
+                        "mlir-translate")
+
+
+def _tools_ready():
+    return (importlib.util.find_spec("mlir") is not None
+            and os.path.isfile(_mlir_translate())
+            and shutil.which(_CXX) is not None
+            and _RUNTIME.is_file())
+
+
+def _fixture():
+    fix = os.environ.get("TOGSIM_SKELETON_FIXTURE")
+    if not fix or not os.path.isfile(fix):
+        pytest.skip("set TOGSIM_SKELETON_FIXTURE to a post-vcix kernel .mlir")
+    return fix
+
+
+# Drives the loader with known tensor bases + a synthetic cycle table, then
+# checks the recorded trace. Tailored to a single-output-tile GEMM (256^3):
+# 3 dmas A/B/C at offset 0 -> addr == base; args 0/1/2; dirs load/load/store.
+_MAIN = r'''
+#include <cstdio>
+#include <cstdint>
+#include <utility>
+#include <vector>
+#include "togsim_loader.h"
+using namespace togsim;
+int main(int argc, char** argv) {
+  uint64_t bases[3] = {0x1000, 0x2000, 0x3000};
+  int64_t  cyc[3]   = {100, 200, 300};
+  int64_t  ovl[3]   = {0, 200, 172};
+  RunResult r = run_producer(argv[1], nullptr, 0, bases, 3, cyc, ovl, 3, 1);
+  if (!r.ok) { printf("run failed\n"); return 2; }
+  int ndisp=0, nd=0, nc=0, nm=0, fail=0;
+  std::vector<uint64_t> dma_a; std::vector<int> dma_arg, dma_dir;
+  std::vector<std::pair<int,uint64_t>> async_tags;  // (tag_id, tag_slot) of async dmas
+  for (auto& t : r.trace) {
+    if (t.kind == TraceRec::DISPATCH) ndisp++;
+    else if (t.kind == TraceRec::DMA) {
+      nd++; dma_a.push_back(t.addr);
+      dma_arg.push_back(t.arg_id); dma_dir.push_back(t.dir);
+      if (t.is_async) async_tags.push_back({t.tag_id, t.tag_slot});
+    } else if (t.kind == TraceRec::COMPUTE) {
+      nc++;
+      int64_t want = (t.tile_id < 3) ? cyc[t.tile_id] : -1;
+      if (t.cycle != want) { printf("compute %lu cyc %ld!=%ld\n",
+          (unsigned long)t.tile_id, (long)t.cycle, (long)want); fail++; }
+    } else if (t.kind == TraceRec::MEMORY_BAR) {
+      nm++; bool ok=false;
+      for (auto& k : async_tags) if (k.first==t.tag_id && k.second==t.tag_slot) ok=true;
+      if (!ok) { printf("membar tag (%d,%lu) pairs no async dma\n",
+          t.tag_id, (unsigned long)t.tag_slot); fail++; }
+    }
+  }
+  const uint64_t exp[3] = {0x1000, 0x2000, 0x3000};
+  const int ea[3] = {0,1,2}, ed[3] = {0,0,1};
+  for (int i = 0; i < nd && i < 3; ++i)
+    if (dma_a[i]!=exp[i] || dma_arg[i]!=ea[i] || dma_dir[i]!=ed[i]) {
+      printf("dma[%d] addr=%#lx arg=%d dir=%d\n", i,
+             (unsigned long)dma_a[i], dma_arg[i], dma_dir[i]); fail++;
+    }
+  printf("dispatch=%d dma=%d compute=%d membar=%d fail=%d\n", ndisp, nd, nc, nm, fail);
+  printf(fail ? "RESULT FAIL\n" : "RESULT PASS\n");
+  return fail ? 1 : 0;
+}
+'''
+
+
+@pytest.mark.skipif(not _tools_ready(),
+                    reason="need mlir bindings + mlir-translate + C++ compiler + runtime")
+def test_runtime_loads_and_records():
+    fix = _fixture()
+    sys.path.insert(0, str(_ROOT))
+    from PyTorchSimFrontend.mlir.passes import lower_to_emitc as c4
+
+    with tempfile.TemporaryDirectory() as d:
+        so = os.path.join(d, "trace.so")
+        c4.build_trace_so(fix, so)
+
+        main_cpp = os.path.join(d, "main.cpp")
+        binp = os.path.join(d, "runtime_test")
+        with open(main_cpp, "w") as fh:
+            fh.write(_MAIN)
+        build = subprocess.run(
+            [_CXX, "-std=gnu++17", "-O2", "-rdynamic", "-I", str(_INCLUDE),
+             main_cpp, str(_RUNTIME), "-o", binp, "-ldl"],
+            capture_output=True, text=True)
+        assert build.returncode == 0, build.stderr
+
+        run = subprocess.run([binp, so], capture_output=True, text=True)
+        out = run.stdout
+        assert "RESULT PASS" in out, out + run.stderr
+        assert run.returncode == 0, out
+        # at least the GEMM's 3 dmas were recorded with resolved addresses.
+        line = [l for l in out.splitlines() if l.startswith("dispatch=")][0]
+        counts = dict(kv.split("=") for kv in line.split())
+        assert int(counts["dma"]) >= 1
+        assert int(counts["compute"]) >= 1
+        assert int(counts["fail"]) == 0
+
+
+_SIM_MAIN = r'''
+#include <cstdio>
+#include <cstdint>
+#include "togsim_loader.h"
+using namespace togsim;
+int main(int argc, char** argv) {
+  uint64_t bases[3] = {0x1000, 0x2000, 0x3000};
+  int64_t  cyc[3]   = {100, 200, 300};
+  int64_t  ovl[3]   = {0, 200, 172};
+  RunResult r = run_producer(argv[1], nullptr, 0, bases, 3, cyc, ovl, 3, 1);
+  if (!r.ok) { printf("run failed\n"); return 2; }
+  TimingParams p; p.dma_latency = 100;
+  SimResult s = simulate(r, p);
+  // serial baseline: no overlap at all.
+  uint64_t serial = 0;
+  for (auto& t : r.trace) {
+    if (t.kind == TraceRec::DMA) serial += p.dma_latency;
+    else if (t.kind == TraceRec::COMPUTE) serial += (uint64_t)t.cycle;
+  }
+  printf("SIM total=%lu compute=%d dma=%d serial=%lu\n",
+         (unsigned long)s.total_cycle, s.n_compute, s.n_dma, (unsigned long)serial);
+  // The trace is schedulable into cycles; overlap (dma||compute, compute
+  // pipelining) makes it no worse than the fully-serial baseline.
+  bool ok = s.total_cycle > 0 && s.n_compute > 0 && s.total_cycle <= serial;
+  printf(ok ? "RESULT PASS\n" : "RESULT FAIL\n");
+  return ok ? 0 : 1;
+}
+'''
+
+
+@pytest.mark.skipif(not _tools_ready(),
+                    reason="need mlir bindings + mlir-translate + C++ compiler + runtime")
+def test_simulate_produces_cycles():
+    fix = _fixture()
+    sys.path.insert(0, str(_ROOT))
+    from PyTorchSimFrontend.mlir.passes import lower_to_emitc as c4
+
+    with tempfile.TemporaryDirectory() as d:
+        so = os.path.join(d, "trace.so")
+        c4.build_trace_so(fix, so)
+        main_cpp = os.path.join(d, "sim.cpp")
+        binp = os.path.join(d, "sim_test")
+        with open(main_cpp, "w") as fh:
+            fh.write(_SIM_MAIN)
+        build = subprocess.run(
+            [_CXX, "-std=gnu++17", "-O2", "-rdynamic", "-I", str(_INCLUDE),
+             main_cpp, str(_RUNTIME), "-o", binp, "-ldl"],
+            capture_output=True, text=True)
+        assert build.returncode == 0, build.stderr
+        run = subprocess.run([binp, so], capture_output=True, text=True)
+        assert "RESULT PASS" in run.stdout, run.stdout + run.stderr
+        assert run.returncode == 0, run.stdout
diff --git a/tests/test_togsim_skeleton.py b/tests/test_togsim_skeleton.py
new file mode 100644
index 00000000..c3644a4b
--- /dev/null
+++ b/tests/test_togsim_skeleton.py
@@ -0,0 +1,119 @@
+"""Tests for the C++ trace-generation front-end pieces (docs/design/togsim_cpp_trace.md).
+
+Two layers:
+
+* `test_togsim_ops_contract` runs anywhere (no MLIR bindings, no torch). It pins
+  the skeleton+API vocabulary (`togsim_ops.py`) and checks it stays in lockstep
+  with the runtime ABI header (`togsim_runtime.h`) -- the single thing most
+  likely to silently drift.
+* `test_build_skeleton_on_fixture` exercises the real `build_skeleton` pass, and
+  is skipped unless the MLIR bindings are importable AND a post-vcix `.mlir`
+  fixture is supplied via the `TOGSIM_SKELETON_FIXTURE` env var. (A valid
+  build_tog-compatible fixture is hard to hand-write reliably; point this at a
+  kernel dump from a real run.)
+"""
+import os
+import importlib.util
+import pathlib
+
+import pytest
+
+_ROOT = pathlib.Path(__file__).resolve().parents[1]
+_OPS_PY = _ROOT / "PyTorchSimFrontend" / "mlir" / "passes" / "togsim_ops.py"
+_HEADER = _ROOT / "TOGSim" / "include" / "togsim_runtime.h"
+
+
+def _load_togsim_ops():
+    spec = importlib.util.spec_from_file_location("togsim_ops", _OPS_PY)
+    mod = importlib.util.module_from_spec(spec)
+    spec.loader.exec_module(mod)
+    return mod
+
+
+def test_togsim_ops_contract():
+    ts = _load_togsim_ops()
+    header = _HEADER.read_text()
+
+    # Every op maps to a callee, and every callee is the header's free function.
+    assert set(ts.EMITC_CALLEE) == set(ts.OP_NAMES)
+    for callee in ts.EMITC_CALLEE.values():
+        assert callee in header, f"{callee} missing from togsim_runtime.h"
+
+    # Entry point symbol agrees with the header.
+    assert ts.ENTRY_SYMBOL == "togsim_kernel"
+    assert ts.ENTRY_SYMBOL in header
+
+    # Runtime callee emitted directly by lower_to_emitc (core alloc).
+    assert ts.CORE_ALLOC_CALLEE in header
+
+    # Direction enum agrees with the header's togsim_dma_dir.
+    assert (ts.DIR_LOAD, ts.DIR_STORE) == (0, 1)
+    assert "TOGSIM_DMA_LOAD  = 0" in header
+    assert "TOGSIM_DMA_STORE = 1" in header
+
+
+def _mlir_available():
+    return importlib.util.find_spec("mlir") is not None
+
+
+@pytest.mark.skipif(not _mlir_available(), reason="MLIR Python bindings not installed")
+def test_build_skeleton_on_fixture():
+    fixture = os.environ.get("TOGSIM_SKELETON_FIXTURE")
+    if not fixture or not os.path.isfile(fixture):
+        pytest.skip("set TOGSIM_SKELETON_FIXTURE to a post-vcix kernel .mlir")
+
+    import sys
+    sys.path.insert(0, str(_ROOT))
+    from PyTorchSimFrontend.mlir.passes import build_skeleton
+
+    import mlir.ir as ir
+    ctx = ir.Context()
+    ctx.allow_unregistered_dialects = True
+    with ctx:
+        module = ir.Module.parse(pathlib.Path(fixture).read_text(), ctx)
+        report = build_skeleton.build_skeleton(module)
+        out = str(module)
+
+    # The data-movement ops are gone; the API ops took their place.
+    assert "memref.dma_start" not in out
+    assert "memref.dma_wait" not in out
+    assert "togsim.dma" in out
+    assert "togsim.memory_barrier" in out   # the explicit async-DMA sync (was dma_wait)
+    assert "event_id" not in out            # static pairing replaced by the runtime tag
+    # Loop skeleton is preserved.
+    assert ("affine.for" in out) or ("scf.for" in out)
+    assert module.operation.verify()
+    print(report)
+
+
+@pytest.mark.skipif(not _mlir_available(), reason="MLIR Python bindings not installed")
+def test_cycle_table_on_fixture():
+    fixture = os.environ.get("TOGSIM_SKELETON_FIXTURE")
+    if not fixture or not os.path.isfile(fixture):
+        pytest.skip("set TOGSIM_SKELETON_FIXTURE to a post-vcix kernel .mlir")
+
+    import sys
+    sys.path.insert(0, str(_ROOT))
+    from PyTorchSimFrontend.mlir.passes import build_skeleton, cycle_table
+
+    import mlir.ir as ir
+    ctx = ir.Context()
+    ctx.allow_unregistered_dialects = True
+    with ctx:
+        module = ir.Module.parse(pathlib.Path(fixture).read_text(), ctx)
+        build_skeleton.build_skeleton(module)
+        types = cycle_table._compute_types(module)
+        # synthetic per-tile cycles (gem5 sample-mode is reused at P3 task 5).
+        cyc = [10 * (i + 1) for i in range(len(types))]
+        x_off, w_off = 4, 0
+        table = cycle_table.build_cycle_table(module, cyc, x_off, w_off)
+
+    assert len(table) == len(types) >= 1
+    # cycle is carried verbatim; overlapping_cycle follows the legacy formula.
+    for (cy, ov), t, raw in zip(table, types, cyc):
+        assert cy == raw
+        if t == cycle_table.VECTOR_COMPUTE:
+            assert ov == 0
+        else:
+            off = w_off if t == cycle_table.MATMUL_PRELOAD else x_off
+            assert ov == max(raw - off, 0)

From 78f77bcfb84e649793cf4655c537db0335d36903 Mon Sep 17 00:00:00 2001
From: Wonhyuk Yang <wonhyuk@postech.ac.kr>
Date: Wed, 24 Jun 2026 22:35:51 +0900
Subject: [PATCH 03/24] [TOGSim] Per-iteration tag pairing for multi-tile-K and
 conv

Per-record tag key in the bridge plus per-iteration tag alloc in
dma-fine-grained so multi-tile-K and conv loads do not collide; strip the
reduction accum marker from the memory_barrier slot.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
---
 .../mlir/passes/build_skeleton.py             | 86 ++++++++++++++++++
 .../mlir/passes/dma_fine_grained.py           | 27 ++++++
 .../mlir/passes/lower_to_vcix.py              | 12 +++
 TOGSim/src/togsim_trace_bridge.cc             | 90 ++++++++++++-------
 docs/design/togsim_cpp_trace.md               | 10 ++-
 docs/design/togsim_cpp_trace_STATUS.md        |  7 ++
 tests/test_togsim_skeleton.py                 | 64 +++++++++++++
 7 files changed, 265 insertions(+), 31 deletions(-)

diff --git a/PyTorchSimFrontend/mlir/passes/build_skeleton.py b/PyTorchSimFrontend/mlir/passes/build_skeleton.py
index deb7d9ae..df4c6046 100644
--- a/PyTorchSimFrontend/mlir/passes/build_skeleton.py
+++ b/PyTorchSimFrontend/mlir/passes/build_skeleton.py
@@ -147,6 +147,89 @@ def _emit_memory_bar(ctx, anchor_op, tag_id, tag_index, write_bufs):
         loc=ir.Location.unknown(ctx), ip=ir.InsertionPoint(anchor_op))
 
 
+def _flatten_add(expr):
+    """Top-level additive summands of an AffineExpr (`.lhs`/`.rhs` come back typed
+    as the base AffineExpr, so use the `isinstance`/cast pattern, not Python
+    isinstance)."""
+    if ir.AffineAddExpr.isinstance(expr):
+        a = ir.AffineAddExpr(expr)
+        return _flatten_add(a.lhs) + _flatten_add(a.rhs)
+    return [expr]
+
+
+def _neg_coeff_dim(summand):
+    """If `summand` is `dim * c` with a negative constant `c`, return that dim's
+    position; else None. lower_to_vcix tags each accumulation (reduction) loop var
+    with coefficient -1 in the dma_wait tag index -- a SENTINEL marking the
+    reduction axis, not an arithmetic offset (legacy TileGraphParser skips stride
+    -1 for the same reason)."""
+    if not ir.AffineMulExpr.isinstance(summand):
+        return None
+    mul = ir.AffineMulExpr(summand)
+    l, r = mul.lhs, mul.rhs
+    dim = l if ir.AffineDimExpr.isinstance(l) else (r if ir.AffineDimExpr.isinstance(r) else None)
+    con = l if ir.AffineConstantExpr.isinstance(l) else (r if ir.AffineConstantExpr.isinstance(r) else None)
+    if dim is None or con is None or ir.AffineConstantExpr(con).value >= 0:
+        return None
+    return ir.AffineDimExpr(dim).position
+
+
+def _strip_accum_terms(ctx, tag_index, anchor_op):
+    """Return a tag-index Value with the accumulation-marked (-1 coefficient) terms
+    dropped, so a memory_barrier waits on the SAME subtile slot its async load
+    wrote.
+
+    The wait tag index built by lower_to_vcix carries `-acc_iv` for each reduction
+    loop var; the matching load index (dma_fine_grained) is subtile-only. Without
+    this, at reduction iteration > 0 the producer EVALUATES `-acc_iv` to a negative
+    slot, so the recorded barrier slot diverges from the load slot and the runtime
+    tag pairing fails (TOGSim aborts with "Key does not exist in ... tag table").
+    Dropping the -1 terms mirrors legacy TileGraphParser.cc, which skips stride -1
+    and routes the reduction axis to a separate accum tag component; here the
+    per-iteration tag alloc (dma_fine_grained) already separates the reductions, so
+    the barrier only needs the subtile slot.
+
+    Falls through (returns `tag_index` unchanged) for anything that is not an
+    affine.apply whose single result carries such a term -- e.g. the single-tile
+    case, whose index has no reduction term."""
+    if tag_index is None:
+        return None
+    try:
+        apply_op = tag_index.owner
+        if apply_op.name != "affine.apply":
+            return tag_index
+        amap = ir.AffineMapAttr(apply_op.attributes["map"]).value
+    except Exception:
+        return tag_index
+    if amap.n_dims == 0 or amap.n_symbols != 0 or len(amap.results) != 1:
+        return tag_index
+    expr = amap.results[0]
+    dropped = sorted({p for p in (_neg_coeff_dim(s) for s in _flatten_add(expr))
+                      if p is not None})
+    if not dropped:
+        return tag_index
+    n = amap.n_dims
+    kept = [i for i in range(n) if i not in dropped]
+    new_pos = {old: i for i, old in enumerate(kept)}
+    # compose the original expr with a selector that sends each dropped dim to 0
+    # and renumbers the kept dims 0..k-1.
+    sel = [ir.AffineConstantExpr.get(0) if i in dropped
+           else ir.AffineDimExpr.get(new_pos[i]) for i in range(n)]
+    new_expr = expr.compose(ir.AffineMap.get(len(kept), 0, sel))
+    new_map = ir.AffineMap.get(len(kept), 0, [new_expr])
+    operands = list(apply_op.operands)
+    new_operands = [operands[i] for i in kept]
+    new_apply = ir.Operation.create(
+        "affine.apply",
+        results=[ir.IndexType.get(ctx)],
+        operands=new_operands,
+        attributes={"map": ir.AffineMapAttr.get(new_map)},
+        loc=ir.Location.unknown(ctx),
+        ip=ir.InsertionPoint(anchor_op),
+    )
+    return new_apply.results[0]
+
+
 def _emit_compute(ctx, compute_node, tile_id, read_bufs, write_bufs):
     front = compute_node.operations[0]
     attrs = {
@@ -322,6 +405,9 @@ def _emit_one_wait(ctx, op, tags):
     if binding is None:
         return False
     tag_id, buf = binding
+    # honor lower_to_vcix's -1 accumulation marker: strip the reduction terms so
+    # the barrier slot equals the subtile slot the paired async load wrote.
+    tag_index = _strip_accum_terms(ctx, tag_index, op)
     _emit_memory_bar(ctx, op, tag_id, tag_index, [buf])
     return True
 
diff --git a/PyTorchSimFrontend/mlir/passes/dma_fine_grained.py b/PyTorchSimFrontend/mlir/passes/dma_fine_grained.py
index 994f1f1f..f1872dca 100644
--- a/PyTorchSimFrontend/mlir/passes/dma_fine_grained.py
+++ b/PyTorchSimFrontend/mlir/passes/dma_fine_grained.py
@@ -240,6 +240,27 @@ def _const_index(v, ip):
                             ir.IntegerAttr.get(ir.IndexType.get(), v), ip=ip).result
 
 
+def _fresh_tag(dma):
+    """Give this DMA a fresh tag memref.alloc right BEFORE the (pre-split) coarse
+    dma_start, and rewire every use of the old tag -- the dma_start re-emitted
+    below AND its dma_wait -- to it. The coarse dma sits at the reduction-loop body
+    level (it has not been wrapped in a subtile load nest yet), so the alloc there
+    dominates both the load nest fine-grained is about to build and the sibling
+    wait nest. Each reduction iteration thus allocates its own tag -> successive
+    iterations are distinct (multi-tile-K / conv) and the per-iteration tag
+    semantics is in the IR, not reconstructed downstream. Old alloc becomes dead."""
+    old = dma.tag
+    new_tag = ir.Operation.create("memref.alloc", results=[old.type],
+                                  operands=[], ip=ir.InsertionPoint(dma.op)).results[0]
+    old.replace_all_uses_with(new_tag)
+    dma.tag = new_tag
+    # the old (func-entry, per-tensor unique) alloc is now dead -- erase it.
+    try:
+        old.owner.erase()
+    except Exception:
+        pass
+
+
 # ---------------------------------------------------------------------------
 # Loop-nest construction
 # ---------------------------------------------------------------------------
@@ -351,6 +372,12 @@ def _run_func(func, vectorlane):
     for d, f in enumerate(fuse["w_to_fused"]):
         bounds[f] = w_counts[d]
 
+    # Give each load a fresh per-iteration tag alloc just before its coarse dma
+    # (rewiring its dma_wait via the old tag's uses), so the tag is distinct per
+    # reduction iteration -- positioned to match the per-iteration tag semantics.
+    _fresh_tag(mvin_input)
+    _fresh_tag(mvin_weight)
+
     # Insert the fused nest at the weight DMA (the later of the two): both DMAs'
     # original DRAM base indices (src_idx[0], computed in the enclosing loops) must
     # dominate the nest. Codegen emits input before weight, matching the C++ pass
diff --git a/PyTorchSimFrontend/mlir/passes/lower_to_vcix.py b/PyTorchSimFrontend/mlir/passes/lower_to_vcix.py
index 9635db68..df124d00 100644
--- a/PyTorchSimFrontend/mlir/passes/lower_to_vcix.py
+++ b/PyTorchSimFrontend/mlir/passes/lower_to_vcix.py
@@ -469,6 +469,14 @@ def _root(v):
         # --- B dma_wait ---
         nacc = len(acc)
         acc_ivs = [_loop_iv(l) for l in acc]
+        # LEGACY behavior: coefficient -1 on each accumulation (reduction) loop var
+        # is a SENTINEL marking "this tag dim is the reduction axis", not an
+        # arithmetic offset. The legacy TOG path (TileGraphParser.cc) honors it by
+        # routing those vars to a separate accum tag component and skipping stride
+        # -1. The C++ trace path does NOT honor it: build_skeleton._strip_accum_terms
+        # drops these -1 terms so the memory_barrier slot stays subtile-only and
+        # pairs with its async load. Kept here for byte-identity with the C++
+        # -test-pytorchsim-to-vcix pass; remove (do not flag) once legacy retires.
         bexpr = ir.AffineDimExpr.get(0) * -1
         for i in range(1, nacc):
             bexpr = bexpr + ir.AffineDimExpr.get(i) * -1
@@ -525,6 +533,10 @@ def _root(v):
 
     with body_ip:
         # --- A dma_wait ---
+        # LEGACY behavior (see the B dma_wait above): the -1 coefficients mark the
+        # reduction axis for the legacy TOG path; the trace path strips them in
+        # build_skeleton._strip_accum_terms. Kept for byte-identity with the C++
+        # -test-pytorchsim-to-vcix pass; remove once legacy retires.
         aexpr = ir.AffineDimExpr.get(0) * -1
         for i in range(1, nacc):
             aexpr = aexpr + ir.AffineDimExpr.get(i) * -1
diff --git a/TOGSim/src/togsim_trace_bridge.cc b/TOGSim/src/togsim_trace_bridge.cc
index bbf6252c..8d0dc721 100644
--- a/TOGSim/src/togsim_trace_bridge.cc
+++ b/TOGSim/src/togsim_trace_bridge.cc
@@ -10,11 +10,30 @@
 
 namespace {
 
-std::shared_ptr<Instruction> make_dma(const togsim::TraceRec& t) {
+// `uniq` is a per-DMA-record unique tag-key id minted by the caller. The Core
+// tag table keys completion on [addr_id, ..., sum(tag_idx*stride)]; using `uniq`
+// as addr_id makes every reduction iteration of one static dma get a DISTINCT
+// key -- so multi-tile-K (and conv, whose reduction is the kh*kw*C nest) do not
+// collide, with no coordinate enumeration. The matching memory_barrier reuses
+// the same `uniq` (current-load map per (tag_id, tag_slot), see
+// trace_to_tilegraph), so the table still pairs them. This works because the
+// recorded stream is already per-iteration (the producer ran the loops) --
+// unlike a compile-time event_id. `tag_idx` (the subtile slot) is retained for
+// the SRAM double-buffer model.
+//
+// FIXME(semantics): the per-iteration tag is still reconstructed HERE from the
+// record order. The producer IR now DOES carry a per-iteration tag -- dma_fine_-
+// grained emits a fresh tag memref.alloc just before each coarse load (rewiring
+// its dma_wait), so successive reduction iterations allocate distinct tags -- but
+// build_skeleton collapses that to one static tag_id (it DCEs the alloc and keys
+// togsim.dma by the alloc's static identity), so this bridge still needs `uniq`
+// to tell iterations apart at runtime. The faithful finish is to thread the
+// per-iteration alloc identity through build_skeleton as an SSA tag handle on the
+// togsim.dma / togsim.memory_barrier (then `uniq` here is unnecessary).
+std::shared_ptr<Instruction> make_dma(const togsim::TraceRec& t, int64_t uniq) {
   Opcode op = (t.dir == 1) ? Opcode::MOVOUT : Opcode::MOVIN;
   std::vector<size_t> tile_size(t.dims.begin(), t.dims.end());
   std::vector<int> tile_stride(t.strides.begin(), t.strides.end());
-  // tag_idx_list / tag_stride_list must match in size; one slot key per dma.
   std::vector<int64_t> tag_idx{(int64_t)t.tag_slot};
   std::vector<int64_t> tag_stride{1};
   auto inst = std::make_shared<Instruction>(
@@ -22,26 +41,21 @@ std::shared_ptr<Instruction> make_dma(const togsim::TraceRec& t) {
       tile_size, tile_stride, (size_t)t.elem_bits, tag_idx, tag_stride,
       /*accum_tag_idx_list=*/std::vector<int64_t>{});
   inst->set_is_async(t.is_async != 0);
-  // The tag key is [addr_id, ..., sum(tag_idx*tag_stride)]. addr_id is the tag
-  // memref identity (tag_id): an async dma and its memory_barrier share a tag
-  // memref, so the same (tag_id, tag_slot) keys both and the Core's tag table
-  // pairs them. (Distinct tag memrefs -> distinct tag_id, so no false collision.)
-  inst->set_addr_name("tag" + std::to_string(t.tag_id), t.tag_id);
+  inst->set_addr_name("tag" + std::to_string(uniq), uniq);
   inst->prepare_tag_key();
   return inst;
 }
 
-// A MEMORY_BAR carrying the SAME tag key as the async dma it gates -- the Core's
-// tag table signals it at the dma's DATA-ready (resp-complete), unlike a raw
-// add_child which the async dma releases at issue-complete. Tag inputs match
-// make_dma (tag_idx={tag_slot}, stride={1}, addr_id=tag_id) so the keys collide.
-std::shared_ptr<Instruction> make_mem_bar(const togsim::TraceRec& t) {
+// A MEMORY_BAR carrying the SAME `uniq` tag key as the async dma it gates -- the
+// Core's tag table signals it at the dma's DATA-ready (resp-complete), unlike a
+// raw add_child which the async dma releases at issue-complete.
+std::shared_ptr<Instruction> make_mem_bar(const togsim::TraceRec& t, int64_t uniq) {
   auto bar = std::make_shared<Instruction>(
       Opcode::MEMORY_BAR, 0, 0, 0,
       std::vector<size_t>{}, std::vector<int>{}, 0,
       std::vector<int64_t>{(int64_t)t.tag_slot}, std::vector<int64_t>{1},
       std::vector<int64_t>{});
-  bar->set_addr_name("tag" + std::to_string(t.tag_id), t.tag_id);
+  bar->set_addr_name("tag" + std::to_string(uniq), uniq);
   bar->prepare_tag_key();
   return bar;
 }
@@ -73,10 +87,18 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
   // SRAM buffer it reads. Scoped per work-item (reset at each dispatch) -- buffers
   // are work-item-local, so distinct work-items are independent (-> parallel).
   std::map<int64_t, std::shared_ptr<Instruction>> last_writer;  // buffer id -> producer
-  // An async dma is paired with its explicit memory_barrier by the RUNTIME tag
-  // (tag_id, tag_slot): the dma records itself here so the later barrier can find
-  // it and depend on it. Scoped per work-item (the tag table is per subgraph).
-  std::map<std::pair<int32_t, uint64_t>, std::shared_ptr<Instruction>> tag_to_dma;
+  // An async dma is paired with its explicit memory_barrier(s) by the runtime tag
+  // (tag_id, tag_slot). It is 1 load : N barriers (the load happens once per
+  // reduction iteration; each consumer in that iteration is preceded by a wait on
+  // the same tag), so we track the CURRENT (most recent) load per (tag_id,
+  // tag_slot) -- like last_writer for a buffer -- not a FIFO. Each load gets a
+  // fresh `uniq` Core key, so successive reduction iterations (multi-tile-K, conv)
+  // never collide in the tag table; the iteration's barriers reuse that load's
+  // uniq. Correct because the load nest and its consumer nest run in order within
+  // the reduction body (no cross-iteration prefetch). Scoped per work-item.
+  std::map<std::pair<int32_t, uint64_t>,
+           std::pair<int64_t, std::shared_ptr<Instruction>>> current_dma;
+  int64_t next_tag = 0;   // mints a unique Core tag key per dma record
   // Async compute (matmul/preload): issued and pipelined on the systolic array;
   // they do not block each other. A store then needs the drained result, so it
   // FLUSHes -- waits all outstanding async compute before running (like a fence
@@ -94,7 +116,8 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
     sg.reset();
     tile.reset();
     last_writer.clear();
-    tag_to_dma.clear();
+    current_dma.clear();
+    next_tag = 0;
     outstanding_async.clear();
     pending_bar.reset();
   };
@@ -138,7 +161,8 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
     if (!tile) continue;  // defensive: ops before the first core_alloc
 
     if (t.kind == TraceRec::DMA) {
-      auto inst = make_dma(t);
+      int64_t uniq = next_tag++;                         // fresh Core tag key per dma record
+      auto inst = make_dma(t, uniq);
       size_t numel = 1;                                  // SRAM footprint (ready-tile ordering)
       for (auto d : t.dims) numel *= (size_t)d;
       tile->inc_required_sram_size(numel * (t.elem_bits / 8));
@@ -155,21 +179,27 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
         }
       } else {                                           // LOAD
         tile->append_instuction(inst);
-        // Record the dma under its runtime tag so the explicit memory_barrier
-        // (the original dma_wait) can pair with it. last_writer = the dma for now;
-        // for an async load the barrier overwrites it (consumers gate on data
-        // arrival), for a sync load the dma itself blocks to data arrival.
-        if (t.is_async) tag_to_dma[{t.tag_id, t.tag_slot}] = inst;
+        // async load: record it as the CURRENT load for this (tag_id, tag_slot)
+        // with its fresh uniq; the barriers in this reduction iteration reuse that
+        // uniq (1 load : N barriers). A new iteration's load overwrites it with a
+        // new uniq -> distinct tag key, no collision. last_writer = the dma for now;
+        // the barrier overwrites it so consumers gate on data arrival. A sync load
+        // has no barrier and blocks to arrival itself.
+        if (t.is_async) current_dma[{t.tag_id, t.tag_slot}] = {uniq, inst};
         for (int64_t b : t.write_bufs) last_writer[b] = inst;
       }
     } else if (t.kind == TraceRec::MEMORY_BAR) {
-      // the explicit async-DMA sync (the original dma_wait). Pair with its dma by
-      // the runtime tag; the dma releases the bar at issue-complete (add_child),
-      // then the bar parks on the tag table until the data arrives (resp-complete,
+      // the explicit async-DMA sync (the original dma_wait). Pair with the CURRENT
+      // load for this (tag_id, tag_slot), reusing its uniq Core key so the dma and
+      // bar pair in the tag table; the dma releases the bar at issue-complete
+      // (add_child), then the bar parks on the tag until data-ready (resp-complete,
       // set_tag_finish). Consumers of the loaded buffer then gate on the bar.
-      auto bar = make_mem_bar(t);
-      auto it = tag_to_dma.find({t.tag_id, t.tag_slot});
-      if (it != tag_to_dma.end()) it->second->add_child(bar);
+      auto it = current_dma.find({t.tag_id, t.tag_slot});
+      int64_t uniq = next_tag++;                         // fallback if unpaired
+      std::shared_ptr<Instruction> dma_inst;
+      if (it != current_dma.end()) { uniq = it->second.first; dma_inst = it->second.second; }
+      auto bar = make_mem_bar(t, uniq);
+      if (dma_inst) dma_inst->add_child(bar);
       tile->append_instuction(bar);
       for (int64_t b : t.write_bufs) last_writer[b] = bar;
     } else if (t.kind == TraceRec::COMPUTE) {
diff --git a/docs/design/togsim_cpp_trace.md b/docs/design/togsim_cpp_trace.md
index 710c1b23..6adc609c 100644
--- a/docs/design/togsim_cpp_trace.md
+++ b/docs/design/togsim_cpp_trace.md
@@ -113,7 +113,15 @@ as the dma↔barrier pairing key:
   `tag_slot` index identifies which async DMA a `memory_barrier` is waiting on.
 - **`tag_slot`** — the SRAM tile slot the loaded tile occupies (the
   double-buffer / SRAM-capacity index). It is *also* part of the pairing key
-  because each load's tile maps to its own slot.
+  because each load's tile maps to its own slot. The slot is **subtile-only**:
+  `lower_to_vcix` writes the dma_wait tag index with a `-acc_iv` term for each
+  accumulation (reduction) loop var — a sentinel marking the reduction axis, not
+  an arithmetic offset — and `build_skeleton` strips those terms so a
+  `memory_barrier` waits on the same slot its async load wrote. (Mirrors legacy
+  `TileGraphParser`, which skips stride -1; reduction iterations are told apart
+  by the per-iteration tag alloc + a fresh per-record Core key in the bridge, not
+  by the slot.) Without the strip, the producer evaluates `-acc_iv` to a negative
+  slot at reduction iteration > 0 and the pairing fails on subtile + multi-tile-K.
 
 Pairing is done at runtime by the existing TOGSim Core tag table: the async DMA
 calls `prepare_tag_key` and `set_tag_finish` (signal at data arrival), the
diff --git a/docs/design/togsim_cpp_trace_STATUS.md b/docs/design/togsim_cpp_trace_STATUS.md
index d13043af..ebf05701 100644
--- a/docs/design/togsim_cpp_trace_STATUS.md
+++ b/docs/design/togsim_cpp_trace_STATUS.md
@@ -103,6 +103,13 @@ work and DRAM traffic match; the remaining difference is scheduling (the
 explicit dataflow DAG plus the occupancy/latency SA-pipeline model overlap
 differently than legacy's per-iteration BARs).
 
+**Subtile + multi-tile-K now runs** (256x512x256 forced to 128x128 subtiles, 2
+K-tiles: 5774 cycles, no crash). This needed `build_skeleton` to strip the
+`-acc_iv` accumulation marker from the dma_wait tag index so the memory_barrier
+slot stays subtile-only and pairs with its load (see §3, `tag_slot`); before the
+strip the producer evaluated `-acc_iv` to a negative slot at the 2nd K-tile and
+TOGSim aborted with "Key does not exist in ... tag table".
+
 ## 4. Components
 
 - `build_skeleton.py` + `dep_analysis.py` — in-place reduction of post-vcix to
diff --git a/tests/test_togsim_skeleton.py b/tests/test_togsim_skeleton.py
index c3644a4b..7bf35eb9 100644
--- a/tests/test_togsim_skeleton.py
+++ b/tests/test_togsim_skeleton.py
@@ -86,6 +86,70 @@ def test_build_skeleton_on_fixture():
     print(report)
 
 
+@pytest.mark.skipif(not _mlir_available(), reason="MLIR Python bindings not installed")
+def test_strip_accum_terms_drops_reduction_marker():
+    """Regression: the dma_wait tag index built by lower_to_vcix carries a `-d_i`
+    term for each accumulation (reduction) loop var -- a sentinel marker, not an
+    offset. build_skeleton must drop those so a memory_barrier waits on the same
+    subtile slot the async load wrote; otherwise the producer evaluates `-acc_iv`
+    to a negative slot at reduction iteration > 0, the recorded barrier slot
+    diverges from the load slot, and TOGSim aborts with "Key does not exist in ...
+    tag table" on subtile + multi-tile-K. See docs/design/togsim_cpp_trace.md and
+    legacy TileGraphParser.cc (which skips stride -1 for the same reason)."""
+    import sys
+    sys.path.insert(0, str(_ROOT))
+    from PyTorchSimFrontend.mlir.passes import build_skeleton as bs
+
+    import mlir.ir as ir
+    ctx = ir.Context()
+    ctx.allow_unregistered_dialects = True
+    with ctx, ir.Location.unknown(ctx):
+        module = ir.Module.parse(
+            "func.func @k() {\n"
+            "  %r = arith.constant 1 : index\n"   # stand-in reduction iv
+            "  %a = arith.constant 0 : index\n"   # subtile dim 1
+            "  %b = arith.constant 0 : index\n"   # subtile dim 2
+            "  return\n"
+            "}", ctx)
+        block = module.body.operations[0].regions[0].blocks[0]
+        consts = [op.results[0] for op in block.operations if op.name == "arith.constant"]
+        anchor = [op for op in block.operations if op.name == "func.return"][0]
+        r, a, b = consts
+
+        def neg_dims(val):
+            amap = ir.AffineMapAttr(val.owner.attributes["map"]).value
+            return [p for p in (bs._neg_coeff_dim(s) for s in bs._flatten_add(amap.results[0]))
+                    if p is not None]
+
+        # #map8-style: -d0 (reduction) + d1 + d2 floordiv 2.
+        d0, d1, d2 = (ir.AffineDimExpr.get(i) for i in range(3))
+        expr = d0 * -1 + d1 + ir.AffineExpr.get_floor_div(d2, 2)
+        with ir.InsertionPoint(anchor):
+            apply = ir.Operation.create(
+                "affine.apply", results=[ir.IndexType.get()], operands=[r, a, b],
+                attributes={"map": ir.AffineMapAttr.get(ir.AffineMap.get(3, 0, [expr]))})
+        tag_in = apply.results[0]
+        assert neg_dims(tag_in) == [0]                       # the reduction marker is present
+
+        tag_out = bs._strip_accum_terms(ctx, tag_in, anchor)
+        assert tag_out is not tag_in                         # a new, reduced apply was emitted
+        out_map = ir.AffineMapAttr(tag_out.owner.attributes["map"]).value
+        assert out_map.n_dims == 2                           # the reduction dim was dropped
+        assert neg_dims(tag_out) == []                       # no reduction marker remains
+        assert list(tag_out.owner.operands) == [a, b]        # only the subtile operands survive
+
+        # No-op: an index with no reduction marker is returned unchanged.
+        plain = d0 + d1
+        with ir.InsertionPoint(anchor):
+            papply = ir.Operation.create(
+                "affine.apply", results=[ir.IndexType.get()], operands=[a, b],
+                attributes={"map": ir.AffineMapAttr.get(ir.AffineMap.get(2, 0, [plain]))})
+        pin = papply.results[0]
+        assert bs._strip_accum_terms(ctx, pin, anchor) is pin
+
+        assert module.operation.verify()
+
+
 @pytest.mark.skipif(not _mlir_available(), reason="MLIR Python bindings not installed")
 def test_cycle_table_on_fixture():
     fixture = os.environ.get("TOGSIM_SKELETON_FIXTURE")

From b189df4b32393e61a55641ea0c7aa2eb6681cd69 Mon Sep 17 00:00:00 2001
From: Wonhyuk Yang <wonhyuk@postech.ac.kr>
Date: Wed, 24 Jun 2026 22:35:51 +0900
Subject: [PATCH 04/24] [TOGSim] Work-item outlining and ABI v12 dispatch

togsim_dispatch with TILE_BEGIN/TILE_END; outline each work-item into
togsim_kernel_tile.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
---
 .../mlir/passes/lower_to_emitc.py             | 195 ++++++++++++++++--
 PyTorchSimFrontend/mlir/passes/togsim_ops.py  |  13 +-
 TOGSim/include/togsim_loader.h                |   4 +-
 TOGSim/include/togsim_runtime.h               |  40 +++-
 TOGSim/src/togsim_runtime.cc                  |  17 +-
 TOGSim/src/togsim_trace_bridge.cc             |  12 +-
 docs/design/togsim_cpp_trace.md               |  39 ++--
 tests/test_togsim_emitc.py                    |   9 +-
 tests/test_togsim_runtime.py                  |   2 +-
 tests/test_togsim_skeleton.py                 |   5 +-
 10 files changed, 272 insertions(+), 64 deletions(-)

diff --git a/PyTorchSimFrontend/mlir/passes/lower_to_emitc.py b/PyTorchSimFrontend/mlir/passes/lower_to_emitc.py
index 4ba40e16..6ade7442 100644
--- a/PyTorchSimFrontend/mlir/passes/lower_to_emitc.py
+++ b/PyTorchSimFrontend/mlir/passes/lower_to_emitc.py
@@ -170,22 +170,181 @@ def walk(b):
     return found[0]
 
 
-def _insert_core_alloc(ctx, kernel, ctx_val):
-    """Insert `togsim_core_alloc(ctx)` at the start of each parallel work-item:
-    the first op of the innermost PARALLEL loop body (or the kernel entry if the
-    kernel has no parallel loop -> a single work-item). The runtime binds the
-    following ops to the returned core (sec 9.3); the producer never names
-    num_cores. The return value is discarded (no free -- a core is an assignment,
-    not a held resource)."""
-    block = kernel.regions[0].blocks[0]
-    target = _innermost_outer_loop(block)
-    body = target.operation.regions[0].blocks[0] if target is not None else block
-    ir.Operation.create(
-        "emitc.call_opaque", results=[], operands=[ctx_val],
-        attributes={"callee": ir.StringAttr.get(ts.CORE_ALLOC_CALLEE),
-                    "args": ir.ArrayAttr.get([_idx(0)])},
-        loc=ir.Location.unknown(ctx),
-        ip=ir.InsertionPoint.at_block_begin(body))
+def _is_outer(forop):
+    a = forop.operation.attributes
+    return "outer_loop" in a and ir.BoolAttr(a["outer_loop"]).value
+
+
+def _parallel_loop_chain(block):
+    """The nested chain of `affine.for {outer_loop}` from `block` inward (one
+    work-item's parallel indices). Empty if the kernel has no parallel loop."""
+    chain = []
+    cur = block
+    while True:
+        nxt = None
+        for op in cur.operations:
+            if op.operation.name == "affine.for" and _is_outer(op):
+                nxt = op
+                break
+        if nxt is None:
+            break
+        chain.append(nxt)
+        cur = nxt.operation.regions[0].blocks[0]
+    return chain
+
+
+def _const_op(value):
+    """The defining arith/emitc constant Operation if `value` is a constant
+    result, else None (block args / other ops)."""
+    owner = value.owner
+    if isinstance(owner, ir.Block):
+        return None
+    return owner if owner.name in ("arith.constant", "emitc.constant") else None
+
+
+def _outline_work_item(ctx, kernel, ctx_val):
+    """Outline the innermost parallel work-item body into a uniform
+    `togsim_kernel_tile(ctx, iv, n)` func, replacing it with a
+    `togsim_dispatch(ctx, togsim_kernel_tile, iv, n)` call (sec 9.3). The
+    work-item SCOPE becomes the function body; the runtime wrapper owns the
+    core-alloc + the TILE_BEGIN/TILE_END boundary (a decorator). One uniform tile
+    signature -> a single general dispatcher serves every kernel.
+
+    Runs after `_rewrite_togsim_ops`, so the moved body holds emitc.call_opaque
+    (not togsim.* ops). The only values captured from outside the body are ctx,
+    the enclosing parallel induction vars, and constants -- threaded via the iv
+    array (parallel IVs) / cloned (constants); anything else is unsupported
+    (dynamic shape -> P4)."""
+    kblk = kernel.regions[0].blocks[0]
+    chain = _parallel_loop_chain(kblk)
+    if chain:
+        L = chain[-1]
+        Lbody = L.operation.regions[0].blocks[0]
+        ivs = [c.operation.regions[0].blocks[0].arguments[0] for c in chain]
+    else:                       # no parallel loop -> the whole kernel body is one work-item
+        L = None
+        Lbody = kblk
+        ivs = []
+
+    i64 = ir.IntegerType.get_signless(64)
+    i32 = ir.IntegerType.get_signless(32)
+    idxty = ir.IndexType.get()
+    ctxty = ir.Type.parse(CTX_TYPE, ctx)
+    i64ptr = ir.Type.parse("!emitc.ptr<i64>", ctx)
+    loc = ir.Location.unknown(ctx)
+
+    # --- the outlined tile function (before the kernel so C defines it first) ---
+    tile = ir.Operation.create(
+        "func.func", results=[], regions=1,
+        attributes={
+            "function_type": ir.TypeAttr.get(ir.FunctionType.get([ctxty, i64ptr, i32], [])),
+            "sym_name": ir.StringAttr.get(ts.TILE_SYMBOL),
+            "sym_visibility": ir.StringAttr.get("private")},
+        loc=loc, ip=ir.InsertionPoint(kernel))
+    with loc:
+        tblk = tile.regions[0].blocks.append(ctxty, i64ptr, i32)
+    ctx2, iv2, _n2 = tblk.arguments
+    with ir.InsertionPoint(tblk):
+        tret = ir.Operation.create("func.return", results=[], operands=[], loc=loc)
+
+    # in the tile fn: recover each parallel index = index_cast(iv[k]).
+    idx_vals = []
+    with ir.InsertionPoint(tret):
+        for k in range(len(ivs)):
+            kc = ir.Operation.create("emitc.constant", results=[i64],
+                    attributes={"value": ir.IntegerAttr.get(i64, k)}, loc=loc).results[0]
+            elem = ir.Operation.create("emitc.subscript", results=[i64],
+                    operands=[iv2, kc], loc=loc).results[0]
+            idx_vals.append(ir.Operation.create("arith.index_cast", results=[idxty],
+                    operands=[elem], loc=loc).results[0])
+
+    # move the work-item body into the tile fn (terminators stay behind).
+    for op in [o for o in Lbody.operations
+               if o.operation.name not in ("affine.yield", "func.return")]:
+        op.operation.move_before(tret)
+
+    # remap captures (Value `==` is identity): ctx -> ctx2, each parallel IV ->
+    # its index_cast, each external constant -> a clone inside the tile fn. A
+    # constant defined inside the tile fn (moved/read) is internal -> left alone.
+    caps = [(ctx_val, ctx2)] + list(zip(ivs, idx_vals))
+    internal_consts = []
+    def _collect_internal(block):
+        for op in block.operations:
+            c = _const_op(op.operation.results[0]) if len(op.operation.results) == 1 else None
+            if c is not None:
+                internal_consts.append(op.operation.results[0])
+            for rg in op.operation.regions:
+                for b in rg.blocks:
+                    _collect_internal(b)
+    _collect_internal(tblk)
+    const_clones = []
+    ext_consts = []
+    def _find_ext_consts(block):
+        for op in block.operations:
+            for opnd in op.operation.operands:
+                if _const_op(opnd) is None:
+                    continue
+                if any(opnd == ic for ic in internal_consts):
+                    continue
+                if any(opnd == e for e in ext_consts):
+                    continue
+                ext_consts.append(opnd)
+            for rg in op.operation.regions:
+                for b in rg.blocks:
+                    _find_ext_consts(b)
+    _find_ext_consts(tblk)
+    top = ir.InsertionPoint(tblk.operations[0])
+    for e in ext_consts:
+        c = _const_op(e)
+        clone = ir.Operation.create(c.name, results=[e.type],
+                    attributes={"value": c.attributes["value"]}, loc=loc, ip=top).results[0]
+        const_clones.append((e, clone))
+
+    allcaps = caps + const_clones
+    def _remap(block):
+        for op in block.operations:
+            for i in range(len(op.operation.operands)):
+                cur = op.operation.operands[i]
+                for orig, new in allcaps:
+                    if cur == orig:
+                        op.operation.operands[i] = new
+                        break
+            for rg in op.operation.regions:
+                for b in rg.blocks:
+                    _remap(b)
+    _remap(tblk)
+
+    # --- the dispatcher: marshal the IVs and hand the tile fn to togsim_dispatch ---
+    term = [o for o in Lbody.operations
+            if o.operation.name in ("affine.yield", "func.return")][0]
+    fn_ref = _opaque(ctx, ts.TILE_SYMBOL)   # function name -> verbatim pointer in C
+    with ir.InsertionPoint(term):
+        if ivs:
+            arrty = ir.Type.parse("!emitc.array<%dxi64>" % len(ivs), ctx)
+            arr = ir.Operation.create("emitc.variable", results=[arrty],
+                    attributes={"value": _opaque(ctx, "")}, loc=loc).results[0]
+            for k, iv in enumerate(ivs):
+                kc = ir.Operation.create("emitc.constant", results=[i64],
+                        attributes={"value": ir.IntegerAttr.get(i64, k)}, loc=loc).results[0]
+                v64 = ir.Operation.create("arith.index_cast", results=[i64],
+                        operands=[iv], loc=loc).results[0]
+                sub = ir.Operation.create("emitc.subscript", results=[i64],
+                        operands=[arr, kc], loc=loc).results[0]
+                # emitc.assign operands are (lvalue dest, value).
+                ir.Operation.create("emitc.assign", results=[], operands=[sub, v64], loc=loc)
+            ir.Operation.create(
+                "emitc.call_opaque", results=[], operands=[ctx_val, arr],
+                attributes={"callee": ir.StringAttr.get(ts.DISPATCH_CALLEE),
+                            "args": ir.ArrayAttr.get(
+                                [_idx(0), fn_ref, _idx(1), ir.IntegerAttr.get(i32, len(ivs))])},
+                loc=loc)
+        else:
+            ir.Operation.create(
+                "emitc.call_opaque", results=[], operands=[ctx_val],
+                attributes={"callee": ir.StringAttr.get(ts.DISPATCH_CALLEE),
+                            "args": ir.ArrayAttr.get(
+                                [_idx(0), fn_ref, _opaque(ctx, "nullptr"), ir.IntegerAttr.get(i32, 0)])},
+                loc=loc)
 
 
 def _rewrite_togsim_ops(ctx, kernel, ctx_val):
@@ -343,8 +502,8 @@ def lower_to_emitc(skeleton_module):
 
     _strip_aux(skeleton_module)
     ctx_val = _rewrite_signature(kernel, ctx)
-    _insert_core_alloc(ctx, kernel, ctx_val)          # core_alloc per work-item
-    _rewrite_togsim_ops(ctx, kernel, ctx_val)
+    _rewrite_togsim_ops(ctx, kernel, ctx_val)         # togsim.* -> emitc.call_opaque
+    _outline_work_item(ctx, kernel, ctx_val)          # work-item body -> togsim_kernel_tile + dispatch
 
     PassManager.parse(_PIPELINE, ctx).run(skeleton_module.operation)
 
diff --git a/PyTorchSimFrontend/mlir/passes/togsim_ops.py b/PyTorchSimFrontend/mlir/passes/togsim_ops.py
index c2be060d..740a8f2f 100644
--- a/PyTorchSimFrontend/mlir/passes/togsim_ops.py
+++ b/PyTorchSimFrontend/mlir/passes/togsim_ops.py
@@ -69,10 +69,15 @@
 #: producer entry-point symbol the TOGSim loader resolves (see togsim_runtime.h).
 ENTRY_SYMBOL = "togsim_kernel"
 
-#: runtime callee emitted directly by lower_to_emitc (not a skeleton op): the
-#: per-work-item core allocation. See togsim_cpp_trace.md sec 9.3. Kept in
-#: lockstep with togsim_runtime.h.
-CORE_ALLOC_CALLEE = "togsim_core_alloc"
+#: outlined per-work-item function the dispatcher hands to togsim_dispatch
+#: (uniform signature (ctx, int64* iv, i32 n); see togsim_cpp_trace.md sec 9.3).
+TILE_SYMBOL = "togsim_kernel_tile"
+
+#: runtime callees emitted directly by lower_to_emitc (not skeleton ops), kept in
+#: lockstep with togsim_runtime.h. DISPATCH_CALLEE is the higher-order wrapper the
+#: dispatcher loop calls per work-item (round-robins a core + TILE_BEGIN/END);
+#: TILE_SYMBOL is passed to it as the function pointer.
+DISPATCH_CALLEE = "togsim_dispatch"
 
 # ---- attribute keys -------------------------------------------------------
 ATTR_DIR       = "dir"        # i32: DIR_LOAD | DIR_STORE
diff --git a/TOGSim/include/togsim_loader.h b/TOGSim/include/togsim_loader.h
index 5df94b75..6c1273ee 100644
--- a/TOGSim/include/togsim_loader.h
+++ b/TOGSim/include/togsim_loader.h
@@ -23,8 +23,8 @@ namespace togsim {
 
 // One modeled instruction recorded by the runtime callbacks.
 struct TraceRec {
-  enum Kind { DISPATCH, DMA, COMPUTE, MEMORY_BAR, COMPUTE_BAR } kind;
-  int32_t  core;          // work-item -> core binding (set by togsim_core_alloc)
+  enum Kind { TILE_BEGIN, TILE_END, DMA, COMPUTE, MEMORY_BAR, COMPUTE_BAR } kind;
+  int32_t  core;          // work-item -> core binding (set by togsim_dispatch)
   // DMA / MEMORY_BAR
   int32_t  dir;           // togsim_dma_dir
   int32_t  arg_id;        // tensor
diff --git a/TOGSim/include/togsim_runtime.h b/TOGSim/include/togsim_runtime.h
index 6748b9e6..e8fd6b84 100644
--- a/TOGSim/include/togsim_runtime.h
+++ b/TOGSim/include/togsim_runtime.h
@@ -68,7 +68,17 @@ extern "C" {
 //                  %tag[%idx], so only a runtime key can pair them. Drops
 //                  togsim_wait/signal/wait_all/event_alloc/event_free + the
 //                  togsim_event handle (no compile-time pairing token).
-#define TOGSIM_ABI_VERSION 11
+//   v11 -> v12 (P3 sec9.3): replace the bare togsim_core_alloc marker with a
+//                  higher-order togsim_dispatch(ctx, tile_fn, iv, n_iv) wrapper.
+//                  The producer outlines each parallel work-item into a uniform
+//                  togsim_kernel_tile(ctx, iv, n) and the dispatcher loop hands it
+//                  to togsim_dispatch, which round-robins a core and brackets the
+//                  call with TILE_BEGIN/TILE_END. The work-item scope is now the
+//                  function call itself (no implicit "until the next core_alloc"
+//                  range); one general dispatcher serves every kernel (uniform
+//                  iv-array ABI). Core alloc + the begin/end boundary are
+//                  runtime-owned.
+#define TOGSIM_ABI_VERSION 12
 int32_t togsim_abi_version(void);
 
 // Opaque per-invocation context owned by TOGSim. Holds the record sink and the
@@ -132,15 +142,25 @@ void togsim_compute(EmitCtx* ctx, uint64_t tile_id, int32_t compute_type,
 void togsim_memory_barrier(EmitCtx* ctx, int32_t tag_id, uint64_t tag_slot,
                            const int64_t* write_bufs, int32_t n_write);
 
-// Core allocation (sec 9.3). The producer calls togsim_core_alloc at the start
-// of each parallel work-item (output tile); the ops that follow are bound to the
-// returned core until the next togsim_core_alloc. No free: a core is just an
-// assignment, not a held resource. The producer is core-count transparent: it
-// NEVER names num_cores or a physical core -- the runtime owns the pool and the
-// policy (round-robin etc.). A work-item's whole reduction sits after one alloc,
-// so it stays on the same core; different work-items get different cores ->
-// multi-core.
-int32_t togsim_core_alloc(EmitCtx* ctx);
+// A parallel work-item body, outlined by the producer (sec 9.3). Uniform across
+// kernels: it takes the EmitCtx, the packed parallel loop indices `iv` (iv[0..
+// n_iv) -- e.g. the (m,n) output-tile indices) and their count. The body emits
+// the work-item's ops (init / reduction / store). One signature => one general
+// dispatcher serves every kernel.
+// (iv is non-const to match the `int64_t*` the EmitC producer emits; the runtime
+// only reads it.)
+typedef void (*togsim_tile_fn)(EmitCtx* ctx, int64_t* iv, int32_t n_iv);
+
+// Dispatch one work-item (sec 9.3). The runtime round-robins a core from the
+// pool, brackets the call with TILE_BEGIN/TILE_END (the work-item boundary), and
+// invokes `fn(ctx, iv, n_iv)` -- so the work-item SCOPE is exactly the function
+// call, not an implicit "ops until the next alloc" range. Core alloc + boundary
+// are runtime-owned; the producer is core-count transparent (never names
+// num_cores or a physical core). Independent work-items land on different cores
+// -> multi-core. A general (kernel-independent) wrapper: it only forwards the
+// opaque iv array to fn.
+void togsim_dispatch(EmitCtx* ctx, togsim_tile_fn fn,
+                     int64_t* iv, int32_t n_iv);
 
 // Compute fence: drain in-flight async compute (the systolic-array matmuls)
 // before the following op (a store) consumes their result. Explicit barrier in
diff --git a/TOGSim/src/togsim_runtime.cc b/TOGSim/src/togsim_runtime.cc
index ca18fd29..86de081e 100644
--- a/TOGSim/src/togsim_runtime.cc
+++ b/TOGSim/src/togsim_runtime.cc
@@ -51,12 +51,16 @@ extern "C" {
 
 int32_t togsim_abi_version(void) { return TOGSIM_ABI_VERSION; }
 
-int32_t togsim_core_alloc(EmitCtx* ctx) {
-  // Round-robin a core from the pool; the producer never sees num_cores. Binds
-  // it as the current core for the ops that follow (the work-item's reduction).
+void togsim_dispatch(EmitCtx* ctx, togsim_tile_fn fn, int64_t* iv, int32_t n_iv) {
+  // Higher-order work-item wrapper (sec 9.3): round-robin a core (the producer
+  // never sees num_cores), bracket the work-item with TILE_BEGIN/TILE_END, and
+  // run its body. The work-item SCOPE is exactly this fn call -- the begin/end
+  // are runtime-owned, so the producer never relies on an implicit "ops until
+  // the next alloc" boundary. The ops fn emits record under ctx->cur_core.
   ctx->cur_core = ctx->num_cores > 0 ? (ctx->rr++ % ctx->num_cores) : 0;
-  ctx->trace.push_back(blank(togsim::TraceRec::DISPATCH, ctx->cur_core));
-  return ctx->cur_core;
+  ctx->trace.push_back(blank(togsim::TraceRec::TILE_BEGIN, ctx->cur_core));
+  fn(ctx, iv, n_iv);
+  ctx->trace.push_back(blank(togsim::TraceRec::TILE_END, ctx->cur_core));
 }
 
 void togsim_dma(EmitCtx* ctx, int32_t dir, int32_t arg_id,
@@ -181,7 +185,8 @@ SimResult simulate(const RunResult& run, const TimingParams& params) {
         out.n_compute++;
         break;
       }
-      case TraceRec::DISPATCH:
+      case TraceRec::TILE_BEGIN:
+      case TraceRec::TILE_END:
       case TraceRec::COMPUTE_BAR:
         break;  // work-item boundary / compute fence: no cost in this reference timer
     }
diff --git a/TOGSim/src/togsim_trace_bridge.cc b/TOGSim/src/togsim_trace_bridge.cc
index 8d0dc721..164b04c0 100644
--- a/TOGSim/src/togsim_trace_bridge.cc
+++ b/TOGSim/src/togsim_trace_bridge.cc
@@ -150,15 +150,21 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
   };
 
   for (const auto& t : run.trace) {
-    if (t.kind == TraceRec::DISPATCH) {
-      // new work-item -> new subgraph (bound to its core) + tile.
+    if (t.kind == TraceRec::TILE_BEGIN) {
+      // togsim_dispatch opened a work-item -> new subgraph (bound to its core) +
+      // tile. The scope runs until the matching TILE_END (the dispatch wrapper
+      // brackets the tile fn call), not until the next begin.
       flush();
       sg = std::make_shared<TileSubGraph>();
       sg->set_core_id(t.core);
       tile = std::make_shared<Tile>(Tile::Status::INITIALIZED);
       continue;
     }
-    if (!tile) continue;  // defensive: ops before the first core_alloc
+    if (t.kind == TraceRec::TILE_END) {
+      flush();   // close the work-item explicitly (scope = the tile fn call)
+      continue;
+    }
+    if (!tile) continue;  // defensive: ops before the first TILE_BEGIN
 
     if (t.kind == TraceRec::DMA) {
       int64_t uniq = next_tag++;                         // fresh Core tag key per dma record
diff --git a/docs/design/togsim_cpp_trace.md b/docs/design/togsim_cpp_trace.md
index 6adc609c..9565bdfb 100644
--- a/docs/design/togsim_cpp_trace.md
+++ b/docs/design/togsim_cpp_trace.md
@@ -411,16 +411,20 @@ only `togsim_*` callbacks are visible across the `dlopen` boundary.
   `num_tiles`-style count — not required now).
 - `tile_id -> cycle` table unchanged (num_cores-invariant).
 
-> Implementation status (P3): `lower_to_emitc` inserts the `togsim_core_alloc`
-> marker at the innermost PARALLEL-loop body inside the single `togsim_kernel`
-> function — the emitted *trace* is identical to the two-function form (one
-> core_alloc per work-item, then the work ops). Address arithmetic is wired
-> (approach A): each `togsim_dma` passes `(arg_id, element offset)` with the
-> offset computed from the loop IVs (lowered by `convert-arith-to-emitc`, cast-
-> free thanks to the size_t IV retype); the runtime adds the tensor base.
-> Outlining the work body into a separate `togsim_kernel_tile` is now *meaningful*
-> (the body uses the parallel IVs in the offset) but still deferred — the
-> single-function trace is identical, so the split is cosmetic until needed.
+> Implementation status (P3, ABI v12): `lower_to_emitc` OUTLINES the innermost
+> PARALLEL-loop body into a uniform `togsim_kernel_tile(ctx, iv, n)` func and the
+> dispatcher loop hands it to `togsim_dispatch(ctx, fn, iv, n)` -- a higher-order
+> runtime wrapper that round-robins a core and brackets the call with
+> TILE_BEGIN/TILE_END. The work-item SCOPE is now the function call itself (not an
+> implicit "ops until the next core_alloc" range), and one general dispatcher
+> serves every kernel (uniform iv-array ABI). Earlier this was a single
+> `togsim_kernel` with a bare `togsim_core_alloc` marker; the emitted *trace* is
+> identical (one work-item bracket, then the work ops), so cycles are unchanged --
+> the outline was done to make the boundary explicit, not for timing. Address
+> arithmetic is wired (approach A): each `togsim_dma` passes `(arg_id, element
+> offset)` with the offset computed from the loop IVs (lowered by
+> `convert-arith-to-emitc`, cast-free thanks to the size_t IV retype); the runtime
+> adds the tensor base. The parallel IVs reach the tile fn through the iv array.
 
 ### 9.5 Stance and the split-K exception
 
@@ -563,9 +567,11 @@ single forward-compat requirement is that the callback sink is an interface.
    store, signals its tag at data arrival), `togsim_compute` (cycle-table lookup),
    `togsim_memory_barrier` (waits the matching `(tag_id, tag_slot)`),
    `togsim_compute_barrier`.
-2. DONE (single-buffer). `lower_to_emitc`: inserts `togsim_core_alloc` at the
-   work-item boundary, lowers `togsim.memory_barrier`, and reads `loop_type`.
-   (Two-function outline still deferred; trace identical.)
+2. DONE (single-buffer). `lower_to_emitc`: OUTLINES the work-item body into
+   `togsim_kernel_tile(ctx, iv, n)` + a `togsim_dispatch` call at the work-item
+   boundary (ABI v12; was a bare `togsim_core_alloc` marker), lowers
+   `togsim.memory_barrier`, and reads `loop_type`. (Two-function outline DONE;
+   trace identical.)
 3. DONE. Real tile addresses wired (approach A): build_skeleton keeps the DRAM
    index operand on `togsim.dma`; lower_to_emitc passes `(arg_id, offset)` and
    `convert-arith-to-emitc` lowers the offset (size_t IV retype makes it
@@ -963,7 +969,12 @@ numbers; 2518-vs-2698 is the current real-table figure.
    -> a harmless extra edge.
 6. **P5 op coverage.** Only GEMM is exercised. Extend to conv / SDPA / vector / pool.
 7. **P4.** Symbolic/dynamic shape; streaming sink (coroutine, alloc-blocks).
-8. **Two-function outline** (togsim_kernel_tile) -- deferred (trace identical).
+8. **Two-function outline** (togsim_kernel_tile) -- DONE (ABI v12). The work-item
+   body is outlined into a uniform `togsim_kernel_tile(ctx, iv, n)` and run via the
+   higher-order `togsim_dispatch` wrapper (round-robin core + TILE_BEGIN/TILE_END);
+   the work-item scope is now the function call. Trace/cycles identical to the old
+   single-function `togsim_core_alloc` form. One general dispatcher serves every
+   kernel.
 9. **Retire the legacy ONNX-TOG path** once the trace path is stable.
 
 ### 11.3 Next-session context
diff --git a/tests/test_togsim_emitc.py b/tests/test_togsim_emitc.py
index 285b9512..b0bd2d8e 100644
--- a/tests/test_togsim_emitc.py
+++ b/tests/test_togsim_emitc.py
@@ -66,7 +66,8 @@ def _fixture():
                     const int64_t*, int32_t, const int64_t*, int32_t){ ++n_compute; }
 void togsim_memory_barrier(EmitCtx*, int32_t tag_id, uint64_t, const int64_t*, int32_t){
   ++n_membar; if(tag_id<0) ++bad; }   // tag_id pairs it with its async dma
-int32_t togsim_core_alloc(EmitCtx*){ return n_core++; }   // count + assign a core
+void togsim_dispatch(EmitCtx* ctx, togsim_tile_fn fn, int64_t* iv, int32_t n){
+  ++n_core; fn(ctx, iv, n); }   // count a work-item + run its (outlined) body
 void togsim_compute_barrier(EmitCtx*){}
 }
 int main(int argc, char** argv){
@@ -108,10 +109,10 @@ def test_build_trace_so():
                 (ln.split() for ln in nm.splitlines()) if len(parts) >= 2}
         assert syms.get("togsim_kernel") == "T", nm
         assert syms.get("togsim_dma") == "U", nm
-        assert syms.get("togsim_core_alloc") == "U", nm
+        assert syms.get("togsim_dispatch") == "U", nm
         assert syms.get("togsim_memory_barrier") == "U", nm
-        # The per-work-item core alloc is emitted.
-        assert 'emitc.call_opaque "togsim_core_alloc"' in emitc_text
+        # The per-work-item dispatch wrapper is emitted (outlined tile fn).
+        assert 'emitc.call_opaque "togsim_dispatch"' in emitc_text
 
 
 @pytest.mark.skipif(not _tools_ready(),
diff --git a/tests/test_togsim_runtime.py b/tests/test_togsim_runtime.py
index 8a0ae82e..f17bccef 100644
--- a/tests/test_togsim_runtime.py
+++ b/tests/test_togsim_runtime.py
@@ -63,7 +63,7 @@ def _fixture():
   std::vector<uint64_t> dma_a; std::vector<int> dma_arg, dma_dir;
   std::vector<std::pair<int,uint64_t>> async_tags;  // (tag_id, tag_slot) of async dmas
   for (auto& t : r.trace) {
-    if (t.kind == TraceRec::DISPATCH) ndisp++;
+    if (t.kind == TraceRec::TILE_BEGIN) ndisp++;   // one per work-item
     else if (t.kind == TraceRec::DMA) {
       nd++; dma_a.push_back(t.addr);
       dma_arg.push_back(t.arg_id); dma_dir.push_back(t.dir);
diff --git a/tests/test_togsim_skeleton.py b/tests/test_togsim_skeleton.py
index 7bf35eb9..56601966 100644
--- a/tests/test_togsim_skeleton.py
+++ b/tests/test_togsim_skeleton.py
@@ -43,8 +43,9 @@ def test_togsim_ops_contract():
     assert ts.ENTRY_SYMBOL == "togsim_kernel"
     assert ts.ENTRY_SYMBOL in header
 
-    # Runtime callee emitted directly by lower_to_emitc (core alloc).
-    assert ts.CORE_ALLOC_CALLEE in header
+    # Runtime callee emitted directly by lower_to_emitc: the work-item dispatch
+    # wrapper. (The outlined tile fn TILE_SYMBOL is producer-generated.)
+    assert ts.DISPATCH_CALLEE in header
 
     # Direction enum agrees with the header's togsim_dma_dir.
     assert (ts.DIR_LOAD, ts.DIR_STORE) == (0, 1)

From 03b7f11e1b512377176a4612f4fbd6c4e7ce04da Mon Sep 17 00:00:00 2001
From: Wonhyuk Yang <wonhyuk@postech.ac.kr>
Date: Wed, 24 Jun 2026 22:35:51 +0900
Subject: [PATCH 05/24] [TOGSim] SRAM-capacity and SA weight-buffer throttle
 for the trace path

DMA-capacity throttle and frozen-state guard, per-core VMEM in the configs,
and the SA weight-buffer throttle.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
---
 TOGSim/include/Core.h                         |  31 +++
 TOGSim/include/Instruction.h                  |  32 +++
 TOGSim/include/SimulationConfig.h             |  10 +
 TOGSim/include/Simulator.h                    |   3 +
 TOGSim/src/Common.cc                          |   4 +
 TOGSim/src/Core.cc                            | 115 +++++++++-
 TOGSim/src/Simulator.cc                       |  34 +++
 TOGSim/src/togsim_trace_bridge.cc             |  44 ++++
 .../systolic_ws_128x128_c1_booksim_tpuv2.yml  |   3 +
 .../systolic_ws_128x128_c1_booksim_tpuv3.yml  |   3 +
 ...ystolic_ws_128x128_c1_simple_noc_tpuv2.yml |   3 +
 ...ystolic_ws_128x128_c1_simple_noc_tpuv3.yml |   3 +
 ...ic_ws_128x128_c1_simple_noc_tpuv3_half.yml |   3 +
 ...28x128_c1_simple_noc_tpuv3_timing_only.yml |   3 +
 ...ystolic_ws_128x128_c1_simple_noc_tpuv4.yml |   3 +
 .../systolic_ws_128x128_c2_booksim_tpuv3.yml  |   3 +
 ...ws_128x128_c2_booksim_tpuv3_bw_quarter.yml |   3 +
 .../systolic_ws_128x128_c2_chiplet_tpuv3.yml  |   3 +
 ...olic_ws_128x128_c2_chiplet_tpuv3_xnuma.yml |   3 +
 ...ystolic_ws_128x128_c2_simple_noc_tpuv2.yml |   3 +
 ...ystolic_ws_128x128_c2_simple_noc_tpuv3.yml |   3 +
 ...lic_ws_128x128_c2_simple_noc_tpuv3_ils.yml |   3 +
 ..._128x128_c2_simple_noc_tpuv3_partition.yml |   3 +
 ...ystolic_ws_128x128_c2_simple_noc_tpuv4.yml |   3 +
 configs/systolic_ws_8x8_c1_booksim.yml        |   3 +
 configs/systolic_ws_8x8_c1_simple_noc.yml     |   3 +
 scripts/trace_timeline.py                     | 202 ++++++++++++++++++
 27 files changed, 528 insertions(+), 1 deletion(-)
 create mode 100644 scripts/trace_timeline.py

diff --git a/TOGSim/include/Core.h b/TOGSim/include/Core.h
index 286feb5f..0b6f8595 100644
--- a/TOGSim/include/Core.h
+++ b/TOGSim/include/Core.h
@@ -1,6 +1,7 @@
 #pragma once
 #include <robin_hood.h>
 #include <unordered_set>
+#include <map>
 #include <memory>
 #include <vector>
 #include <fmt/core.h>
@@ -24,6 +25,10 @@ class Core {
   Core(uint32_t id, SimulationConfig config);
   ~Core()=default;
   virtual bool running();
+  // True if this core has work actively in flight (DMA / compute pipeline / queues)
+  // that will produce a future finish event -- i.e. running() minus "tiles waiting".
+  // Used by the frozen-state (spad-too-small) guard.
+  bool has_inflight();
   virtual bool can_issue(const std::shared_ptr<Tile>& op);
   virtual void issue(std::shared_ptr<Tile> tile);
   virtual std::shared_ptr<Tile> pop_finished_tile();
@@ -55,6 +60,16 @@ class Core {
   void sa_cycle();
   bool can_issue_compute(std::shared_ptr<Instruction>& inst);
   void update_stats();
+  // SRAM-capacity throttle (sec 10.x): a consumer frees the buffer-versions it
+  // read (refcount -> 0 releases the spad bytes). Called when COMP/MOVOUT issue.
+  void release_sram(const std::shared_ptr<Instruction>& inst);
+  // SA weight-buffer throttle (sec 10.x): pick a systolic array that has a free
+  // weight slot (round-robin among free); -1 if all full -> the preload stalls.
+  int pick_free_weight_sa();
+  // Free weight slots due this cycle: a matmul releases its slot at its
+  // streaming-end (finish - overlapping, when it stops reading the weight),
+  // scheduled at issue in _weight_release_q. Last consumer frees it.
+  void process_weight_releases();
 
   /* Core id & config file */
   const uint32_t _id;
@@ -103,4 +118,20 @@ class Core {
   std::queue<mem_fetch*> _request_queue;
   std::queue<mem_fetch*> _response_queue;
   uint32_t _waiting_write_reqs;
+
+  // SRAM-capacity throttle (sec 10.x). _sram_used = current per-core spad bytes;
+  // _sram_capacity = limit (0 = disabled); _sram_allocs maps a buffer-version id
+  // to its accumulated footprint bytes (freed when its last reader issues).
+  size_t _sram_used = 0;
+  size_t _sram_capacity = 0;
+  std::unordered_map<int64_t, size_t> _sram_allocs;
+
+  // SA weight-buffer throttle (sec 10.x). _weight_slots_used[s] = weights resident
+  // on SA s (loaded by a preload, not yet freed by their last matmul);
+  // _weight_slot_depth = per-SA capacity (0 = disabled -> plain round-robin).
+  std::vector<int> _weight_slots_used;
+  uint32_t _weight_slot_depth = 0;
+  // Pending weight-slot releases keyed by cycle (each matmul's streaming-end);
+  // process_weight_releases() drains those due and decrements the token.
+  std::multimap<cycle_type, std::shared_ptr<WeightToken>> _weight_release_q;
 };
\ No newline at end of file
diff --git a/TOGSim/include/Instruction.h b/TOGSim/include/Instruction.h
index b513ee5c..e5a8dc4a 100644
--- a/TOGSim/include/Instruction.h
+++ b/TOGSim/include/Instruction.h
@@ -18,6 +18,11 @@
 //              fence before a store consumes async matmul results (sec 10.7).
 enum class Opcode { MOVIN, MOVOUT, COMP, MEMORY_BAR, COMPUTE_BAR, COUNT};
 
+// One weight slot on systolic array `sa` (sec 10.x). A preload sets refcount =
+// the matmuls reusing the weight; each frees it at its streaming-end, the last
+// one releases the slot. Shared (shared_ptr) by the preload's matmul consumers.
+struct WeightToken { int sa; int refcount; };
+
 typedef uint64_t addr_type;
 typedef uint64_t cycle_type;
 
@@ -38,6 +43,13 @@ class Instruction : public std::enable_shared_from_this<Instruction> {
   // successor overlaps it instead of waiting its full latency (sec 10.7).
   void add_pipeline_child(std::shared_ptr<Instruction> child);
   void release_pipeline_children();
+  // SA weight-buffer model: the SA this op is pinned to (a preload picks it, its
+  // matmul consumers inherit it) and the shared weight slot the matmuls release.
+  const std::set<std::shared_ptr<Instruction>>& get_pipeline_children() { return _pipeline_children; }
+  void set_assigned_sa(int s) { _assigned_sa = s; }
+  int get_assigned_sa() const { return _assigned_sa; }
+  void set_weight_token(const std::shared_ptr<WeightToken>& t) { _weight_token = t; }
+  const std::shared_ptr<WeightToken>& get_weight_token() const { return _weight_token; }
   bool check_ready() { return ready_counter == 0; }
   const Opcode get_opcode() { return opcode; }
   bool is_dma_read() { return opcode == Opcode::MOVIN; }
@@ -95,6 +107,20 @@ class Instruction : public std::enable_shared_from_this<Instruction> {
   std::set<std::shared_ptr<Instruction>>& get_child_inst() { return child_inst; }
   uint64_t get_global_inst_id() const { return _global_inst_id; }
 
+  // SRAM-capacity model (sec 10.x). A load contributes its footprint to a
+  // buffer-version allocation; the version is freed when its LAST consumer (the
+  // program-order-last reader, tagged by the bridge) issues. The bridge fills
+  // these; Core enforces them.
+  //   _sram_alloc_id      : which buffer-version this load fills (-1 = untracked)
+  //   _sram_release_allocs: versions this consumer frees on issue (tagged only on
+  //                         each version's last reader)
+  void set_sram_alloc(int64_t id) { _sram_alloc_id = id; }
+  int64_t get_sram_alloc() const { return _sram_alloc_id; }
+  void add_sram_release(int64_t id) { _sram_release_allocs.push_back(id); }
+  const std::vector<int64_t>& get_sram_release() const { return _sram_release_allocs; }
+  // bytes this load occupies in the spad (from the tile it moves in).
+  size_t sram_footprint() const { return _tile_numel * (_elem_bits / 8); }
+
   cycle_type start_cycle;
   cycle_type finish_cycle;
   cycle_type bubble_cycle=0;
@@ -133,4 +159,10 @@ class Instruction : public std::enable_shared_from_this<Instruction> {
   bool _is_indirect_mode=false;
   bool _is_sparse_inst=false;
   std::string _indirect_index_path="";
+  // SRAM-capacity model (see the setters above).
+  int64_t _sram_alloc_id = -1;
+  std::vector<int64_t> _sram_release_allocs;
+  // SA weight-buffer model (see the setters above).
+  int _assigned_sa = -1;
+  std::shared_ptr<WeightToken> _weight_token;
 };
\ No newline at end of file
diff --git a/TOGSim/include/SimulationConfig.h b/TOGSim/include/SimulationConfig.h
index 2ef08618..7785ff7a 100644
--- a/TOGSim/include/SimulationConfig.h
+++ b/TOGSim/include/SimulationConfig.h
@@ -27,6 +27,16 @@ struct SimulationConfig {
   uint32_t num_systolic_array_per_core = 1;
   uint32_t num_stonne_per_core = 1;
   uint32_t num_stonne_port = 1;
+  // Per-core VMEM/spad capacity (KB) for the trace-path DMA throttle (sec 10.x):
+  // a load that would overflow the spad does not issue until a consumer frees a
+  // tile. Provided by the config (the TPU configs set 16384 = 16 MB VMEM). 0 =
+  // unset -> gate disabled (unlimited). Only affects trace-path instructions
+  // (legacy TileGraphParser insts have alloc id -1 -> never gated).
+  uint32_t core_spad_size_kb = 0;
+  // SA weight-buffer depth (sec 10.x): weight tiles a systolic array holds; a
+  // preload stalls until a slot frees (its matmuls finished). 2 = weight
+  // double-buffer (convention default, tunable). 0 = disabled.
+  uint32_t sa_weight_buffer_depth = 2;
 
   /* DRAM config */
   DramType dram_type;
diff --git a/TOGSim/include/Simulator.h b/TOGSim/include/Simulator.h
index e3542d51..91baf5b5 100644
--- a/TOGSim/include/Simulator.h
+++ b/TOGSim/include/Simulator.h
@@ -48,6 +48,9 @@ class Simulator {
   void dram_cycle();
   void icnt_cycle();
   bool running();
+  // Spad-too-small guard: if the sim stays frozen (running() but nothing in
+  // flight) past kWedgeThreshold cycles, error out and exit. Called each cycle.
+  void check_frozen();
   void set_cycle_mask();
   uint32_t get_dest_node(mem_fetch *access);
   SimulationConfig _config;
diff --git a/TOGSim/src/Common.cc b/TOGSim/src/Common.cc
index 3f84d885..6f9a74d7 100644
--- a/TOGSim/src/Common.cc
+++ b/TOGSim/src/Common.cc
@@ -64,6 +64,10 @@ SimulationConfig initialize_config(const YAML::Node& config,
   parsed_config.core_freq_mhz = get_config_value<uint32_t>(config, "core_freq_mhz");
   if (config["num_systolic_array_per_core"])
     parsed_config.num_systolic_array_per_core = config["num_systolic_array_per_core"].as<uint32_t>();
+  if (config["core_spad_size_kb"])
+    parsed_config.core_spad_size_kb = config["core_spad_size_kb"].as<uint32_t>();
+  if (config["sa_weight_buffer_depth"])
+    parsed_config.sa_weight_buffer_depth = config["sa_weight_buffer_depth"].as<uint32_t>();
   if (config["num_stonne_per_core"])
     parsed_config.num_stonne_per_core = config["num_stonne_per_core"].as<uint32_t>();
   if (config["num_stonne_port"])
diff --git a/TOGSim/src/Core.cc b/TOGSim/src/Core.cc
index 6ec63b22..1afbbb52 100644
--- a/TOGSim/src/Core.cc
+++ b/TOGSim/src/Core.cc
@@ -17,6 +17,42 @@ Core::Core(uint32_t id, SimulationConfig config)
   _stat_sa_compute_idle_cycle.resize(_num_systolic_array_per_core);
   _stat_inst_count.resize(static_cast<size_t>(Opcode::COUNT), 0);
   _stat_tot_skipped_inst.resize(static_cast<size_t>(Opcode::COUNT), 0);
+  _sram_capacity = (size_t)config.core_spad_size_kb * 1024;  // 0 = throttle disabled
+  _weight_slot_depth = config.sa_weight_buffer_depth;        // 0 = disabled (plain rr)
+  _weight_slots_used.resize(_num_systolic_array_per_core, 0);
+}
+
+// Round-robin a systolic array that still has a free weight slot; -1 if all full
+// (the preload must stall). Advances _systolic_array_rr past the chosen SA.
+int Core::pick_free_weight_sa() {
+  for (uint32_t i = 0; i < _num_systolic_array_per_core; i++) {
+    uint32_t s = (_systolic_array_rr + i) % _num_systolic_array_per_core;
+    if (_weight_slots_used[s] < (int)_weight_slot_depth) {
+      _systolic_array_rr = (s + 1) % _num_systolic_array_per_core;
+      return (int)s;
+    }
+  }
+  return -1;
+}
+
+void Core::process_weight_releases() {
+  while (!_weight_release_q.empty() && _weight_release_q.begin()->first <= _core_cycle) {
+    auto tok = _weight_release_q.begin()->second;
+    _weight_release_q.erase(_weight_release_q.begin());
+    if (--tok->refcount <= 0) _weight_slots_used[tok->sa]--;  // last reader frees the slot
+  }
+}
+
+// The LAST reader of a buffer-version issued (bridge tags only that consumer):
+// free the version's bytes back to the per-core spad.
+void Core::release_sram(const std::shared_ptr<Instruction>& inst) {
+  if (!_sram_capacity) return;
+  for (int64_t id : inst->get_sram_release()) {
+    auto it = _sram_allocs.find(id);
+    if (it == _sram_allocs.end()) continue;
+    _sram_used -= it->second;
+    _sram_allocs.erase(it);
+  }
 }
 
 bool Core::can_issue(const std::shared_ptr<Tile>& op) {
@@ -200,6 +236,8 @@ void Core::cycle() {
   /* Increase core cycle counter */
   _core_cycle++;
 
+  process_weight_releases();  // free weight slots due this cycle before dispatch
+
   /* Iterate tile while an instruction is issued */
   bool issued = false;
 
@@ -240,6 +278,22 @@ void Core::cycle() {
               _stat_tot_skipped_inst.at(static_cast<size_t>(inst->get_opcode()))++;
               break;
             } else {
+              // SRAM-capacity gate (sec 10.x): a load that would overflow the
+              // per-core spad does not issue this cycle -- leave it in the ready
+              // queue (it++ retries next cycle) until a consumer frees a tile. On
+              // issue, occupy its bytes under its buffer-version allocation.
+              if (_sram_capacity && inst->get_sram_alloc() >= 0) {
+                size_t F = inst->sram_footprint();
+                // Stall if the tile does not fit in the free spad right now. If
+                // it can never fit (the kernel's working set exceeds the whole
+                // spad), the sim wedges -- Simulator::cycle() detects that frozen
+                // state and exits with a "spad too small" error rather than
+                // looping forever.
+                if (_sram_used + F > _sram_capacity)
+                  break;                                       // not issued -> retry next cycle
+                _sram_used += F;
+                _sram_allocs[inst->get_sram_alloc()] += F;     // accumulate version footprint
+              }
               core_trace_log::trace_instruction_line(_core_cycle,
                                                        _id,
                                                        TraceLogTag::pad15(
@@ -254,6 +308,7 @@ void Core::cycle() {
             }
           }
         case Opcode::MOVOUT:
+          release_sram(inst);   // store issued -> free the tiles it drained
           core_trace_log::trace_instruction_line(_core_cycle,
                                                    _id,
                                                    TraceLogTag::pad15(TraceLogTag::kInstructionIssued),
@@ -265,11 +320,48 @@ void Core::cycle() {
           break;
         case Opcode::COMP:
           {
+            const int ct = inst->get_compute_type();
+            // --- SA selection + weight-buffer gate (sec 10.x) ---
+            // A preload picks a systolic array with a free weight slot and pins
+            // its matmul consumers to that SA (they free the slot on finish). A
+            // matmul runs on the SA its weight was preloaded into. This both
+            // bounds preload run-ahead and keeps matmuls on their weight's SA.
+            int sa_idx = -1;
+            if (ct == MATMUL || ct == PRELOAD) {
+              if (ct == PRELOAD) {
+                int n_consumers = 0;   // matmuls reusing this weight
+                for (auto& c : inst->get_pipeline_children())
+                  if (c->get_compute_type() == MATMUL) n_consumers++;
+                if (_weight_slot_depth > 0 && n_consumers > 0) {
+                  sa_idx = pick_free_weight_sa();
+                  if (sa_idx < 0) break;            // all weight slots full -> stall (retry)
+                  _weight_slots_used[sa_idx]++;
+                  auto tok = std::make_shared<WeightToken>(WeightToken{sa_idx, n_consumers});
+                  for (auto& c : inst->get_pipeline_children())
+                    if (c->get_compute_type() == MATMUL) {
+                      c->set_assigned_sa(sa_idx);
+                      c->set_weight_token(tok);
+                    }
+                } else {                            // disabled / no consumers -> plain rr
+                  sa_idx = _systolic_array_rr;
+                  _systolic_array_rr = (_systolic_array_rr + 1) % _num_systolic_array_per_core;
+                }
+              } else {                              // MATMUL
+                sa_idx = inst->get_assigned_sa();
+                if (sa_idx < 0) {                   // no preload pinned it -> rr fallback
+                  sa_idx = _systolic_array_rr;
+                  _systolic_array_rr = (_systolic_array_rr + 1) % _num_systolic_array_per_core;
+                }
+              }
+              inst->set_assigned_sa(sa_idx);         // record the SA actually used (for the trace)
+            }
+            release_sram(inst);   // consumer issued -> free the tiles it read
             // sec 10.7: this op is now entering the pipeline -> release its
             // occupancy (pipeline) dependents so a preload/matmul successor
             // overlaps it instead of waiting its full latency.
             inst->release_pipeline_children();
-            auto& target_pipeline = get_compute_pipeline(inst->get_compute_type());
+            auto& target_pipeline = (ct == VECTOR_UNIT) ? _vu_compute_pipeline
+                                                        : _sa_compute_pipeline.at(sa_idx);
             if (target_pipeline.empty()) {
               inst->finish_cycle = _core_cycle + inst->get_compute_cycle();
               inst->bubble_cycle = inst->get_overlapping_cycle();
@@ -280,6 +372,14 @@ void Core::cycle() {
               inst->bubble_cycle = bubble_cycle;
             }
 
+            // Release this matmul's weight slot at its streaming-end (finish -
+            // overlapping), not at full finish (the drain tail does not read it).
+            if (ct == MATMUL && inst->get_weight_token()) {
+              cycle_type rel = inst->finish_cycle > inst->get_overlapping_cycle()
+                                 ? inst->finish_cycle - inst->get_overlapping_cycle() : _core_cycle;
+              _weight_release_q.emplace(rel, inst->get_weight_token());
+            }
+
             if (inst->get_compute_cycle() == 0) {
               inst->finish_instruction();
               static_cast<Tile*>(inst->get_owner())->inc_finished_inst();
@@ -412,6 +512,19 @@ void Core::finish_instruction(std::shared_ptr<Instruction>& inst, InstFinishTrac
                                            core_trace_log::format_instruction_detail_line(*inst));
 }
 
+bool Core::has_inflight() {
+  // running() without the "_tiles.size() > 0" term: work that will produce a
+  // finish event on its own (so the sim is NOT frozen). If this is false but
+  // tiles remain, only stalled ready instructions are left.
+  if (!_vu_compute_pipeline.empty()) return true;
+  for (int i = 0; i < _num_systolic_array_per_core; i++)
+    if (!_sa_compute_pipeline.at(i).empty()) return true;
+  if (!_dma_waiting_queue.empty() || !_dma_finished_queue.empty()) return true;
+  if (!_dma.empty()) return true;
+  if (!_ld_inst_queue.empty() || !_st_inst_queue.empty()) return true;
+  return false;
+}
+
 bool Core::running() {
   bool running = false;
   running = running || _tiles.size() > 0;
diff --git a/TOGSim/src/Simulator.cc b/TOGSim/src/Simulator.cc
index d987d787..03dd7bf9 100644
--- a/TOGSim/src/Simulator.cc
+++ b/TOGSim/src/Simulator.cc
@@ -184,6 +184,38 @@ void Simulator::icnt_cycle() {
   _icnt->cycle();
 }
 
+// Consecutive frozen cycles tolerated before declaring the sim wedged (spad too
+// small). Generous so transient idle never false-fires; a true freeze is constant.
+static constexpr uint64_t kWedgeThreshold = 5000;
+
+// Frozen-state guard: work remains (running()) but nothing is in flight to
+// advance it -- the SRAM throttle can never satisfy a load because the kernel's
+// working set exceeds the whole per-core spad (core_spad_size_kb too small). The
+// state repeats every cycle, so after a margin error out instead of looping
+// forever. `stuck` is function-local-static (one running sim at a time; it resets
+// on any progress).
+void Simulator::check_frozen() {
+  static uint64_t stuck = 0;
+  // In flight = anything that will produce a future state change: icnt/dram busy,
+  // a core with DMA/compute pending, or a tile still schedulable.
+  bool inflight = _icnt->running() || _dram->running();
+  for (int id = 0; id < _n_cores && !inflight; id++) {
+    if (_cores[id]->has_inflight()) inflight = true;
+    else if (!get_partition_scheduler(id)->empty(id)) inflight = true;
+  }
+  if (running() && !inflight) {
+    if (++stuck > kWedgeThreshold) {
+      spdlog::error("[Simulator] simulation wedged at cycle {}: work remains but "
+                    "nothing is in flight -- the per-core spad (core_spad_size_kb) "
+                    "is too small to hold a kernel's working set. Increase it.",
+                    _core_cycles);
+      exit(EXIT_FAILURE);
+    }
+  } else {
+    stuck = 0;
+  }
+}
+
 void Simulator::cycle() {
   while (running() || _core_cycles < 1) {
     set_cycle_mask();
@@ -198,6 +230,8 @@ void Simulator::cycle() {
     // Interconnect cycle
     if (IS_ICNT_CYCLE(_cycle_mask))
       icnt_cycle();
+
+    check_frozen();   // spad-too-small guard (errors out if wedged)
   }
   for (auto &core: _cores) {
     core->check_tag();
diff --git a/TOGSim/src/togsim_trace_bridge.cc b/TOGSim/src/togsim_trace_bridge.cc
index 164b04c0..01406793 100644
--- a/TOGSim/src/togsim_trace_bridge.cc
+++ b/TOGSim/src/togsim_trace_bridge.cc
@@ -149,6 +149,46 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
     tile->append_instuction(inst);
   };
 
+  // --- SRAM-capacity tracking (buffer-version allocations, sec 10.x) ---
+  // A coarse tile = one version of its buffer; the fine DMAs that fill it share
+  // one allocation, freed once all the version's consumers have issued (refcount
+  // -> 0). NOT reset in flush(): the spad is one physical per-core resource, so a
+  // buffer reused by the next reduction iter / work-item is a NEW version that
+  // must wait for the old one to free (WAR / double-buffer). Tracked buffers are
+  // the DMA-loaded ones; the accumulator / virtual SA-weights are never written
+  // by a load, so cur_alloc has no entry and they are skipped. (v1: single-core;
+  // multi-core would key cur_alloc/vers by (core, buf).)
+  int64_t next_alloc = 0;
+  std::map<int64_t, int64_t> cur_alloc;   // buf -> current version id
+  std::map<int64_t, bool> open_ver;       // buf -> version still accepting loads
+  struct Ver { std::vector<std::shared_ptr<Instruction>> loads, readers; };
+  std::map<int64_t, Ver> vers;
+  auto sram_on_load = [&](int64_t b, const std::shared_ptr<Instruction>& ld) {
+    if (!cur_alloc.count(b) || !open_ver[b]) {   // a read closed it -> new version
+      cur_alloc[b] = next_alloc++;
+      open_ver[b] = true;
+      vers[cur_alloc[b]] = {};
+    }
+    ld->set_sram_alloc(cur_alloc[b]);
+    vers[cur_alloc[b]].loads.push_back(ld);
+  };
+  auto sram_on_read = [&](int64_t b, const std::shared_ptr<Instruction>& rd) {
+    auto it = cur_alloc.find(b);
+    if (it == cur_alloc.end()) return;           // not a load buffer -> untracked
+    vers[it->second].readers.push_back(rd);
+    open_ver[b] = false;                          // next write starts a new version
+  };
+  auto sram_finalize = [&]() {                    // tag only each version's LAST reader
+    for (auto& kv : vers) {
+      auto& v = kv.second;
+      if (v.readers.empty()) {                    // no consumer -> never freed: untrack
+        for (auto& ld : v.loads) ld->set_sram_alloc(-1);
+        continue;
+      }
+      v.readers.back()->add_sram_release(kv.first);  // it frees the whole version on issue
+    }
+  };
+
   for (const auto& t : run.trace) {
     if (t.kind == TraceRec::TILE_BEGIN) {
       // togsim_dispatch opened a work-item -> new subgraph (bound to its core) +
@@ -183,6 +223,7 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
         } else {
           link(inst, t.read_bufs, t.write_bufs);
         }
+        for (int64_t b : t.read_bufs) sram_on_read(b, inst);  // store frees what it drains
       } else {                                           // LOAD
         tile->append_instuction(inst);
         // async load: record it as the CURRENT load for this (tag_id, tag_slot)
@@ -193,6 +234,7 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
         // has no barrier and blocks to arrival itself.
         if (t.is_async) current_dma[{t.tag_id, t.tag_slot}] = {uniq, inst};
         for (int64_t b : t.write_bufs) last_writer[b] = inst;
+        for (int64_t b : t.write_bufs) sram_on_load(b, inst);   // occupy spad
       }
     } else if (t.kind == TraceRec::MEMORY_BAR) {
       // the explicit async-DMA sync (the original dma_wait). Pair with the CURRENT
@@ -211,6 +253,7 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
     } else if (t.kind == TraceRec::COMPUTE) {
       auto inst = make_compute(t);
       link(inst, t.read_bufs, t.write_bufs);
+      for (int64_t b : t.read_bufs) sram_on_read(b, inst);     // frees the tiles it consumes
       if (is_async_compute(t.compute_type)) outstanding_async.push_back(inst);
     } else if (t.kind == TraceRec::COMPUTE_BAR) {
       // explicit compute fence: ready once all outstanding async compute have
@@ -224,5 +267,6 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
     }
   }
   flush();
+  sram_finalize();   // readers per version are now final -> set each version's refcount
   return tg;
 }
diff --git a/configs/systolic_ws_128x128_c1_booksim_tpuv2.yml b/configs/systolic_ws_128x128_c1_booksim_tpuv2.yml
index 6d2537d9..7fea374b 100644
--- a/configs/systolic_ws_128x128_c1_booksim_tpuv2.yml
+++ b/configs/systolic_ws_128x128_c1_booksim_tpuv2.yml
@@ -22,3 +22,6 @@ codegen_external_mapping_file: ''
 codegen_autotune_max_retry: 10
 codegen_autotune_template_topk: 4
 codegen_compiler_optimization: all
+
+# Per-core VMEM (vector/scratchpad) size: TPUv2/v3/v4 = 16 MB.
+core_spad_size_kb: 16384
diff --git a/configs/systolic_ws_128x128_c1_booksim_tpuv3.yml b/configs/systolic_ws_128x128_c1_booksim_tpuv3.yml
index f830419b..3a96b588 100644
--- a/configs/systolic_ws_128x128_c1_booksim_tpuv3.yml
+++ b/configs/systolic_ws_128x128_c1_booksim_tpuv3.yml
@@ -26,3 +26,6 @@ codegen_external_mapping_file: ''
 codegen_autotune_max_retry: 10
 codegen_autotune_template_topk: 4
 codegen_compiler_optimization: all
+
+# Per-core VMEM (vector/scratchpad) size: TPUv2/v3/v4 = 16 MB.
+core_spad_size_kb: 16384
diff --git a/configs/systolic_ws_128x128_c1_simple_noc_tpuv2.yml b/configs/systolic_ws_128x128_c1_simple_noc_tpuv2.yml
index 1a8c60f6..41e267b6 100644
--- a/configs/systolic_ws_128x128_c1_simple_noc_tpuv2.yml
+++ b/configs/systolic_ws_128x128_c1_simple_noc_tpuv2.yml
@@ -25,3 +25,6 @@ codegen_external_mapping_file: ''
 codegen_autotune_max_retry: 10
 codegen_autotune_template_topk: 4
 codegen_compiler_optimization: all
+
+# Per-core VMEM (vector/scratchpad) size: TPUv2/v3/v4 = 16 MB.
+core_spad_size_kb: 16384
diff --git a/configs/systolic_ws_128x128_c1_simple_noc_tpuv3.yml b/configs/systolic_ws_128x128_c1_simple_noc_tpuv3.yml
index ff976784..397f0fb7 100644
--- a/configs/systolic_ws_128x128_c1_simple_noc_tpuv3.yml
+++ b/configs/systolic_ws_128x128_c1_simple_noc_tpuv3.yml
@@ -26,3 +26,6 @@ codegen_external_mapping_file: ''
 codegen_autotune_max_retry: 10
 codegen_autotune_template_topk: 4
 codegen_compiler_optimization: all
+
+# Per-core VMEM (vector/scratchpad) size: TPUv2/v3/v4 = 16 MB.
+core_spad_size_kb: 16384
diff --git a/configs/systolic_ws_128x128_c1_simple_noc_tpuv3_half.yml b/configs/systolic_ws_128x128_c1_simple_noc_tpuv3_half.yml
index 2ed1bb12..f080fc69 100644
--- a/configs/systolic_ws_128x128_c1_simple_noc_tpuv3_half.yml
+++ b/configs/systolic_ws_128x128_c1_simple_noc_tpuv3_half.yml
@@ -26,3 +26,6 @@ codegen_external_mapping_file: ''
 codegen_autotune_max_retry: 10
 codegen_autotune_template_topk: 4
 codegen_compiler_optimization: all
+
+# Per-core VMEM (vector/scratchpad) size: TPUv2/v3/v4 = 16 MB.
+core_spad_size_kb: 16384
diff --git a/configs/systolic_ws_128x128_c1_simple_noc_tpuv3_timing_only.yml b/configs/systolic_ws_128x128_c1_simple_noc_tpuv3_timing_only.yml
index 1bcc9bb3..f89661b8 100644
--- a/configs/systolic_ws_128x128_c1_simple_noc_tpuv3_timing_only.yml
+++ b/configs/systolic_ws_128x128_c1_simple_noc_tpuv3_timing_only.yml
@@ -26,3 +26,6 @@ codegen_external_mapping_file: ''
 codegen_autotune_max_retry: 10
 codegen_autotune_template_topk: 8
 codegen_compiler_optimization: all
+
+# Per-core VMEM (vector/scratchpad) size: TPUv2/v3/v4 = 16 MB.
+core_spad_size_kb: 16384
diff --git a/configs/systolic_ws_128x128_c1_simple_noc_tpuv4.yml b/configs/systolic_ws_128x128_c1_simple_noc_tpuv4.yml
index 39d195b0..ca69d930 100644
--- a/configs/systolic_ws_128x128_c1_simple_noc_tpuv4.yml
+++ b/configs/systolic_ws_128x128_c1_simple_noc_tpuv4.yml
@@ -28,3 +28,6 @@ codegen_external_mapping_file: ''
 codegen_autotune_max_retry: 10
 codegen_autotune_template_topk: 4
 codegen_compiler_optimization: all
+
+# Per-core VMEM (vector/scratchpad) size: TPUv2/v3/v4 = 16 MB.
+core_spad_size_kb: 16384
diff --git a/configs/systolic_ws_128x128_c2_booksim_tpuv3.yml b/configs/systolic_ws_128x128_c2_booksim_tpuv3.yml
index bf01913b..b7b03e7a 100644
--- a/configs/systolic_ws_128x128_c2_booksim_tpuv3.yml
+++ b/configs/systolic_ws_128x128_c2_booksim_tpuv3.yml
@@ -26,3 +26,6 @@ codegen_external_mapping_file: ''
 codegen_autotune_max_retry: 10
 codegen_autotune_template_topk: 4
 codegen_compiler_optimization: all
+
+# Per-core VMEM (vector/scratchpad) size: TPUv2/v3/v4 = 16 MB.
+core_spad_size_kb: 16384
diff --git a/configs/systolic_ws_128x128_c2_booksim_tpuv3_bw_quarter.yml b/configs/systolic_ws_128x128_c2_booksim_tpuv3_bw_quarter.yml
index 8c71c528..903ffcbc 100644
--- a/configs/systolic_ws_128x128_c2_booksim_tpuv3_bw_quarter.yml
+++ b/configs/systolic_ws_128x128_c2_booksim_tpuv3_bw_quarter.yml
@@ -34,3 +34,6 @@ codegen_external_mapping_file: ''
 codegen_autotune_max_retry: 10
 codegen_autotune_template_topk: 4
 codegen_compiler_optimization: all
+
+# Per-core VMEM (vector/scratchpad) size: TPUv2/v3/v4 = 16 MB.
+core_spad_size_kb: 16384
diff --git a/configs/systolic_ws_128x128_c2_chiplet_tpuv3.yml b/configs/systolic_ws_128x128_c2_chiplet_tpuv3.yml
index d058f188..6a234017 100644
--- a/configs/systolic_ws_128x128_c2_chiplet_tpuv3.yml
+++ b/configs/systolic_ws_128x128_c2_chiplet_tpuv3.yml
@@ -28,3 +28,6 @@ codegen_external_mapping_file: ''
 codegen_autotune_max_retry: 10
 codegen_autotune_template_topk: 4
 codegen_compiler_optimization: all
+
+# Per-core VMEM (vector/scratchpad) size: TPUv2/v3/v4 = 16 MB.
+core_spad_size_kb: 16384
diff --git a/configs/systolic_ws_128x128_c2_chiplet_tpuv3_xnuma.yml b/configs/systolic_ws_128x128_c2_chiplet_tpuv3_xnuma.yml
index 019a0f0f..f0546e56 100644
--- a/configs/systolic_ws_128x128_c2_chiplet_tpuv3_xnuma.yml
+++ b/configs/systolic_ws_128x128_c2_chiplet_tpuv3_xnuma.yml
@@ -27,3 +27,6 @@ codegen_external_mapping_file: ''
 codegen_autotune_max_retry: 10
 codegen_autotune_template_topk: 4
 codegen_compiler_optimization: all
+
+# Per-core VMEM (vector/scratchpad) size: TPUv2/v3/v4 = 16 MB.
+core_spad_size_kb: 16384
diff --git a/configs/systolic_ws_128x128_c2_simple_noc_tpuv2.yml b/configs/systolic_ws_128x128_c2_simple_noc_tpuv2.yml
index 348babae..08ec26ac 100644
--- a/configs/systolic_ws_128x128_c2_simple_noc_tpuv2.yml
+++ b/configs/systolic_ws_128x128_c2_simple_noc_tpuv2.yml
@@ -25,3 +25,6 @@ codegen_external_mapping_file: ''
 codegen_autotune_max_retry: 10
 codegen_autotune_template_topk: 4
 codegen_compiler_optimization: all
+
+# Per-core VMEM (vector/scratchpad) size: TPUv2/v3/v4 = 16 MB.
+core_spad_size_kb: 16384
diff --git a/configs/systolic_ws_128x128_c2_simple_noc_tpuv3.yml b/configs/systolic_ws_128x128_c2_simple_noc_tpuv3.yml
index a0985aec..a6e073e9 100644
--- a/configs/systolic_ws_128x128_c2_simple_noc_tpuv3.yml
+++ b/configs/systolic_ws_128x128_c2_simple_noc_tpuv3.yml
@@ -26,3 +26,6 @@ codegen_external_mapping_file: ''
 codegen_autotune_max_retry: 10
 codegen_autotune_template_topk: 4
 codegen_compiler_optimization: all
+
+# Per-core VMEM (vector/scratchpad) size: TPUv2/v3/v4 = 16 MB.
+core_spad_size_kb: 16384
diff --git a/configs/systolic_ws_128x128_c2_simple_noc_tpuv3_ils.yml b/configs/systolic_ws_128x128_c2_simple_noc_tpuv3_ils.yml
index 166e2e25..5436b3e8 100644
--- a/configs/systolic_ws_128x128_c2_simple_noc_tpuv3_ils.yml
+++ b/configs/systolic_ws_128x128_c2_simple_noc_tpuv3_ils.yml
@@ -29,3 +29,6 @@ codegen_external_mapping_file: ''
 codegen_autotune_max_retry: 10
 codegen_autotune_template_topk: 4
 codegen_compiler_optimization: all
+
+# Per-core VMEM (vector/scratchpad) size: TPUv2/v3/v4 = 16 MB.
+core_spad_size_kb: 16384
diff --git a/configs/systolic_ws_128x128_c2_simple_noc_tpuv3_partition.yml b/configs/systolic_ws_128x128_c2_simple_noc_tpuv3_partition.yml
index 6119e83d..d928f9d3 100644
--- a/configs/systolic_ws_128x128_c2_simple_noc_tpuv3_partition.yml
+++ b/configs/systolic_ws_128x128_c2_simple_noc_tpuv3_partition.yml
@@ -30,3 +30,6 @@ codegen_external_mapping_file: ''
 codegen_autotune_max_retry: 10
 codegen_autotune_template_topk: 4
 codegen_compiler_optimization: all
+
+# Per-core VMEM (vector/scratchpad) size: TPUv2/v3/v4 = 16 MB.
+core_spad_size_kb: 16384
diff --git a/configs/systolic_ws_128x128_c2_simple_noc_tpuv4.yml b/configs/systolic_ws_128x128_c2_simple_noc_tpuv4.yml
index 9100c22a..dd9dfac7 100644
--- a/configs/systolic_ws_128x128_c2_simple_noc_tpuv4.yml
+++ b/configs/systolic_ws_128x128_c2_simple_noc_tpuv4.yml
@@ -28,3 +28,6 @@ codegen_external_mapping_file: ''
 codegen_autotune_max_retry: 10
 codegen_autotune_template_topk: 4
 codegen_compiler_optimization: all
+
+# Per-core VMEM (vector/scratchpad) size: TPUv2/v3/v4 = 16 MB.
+core_spad_size_kb: 16384
diff --git a/configs/systolic_ws_8x8_c1_booksim.yml b/configs/systolic_ws_8x8_c1_booksim.yml
index f46d380e..1593e148 100644
--- a/configs/systolic_ws_8x8_c1_booksim.yml
+++ b/configs/systolic_ws_8x8_c1_booksim.yml
@@ -23,3 +23,6 @@ codegen_external_mapping_file: ''
 codegen_autotune_max_retry: 10
 codegen_autotune_template_topk: 4
 codegen_compiler_optimization: all
+
+# Per-core spad: 8x8 array, 128 KB x 8 = 1 MB.
+core_spad_size_kb: 1024
diff --git a/configs/systolic_ws_8x8_c1_simple_noc.yml b/configs/systolic_ws_8x8_c1_simple_noc.yml
index 1be24b85..b2d16c6a 100644
--- a/configs/systolic_ws_8x8_c1_simple_noc.yml
+++ b/configs/systolic_ws_8x8_c1_simple_noc.yml
@@ -24,3 +24,6 @@ codegen_external_mapping_file: ''
 codegen_autotune_max_retry: 10
 codegen_autotune_template_topk: 4
 codegen_compiler_optimization: all
+
+# Per-core spad: 8x8 array, 128 KB x 8 = 1 MB.
+core_spad_size_kb: 1024
diff --git a/scripts/trace_timeline.py b/scripts/trace_timeline.py
new file mode 100644
index 00000000..7f9ade1e
--- /dev/null
+++ b/scripts/trace_timeline.py
@@ -0,0 +1,202 @@
+#!/usr/bin/env python3
+"""Convert a TOGSim `--log_level trace` log into a Chrome Trace Event JSON that
+opens in Perfetto (https://ui.perfetto.dev) or chrome://tracing as an interactive
+timeline (Gantt).
+
+Each instruction becomes one duration slice on one of 3 per-core lanes:
+  dma     -- MOVIN / MOVOUT
+  sa      -- COMP compute_type 1 (matmul) / 2 (preload)
+  vector  -- COMP compute_type 0 (vector)
+grouped per core (pid). Time unit = core cycles. Barriers (MEMORY_BAR/COMPUTE_BAR)
+are not drawn. A compute slice's width is its compute_cycle (the op's own latency),
+not issue->finish (which balloons under pipeline backlog); a DMA slice is the
+actual transfer ASYNC_DMA_ISSUE -> data-ready.
+
+Usage:
+  bin/Simulator --config <yml> --trace_so <so> --cycle_table <tsv> --log_level trace \
+      2>&1 | python scripts/trace_timeline.py -o timeline.json
+  # or
+  python scripts/trace_timeline.py trace.log -o timeline.json
+Then drag timeline.json into https://ui.perfetto.dev .
+"""
+import argparse
+import json
+import re
+import sys
+
+# [cycle][Core C][TAG ][INST_ID=N] OPCODE (detail...)
+_LINE = re.compile(
+    r"\[(\d+)\]\[Core (\d+)\]\[([A-Z_]+)\s*\](?:\[INST_ID=(-?\d+)\])?\s*(\w+)?(.*)")
+
+# Only 3 lanes per core. Barriers are dropped (see _HIDE).
+_LANE = {"MOVIN": "dma", "MOVOUT": "dma"}
+_HIDE = {"MEMORY_BAR", "COMPUTE_BAR", "TILE_BEGIN", "TILE_END"}
+_CT_NAME = {0: "vector", 1: "matmul", 2: "preload"}
+
+
+def _label(opcode, detail):
+    if opcode == "COMP":
+        m = re.search(r"compute_type=(\d+)", detail)
+        ct = int(m.group(1)) if m else -1
+        return _CT_NAME.get(ct, "comp")
+    m = re.search(r"addr_name=(\w+)", detail)
+    return f"{opcode} {m.group(1)}" if m else opcode
+
+
+def _lane(opcode, detail):
+    if opcode == "COMP":
+        m = re.search(r"compute_type=(\d+)", detail)
+        ct = int(m.group(1)) if m else -1
+        return "vector" if ct == 0 else "sa"
+    return _LANE.get(opcode, "dma")
+
+
+def parse(lines):
+    # key = (core, inst_id) -> record
+    insts = {}
+    for ln in lines:
+        m = _LINE.search(ln)
+        if not m:
+            continue
+        cyc, core, tag, iid, opcode, detail = m.groups()
+        if iid is None or opcode is None:
+            continue
+        cyc, core, iid = int(cyc), int(core), int(iid)
+        key = (core, iid)
+        r = insts.setdefault(key, {
+            "core": core, "iid": iid, "opcode": opcode, "detail": detail,
+            "issued": None, "finished": None, "resp": None, "dma_issue": None})
+        if not r["opcode"] or r["opcode"] == opcode:
+            r["opcode"] = opcode
+            if detail.strip():
+                r["detail"] = detail
+        if tag == "INST_ISSUED" and r["issued"] is None:
+            r["issued"] = cyc
+        elif tag == "INST_FINISHED":
+            r["finished"] = cyc
+        elif tag == "DRAM_RESP_DONE":
+            r["resp"] = cyc
+        elif tag == "ASYNC_DMA_ISSUE":   # actual transfer start (DMA engine busy)
+            r["dma_issue"] = cyc
+    return insts
+
+
+def _occ(detail):
+    """(compute_cycle, overlapping_cycle) from a COMP detail string."""
+    cc = re.search(r"compute_cycle=(\d+)", detail)
+    ov = re.search(r"overlapping_cycle=(\d+)", detail)
+    return (int(cc.group(1)) if cc else 0, int(ov.group(1)) if ov else 0)
+
+
+def to_chrome(insts, num_sa=1):
+    """Model each hardware unit as a server and replay its ops in issue order, so
+    real idle gaps (bubbles) show and slices don't nest:
+      dma    : MOVIN/MOVOUT -- 1 DMA engine; slice = actual transfer
+               (ASYNC_DMA_ISSUE -> data-ready).
+      vector : COMP type 0  -- 1 VPU.
+      sa     : COMP type 1/2 -- num_sa systolic arrays, round-robin by issue order.
+    A compute slice's width is compute_cycle - overlapping_cycle (its occupancy =
+    latency minus the tail that overlaps the next op), starting when the unit
+    actually picks it up: start = max(issue, unit_free). num_sa>1 -> lanes sa0.. ."""
+    by_core = {}
+    for r in insts.values():
+        op, detail, core = r["opcode"], r["detail"], r["core"]
+        if op in _HIDE:
+            continue
+        u = by_core.setdefault(core, {"dma": [], "vector": [], "sa": []})
+        if op == "COMP":
+            m = re.search(r"compute_type=(\d+)", detail)
+            ct = int(m.group(1)) if m else -1
+            u["vector" if ct == 0 else "sa"].append(r)
+        else:
+            u["dma"].append(r)
+
+    events, lanes, cores = [], set(), set()
+
+    def add(core, lane, ts, dur, name, r):
+        lanes.add((core, lane))
+        cores.add(core)
+        events.append({"name": name, "cat": lane, "ph": "X", "ts": ts,
+                       "dur": max(dur, 1), "pid": core, "tid": lane,
+                       "args": {"inst_id": r["iid"], "issued": r["issued"],
+                                "finished": r["finished"], "data_ready": r["resp"]}})
+
+    def issue_key(r):
+        return r["issued"] if r["issued"] is not None else 0
+
+    nsa = max(num_sa, 1)
+    for core, u in sorted(by_core.items()):
+        # DMA engine: one server, serialized. A load occupies the engine while it
+        # INJECTS its requests -- [INST_ISSUED, ASYNC_DMA_ISSUE] -- not the response
+        # tail [ASYNC_DMA_ISSUE, resp] (engine is free during that) and not up to
+        # data-ready (which would mask gaps). When a load is blocked from issuing
+        # (spad full), its INST_ISSUED is delayed past the engine-free time, so a
+        # real idle gap appears = the SRAM throttle stalling the DMA.
+        free = 0
+        for r in sorted(u["dma"], key=issue_key):
+            start = r["issued"] if r["issued"] is not None else r["dma_issue"]
+            end = r["dma_issue"]
+            if end is None:  # sync dma / store: no async-issue marker
+                end = r["resp"] if r["resp"] is not None else r["finished"]
+            if start is None:
+                continue
+            if end is None or end < start:
+                end = start + 1
+            start = max(start, free)
+            free = max(end, start + 1)
+            add(core, "dma", start, free - start, _label(r["opcode"], r["detail"]), r)
+        # VPU: one server; slice = occupancy (compute_cycle - overlapping_cycle).
+        free = 0
+        for r in sorted(u["vector"], key=issue_key):
+            if r["issued"] is None:
+                continue
+            cc, ov = _occ(r["detail"])
+            dur = max(cc - ov, 1)
+            start = max(r["issued"], free)
+            free = start + dur
+            add(core, "vector", start, dur, "vector", r)
+        # SA: num_sa servers, round-robin in issue order (mirrors the Core's rr).
+        sa_free = [0] * nsa
+        for i, r in enumerate(sorted(u["sa"], key=issue_key)):
+            if r["issued"] is None:
+                continue
+            s = i % nsa
+            cc, ov = _occ(r["detail"])
+            dur = max(cc - ov, 1)
+            start = max(r["issued"], sa_free[s])
+            sa_free[s] = start + dur
+            lane = "sa" if nsa == 1 else f"sa{s}"
+            add(core, lane, start, dur, _label(r["opcode"], r["detail"]), r)
+
+    for c in sorted(cores):
+        events.append({"name": "process_name", "ph": "M", "pid": c, "tid": 0,
+                       "args": {"name": f"Core {c}"}})
+    order = {"dma": 0, "sa": 1, "sa0": 1, "sa1": 2, "sa2": 3, "sa3": 4, "vector": 8}
+    for c, lane in sorted(lanes, key=lambda x: (x[0], order.get(x[1], 5))):
+        events.append({"name": "thread_name", "ph": "M", "pid": c, "tid": lane,
+                       "args": {"name": lane}})
+        events.append({"name": "thread_sort_index", "ph": "M", "pid": c, "tid": lane,
+                       "args": {"sort_index": order.get(lane, 5)}})
+    return {"traceEvents": events, "displayTimeUnit": "ns"}
+
+
+def main(argv):
+    ap = argparse.ArgumentParser()
+    ap.add_argument("input", nargs="?", help="trace log file (default: stdin)")
+    ap.add_argument("-o", "--out", default="timeline.json")
+    ap.add_argument("-s", "--num-sa", type=int, default=1,
+                    help="systolic arrays per core (num_systolic_array_per_core); "
+                         ">1 splits into sa0..saN-1 lanes")
+    a = ap.parse_args(argv[1:])
+    src = open(a.input) if a.input else sys.stdin
+    insts = parse(src)
+    trace = to_chrome(insts, a.num_sa)
+    with open(a.out, "w") as fh:
+        json.dump(trace, fh)
+    n = sum(1 for e in trace["traceEvents"] if e["ph"] == "X")
+    sys.stderr.write(f"wrote {a.out}: {n} slices -> open in https://ui.perfetto.dev\n")
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main(sys.argv))

From 05770cb5b16f2ef7f12208b4f267387ee4d89892 Mon Sep 17 00:00:00 2001
From: Wonhyuk Yang <wonhyuk@postech.ac.kr>
Date: Wed, 24 Jun 2026 22:35:51 +0900
Subject: [PATCH 06/24] [Tooling] TOGSim trace timeline (Perfetto) and the
 trace emits it needs

trace_timeline.py with per-work-item grouping and resource-centric DMA lanes;
the trace logs the first DRAM response and the assigned systolic array, and
scopes the compute barrier to its dispatch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
---
 TOGSim/include/Instruction.h      |  14 ++++
 TOGSim/include/TraceLogTags.h     |   1 +
 TOGSim/src/Core.cc                |  32 ++++----
 TOGSim/src/CoreTraceLog.cc        |  23 ++++--
 TOGSim/src/Instruction.cc         |   7 +-
 TOGSim/src/togsim_trace_bridge.cc |   6 ++
 scripts/trace_timeline.py         | 131 ++++++++++++++++++++----------
 7 files changed, 148 insertions(+), 66 deletions(-)

diff --git a/TOGSim/include/Instruction.h b/TOGSim/include/Instruction.h
index e5a8dc4a..d5e193d4 100644
--- a/TOGSim/include/Instruction.h
+++ b/TOGSim/include/Instruction.h
@@ -50,6 +50,14 @@ class Instruction : public std::enable_shared_from_this<Instruction> {
   int get_assigned_sa() const { return _assigned_sa; }
   void set_weight_token(const std::shared_ptr<WeightToken>& t) { _weight_token = t; }
   const std::shared_ptr<WeightToken>& get_weight_token() const { return _weight_token; }
+  // Trace-only: which work-item (togsim_dispatch tile) this op belongs to, for
+  // grouping/coloring in the timeline. Set by the bridge per TILE_BEGIN.
+  void set_tile_group(int g) { _tile_group = g; }
+  int get_tile_group() const { return _tile_group; }
+  // COMPUTE_BAR fence: the max finish_cycle of the async computes it gates (its
+  // own dispatch only), so it drains those instead of every SA pipeline.
+  void update_fence_finish(cycle_type c) { if (c > _fence_finish) _fence_finish = c; }
+  cycle_type get_fence_finish() const { return _fence_finish; }
   bool check_ready() { return ready_counter == 0; }
   const Opcode get_opcode() { return opcode; }
   bool is_dma_read() { return opcode == Opcode::MOVIN; }
@@ -72,6 +80,9 @@ class Instruction : public std::enable_shared_from_this<Instruction> {
   void inc_waiting_request();
   void dec_waiting_request();
   size_t get_waiting_request() { return _nr_waiting_request; }
+  // trace: log only the FIRST DRAM response of a load (when data starts arriving).
+  bool got_first_response() const { return _got_first_response; }
+  void mark_first_response() { _got_first_response = true; }
   std::vector<size_t>& get_tile_size() { return tile_size; }
   std::vector<int>& get_tile_stride() { return tile_stride; }
   void set_overlapping_cycle(cycle_type cycle) { overlapping_cycle = cycle; }
@@ -143,6 +154,7 @@ class Instruction : public std::enable_shared_from_this<Instruction> {
   std::vector<int> tile_stride;
   size_t _tile_numel;
   size_t _nr_waiting_request=0;
+  bool _got_first_response=false;
   size_t _elem_bits = 0;
   addr_type dram_addr;
   uint32_t _numa_id = 0; // For DMA instruction
@@ -165,4 +177,6 @@ class Instruction : public std::enable_shared_from_this<Instruction> {
   // SA weight-buffer model (see the setters above).
   int _assigned_sa = -1;
   std::shared_ptr<WeightToken> _weight_token;
+  int _tile_group = -1;   // trace-only work-item id (see set_tile_group)
+  cycle_type _fence_finish = 0;   // COMPUTE_BAR: drain target (see update_fence_finish)
 };
\ No newline at end of file
diff --git a/TOGSim/include/TraceLogTags.h b/TOGSim/include/TraceLogTags.h
index 6c158099..759a4fdb 100644
--- a/TOGSim/include/TraceLogTags.h
+++ b/TOGSim/include/TraceLogTags.h
@@ -24,6 +24,7 @@ inline constexpr const char* kInstructionFinished = "INST_FINISHED";
 inline constexpr const char* kInstructionSkipped = "INST_SKIP";
 
 inline constexpr const char* kAsyncDmaAllRequestsIssued = "ASYNC_DMA_ISSUE";
+inline constexpr const char* kFirstDramResponse = "DRAM_RESP_FIRST";
 inline constexpr const char* kAllDramResponsesReceived = "DRAM_RESP_DONE";
 
 inline constexpr const char* kL2CacheableStatusForAddress = "L2CACHE_STAT";
diff --git a/TOGSim/src/Core.cc b/TOGSim/src/Core.cc
index 1afbbb52..25335c9c 100644
--- a/TOGSim/src/Core.cc
+++ b/TOGSim/src/Core.cc
@@ -356,10 +356,6 @@ void Core::cycle() {
               inst->set_assigned_sa(sa_idx);         // record the SA actually used (for the trace)
             }
             release_sram(inst);   // consumer issued -> free the tiles it read
-            // sec 10.7: this op is now entering the pipeline -> release its
-            // occupancy (pipeline) dependents so a preload/matmul successor
-            // overlaps it instead of waiting its full latency.
-            inst->release_pipeline_children();
             auto& target_pipeline = (ct == VECTOR_UNIT) ? _vu_compute_pipeline
                                                         : _sa_compute_pipeline.at(sa_idx);
             if (target_pipeline.empty()) {
@@ -371,6 +367,10 @@ void Core::cycle() {
               inst->finish_cycle = target_pipeline.back()->finish_cycle + inst->get_compute_cycle() - overlapped_cycle;
               inst->bubble_cycle = bubble_cycle;
             }
+            // sec 10.7: release the occupancy (pipeline) dependents so a successor
+            // overlaps this op. finish_cycle is set first so release can feed it to
+            // a COMPUTE_BAR child's per-dispatch fence (see release_pipeline_children).
+            inst->release_pipeline_children();
 
             // Release this matmul's weight slot at its streaming-end (finish -
             // overlapping), not at full finish (the drain tail does not read it).
@@ -430,16 +430,13 @@ void Core::cycle() {
           break;
         case Opcode::COMPUTE_BAR:
           {
-            // Compute fence (sec 10.7): finish only once ALL compute pipelines
-            // have drained (every systolic array + the VPU empty). Its
-            // ready_counter already gated it until the async computes ISSUED
-            // (pipeline-child release), so by now they are in the pipeline; we
-            // wait for them to drain. If not yet drained, do not issue -- it
-            // stays in the ready queue and is re-checked next cycle.
-            bool drained = _vu_compute_pipeline.empty();
-            for (int s = 0; s < _num_systolic_array_per_core; s++)
-              drained = drained && _sa_compute_pipeline.at(s).empty();
-            if (drained) {
+            // Compute fence (sec 10.7): finish once THIS dispatch's async computes
+            // have drained -- i.e. the current cycle has reached the max finish of
+            // the computes it gates (fed in via update_fence_finish when each
+            // issued). Scoped to its own dispatch, so an unrelated tile's matmuls
+            // sharing the SA pipelines do not delay it (no cross-dispatch
+            // serialization). Not yet drained -> stays in the ready queue.
+            if (_core_cycle >= inst->get_fence_finish()) {
               core_trace_log::trace_instruction_line(_core_cycle, _id,
                   TraceLogTag::pad15(TraceLogTag::kInstructionFinished),
                   inst->get_global_inst_id(),
@@ -550,6 +547,13 @@ void Core::push_memory_response(mem_fetch* response) {
   Instruction* owner_inst = static_cast<Instruction*>(response->get_custom_data());
   assert(owner_inst->get_waiting_request());
 
+  if (!owner_inst->got_first_response()) {   // first data of this load arrived
+    owner_inst->mark_first_response();
+    core_trace_log::trace_instruction_line(_core_cycle, _id,
+        TraceLogTag::pad15(TraceLogTag::kFirstDramResponse),
+        owner_inst->get_global_inst_id(),
+        core_trace_log::format_instruction_detail_line(*owner_inst));
+  }
   owner_inst->dec_waiting_request();
   if (!owner_inst->get_waiting_request()) {
     auto it = _dma_waiting_queue.find(owner_inst);
diff --git a/TOGSim/src/CoreTraceLog.cc b/TOGSim/src/CoreTraceLog.cc
index 9761f9ec..7086893e 100644
--- a/TOGSim/src/CoreTraceLog.cc
+++ b/TOGSim/src/CoreTraceLog.cc
@@ -31,7 +31,7 @@ std::string format_dma_inst_issued_detail(Instruction& inst) {
   }
   return fmt::format(
       "addr_name={} dram=0x{:016x} rank={} elem_bits={} async={} indirect={} tag=0x{:016x} stride=[{}] size=[{}] "
-      "tag_idx=[{}]",
+      "tag_idx=[{}] tile={}",
       inst.get_addr_name(),
       static_cast<uint64_t>(inst.get_base_dram_address()),
       rank,
@@ -41,7 +41,8 @@ std::string format_dma_inst_issued_detail(Instruction& inst) {
       tag_hex,
       fmt::join(inst.get_tile_stride(), ","),
       fmt::join(ts, ","),
-      fmt::join(tidx, ","));
+      fmt::join(tidx, ","),
+      inst.get_tile_group());
 }
 
 std::string format_dma_inst_issued_trace_line(Instruction& inst) {
@@ -52,31 +53,35 @@ std::string format_instruction_detail_line(Instruction& inst) {
   const Opcode op = inst.get_opcode();
   const std::string opname = opcode_to_string(op);
   if (op == Opcode::COMP) {
-    return fmt::format("{} (compute_type={} compute_cycle={} overlapping_cycle={})",
+    return fmt::format("{} (compute_type={} compute_cycle={} overlapping_cycle={} sa={} tile={})",
                        opname,
                        inst.get_compute_type(),
                        inst.get_compute_cycle(),
-                       inst.get_overlapping_cycle());
+                       inst.get_overlapping_cycle(),
+                       inst.get_assigned_sa(),
+                       inst.get_tile_group());
   }
   if ((op == Opcode::MOVIN || op == Opcode::MOVOUT) && inst.is_async_dma()) {
-    return fmt::format("{} (ASYNC subgraph_id={} addr_name={} tag_id=[{}] tag_idx=[{}] tag_stride=[{}])",
+    return fmt::format("{} (ASYNC subgraph_id={} addr_name={} tag_id=[{}] tag_idx=[{}] tag_stride=[{}] tile={})",
                        opname,
                        inst.subgraph_id,
                        inst.get_addr_name(),
                        format_tag_key_list_hex(inst.get_tag_id()),
                        fmt::join(inst.get_tag_idx_list(), ","),
-                       fmt::join(inst.get_tag_stride_list(), ","));
+                       fmt::join(inst.get_tag_stride_list(), ","),
+                       inst.get_tile_group());
   }
   if (op == Opcode::MOVIN || op == Opcode::MOVOUT) {
-    return fmt::format("{} (addr_name={})", opname, inst.get_addr_name());
+    return fmt::format("{} (addr_name={} tile={})", opname, inst.get_addr_name(), inst.get_tile_group());
   }
   if (op == Opcode::MEMORY_BAR) {
-    return fmt::format("{} (addr_name={} tag_id=[{}] tag_idx=[{}] tag_stride=[{}])",
+    return fmt::format("{} (addr_name={} tag_id=[{}] tag_idx=[{}] tag_stride=[{}] tile={})",
                        opname,
                        inst.get_addr_name(),
                        format_tag_key_list_hex(inst.get_tag_id()),
                        fmt::join(inst.get_tag_idx_list(), ","),
-                       fmt::join(inst.get_tag_stride_list(), ","));
+                       fmt::join(inst.get_tag_stride_list(), ","),
+                       inst.get_tile_group());
   }
   return opname;
 }
diff --git a/TOGSim/src/Instruction.cc b/TOGSim/src/Instruction.cc
index 54e50511..d0471226 100644
--- a/TOGSim/src/Instruction.cc
+++ b/TOGSim/src/Instruction.cc
@@ -67,7 +67,12 @@ void Instruction::add_pipeline_child(std::shared_ptr<Instruction> child) {
 }
 
 void Instruction::release_pipeline_children() {
-  for (auto& c : _pipeline_children) c->dec_ready_counter();
+  for (auto& c : _pipeline_children) {
+    // a COMPUTE_BAR child fences only its own dispatch -> it drains the max
+    // finish of the computes it gates, fed here as each one issues.
+    if (c->get_opcode() == Opcode::COMPUTE_BAR) c->update_fence_finish(finish_cycle);
+    c->dec_ready_counter();
+  }
   _pipeline_children.clear();
 }
 
diff --git a/TOGSim/src/togsim_trace_bridge.cc b/TOGSim/src/togsim_trace_bridge.cc
index 01406793..e13af2d7 100644
--- a/TOGSim/src/togsim_trace_bridge.cc
+++ b/TOGSim/src/togsim_trace_bridge.cc
@@ -99,6 +99,7 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
   std::map<std::pair<int32_t, uint64_t>,
            std::pair<int64_t, std::shared_ptr<Instruction>>> current_dma;
   int64_t next_tag = 0;   // mints a unique Core tag key per dma record
+  int cur_tile_group = -1;   // work-item index, bumped per TILE_BEGIN (trace grouping)
   // Async compute (matmul/preload): issued and pipelined on the systolic array;
   // they do not block each other. A store then needs the drained result, so it
   // FLUSHes -- waits all outstanding async compute before running (like a fence
@@ -198,6 +199,7 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
       sg = std::make_shared<TileSubGraph>();
       sg->set_core_id(t.core);
       tile = std::make_shared<Tile>(Tile::Status::INITIALIZED);
+      cur_tile_group++;
       continue;
     }
     if (t.kind == TraceRec::TILE_END) {
@@ -209,6 +211,7 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
     if (t.kind == TraceRec::DMA) {
       int64_t uniq = next_tag++;                         // fresh Core tag key per dma record
       auto inst = make_dma(t, uniq);
+      inst->set_tile_group(cur_tile_group);
       size_t numel = 1;                                  // SRAM footprint (ready-tile ordering)
       for (auto d : t.dims) numel *= (size_t)d;
       tile->inc_required_sram_size(numel * (t.elem_bits / 8));
@@ -247,11 +250,13 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
       std::shared_ptr<Instruction> dma_inst;
       if (it != current_dma.end()) { uniq = it->second.first; dma_inst = it->second.second; }
       auto bar = make_mem_bar(t, uniq);
+      bar->set_tile_group(cur_tile_group);
       if (dma_inst) dma_inst->add_child(bar);
       tile->append_instuction(bar);
       for (int64_t b : t.write_bufs) last_writer[b] = bar;
     } else if (t.kind == TraceRec::COMPUTE) {
       auto inst = make_compute(t);
+      inst->set_tile_group(cur_tile_group);
       link(inst, t.read_bufs, t.write_bufs);
       for (int64_t b : t.read_bufs) sram_on_read(b, inst);     // frees the tiles it consumes
       if (is_async_compute(t.compute_type)) outstanding_async.push_back(inst);
@@ -260,6 +265,7 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
       // ISSUED (pipeline-child release); the Core then waits the SA pipelines to
       // drain before it finishes (-> the store it gates).
       auto bar = std::make_shared<Instruction>(Opcode::COMPUTE_BAR);
+      bar->set_tile_group(cur_tile_group);
       for (auto& a : outstanding_async) a->add_pipeline_child(bar);
       outstanding_async.clear();
       tile->append_instuction(bar);
diff --git a/scripts/trace_timeline.py b/scripts/trace_timeline.py
index 7f9ade1e..5cf9608b 100644
--- a/scripts/trace_timeline.py
+++ b/scripts/trace_timeline.py
@@ -3,14 +3,16 @@
 opens in Perfetto (https://ui.perfetto.dev) or chrome://tracing as an interactive
 timeline (Gantt).
 
-Each instruction becomes one duration slice on one of 3 per-core lanes:
-  dma     -- MOVIN / MOVOUT
-  sa      -- COMP compute_type 1 (matmul) / 2 (preload)
+Each instruction becomes one duration slice, grouped per core (pid). Lanes:
+  dram-rd -- loads crossing the DRAM bus (read bandwidth)
+  dram-wr -- stores crossing the DRAM bus (write bandwidth)
+  sa / sa0.. -- COMP compute_type 1 (matmul) / 2 (preload)
   vector  -- COMP compute_type 0 (vector)
-grouped per core (pid). Time unit = core cycles. Barriers (MEMORY_BAR/COMPUTE_BAR)
-are not drawn. A compute slice's width is its compute_cycle (the op's own latency),
-not issue->finish (which balloons under pipeline backlog); a DMA slice is the
-actual transfer ASYNC_DMA_ISSUE -> data-ready.
+Time unit = core cycles. Barriers (MEMORY_BAR/COMPUTE_BAR) are not drawn. A DMA bar
+runs from the op's first DRAM response (DRAM_RESP_FIRST, logged by the Core -- so it
+captures data moving even while still injecting) to its completion (load: data-ready;
+store: finished), serialized per direction so each is one visible bar (packed row =
+saturated bus). A compute slice's width is its occupancy (compute_cycle - overlapping).
 
 Usage:
   bin/Simulator --config <yml> --trace_so <so> --cycle_table <tsv> --log_level trace \
@@ -33,14 +35,41 @@
 _HIDE = {"MEMORY_BAR", "COMPUTE_BAR", "TILE_BEGIN", "TILE_END"}
 _CT_NAME = {0: "vector", 1: "matmul", 2: "preload"}
 
+# Perfetto/catapult reserved color names; slices are tinted by tile (= the
+# togsim_dispatch work-item / output tile) so one tile's ops share a color across
+# lanes/cores. 16 names so a core's tiles (which stride by num_cores) stay
+# distinct -- an 8-name palette collapsed to 4 colors per core under 2-core
+# even/odd assignment.
+_TILE_PALETTE = ["good", "bad", "terrible", "yellow", "olive", "rail_response",
+                 "rail_load", "rail_animation", "rail_idle", "thread_state_running",
+                 "thread_state_runnable", "thread_state_iowait",
+                 "thread_state_uninterruptible", "generic_work", "startup",
+                 "vsync_highlight_color"]
+
+
+def _tile_color(detail):
+    m = re.search(r"\btile=(\d+)", detail or "")
+    return _TILE_PALETTE[int(m.group(1)) % len(_TILE_PALETTE)] if m else None
+
+
+_DMA_SHORT = {"MOVIN": "MVIN", "MOVOUT": "MVOUT"}
+
+
+def _tile_of(detail):
+    m = re.search(r"\btile=(-?\d+)", detail or "")
+    return m.group(1) if m else "?"
+
 
 def _label(opcode, detail):
     if opcode == "COMP":
         m = re.search(r"compute_type=(\d+)", detail)
         ct = int(m.group(1)) if m else -1
-        return _CT_NAME.get(ct, "comp")
-    m = re.search(r"addr_name=(\w+)", detail)
-    return f"{opcode} {m.group(1)}" if m else opcode
+        return f"T{_tile_of(detail)} {_CT_NAME.get(ct, 'comp')}"
+    # DMA: keep each load's OWN identity (addr_name) so the input/weight/K-panel
+    # loads stay distinct; tile is conveyed by color (and args), not the name.
+    m = re.search(r"addr_name=(\w+)", detail or "")
+    who = m.group(1) if m else "?"
+    return f"{who} (T{_tile_of(detail)} {_DMA_SHORT.get(opcode, opcode)})"
 
 
 def _lane(opcode, detail):
@@ -65,7 +94,8 @@ def parse(lines):
         key = (core, iid)
         r = insts.setdefault(key, {
             "core": core, "iid": iid, "opcode": opcode, "detail": detail,
-            "issued": None, "finished": None, "resp": None, "dma_issue": None})
+            "issued": None, "finished": None, "resp": None, "dma_issue": None,
+            "first_resp": None})
         if not r["opcode"] or r["opcode"] == opcode:
             r["opcode"] = opcode
             if detail.strip():
@@ -76,7 +106,9 @@ def parse(lines):
             r["finished"] = cyc
         elif tag == "DRAM_RESP_DONE":
             r["resp"] = cyc
-        elif tag == "ASYNC_DMA_ISSUE":   # actual transfer start (DMA engine busy)
+        elif tag == "DRAM_RESP_FIRST" and r["first_resp"] is None:  # first data arrived
+            r["first_resp"] = cyc
+        elif tag == "ASYNC_DMA_ISSUE":   # all requests injected (engine done)
             r["dma_issue"] = cyc
     return insts
 
@@ -94,7 +126,8 @@ def to_chrome(insts, num_sa=1):
       dma    : MOVIN/MOVOUT -- 1 DMA engine; slice = actual transfer
                (ASYNC_DMA_ISSUE -> data-ready).
       vector : COMP type 0  -- 1 VPU.
-      sa     : COMP type 1/2 -- num_sa systolic arrays, round-robin by issue order.
+      sa     : COMP type 1/2 -- each op on the SA the Core reports (`sa=` field;
+               weight-pinned), so lanes auto-split sa0..; rr fallback if absent.
     A compute slice's width is compute_cycle - overlapping_cycle (its occupancy =
     latency minus the tail that overlaps the next op), starting when the unit
     actually picks it up: start = max(issue, unit_free). num_sa>1 -> lanes sa0.. ."""
@@ -116,35 +149,39 @@ def to_chrome(insts, num_sa=1):
     def add(core, lane, ts, dur, name, r):
         lanes.add((core, lane))
         cores.add(core)
-        events.append({"name": name, "cat": lane, "ph": "X", "ts": ts,
-                       "dur": max(dur, 1), "pid": core, "tid": lane,
-                       "args": {"inst_id": r["iid"], "issued": r["issued"],
-                                "finished": r["finished"], "data_ready": r["resp"]}})
+        args = {"inst_id": r["iid"], "tile": _tile_of(r["detail"]),
+                "issued": r["issued"], "first_data": r["first_resp"],
+                "finished": r["finished"], "data_ready": r["resp"]}
+        am = re.search(r"addr_name=(\w+)", r["detail"] or "")
+        if am:
+            args["addr"] = am.group(1)
+        ev = {"name": name, "cat": lane, "ph": "X", "ts": ts,
+              "dur": max(dur, 1), "pid": core, "tid": lane, "args": args}
+        cname = _tile_color(r["detail"])
+        if cname:
+            ev["cname"] = cname
+        events.append(ev)
 
     def issue_key(r):
         return r["issued"] if r["issued"] is not None else 0
 
     nsa = max(num_sa, 1)
     for core, u in sorted(by_core.items()):
-        # DMA engine: one server, serialized. A load occupies the engine while it
-        # INJECTS its requests -- [INST_ISSUED, ASYNC_DMA_ISSUE] -- not the response
-        # tail [ASYNC_DMA_ISSUE, resp] (engine is free during that) and not up to
-        # data-ready (which would mask gaps). When a load is blocked from issuing
-        # (spad full), its INST_ISSUED is delayed past the engine-free time, so a
-        # real idle gap appears = the SRAM throttle stalling the DMA.
-        free = 0
-        for r in sorted(u["dma"], key=issue_key):
-            start = r["issued"] if r["issued"] is not None else r["dma_issue"]
-            end = r["dma_issue"]
-            if end is None:  # sync dma / store: no async-issue marker
-                end = r["resp"] if r["resp"] is not None else r["finished"]
-            if start is None:
-                continue
-            if end is None or end < start:
-                end = start + 1
-            start = max(start, free)
-            free = max(end, start + 1)
-            add(core, "dma", start, free - start, _label(r["opcode"], r["detail"]), r)
+        # DMA data crossing the DRAM bus, split by direction (reads and writes are
+        # asymmetric). A LOAD's data comes back on the response, so its bar runs
+        # [first DRAM response, data-ready]. A STORE's data goes out with the
+        # request (fire-and-forget; its acks arrive after it has finished), so its
+        # bar runs [issued, finished]. Serialized per direction so each op is one
+        # visible bar: a packed row = the bus is saturated, gaps = it is idle.
+        for lane, op, sk, ek in (("dram-rd", "MOVIN", "first_resp", "resp"),
+                                 ("dram-wr", "MOVOUT", "issued", "finished")):
+            free = 0
+            rows = [r for r in u["dma"] if r["opcode"] == op
+                    and r[sk] is not None and r[ek] is not None and r[ek] > r[sk]]
+            for r in sorted(rows, key=lambda r: r[ek]):
+                start = max(r[sk], free)
+                free = max(r[ek], start + 1)
+                add(core, lane, start, free - start, _label(r["opcode"], r["detail"]), r)
         # VPU: one server; slice = occupancy (compute_cycle - overlapping_cycle).
         free = 0
         for r in sorted(u["vector"], key=issue_key):
@@ -155,23 +192,33 @@ def issue_key(r):
             start = max(r["issued"], free)
             free = start + dur
             add(core, "vector", start, dur, "vector", r)
-        # SA: num_sa servers, round-robin in issue order (mirrors the Core's rr).
-        sa_free = [0] * nsa
-        for i, r in enumerate(sorted(u["sa"], key=issue_key)):
+        # SA: each op runs on the systolic array the Core reports (the `sa=` field
+        # = its weight-pinned / round-robin assignment); fall back to round-robin
+        # by issue order for older logs without the field. Each SA is one server.
+        rows = sorted(u["sa"], key=issue_key)
+
+        def _sa_of(r, i):
+            m = re.search(r"\bsa=(-?\d+)", r["detail"])
+            return int(m.group(1)) if (m and int(m.group(1)) >= 0) else (i % nsa)
+
+        max_sa = max([nsa] + [_sa_of(r, i) + 1 for i, r in enumerate(rows)])
+        sa_free = [0] * max_sa
+        for i, r in enumerate(rows):
             if r["issued"] is None:
                 continue
-            s = i % nsa
+            s = _sa_of(r, i)
             cc, ov = _occ(r["detail"])
             dur = max(cc - ov, 1)
             start = max(r["issued"], sa_free[s])
             sa_free[s] = start + dur
-            lane = "sa" if nsa == 1 else f"sa{s}"
+            lane = "sa" if max_sa == 1 else f"sa{s}"
             add(core, lane, start, dur, _label(r["opcode"], r["detail"]), r)
 
     for c in sorted(cores):
         events.append({"name": "process_name", "ph": "M", "pid": c, "tid": 0,
                        "args": {"name": f"Core {c}"}})
-    order = {"dma": 0, "sa": 1, "sa0": 1, "sa1": 2, "sa2": 3, "sa3": 4, "vector": 8}
+    order = {"dram-rd": 0, "dram-wr": 1,
+             "sa": 2, "sa0": 2, "sa1": 3, "sa2": 4, "sa3": 5, "vector": 7}
     for c, lane in sorted(lanes, key=lambda x: (x[0], order.get(x[1], 5))):
         events.append({"name": "thread_name", "ph": "M", "pid": c, "tid": lane,
                        "args": {"name": lane}})

From a23690a4deacd6d48f2d0e03bec9f8886a24c581 Mon Sep 17 00:00:00 2001
From: Wonhyuk Yang <wonhyuk@postech.ac.kr>
Date: Wed, 24 Jun 2026 22:35:51 +0900
Subject: [PATCH 07/24] [TOGSim] Make the C++ trace path the default and
 stabilize it

Default to the trace path; fix uninitialized Instruction fields, the matmul
accumulator wedge, fused-subtile dedup, nested/fused epilogue dataflow, and
dma_wait fusion; bound concurrent dispatches to the spad, round-robin
work-items within a partition, benchmark autotune and run the multi-tenant
scheduler through the trace path, and emit trace.so for pooling/reduction.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
---
 PyTorchSimFrontend/extension_codecache.py     |  12 +-
 .../mlir/mlir_codegen_backend.py              |   8 +-
 .../mlir/passes/build_skeleton.py             |  72 ++++++++++-
 .../mlir/passes/dep_analysis.py               | 120 ++++++++++++------
 .../mlir/passes/dma_fine_grained.py           |  19 ++-
 .../mlir/passes/lower_to_emitc.py             |   2 +
 Simulator/simulator.py                        |  16 ++-
 TOGSim/include/Instruction.h                  |  17 +--
 TOGSim/include/togsim_loader.h                |   7 +-
 TOGSim/src/Core.cc                            |   7 +-
 TOGSim/src/Instruction.cc                     |  13 +-
 TOGSim/src/Simulator.cc                       |   2 +-
 TOGSim/src/main.cc                            |  88 +++++++++----
 TOGSim/src/togsim_runtime.cc                  |  22 ++--
 TOGSim/src/togsim_trace_bridge.cc             |  28 ++++
 15 files changed, 324 insertions(+), 109 deletions(-)

diff --git a/PyTorchSimFrontend/extension_codecache.py b/PyTorchSimFrontend/extension_codecache.py
index 247c07d5..e573d1a5 100644
--- a/PyTorchSimFrontend/extension_codecache.py
+++ b/PyTorchSimFrontend/extension_codecache.py
@@ -270,12 +270,12 @@ def load(cls, source_code,
                 vector_lane=vectorlane_size
             )
 
-            # P3 trace pipeline (opt-in, TORCHSIM_DUMP_TRACE_SO=1): also emit the
-            # compiled trace producer .so + the cycle-table TSV from the SAME
-            # post-vcix IR and gem5 cycle_list/offsets, so the trace path can be
-            # run and compared cycle-consistently against this legacy path.
-            # Best-effort: never breaks the legacy compile.
-            if os.environ.get("TORCHSIM_DUMP_TRACE_SO") == "1":
+            # Trace pipeline (DEFAULT): emit the compiled trace producer .so + the
+            # cycle-table TSV from the post-vcix IR and gem5 cycle_list/offsets. This
+            # is the default simulation path (the C++ TOG); the legacy ONNX TOG is the
+            # opt-in fallback via TORCHSIM_LEGACY_TOG=1, in which case the .so is unused
+            # so skip emitting it. Best-effort: never breaks the compile.
+            if os.environ.get("TORCHSIM_LEGACY_TOG") != "1":
                 try:
                     import mlir.ir as ir
                     from PyTorchSimFrontend.mlir.passes import (
diff --git a/PyTorchSimFrontend/mlir/mlir_codegen_backend.py b/PyTorchSimFrontend/mlir/mlir_codegen_backend.py
index 725e0dc6..1ee62f36 100644
--- a/PyTorchSimFrontend/mlir/mlir_codegen_backend.py
+++ b/PyTorchSimFrontend/mlir/mlir_codegen_backend.py
@@ -1120,9 +1120,13 @@ def codegen_nodes(self, nodes, kernel_name):
         src_code, meta_code = super().codegen_nodes(nodes, kernel_name)
         self._prepare_simulator_headers(src_code)
         if "autotune" in extension_config.codegen_mapping_strategy and extension_config.pytorchsim_timing_mode:
-            optimal_src_code, meta_code = self.autotune(nodes, kernel_name)[:2]
+            # Use temporaries: autotune returns [None, None, None] when it cannot
+            # autotune (e.g. a size-1 pointwise kernel with ranges == [1]), and
+            # unpacking into meta_code would clobber the valid arg_attributes that
+            # the fall-through below returns.
+            optimal_src_code, optimal_meta_code = self.autotune(nodes, kernel_name)[:2]
             if optimal_src_code is not None:
-                return optimal_src_code, meta_code
+                return optimal_src_code, optimal_meta_code
         return src_code, meta_code
 
     def _prepare_simulator_headers(self, src_code):
diff --git a/PyTorchSimFrontend/mlir/passes/build_skeleton.py b/PyTorchSimFrontend/mlir/passes/build_skeleton.py
index df4c6046..6e27cf45 100644
--- a/PyTorchSimFrontend/mlir/passes/build_skeleton.py
+++ b/PyTorchSimFrontend/mlir/passes/build_skeleton.py
@@ -271,6 +271,74 @@ def _results_unused(op):
     return True
 
 
+def _strip_loop_iter_args(block):
+    """Drop loop-carried values (iter_args) from every affine.for/scf.for.
+
+    The skeleton only needs the loop STRUCTURE (iteration counts) and the
+    togsim.* markers -- not the data flowing through the loop. Reduction kernels
+    carry a *vector* accumulator as an iter_arg; EmitC/C++ cannot represent a
+    loop carrying a vector, so the trace .so emission fails. Since the trace is
+    timing-only (values come from the recorded run), we rebuild each loop without
+    iter_args: body uses of an iter_arg become its init value, the loop result
+    becomes its init, and the now-orphaned accumulate ops are removed by _dce.
+    """
+    # Only strip a loop whose RESULTS are unused (dead for the trace): the carried
+    # value goes nowhere live, so dropping it is safe. A loop whose result still
+    # feeds a kept op (e.g. an index accumulator consumed by a togsim.dma address)
+    # is left untouched. Run after _dce so the result store is already gone; then
+    # nested reductions free up inner results round by round (outer stripped first).
+    while True:
+        tgt = None
+        for op in walk_ops(block):
+            n = op.operation.name
+            if (n in ("affine.for", "scf.for") and len(op.operation.results) > 0
+                    and _results_unused(op)):
+                tgt = op
+                break
+        if tgt is None:
+            return
+        _rebuild_loop_no_iter(tgt)
+
+
+def _rebuild_loop_no_iter(op):
+    o = op.operation
+    nres = len(o.results)
+    n_in = len(o.operands)
+    inits = [o.operands[n_in - nres + i] for i in range(nres)]
+    keep_operands = [o.operands[i] for i in range(n_in - nres)]  # bound operands only
+    old_block = o.regions[0].blocks[0]
+    oargs = list(old_block.arguments)  # [iv, *iter_args]
+
+    attrs = {na.name: na.attr for na in o.attributes}
+    # affine.for tags its operand groups; zero the iter-arg group (last entry).
+    if "operandSegmentSizes" in attrs:
+        seg = [int(x) for x in str(attrs["operandSegmentSizes"]).split(":")[1].strip(" >").split(",")]
+        seg[-1] = 0
+        attrs["operandSegmentSizes"] = ir.Attribute.parse(
+            "array<i32: " + ", ".join(str(s) for s in seg) + ">")
+
+    loc = ir.Location.unknown(o.context)
+    with loc:                                                  # default loc for new block args
+        new = ir.Operation.create(o.name, results=[], operands=keep_operands,
+                                  attributes=attrs, regions=1, loc=loc,
+                                  ip=ir.InsertionPoint(o))
+        nb = new.regions[0].blocks.append(oargs[0].type)      # block with the iv arg only
+
+        oargs[0].replace_all_uses_with(nb.arguments[0])       # iv
+        for ba, ini in zip(oargs[1:], inits):                 # iter-arg uses -> init
+            ba.replace_all_uses_with(ini)
+        for res, ini in zip(o.results, inits):                # loop result -> init
+            res.replace_all_uses_with(ini)
+
+        term_name = "affine.yield" if o.name == "affine.for" else "scf.yield"
+        with ir.InsertionPoint(nb):
+            ir.Operation.create(term_name, results=[], operands=[], loc=loc)
+        new_term = list(nb.operations)[0]
+        for bop in list(old_block.operations)[:-1]:           # move body (drop old yield)
+            bop.operation.move_before(new_term)
+        o.erase()
+
+
 def _dce(block):
     """Erase non-kept ops with no used results, to a fixed point. Safe: an op
     with live SSA uses is never touched."""
@@ -466,7 +534,9 @@ def build_skeleton(module):
             op.operation.erase()
         except Exception:
             pass
-    _dce(block)
+    _dce(block)                    # drop dead consumers (e.g. the result store) first,
+    _strip_loop_iter_args(block)   # so a now-unused loop result lets us strip its iter_args
+    _dce(block)                    # then clean the orphaned accumulate ops
 
     return ("skeleton: compute=%d dma=%d wait=%d (unpaired waits dropped)"
             % (n_compute, n_dma, n_wait))
diff --git a/PyTorchSimFrontend/mlir/passes/dep_analysis.py b/PyTorchSimFrontend/mlir/passes/dep_analysis.py
index bc53bfc9..06d8270d 100644
--- a/PyTorchSimFrontend/mlir/passes/dep_analysis.py
+++ b/PyTorchSimFrontend/mlir/passes/dep_analysis.py
@@ -42,42 +42,82 @@ def _global_of(memref_val):
     return None
 
 
-def _read_buffers_of_compute(cn):
-    """SRAM buffers a compute node reads: (a) each vcix.iv input traced to its
-    vector.transfer_read source (activations/weights streamed into the SA), and
-    (b) any direct vector.transfer_read in the node (the epilogue's accumulator
-    read-modify-write of Y_spad)."""
-    bufs = set()
-    for op in cn.operations:
-        if op.name == "vector.transfer_read" and list(op.operands):
-            b = _global_of(op.operands[0])
-            if b:
-                bufs.add(b)
-        elif op.name == "vcix.iv" and list(op.operands):
-            v = op.operands[0]
-            defop = v.owner if isinstance(v.owner, ir.Operation) else getattr(v.owner, "operation", None)
-            if defop is not None and defop.name == "vector.transfer_read" and list(defop.operands):
-                b = _global_of(defop.operands[0])
-                if b:
-                    bufs.add(b)
-    return bufs
-
-
-def _write_buffers_of_compute(cn):
-    """SRAM buffers a compute node writes: vector.transfer_write / vector_store target."""
-    bufs = set()
-    for op in cn.operations:
-        if op.name in ("vector.transfer_write", "affine.vector_store", "vector.store"):
-            # target memref is the last memref operand
-            for v in op.operands:
-                try:
-                    if ir.MemRefType.isinstance(v.type):
-                        b = _global_of(v)
-                        if b:
-                            bufs.add(b)
-                except Exception:
-                    pass
-    return bufs
+# Ops that touch SRAM-buffer DATA, by category. A view op (subview/reinterpret_cast)
+# instead PRODUCES a memref -- pure address computation, skipped here; the real access
+# is the load/store using it, whose memref operand _global_of traces back through the
+# view to the @global. Anything else carrying a memref operand raises, so a NEW fusion
+# pattern is caught at compile time rather than as a silent runtime deadlock.
+_LOAD_OPS = {"vector.transfer_read", "affine.vector_load", "vector.load",
+             "memref.load", "affine.load"}
+_STORE_OPS = {"vector.transfer_write", "affine.vector_store", "vector.store",
+              "memref.store", "affine.store"}
+_IGNORE_OPS = {"memref.dealloc"}   # lifetime, not a data access
+
+
+def _is_memref(v):
+    try:
+        return ir.MemRefType.isinstance(v.type)
+    except Exception:
+        return False
+
+
+def _walk_compute_ops(cn):
+    """Every op in the compute node, recursing into nested regions (loop bodies). A
+    fused epilogue (BatchNorm/ReLU) keeps its ops inside an un-unrolled affine.for, so
+    a top-level-only scan (cn.operations) sees just the loop and misses every access."""
+    for top in cn.operations:
+        stack = [top]
+        while stack:
+            op = stack.pop()
+            yield op
+            for region in op.operation.regions:
+                for block in region.blocks:
+                    stack.extend(block.operations)
+
+
+def _rw_buffers_of_compute(cn):
+    """(reads, writes): the @global SRAM buffers a compute node reads/writes, walking
+    nested regions and classifying each op that touches a memref."""
+    reads, writes = set(), set()
+    def rd(v):
+        b = _global_of(v)
+        if b:
+            reads.add(b)
+    def wr(v):
+        b = _global_of(v)
+        if b:
+            writes.add(b)
+    for op in _walk_compute_ops(cn):
+        if any(_is_memref(r) for r in op.results):
+            continue                                   # view/cast/alloc -- address only
+        mrefs = [v for v in op.operands if _is_memref(v)]
+        if not mrefs:
+            continue
+        name = op.name
+        if name in _LOAD_OPS:
+            for v in mrefs:
+                rd(v)
+        elif name in _STORE_OPS:
+            for v in mrefs:
+                wr(v)                                  # the store target memref
+        elif name == "memref.copy":
+            rd(mrefs[0])
+            wr(mrefs[-1])
+        elif name.startswith("linalg."):               # DPS: ins read, outs read+write
+            for v in op.inputs:
+                if _is_memref(v):
+                    rd(v)
+            for v in op.outputs:
+                if _is_memref(v):
+                    rd(v)
+                    wr(v)
+        elif name in _IGNORE_OPS:
+            continue
+        else:
+            raise RuntimeError(
+                f"dep_analysis: unclassified memref op '{name}' in a compute node -- "
+                f"it touches an SRAM buffer; classify it in _LOAD_OPS/_STORE_OPS")
+    return reads, writes
 
 
 def _dma_buffer(builder, dma_node):
@@ -99,8 +139,7 @@ def _dma_buffer(builder, dma_node):
 def compute_buffers(cn):
     """(read_buffers, write_buffers) for one compute node, including the virtual
     SA_WEIGHTS edge (preload writes it, matmul reads it)."""
-    reads = set(_read_buffers_of_compute(cn))
-    writes = set(_write_buffers_of_compute(cn))
+    reads, writes = _rw_buffers_of_compute(cn)
     if cn.compute_type == 1:      # MATMUL consumes the preloaded weights
         reads.add(SA_WEIGHTS)
     elif cn.compute_type == 2:    # PRELOAD loads them
@@ -132,10 +171,11 @@ def analyze(module):
         if not cn.operations:
             continue
         ct = {0: "VECTOR", 1: "MATMUL", 2: "PRELOAD"}.get(cn.compute_type, f"c{cn.compute_type}")
+        creads, cwrites = _rw_buffers_of_compute(cn)
         nodes.append({
             "kind": ct,
-            "reads": _read_buffers_of_compute(cn),
-            "writes": _write_buffers_of_compute(cn),
+            "reads": creads,
+            "writes": cwrites,
             "node": cn,
             "compute_type": cn.compute_type,
         })
diff --git a/PyTorchSimFrontend/mlir/passes/dma_fine_grained.py b/PyTorchSimFrontend/mlir/passes/dma_fine_grained.py
index f1872dca..d7571d2b 100644
--- a/PyTorchSimFrontend/mlir/passes/dma_fine_grained.py
+++ b/PyTorchSimFrontend/mlir/passes/dma_fine_grained.py
@@ -21,6 +21,7 @@
 
 Pipeline entry point: run_fine_grained(in_path, out_path, vectorlane).
 """
+import itertools
 import os
 import sys
 
@@ -383,11 +384,19 @@ def _run_func(func, vectorlane):
     # dominate the nest. Codegen emits input before weight, matching the C++ pass
     # which fuses after the weight subtile loop.
     ip = ir.InsertionPoint(mvin_weight.op)
-    fused_ivs, body_ip = _build_for_nest(bounds, ip)
-    in_ivs = [fused_ivs[fuse["in_to_fused"][d]] for d in range(rank)]
-    w_ivs = [fused_ivs[fuse["w_to_fused"][d]] for d in range(rank)]
-    _emit_dma(mvin_input, in_ivs, vectorlane, body_ip)
-    _emit_dma(mvin_weight, w_ivs, vectorlane, body_ip)
+    # Unroll the fused nest, emitting each distinct input/weight subtile ONCE (a load
+    # is invariant to the other operand's dims, so the cross-product re-emits it
+    # identically). Dedup by the operand's own coords; keep the fused issue order.
+    seen_in, seen_w = set(), set()
+    for it in itertools.product(*[range(b) for b in bounds]):
+        in_key = tuple(it[fuse["in_to_fused"][d]] for d in range(rank))
+        if in_key not in seen_in:
+            seen_in.add(in_key)
+            _emit_dma(mvin_input, [_const_index(c, ip) for c in in_key], vectorlane, ip)
+        w_key = tuple(it[fuse["w_to_fused"][d]] for d in range(rank))
+        if w_key not in seen_w:
+            seen_w.add(w_key)
+            _emit_dma(mvin_weight, [_const_index(c, ip) for c in w_key], vectorlane, ip)
     mvin_input.op.erase()
     mvin_weight.op.erase()
 
diff --git a/PyTorchSimFrontend/mlir/passes/lower_to_emitc.py b/PyTorchSimFrontend/mlir/passes/lower_to_emitc.py
index 6ade7442..6bf7d559 100644
--- a/PyTorchSimFrontend/mlir/passes/lower_to_emitc.py
+++ b/PyTorchSimFrontend/mlir/passes/lower_to_emitc.py
@@ -57,7 +57,9 @@
 
 #: upstream EmitC conversion pipeline (the infrastructure this pass drives).
 _PIPELINE = ("builtin.module("
+             "convert-vector-to-scf{full-unroll=true},"
              "func.func(lower-affine),"
+             "func.func(lower-vector-multi-reduction),"
              "convert-scf-to-emitc,"
              "convert-arith-to-emitc,"
              "convert-func-to-emitc)")
diff --git a/Simulator/simulator.py b/Simulator/simulator.py
index 2b9f05be..1c0c5cbe 100644
--- a/Simulator/simulator.py
+++ b/Simulator/simulator.py
@@ -560,7 +560,21 @@ def run_standalone(
             os.fsync(trace_file.fileno())
 
         try:
-            cmd = f"{TOGSimulator.get_togsim_command(config_path, togsim_path)} --models_list {trace_file_path}"
+            # The C++ TOG (trace) path is the DEFAULT: drive the simulation from the
+            # emitted trace.so. The legacy ONNX TOG is the opt-in fallback via
+            # TORCHSIM_LEGACY_TOG=1. Each autotune candidate compiles to its own
+            # write_path (keyed by its retiled source), so its trace.so/cycle_table sit
+            # next to its tile_graph.onnx -- benchmark it through the trace path too.
+            # Fall back to legacy only if the .so was not emitted.
+            trace_so = os.path.join(os.path.dirname(str(model_path)), "trace.so")
+            cycle_tsv = os.path.join(os.path.dirname(str(model_path)), "trace_cycles.tsv")
+            base_cmd = TOGSimulator.get_togsim_command(config_path, togsim_path)
+            use_trace = (os.environ.get("TORCHSIM_LEGACY_TOG") != "1"
+                         and os.path.exists(trace_so))
+            if use_trace:
+                cmd = f"{base_cmd} --trace_so {trace_so} --cycle_table {cycle_tsv}"
+            else:
+                cmd = f"{base_cmd} --models_list {trace_file_path}"
             if extension_config.CONFIG_TOGSIM_DEBUG_LEVEL:
                 cmd += f" --log_level {extension_config.CONFIG_TOGSIM_DEBUG_LEVEL}"
 
diff --git a/TOGSim/include/Instruction.h b/TOGSim/include/Instruction.h
index d5e193d4..fa5d4ca1 100644
--- a/TOGSim/include/Instruction.h
+++ b/TOGSim/include/Instruction.h
@@ -132,12 +132,12 @@ class Instruction : public std::enable_shared_from_this<Instruction> {
   // bytes this load occupies in the spad (from the tile it moves in).
   size_t sram_footprint() const { return _tile_numel * (_elem_bits / 8); }
 
-  cycle_type start_cycle;
-  cycle_type finish_cycle;
+  cycle_type start_cycle = 0;
+  cycle_type finish_cycle = 0;
   cycle_type bubble_cycle=0;
 
   bool finished=false;
-  int subgraph_id;
+  int subgraph_id = 0;
  private:
   uint64_t _global_inst_id = 0;
   static uint64_t _next_global_inst_id;
@@ -145,18 +145,19 @@ class Instruction : public std::enable_shared_from_this<Instruction> {
   void *_owner = nullptr;
   std::list<std::shared_ptr<Instruction>>* _owner_ready_queue_ref = nullptr;
   Opcode opcode;
-  cycle_type compute_cycle;
-  cycle_type overlapping_cycle;
-  size_t ready_counter;
+  cycle_type compute_cycle = 0;
+  cycle_type overlapping_cycle = 0;
+  size_t ready_counter = 0;   // parents not yet finished; the minimal Instruction(Opcode)
+                              // ctor (barriers) relies on this default + inc_ready_counter
   std::set<std::shared_ptr<Instruction>> child_inst;
   std::set<std::shared_ptr<Instruction>> _pipeline_children;  // released at issue (sec 10.7)
   std::vector<size_t> tile_size;
   std::vector<int> tile_stride;
-  size_t _tile_numel;
+  size_t _tile_numel = 0;
   size_t _nr_waiting_request=0;
   bool _got_first_response=false;
   size_t _elem_bits = 0;
-  addr_type dram_addr;
+  addr_type dram_addr = 0;
   uint32_t _numa_id = 0; // For DMA instruction
   int _compute_type = 0;
   std::vector<int64_t> _tag_idx_list;
diff --git a/TOGSim/include/togsim_loader.h b/TOGSim/include/togsim_loader.h
index 6c1273ee..9b326da6 100644
--- a/TOGSim/include/togsim_loader.h
+++ b/TOGSim/include/togsim_loader.h
@@ -54,12 +54,15 @@ struct RunResult {
 // built EmitCtx, and return the recorded trace.
 //   tensor_base[arg_id] : DRAM base address of each kernel tensor argument
 //   cyc[tile_id] / ovl[tile_id] : the cycle table (cycle, overlapping_cycle)
-//   num_cores : dispatch round-robins work-items across this many cores
+//   partition_cores : the core ids of the partition this kernel is enqueued to;
+//                     dispatch round-robins work-items only over THESE cores (a
+//                     kernel stays within its partition -- other partitions are
+//                     independent). Empty/null -> core 0.
 RunResult run_producer(const char* so_path,
                        const int64_t* shape_args, int32_t n_shape,
                        const uint64_t* tensor_base, int32_t n_tensors,
                        const int64_t* cyc, const int64_t* ovl, int32_t n_tiles,
-                       int32_t num_cores);
+                       const int32_t* partition_cores, int32_t n_partition_cores);
 
 // First-order reference timing over a recorded trace, to validate that the
 // stream carries enough to be scheduled (it is NOT the production Core -- no
diff --git a/TOGSim/src/Core.cc b/TOGSim/src/Core.cc
index 25335c9c..980b4fa6 100644
--- a/TOGSim/src/Core.cc
+++ b/TOGSim/src/Core.cc
@@ -56,8 +56,11 @@ void Core::release_sram(const std::shared_ptr<Instruction>& inst) {
 }
 
 bool Core::can_issue(const std::shared_ptr<Tile>& op) {
-  /* Check SRAM is enough to run tile */
-  return _tiles.size() < 4  && !op->is_stonne_tile();
+  /* Bound concurrent dispatches so their combined spad working set fits: with the
+   * global @buffers each in-flight dispatch piles its own load versions, and too
+   * many at once overflow the spad (versions never free -> wedge). 2 keeps double-
+   * buffering overlap while leaving headroom. */
+  return _tiles.size() < 2  && !op->is_stonne_tile();
 }
 
 void Core::issue(std::shared_ptr<Tile> op) {
diff --git a/TOGSim/src/Instruction.cc b/TOGSim/src/Instruction.cc
index d0471226..c5778a28 100644
--- a/TOGSim/src/Instruction.cc
+++ b/TOGSim/src/Instruction.cc
@@ -57,13 +57,18 @@ void Instruction::finish_instruction() {
 }
 
 void Instruction::add_child(std::shared_ptr<Instruction> child) {
-  child->inc_ready_counter();
-  child_inst.insert(child);
+  // child_inst is a set (each child released exactly once at finish), so the
+  // ready_counter must be bumped only when the edge is NEW -- a producer that
+  // writes several buffers a single consumer reads (e.g. a sort tile reading the
+  // value+index buffers its predecessor wrote) links the same pair once per shared
+  // buffer; double-counting would leave ready_counter stuck above 0 -> deadlock.
+  if (child_inst.insert(child).second)
+    child->inc_ready_counter();
 }
 
 void Instruction::add_pipeline_child(std::shared_ptr<Instruction> child) {
-  child->inc_ready_counter();
-  _pipeline_children.insert(child);
+  if (_pipeline_children.insert(child).second)
+    child->inc_ready_counter();
 }
 
 void Instruction::release_pipeline_children() {
diff --git a/TOGSim/src/Simulator.cc b/TOGSim/src/Simulator.cc
index 03dd7bf9..366528ec 100644
--- a/TOGSim/src/Simulator.cc
+++ b/TOGSim/src/Simulator.cc
@@ -186,7 +186,7 @@ void Simulator::icnt_cycle() {
 
 // Consecutive frozen cycles tolerated before declaring the sim wedged (spad too
 // small). Generous so transient idle never false-fires; a true freeze is constant.
-static constexpr uint64_t kWedgeThreshold = 5000;
+static constexpr uint64_t kWedgeThreshold = 100000;
 
 // Frozen-state guard: work remains (running()) but nothing is in flight to
 // advance it -- the SRAM throttle can never satisfy a load because the kernel's
diff --git a/TOGSim/src/main.cc b/TOGSim/src/main.cc
index 8726cfdf..bd570188 100644
--- a/TOGSim/src/main.cc
+++ b/TOGSim/src/main.cc
@@ -1,3 +1,4 @@
+#include <cstdlib>
 #include <fstream>
 #include <chrono>
 #include <filesystem>
@@ -15,13 +16,65 @@ namespace fs = std::filesystem;
 namespace po = boost::program_options;
 
 
+// Run a kernel's compiled trace producer (.so) and bridge it to a TileGraph,
+// targeting `partition_id` (its work-items round-robin only over that partition's
+// cores -- partitions are independent schedulers). The cycle-table TSV gives the
+// per-tile compute latency; a flat stub is used if it is missing. Returns nullptr
+// if the producer run fails. Shared by the standalone --trace_so path and the
+// multi-tenant launchKernel below.
+std::unique_ptr<TileGraph> build_trace_tilegraph(Simulator* simulator,
+                                                 const std::string& trace_so_path,
+                                                 const std::string& cycle_table_path,
+                                                 int partition_id) {
+  const auto& cfg = simulator->get_hardware_config_yaml();
+  int num_cores = cfg["num_cores"] ? cfg["num_cores"].as<int>() : 1;
+  std::vector<int32_t> partition_cores;
+  for (int c = 0; c < num_cores; c++)
+    if (simulator->get_partition_id(c) == partition_id) partition_cores.push_back(c);
+  if (partition_cores.empty()) partition_cores.push_back(0);
+  // First cut: stub tensor bases (real per-tensor addresses come later).
+  std::vector<uint64_t> bases(16);
+  for (size_t i = 0; i < bases.size(); ++i) bases[i] = 0x100000ull * (i + 1);
+  // Cycle table: load the per-tile_id TSV sidecar if present, else a flat stub.
+  std::vector<int64_t> cyc, ovl;
+  std::ifstream ct(cycle_table_path);
+  if (ct.is_open()) {
+    int64_t c, o;
+    while (ct >> c >> o) { cyc.push_back(c); ovl.push_back(o); }
+  }
+  if (cyc.empty()) { cyc.assign(256, 128); ovl.assign(256, 0); }
+  auto run = togsim::run_producer(trace_so_path.c_str(), nullptr, 0,
+                                  bases.data(), (int)bases.size(),
+                                  cyc.data(), ovl.data(), (int)cyc.size(),
+                                  partition_cores.data(), (int32_t)partition_cores.size());
+  if (!run.ok) return nullptr;
+  return trace_to_tilegraph(run, "trace_kernel");
+}
+
 void launchKernel(Simulator* simulator, unsigned int kernel_id, std::string onnx_path, std::string attribute_path, const YAML::Node& config_yaml, cycle_type request_time=0, int partition_id=0, int device_id=0) {
-  auto graph_praser = TileGraphParser(onnx_path, attribute_path, config_yaml);
-  std::unique_ptr<TileGraph>& tile_graph = graph_praser.get_tile_graph();
+  std::unique_ptr<TileGraph> tile_graph;
+  std::string tog_path = onnx_path;  // for the log line
+  // Prefer the C++ trace path: the kernel's trace.so / trace_cycles.tsv sit next to
+  // its tile_graph.onnx (same write_path). This brings the multi-tenant scheduler
+  // onto the new TOG too; opt out with TORCHSIM_LEGACY_TOG=1, and fall back to the
+  // legacy ONNX parser when the .so is absent or fails to run.
+  const char* legacy = std::getenv("TORCHSIM_LEGACY_TOG");
+  std::string dir = fs::path(onnx_path).parent_path().string();
+  std::string trace_so = dir + "/trace.so";
+  std::string cycle_tsv = dir + "/trace_cycles.tsv";
+  if ((!legacy || std::string(legacy) != "1") && fs::exists(trace_so)) {
+    tile_graph = build_trace_tilegraph(simulator, trace_so, cycle_tsv, partition_id);
+    if (tile_graph) tog_path = trace_so;
+    else spdlog::warn("[TOGSim] trace.so run failed for {}; falling back to ONNX", trace_so);
+  }
+  if (!tile_graph) {
+    auto graph_praser = TileGraphParser(onnx_path, attribute_path, config_yaml);
+    tile_graph = std::move(graph_praser.get_tile_graph());
+  }
   tile_graph->set_arrival_time(request_time ? request_time : simulator->get_core_cycle());
   tile_graph->set_kernel_id(kernel_id);
   spdlog::info("[Scheduler {}] Enqueued kernel_id: {}, tog_path: {}, operation: {}, request_time_cycles: {}",
-               partition_id, kernel_id, onnx_path, tile_graph->get_name(), request_time);
+               partition_id, kernel_id, tog_path, tile_graph->get_name(), request_time);
   simulator->enqueue_graph(partition_id, std::move(tile_graph));
 }
 
@@ -159,37 +212,18 @@ int main(int argc, char** argv) {
   std::string trace_so_path;
   cmd_parser.set_if_defined("trace_so", &trace_so_path);
   if (!trace_so_path.empty()) {
-    const auto& cfg = simulator->get_hardware_config_yaml();
-    int num_cores = cfg["num_cores"] ? cfg["num_cores"].as<int>() : 1;
-    // First cut: stub tensor bases (real per-tensor addresses come later).
-    std::vector<uint64_t> bases(16);
-    for (size_t i = 0; i < bases.size(); ++i) bases[i] = 0x100000ull * (i + 1);
-    // Cycle table: load the per-tile_id TSV sidecar if given, else a flat stub.
-    std::vector<int64_t> cyc, ovl;
+    // Standalone single-kernel trace run: enqueue to partition 0 (its work-items
+    // round-robin over partition 0's cores only; see build_trace_tilegraph).
     std::string cycle_table_path;
     cmd_parser.set_if_defined("cycle_table", &cycle_table_path);
-    if (!cycle_table_path.empty()) {
-      std::ifstream ct(cycle_table_path);
-      if (!ct.is_open()) { spdlog::error("[TOGSim] cannot open cycle_table {}", cycle_table_path); exit(1); }
-      int64_t c, o;
-      while (ct >> c >> o) { cyc.push_back(c); ovl.push_back(o); }
-      spdlog::info("[TOGSim-trace] loaded cycle table: {} tiles from {}", cyc.size(), cycle_table_path);
-    } else {
-      cyc.assign(256, 128);
-      ovl.assign(256, 0);
-    }
-    auto run = togsim::run_producer(trace_so_path.c_str(), nullptr, 0,
-                                    bases.data(), (int)bases.size(),
-                                    cyc.data(), ovl.data(), (int)cyc.size(),
-                                    num_cores);
-    if (!run.ok) { spdlog::error("[TOGSim] trace producer run failed"); exit(1); }
-    spdlog::info("[TOGSim-trace] recorded {} instructions", run.trace.size());
-    auto tg = trace_to_tilegraph(run, "trace_kernel");
+    auto tg = build_trace_tilegraph(simulator, trace_so_path, cycle_table_path, 0);
+    if (!tg) { spdlog::error("[TOGSim] trace producer run failed"); exit(1); }
     tg->set_arrival_time(simulator->get_core_cycle());
     tg->set_kernel_id(0);
     simulator->enqueue_graph(0, std::move(tg));
     simulator->run_simulator();
     spdlog::info("[TOGSim-trace] Total cycles: {}", simulator->get_core_cycle());
+    spdlog::info("Simulation finished");
     simulator->print_core_stat();
     return 0;
   }
diff --git a/TOGSim/src/togsim_runtime.cc b/TOGSim/src/togsim_runtime.cc
index 86de081e..d15d92b9 100644
--- a/TOGSim/src/togsim_runtime.cc
+++ b/TOGSim/src/togsim_runtime.cc
@@ -31,9 +31,9 @@ struct EmitCtx {
   const int64_t*  cyc = nullptr;   // tile_id -> cycle
   const int64_t*  ovl = nullptr;   // tile_id -> overlapping_cycle
   int32_t         n_tiles = 0;
-  int32_t         num_cores = 1;
+  std::vector<int32_t> cores{0};   // the partition's core ids; dispatch round-robins over these
   // mutable run state
-  int32_t  rr = 0;            // round-robin core cursor
+  int32_t  rr = 0;            // round-robin cursor into `cores`
   int32_t  cur_core = -1;     // current work-item's core
   std::vector<togsim::TraceRec> trace;
 };
@@ -52,12 +52,13 @@ extern "C" {
 int32_t togsim_abi_version(void) { return TOGSIM_ABI_VERSION; }
 
 void togsim_dispatch(EmitCtx* ctx, togsim_tile_fn fn, int64_t* iv, int32_t n_iv) {
-  // Higher-order work-item wrapper (sec 9.3): round-robin a core (the producer
-  // never sees num_cores), bracket the work-item with TILE_BEGIN/TILE_END, and
-  // run its body. The work-item SCOPE is exactly this fn call -- the begin/end
-  // are runtime-owned, so the producer never relies on an implicit "ops until
-  // the next alloc" boundary. The ops fn emits record under ctx->cur_core.
-  ctx->cur_core = ctx->num_cores > 0 ? (ctx->rr++ % ctx->num_cores) : 0;
+  // Higher-order work-item wrapper (sec 9.3): round-robin over THIS partition's
+  // cores (a kernel is enqueued to one partition; partitions are independent, so
+  // a work-item must never land on another partition's core -- that subgraph would
+  // sit in this partition's scheduler forever). Bracket the work-item with
+  // TILE_BEGIN/TILE_END; the ops fn emits records under ctx->cur_core.
+  ctx->cur_core = ctx->cores.empty() ? 0
+                : ctx->cores[ctx->rr++ % (int32_t)ctx->cores.size()];
   ctx->trace.push_back(blank(togsim::TraceRec::TILE_BEGIN, ctx->cur_core));
   fn(ctx, iv, n_iv);
   ctx->trace.push_back(blank(togsim::TraceRec::TILE_END, ctx->cur_core));
@@ -119,7 +120,7 @@ RunResult run_producer(const char* so_path,
                        const int64_t* shape_args, int32_t n_shape,
                        const uint64_t* tensor_base, int32_t n_tensors,
                        const int64_t* cyc, const int64_t* ovl, int32_t n_tiles,
-                       int32_t num_cores) {
+                       const int32_t* partition_cores, int32_t n_partition_cores) {
   RunResult res;
   void* lib = dlopen(so_path, RTLD_NOW | RTLD_GLOBAL);
   if (!lib) { fprintf(stderr, "togsim: dlopen failed: %s\n", dlerror()); return res; }
@@ -129,7 +130,8 @@ RunResult run_producer(const char* so_path,
   EmitCtx ctx;
   ctx.tensor_base = tensor_base; ctx.n_tensors = n_tensors;
   ctx.cyc = cyc; ctx.ovl = ovl; ctx.n_tiles = n_tiles;
-  ctx.num_cores = num_cores > 0 ? num_cores : 1;
+  ctx.cores.assign(partition_cores, partition_cores + (n_partition_cores > 0 ? n_partition_cores : 0));
+  if (ctx.cores.empty()) ctx.cores.push_back(0);
   emit(&ctx, (int64_t*)shape_args, n_shape);
 
   res.ok = true;
diff --git a/TOGSim/src/togsim_trace_bridge.cc b/TOGSim/src/togsim_trace_bridge.cc
index e13af2d7..3a75945c 100644
--- a/TOGSim/src/togsim_trace_bridge.cc
+++ b/TOGSim/src/togsim_trace_bridge.cc
@@ -98,6 +98,14 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
   // the reduction body (no cross-iteration prefetch). Scoped per work-item.
   std::map<std::pair<int32_t, uint64_t>,
            std::pair<int64_t, std::shared_ptr<Instruction>>> current_dma;
+  // Dedup identical dma_waits: the barrier already built for the CURRENT load of a
+  // (tag_id, tag_slot). A later memory_barrier on the SAME load instance reuses it
+  // (its consumers gate on the existing bar) instead of re-emitting -- a conv reads one
+  // loaded subtile from many matmuls, so the fine-grained per-consumer waits collapse to
+  // one per load. A new load (next reduction iter) bumps uniq, so a genuine new wait
+  // still gets its own bar; the first wait stays at its consumer, so overlap is kept.
+  std::map<std::pair<int32_t, uint64_t>,
+           std::pair<int64_t, std::shared_ptr<Instruction>>> bar_for_load;
   int64_t next_tag = 0;   // mints a unique Core tag key per dma record
   int cur_tile_group = -1;   // work-item index, bumped per TILE_BEGIN (trace grouping)
   // Async compute (matmul/preload): issued and pipelined on the systolic array;
@@ -118,6 +126,7 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
     tile.reset();
     last_writer.clear();
     current_dma.clear();
+    bar_for_load.clear();
     next_tag = 0;
     outstanding_async.clear();
     pending_bar.reset();
@@ -138,6 +147,17 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
                   const std::vector<int64_t>& reads,
                   const std::vector<int64_t>& writes) {
     for (int64_t b : reads) {
+      // A matmul reading its own accumulator (a buffer it also WRITES) imposes NO
+      // producer order: Y += X@W is commutative. Chaining matmuls through the
+      // accumulator (M_k <- M_{k-1}) needlessly serializes them and DEADLOCKS the SA
+      // weight-slot pipeline -- a later iteration's preload can grab the last weight
+      // slot while the in-order head matmul is starved of one, and that head can never
+      // run to release a slot. The store still waits every matmul via the COMPUTE_BAR
+      // fence, so dropping this edge is safe (TOGSim is timing-only; values come from
+      // the recorded trace).
+      bool is_accum = false;
+      for (int64_t w : writes) if (w == b) { is_accum = true; break; }
+      if (inst->get_compute_type() == MATMUL_CT && is_accum) continue;
       auto it = last_writer.find(b);
       if (it == last_writer.end()) continue;
       int pct = it->second->get_compute_type();
@@ -249,11 +269,19 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
       int64_t uniq = next_tag++;                         // fallback if unpaired
       std::shared_ptr<Instruction> dma_inst;
       if (it != current_dma.end()) { uniq = it->second.first; dma_inst = it->second.second; }
+      // Identical wait (same slot, same load instance) already has a barrier -> reuse it
+      // so the buffer's consumers gate on it, instead of emitting a redundant barrier.
+      auto bf = bar_for_load.find({t.tag_id, t.tag_slot});
+      if (bf != bar_for_load.end() && bf->second.first == uniq) {
+        for (int64_t b : t.write_bufs) last_writer[b] = bf->second.second;
+        continue;
+      }
       auto bar = make_mem_bar(t, uniq);
       bar->set_tile_group(cur_tile_group);
       if (dma_inst) dma_inst->add_child(bar);
       tile->append_instuction(bar);
       for (int64_t b : t.write_bufs) last_writer[b] = bar;
+      bar_for_load[{t.tag_id, t.tag_slot}] = {uniq, bar};
     } else if (t.kind == TraceRec::COMPUTE) {
       auto inst = make_compute(t);
       inst->set_tile_group(cur_tile_group);

From 76a2862bc11533846c14e0891dab19ecbc17ccd4 Mon Sep 17 00:00:00 2001
From: Wonhyuk Yang <wonhyuk@postech.ac.kr>
Date: Wed, 24 Jun 2026 22:35:51 +0900
Subject: [PATCH 08/24] [TOGSim] Make the trace runtime test self-contained

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
---
 tests/fixtures/gemm256_postvcix.mlir | 419 +++++++++++++++++++++++++++
 tests/test_togsim_runtime.py         |  24 +-
 2 files changed, 435 insertions(+), 8 deletions(-)
 create mode 100644 tests/fixtures/gemm256_postvcix.mlir

diff --git a/tests/fixtures/gemm256_postvcix.mlir b/tests/fixtures/gemm256_postvcix.mlir
new file mode 100644
index 00000000..740e8ab2
--- /dev/null
+++ b/tests/fixtures/gemm256_postvcix.mlir
@@ -0,0 +1,419 @@
+#map = affine_map<(d0, d1) -> (d0 * 256 + d1)>
+#map1 = affine_map<(d0, d1) -> (d0 * 65536 + d1 * 256)>
+#map2 = affine_map<(d0, d1) -> (d0 + d1)>
+#map3 = affine_map<(d0, d1) -> (d0 * 256 + d1 * 512)>
+#map4 = affine_map<(d0, d1, d2) -> (-d0 + d1 + d2 floordiv 2)>
+#map5 = affine_map<(d0, d1, d2)[s0, s1] -> (d0 * s0 + d1 * s1 + d2)>
+#map6 = affine_map<(d0)[s0] -> (d0 floordiv s0)>
+#map7 = affine_map<(d0)[s0] -> (d0 mod s0)>
+#map8 = affine_map<(d0, d1, d2) -> (-d0 + d1 * 2 + d2)>
+module {
+  memref.global @X_spad : memref<256x256xf32, 1>
+  memref.global @W_spad : memref<256x256xf32, 1>
+  memref.global @Y_spad : memref<256x256xf32, 1>
+  func.func @kernel(%arg0: memref<65536xf32>, %arg1: memref<65536xf32>, %arg2: memref<65536xf32>) {
+    %0 = memref.get_global @X_spad : memref<256x256xf32, 1>
+    %1 = memref.get_global @W_spad : memref<256x256xf32, 1>
+    %2 = memref.get_global @Y_spad : memref<256x256xf32, 1>
+    %cst = arith.constant dense<0.000000e+00> : vector<512xf32>
+    %c0 = arith.constant 0 : index
+    %c1 = arith.constant 1 : index
+    %c3 = arith.constant 3 : index
+    %c2 = arith.constant 2 : index
+    %alloc = memref.alloc() : memref<1xi32>
+    affine.for %arg3 = 0 to 256 step 256 {
+      affine.for %arg4 = 0 to 256 step 256 {
+        affine.vector_store %cst, %2[0, 0] : memref<256x256xf32, 1>, vector<512xf32>
+        affine.for %arg5 = 0 to 256 step 256 {
+          %4 = affine.apply #map(%arg3, %arg5)
+          %c1_1 = arith.constant 1 : index
+          %alloc_2 = memref.alloc() : memref<1xi32>
+          %5 = affine.apply #map(%arg5, %arg4)
+          %c1_3 = arith.constant 1 : index
+          %alloc_4 = memref.alloc() : memref<1xi32>
+          %c0_5 = arith.constant 0 : index
+          %c0_6 = arith.constant 0 : index
+          %c0_7 = arith.constant 0 : index
+          %6 = affine.apply #map1(%c0_5, %c0_6)
+          %7 = affine.apply #map2(%6, %4)
+          %8 = affine.apply #map3(%c0_5, %c0_6)
+          %9 = affine.apply #map2(%c0_5, %c0_6)
+          memref.dma_start %arg0[%7], %0[%c0_7, %8], %c2, %alloc_2[%9], %c1_1, %c1 : memref<65536xf32>, memref<256x256xf32, 1>, memref<1xi32> {async = true, dram_stride = [256, 1], fine_grained = true, sram_stride = [1, 256], subtile_size = [256, 256]}
+          %c0_8 = arith.constant 0 : index
+          %c0_9 = arith.constant 0 : index
+          %c0_10 = arith.constant 0 : index
+          %10 = affine.apply #map1(%c0_8, %c0_9)
+          %11 = affine.apply #map2(%10, %5)
+          %12 = affine.apply #map3(%c0_8, %c0_9)
+          %13 = affine.apply #map2(%c0_8, %c0_9)
+          memref.dma_start %arg1[%11], %1[%c0_10, %12], %c2, %alloc_4[%13], %c1_3, %c1 : memref<65536xf32>, memref<256x256xf32, 1>, memref<1xi32> {async = true, dram_stride = [256, 1], fine_grained = true, sram_stride = [1, 256], subtile_size = [256, 256]}
+          %c0_11 = arith.constant 0 : index
+          %c8_i64 = arith.constant 8 : i64
+          %c256 = arith.constant 256 : index
+          %c256_12 = arith.constant 256 : index
+          %c256_13 = arith.constant 256 : index
+          %c128 = arith.constant 128 : index
+          %c1_14 = arith.constant 1 : index
+          %cst_15 = arith.constant 0.000000e+00 : f32
+          affine.for %arg6 = 0 to 2 {
+            affine.for %arg7 = 0 to 2 {
+              %14 = affine.apply #map4(%arg5, %c0_11, %c0_11)
+              memref.dma_wait %alloc_4[%14], %c1_14 : memref<1xi32>
+              %c0_16 = arith.constant 0 : index
+              %c128_17 = arith.constant 128 : index
+              %15 = affine.apply #map5(%arg6, %arg7, %c0_16)[%c256, %c128_17]
+              %16 = affine.apply #map6(%15)[%c256_12]
+              %17 = affine.apply #map7(%15)[%c256_12]
+              %18 = vector.transfer_read %1[%16, %17], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+              "vcix.iv"(%18, %c8_i64) {imm = 0 : i64, opcode = 1 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+              %c8 = arith.constant 8 : index
+              %c128_18 = arith.constant 128 : index
+              %19 = affine.apply #map5(%arg6, %arg7, %c8)[%c256, %c128_18]
+              %20 = affine.apply #map6(%19)[%c256_12]
+              %21 = affine.apply #map7(%19)[%c256_12]
+              %22 = vector.transfer_read %1[%20, %21], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+              "vcix.iv"(%22, %c8_i64) {imm = 0 : i64, opcode = 1 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+              %c16 = arith.constant 16 : index
+              %c128_19 = arith.constant 128 : index
+              %23 = affine.apply #map5(%arg6, %arg7, %c16)[%c256, %c128_19]
+              %24 = affine.apply #map6(%23)[%c256_12]
+              %25 = affine.apply #map7(%23)[%c256_12]
+              %26 = vector.transfer_read %1[%24, %25], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+              "vcix.iv"(%26, %c8_i64) {imm = 0 : i64, opcode = 1 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+              %c24 = arith.constant 24 : index
+              %c128_20 = arith.constant 128 : index
+              %27 = affine.apply #map5(%arg6, %arg7, %c24)[%c256, %c128_20]
+              %28 = affine.apply #map6(%27)[%c256_12]
+              %29 = affine.apply #map7(%27)[%c256_12]
+              %30 = vector.transfer_read %1[%28, %29], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+              "vcix.iv"(%30, %c8_i64) {imm = 0 : i64, opcode = 1 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+              %c32 = arith.constant 32 : index
+              %c128_21 = arith.constant 128 : index
+              %31 = affine.apply #map5(%arg6, %arg7, %c32)[%c256, %c128_21]
+              %32 = affine.apply #map6(%31)[%c256_12]
+              %33 = affine.apply #map7(%31)[%c256_12]
+              %34 = vector.transfer_read %1[%32, %33], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+              "vcix.iv"(%34, %c8_i64) {imm = 0 : i64, opcode = 1 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+              %c40 = arith.constant 40 : index
+              %c128_22 = arith.constant 128 : index
+              %35 = affine.apply #map5(%arg6, %arg7, %c40)[%c256, %c128_22]
+              %36 = affine.apply #map6(%35)[%c256_12]
+              %37 = affine.apply #map7(%35)[%c256_12]
+              %38 = vector.transfer_read %1[%36, %37], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+              "vcix.iv"(%38, %c8_i64) {imm = 0 : i64, opcode = 1 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+              %c48 = arith.constant 48 : index
+              %c128_23 = arith.constant 128 : index
+              %39 = affine.apply #map5(%arg6, %arg7, %c48)[%c256, %c128_23]
+              %40 = affine.apply #map6(%39)[%c256_12]
+              %41 = affine.apply #map7(%39)[%c256_12]
+              %42 = vector.transfer_read %1[%40, %41], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+              "vcix.iv"(%42, %c8_i64) {imm = 0 : i64, opcode = 1 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+              %c56 = arith.constant 56 : index
+              %c128_24 = arith.constant 128 : index
+              %43 = affine.apply #map5(%arg6, %arg7, %c56)[%c256, %c128_24]
+              %44 = affine.apply #map6(%43)[%c256_12]
+              %45 = affine.apply #map7(%43)[%c256_12]
+              %46 = vector.transfer_read %1[%44, %45], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+              "vcix.iv"(%46, %c8_i64) {imm = 0 : i64, opcode = 1 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+              %c64 = arith.constant 64 : index
+              %c128_25 = arith.constant 128 : index
+              %47 = affine.apply #map5(%arg6, %arg7, %c64)[%c256, %c128_25]
+              %48 = affine.apply #map6(%47)[%c256_12]
+              %49 = affine.apply #map7(%47)[%c256_12]
+              %50 = vector.transfer_read %1[%48, %49], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+              "vcix.iv"(%50, %c8_i64) {imm = 0 : i64, opcode = 1 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+              %c72 = arith.constant 72 : index
+              %c128_26 = arith.constant 128 : index
+              %51 = affine.apply #map5(%arg6, %arg7, %c72)[%c256, %c128_26]
+              %52 = affine.apply #map6(%51)[%c256_12]
+              %53 = affine.apply #map7(%51)[%c256_12]
+              %54 = vector.transfer_read %1[%52, %53], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+              "vcix.iv"(%54, %c8_i64) {imm = 0 : i64, opcode = 1 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+              %c80 = arith.constant 80 : index
+              %c128_27 = arith.constant 128 : index
+              %55 = affine.apply #map5(%arg6, %arg7, %c80)[%c256, %c128_27]
+              %56 = affine.apply #map6(%55)[%c256_12]
+              %57 = affine.apply #map7(%55)[%c256_12]
+              %58 = vector.transfer_read %1[%56, %57], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+              "vcix.iv"(%58, %c8_i64) {imm = 0 : i64, opcode = 1 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+              %c88 = arith.constant 88 : index
+              %c128_28 = arith.constant 128 : index
+              %59 = affine.apply #map5(%arg6, %arg7, %c88)[%c256, %c128_28]
+              %60 = affine.apply #map6(%59)[%c256_12]
+              %61 = affine.apply #map7(%59)[%c256_12]
+              %62 = vector.transfer_read %1[%60, %61], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+              "vcix.iv"(%62, %c8_i64) {imm = 0 : i64, opcode = 1 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+              %c96 = arith.constant 96 : index
+              %c128_29 = arith.constant 128 : index
+              %63 = affine.apply #map5(%arg6, %arg7, %c96)[%c256, %c128_29]
+              %64 = affine.apply #map6(%63)[%c256_12]
+              %65 = affine.apply #map7(%63)[%c256_12]
+              %66 = vector.transfer_read %1[%64, %65], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+              "vcix.iv"(%66, %c8_i64) {imm = 0 : i64, opcode = 1 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+              %c104 = arith.constant 104 : index
+              %c128_30 = arith.constant 128 : index
+              %67 = affine.apply #map5(%arg6, %arg7, %c104)[%c256, %c128_30]
+              %68 = affine.apply #map6(%67)[%c256_12]
+              %69 = affine.apply #map7(%67)[%c256_12]
+              %70 = vector.transfer_read %1[%68, %69], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+              "vcix.iv"(%70, %c8_i64) {imm = 0 : i64, opcode = 1 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+              %c112 = arith.constant 112 : index
+              %c128_31 = arith.constant 128 : index
+              %71 = affine.apply #map5(%arg6, %arg7, %c112)[%c256, %c128_31]
+              %72 = affine.apply #map6(%71)[%c256_12]
+              %73 = affine.apply #map7(%71)[%c256_12]
+              %74 = vector.transfer_read %1[%72, %73], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+              "vcix.iv"(%74, %c8_i64) {imm = 0 : i64, opcode = 1 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+              %c120 = arith.constant 120 : index
+              %c128_32 = arith.constant 128 : index
+              %75 = affine.apply #map5(%arg6, %arg7, %c120)[%c256, %c128_32]
+              %76 = affine.apply #map6(%75)[%c256_12]
+              %77 = affine.apply #map7(%75)[%c256_12]
+              %78 = vector.transfer_read %1[%76, %77], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+              "vcix.iv"(%78, %c8_i64) {imm = 0 : i64, opcode = 1 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+              affine.for %arg8 = 0 to 2 {
+                %79 = affine.apply #map8(%arg5, %c0_11, %c0_11)
+                memref.dma_wait %alloc_2[%79], %c1_14 : memref<1xi32>
+                %c0_33 = arith.constant 0 : index
+                %80 = affine.apply #map5(%arg7, %arg8, %c0_33)[%c256_13, %c128]
+                %81 = affine.apply #map6(%80)[%c256]
+                %82 = affine.apply #map7(%80)[%c256]
+                %83 = vector.transfer_read %0[%81, %82], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                "vcix.iv"(%83, %c8_i64) {imm = 0 : i64, opcode = 0 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+                %c8_34 = arith.constant 8 : index
+                %84 = affine.apply #map5(%arg7, %arg8, %c8_34)[%c256_13, %c128]
+                %85 = affine.apply #map6(%84)[%c256]
+                %86 = affine.apply #map7(%84)[%c256]
+                %87 = vector.transfer_read %0[%85, %86], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                "vcix.iv"(%87, %c8_i64) {imm = 0 : i64, opcode = 0 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+                %c16_35 = arith.constant 16 : index
+                %88 = affine.apply #map5(%arg7, %arg8, %c16_35)[%c256_13, %c128]
+                %89 = affine.apply #map6(%88)[%c256]
+                %90 = affine.apply #map7(%88)[%c256]
+                %91 = vector.transfer_read %0[%89, %90], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                "vcix.iv"(%91, %c8_i64) {imm = 0 : i64, opcode = 0 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+                %c24_36 = arith.constant 24 : index
+                %92 = affine.apply #map5(%arg7, %arg8, %c24_36)[%c256_13, %c128]
+                %93 = affine.apply #map6(%92)[%c256]
+                %94 = affine.apply #map7(%92)[%c256]
+                %95 = vector.transfer_read %0[%93, %94], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                "vcix.iv"(%95, %c8_i64) {imm = 0 : i64, opcode = 0 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+                %c32_37 = arith.constant 32 : index
+                %96 = affine.apply #map5(%arg7, %arg8, %c32_37)[%c256_13, %c128]
+                %97 = affine.apply #map6(%96)[%c256]
+                %98 = affine.apply #map7(%96)[%c256]
+                %99 = vector.transfer_read %0[%97, %98], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                "vcix.iv"(%99, %c8_i64) {imm = 0 : i64, opcode = 0 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+                %c40_38 = arith.constant 40 : index
+                %100 = affine.apply #map5(%arg7, %arg8, %c40_38)[%c256_13, %c128]
+                %101 = affine.apply #map6(%100)[%c256]
+                %102 = affine.apply #map7(%100)[%c256]
+                %103 = vector.transfer_read %0[%101, %102], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                "vcix.iv"(%103, %c8_i64) {imm = 0 : i64, opcode = 0 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+                %c48_39 = arith.constant 48 : index
+                %104 = affine.apply #map5(%arg7, %arg8, %c48_39)[%c256_13, %c128]
+                %105 = affine.apply #map6(%104)[%c256]
+                %106 = affine.apply #map7(%104)[%c256]
+                %107 = vector.transfer_read %0[%105, %106], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                "vcix.iv"(%107, %c8_i64) {imm = 0 : i64, opcode = 0 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+                %c56_40 = arith.constant 56 : index
+                %108 = affine.apply #map5(%arg7, %arg8, %c56_40)[%c256_13, %c128]
+                %109 = affine.apply #map6(%108)[%c256]
+                %110 = affine.apply #map7(%108)[%c256]
+                %111 = vector.transfer_read %0[%109, %110], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                "vcix.iv"(%111, %c8_i64) {imm = 0 : i64, opcode = 0 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+                %c64_41 = arith.constant 64 : index
+                %112 = affine.apply #map5(%arg7, %arg8, %c64_41)[%c256_13, %c128]
+                %113 = affine.apply #map6(%112)[%c256]
+                %114 = affine.apply #map7(%112)[%c256]
+                %115 = vector.transfer_read %0[%113, %114], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                "vcix.iv"(%115, %c8_i64) {imm = 0 : i64, opcode = 0 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+                %c72_42 = arith.constant 72 : index
+                %116 = affine.apply #map5(%arg7, %arg8, %c72_42)[%c256_13, %c128]
+                %117 = affine.apply #map6(%116)[%c256]
+                %118 = affine.apply #map7(%116)[%c256]
+                %119 = vector.transfer_read %0[%117, %118], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                "vcix.iv"(%119, %c8_i64) {imm = 0 : i64, opcode = 0 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+                %c80_43 = arith.constant 80 : index
+                %120 = affine.apply #map5(%arg7, %arg8, %c80_43)[%c256_13, %c128]
+                %121 = affine.apply #map6(%120)[%c256]
+                %122 = affine.apply #map7(%120)[%c256]
+                %123 = vector.transfer_read %0[%121, %122], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                "vcix.iv"(%123, %c8_i64) {imm = 0 : i64, opcode = 0 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+                %c88_44 = arith.constant 88 : index
+                %124 = affine.apply #map5(%arg7, %arg8, %c88_44)[%c256_13, %c128]
+                %125 = affine.apply #map6(%124)[%c256]
+                %126 = affine.apply #map7(%124)[%c256]
+                %127 = vector.transfer_read %0[%125, %126], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                "vcix.iv"(%127, %c8_i64) {imm = 0 : i64, opcode = 0 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+                %c96_45 = arith.constant 96 : index
+                %128 = affine.apply #map5(%arg7, %arg8, %c96_45)[%c256_13, %c128]
+                %129 = affine.apply #map6(%128)[%c256]
+                %130 = affine.apply #map7(%128)[%c256]
+                %131 = vector.transfer_read %0[%129, %130], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                "vcix.iv"(%131, %c8_i64) {imm = 0 : i64, opcode = 0 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+                %c104_46 = arith.constant 104 : index
+                %132 = affine.apply #map5(%arg7, %arg8, %c104_46)[%c256_13, %c128]
+                %133 = affine.apply #map6(%132)[%c256]
+                %134 = affine.apply #map7(%132)[%c256]
+                %135 = vector.transfer_read %0[%133, %134], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                "vcix.iv"(%135, %c8_i64) {imm = 0 : i64, opcode = 0 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+                %c112_47 = arith.constant 112 : index
+                %136 = affine.apply #map5(%arg7, %arg8, %c112_47)[%c256_13, %c128]
+                %137 = affine.apply #map6(%136)[%c256]
+                %138 = affine.apply #map7(%136)[%c256]
+                %139 = vector.transfer_read %0[%137, %138], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                "vcix.iv"(%139, %c8_i64) {imm = 0 : i64, opcode = 0 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+                %c120_48 = arith.constant 120 : index
+                %140 = affine.apply #map5(%arg7, %arg8, %c120_48)[%c256_13, %c128]
+                %141 = affine.apply #map6(%140)[%c256]
+                %142 = affine.apply #map7(%140)[%c256]
+                %143 = vector.transfer_read %0[%141, %142], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                "vcix.iv"(%143, %c8_i64) {imm = 0 : i64, opcode = 0 : i64, rd = 0 : i64} : (vector<8xf32>, i64) -> ()
+                "vcix.i"(%c8_i64) {imm = 4 : i64, lmul = 0 : i64, opcode = 1 : i64, rd = 0 : i64, rs2 = 0 : i64, sew = 32 : i64} : (i64) -> ()
+                %c0_49 = arith.constant 0 : index
+                %144 = affine.apply #map5(%arg6, %arg8, %c0_49)[%c256_13, %c128]
+                %145 = "vcix.v.i"(%c8_i64) {imm = 0 : i64, opcode = 2 : i64, rs2 = 0 : i64} : (i64) -> vector<8xf32>
+                %146 = affine.apply #map6(%144)[%c256_12]
+                %147 = affine.apply #map7(%144)[%c256_12]
+                %148 = vector.transfer_read %2[%146, %147], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                %149 = arith.addf %148, %145 : vector<8xf32>
+                vector.transfer_write %149, %2[%146, %147] : vector<8xf32>, memref<256x256xf32, 1>
+                %c8_50 = arith.constant 8 : index
+                %150 = affine.apply #map5(%arg6, %arg8, %c8_50)[%c256_13, %c128]
+                %151 = "vcix.v.i"(%c8_i64) {imm = 0 : i64, opcode = 2 : i64, rs2 = 0 : i64} : (i64) -> vector<8xf32>
+                %152 = affine.apply #map6(%150)[%c256_12]
+                %153 = affine.apply #map7(%150)[%c256_12]
+                %154 = vector.transfer_read %2[%152, %153], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                %155 = arith.addf %154, %151 : vector<8xf32>
+                vector.transfer_write %155, %2[%152, %153] : vector<8xf32>, memref<256x256xf32, 1>
+                %c16_51 = arith.constant 16 : index
+                %156 = affine.apply #map5(%arg6, %arg8, %c16_51)[%c256_13, %c128]
+                %157 = "vcix.v.i"(%c8_i64) {imm = 0 : i64, opcode = 2 : i64, rs2 = 0 : i64} : (i64) -> vector<8xf32>
+                %158 = affine.apply #map6(%156)[%c256_12]
+                %159 = affine.apply #map7(%156)[%c256_12]
+                %160 = vector.transfer_read %2[%158, %159], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                %161 = arith.addf %160, %157 : vector<8xf32>
+                vector.transfer_write %161, %2[%158, %159] : vector<8xf32>, memref<256x256xf32, 1>
+                %c24_52 = arith.constant 24 : index
+                %162 = affine.apply #map5(%arg6, %arg8, %c24_52)[%c256_13, %c128]
+                %163 = "vcix.v.i"(%c8_i64) {imm = 0 : i64, opcode = 2 : i64, rs2 = 0 : i64} : (i64) -> vector<8xf32>
+                %164 = affine.apply #map6(%162)[%c256_12]
+                %165 = affine.apply #map7(%162)[%c256_12]
+                %166 = vector.transfer_read %2[%164, %165], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                %167 = arith.addf %166, %163 : vector<8xf32>
+                vector.transfer_write %167, %2[%164, %165] : vector<8xf32>, memref<256x256xf32, 1>
+                %c32_53 = arith.constant 32 : index
+                %168 = affine.apply #map5(%arg6, %arg8, %c32_53)[%c256_13, %c128]
+                %169 = "vcix.v.i"(%c8_i64) {imm = 0 : i64, opcode = 2 : i64, rs2 = 0 : i64} : (i64) -> vector<8xf32>
+                %170 = affine.apply #map6(%168)[%c256_12]
+                %171 = affine.apply #map7(%168)[%c256_12]
+                %172 = vector.transfer_read %2[%170, %171], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                %173 = arith.addf %172, %169 : vector<8xf32>
+                vector.transfer_write %173, %2[%170, %171] : vector<8xf32>, memref<256x256xf32, 1>
+                %c40_54 = arith.constant 40 : index
+                %174 = affine.apply #map5(%arg6, %arg8, %c40_54)[%c256_13, %c128]
+                %175 = "vcix.v.i"(%c8_i64) {imm = 0 : i64, opcode = 2 : i64, rs2 = 0 : i64} : (i64) -> vector<8xf32>
+                %176 = affine.apply #map6(%174)[%c256_12]
+                %177 = affine.apply #map7(%174)[%c256_12]
+                %178 = vector.transfer_read %2[%176, %177], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                %179 = arith.addf %178, %175 : vector<8xf32>
+                vector.transfer_write %179, %2[%176, %177] : vector<8xf32>, memref<256x256xf32, 1>
+                %c48_55 = arith.constant 48 : index
+                %180 = affine.apply #map5(%arg6, %arg8, %c48_55)[%c256_13, %c128]
+                %181 = "vcix.v.i"(%c8_i64) {imm = 0 : i64, opcode = 2 : i64, rs2 = 0 : i64} : (i64) -> vector<8xf32>
+                %182 = affine.apply #map6(%180)[%c256_12]
+                %183 = affine.apply #map7(%180)[%c256_12]
+                %184 = vector.transfer_read %2[%182, %183], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                %185 = arith.addf %184, %181 : vector<8xf32>
+                vector.transfer_write %185, %2[%182, %183] : vector<8xf32>, memref<256x256xf32, 1>
+                %c56_56 = arith.constant 56 : index
+                %186 = affine.apply #map5(%arg6, %arg8, %c56_56)[%c256_13, %c128]
+                %187 = "vcix.v.i"(%c8_i64) {imm = 0 : i64, opcode = 2 : i64, rs2 = 0 : i64} : (i64) -> vector<8xf32>
+                %188 = affine.apply #map6(%186)[%c256_12]
+                %189 = affine.apply #map7(%186)[%c256_12]
+                %190 = vector.transfer_read %2[%188, %189], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                %191 = arith.addf %190, %187 : vector<8xf32>
+                vector.transfer_write %191, %2[%188, %189] : vector<8xf32>, memref<256x256xf32, 1>
+                %c64_57 = arith.constant 64 : index
+                %192 = affine.apply #map5(%arg6, %arg8, %c64_57)[%c256_13, %c128]
+                %193 = "vcix.v.i"(%c8_i64) {imm = 0 : i64, opcode = 2 : i64, rs2 = 0 : i64} : (i64) -> vector<8xf32>
+                %194 = affine.apply #map6(%192)[%c256_12]
+                %195 = affine.apply #map7(%192)[%c256_12]
+                %196 = vector.transfer_read %2[%194, %195], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                %197 = arith.addf %196, %193 : vector<8xf32>
+                vector.transfer_write %197, %2[%194, %195] : vector<8xf32>, memref<256x256xf32, 1>
+                %c72_58 = arith.constant 72 : index
+                %198 = affine.apply #map5(%arg6, %arg8, %c72_58)[%c256_13, %c128]
+                %199 = "vcix.v.i"(%c8_i64) {imm = 0 : i64, opcode = 2 : i64, rs2 = 0 : i64} : (i64) -> vector<8xf32>
+                %200 = affine.apply #map6(%198)[%c256_12]
+                %201 = affine.apply #map7(%198)[%c256_12]
+                %202 = vector.transfer_read %2[%200, %201], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                %203 = arith.addf %202, %199 : vector<8xf32>
+                vector.transfer_write %203, %2[%200, %201] : vector<8xf32>, memref<256x256xf32, 1>
+                %c80_59 = arith.constant 80 : index
+                %204 = affine.apply #map5(%arg6, %arg8, %c80_59)[%c256_13, %c128]
+                %205 = "vcix.v.i"(%c8_i64) {imm = 0 : i64, opcode = 2 : i64, rs2 = 0 : i64} : (i64) -> vector<8xf32>
+                %206 = affine.apply #map6(%204)[%c256_12]
+                %207 = affine.apply #map7(%204)[%c256_12]
+                %208 = vector.transfer_read %2[%206, %207], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                %209 = arith.addf %208, %205 : vector<8xf32>
+                vector.transfer_write %209, %2[%206, %207] : vector<8xf32>, memref<256x256xf32, 1>
+                %c88_60 = arith.constant 88 : index
+                %210 = affine.apply #map5(%arg6, %arg8, %c88_60)[%c256_13, %c128]
+                %211 = "vcix.v.i"(%c8_i64) {imm = 0 : i64, opcode = 2 : i64, rs2 = 0 : i64} : (i64) -> vector<8xf32>
+                %212 = affine.apply #map6(%210)[%c256_12]
+                %213 = affine.apply #map7(%210)[%c256_12]
+                %214 = vector.transfer_read %2[%212, %213], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                %215 = arith.addf %214, %211 : vector<8xf32>
+                vector.transfer_write %215, %2[%212, %213] : vector<8xf32>, memref<256x256xf32, 1>
+                %c96_61 = arith.constant 96 : index
+                %216 = affine.apply #map5(%arg6, %arg8, %c96_61)[%c256_13, %c128]
+                %217 = "vcix.v.i"(%c8_i64) {imm = 0 : i64, opcode = 2 : i64, rs2 = 0 : i64} : (i64) -> vector<8xf32>
+                %218 = affine.apply #map6(%216)[%c256_12]
+                %219 = affine.apply #map7(%216)[%c256_12]
+                %220 = vector.transfer_read %2[%218, %219], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                %221 = arith.addf %220, %217 : vector<8xf32>
+                vector.transfer_write %221, %2[%218, %219] : vector<8xf32>, memref<256x256xf32, 1>
+                %c104_62 = arith.constant 104 : index
+                %222 = affine.apply #map5(%arg6, %arg8, %c104_62)[%c256_13, %c128]
+                %223 = "vcix.v.i"(%c8_i64) {imm = 0 : i64, opcode = 2 : i64, rs2 = 0 : i64} : (i64) -> vector<8xf32>
+                %224 = affine.apply #map6(%222)[%c256_12]
+                %225 = affine.apply #map7(%222)[%c256_12]
+                %226 = vector.transfer_read %2[%224, %225], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                %227 = arith.addf %226, %223 : vector<8xf32>
+                vector.transfer_write %227, %2[%224, %225] : vector<8xf32>, memref<256x256xf32, 1>
+                %c112_63 = arith.constant 112 : index
+                %228 = affine.apply #map5(%arg6, %arg8, %c112_63)[%c256_13, %c128]
+                %229 = "vcix.v.i"(%c8_i64) {imm = 0 : i64, opcode = 2 : i64, rs2 = 0 : i64} : (i64) -> vector<8xf32>
+                %230 = affine.apply #map6(%228)[%c256_12]
+                %231 = affine.apply #map7(%228)[%c256_12]
+                %232 = vector.transfer_read %2[%230, %231], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                %233 = arith.addf %232, %229 : vector<8xf32>
+                vector.transfer_write %233, %2[%230, %231] : vector<8xf32>, memref<256x256xf32, 1>
+                %c120_64 = arith.constant 120 : index
+                %234 = affine.apply #map5(%arg6, %arg8, %c120_64)[%c256_13, %c128]
+                %235 = "vcix.v.i"(%c8_i64) {imm = 0 : i64, opcode = 2 : i64, rs2 = 0 : i64} : (i64) -> vector<8xf32>
+                %236 = affine.apply #map6(%234)[%c256_12]
+                %237 = affine.apply #map7(%234)[%c256_12]
+                %238 = vector.transfer_read %2[%236, %237], %cst_15 : memref<256x256xf32, 1>, vector<8xf32>
+                %239 = arith.addf %238, %235 : vector<8xf32>
+                vector.transfer_write %239, %2[%236, %237] : vector<8xf32>, memref<256x256xf32, 1>
+              } {inner_loop = true}
+            } {inner_loop = true}
+          } {inner_loop = true}
+        } {accumulation_loop = true, subtile_loop = "k"}
+        affine.for %arg5 = 0 to 1 {
+        } {inner_loop = false}
+        %3 = affine.apply #map(%arg3, %arg4)
+        %c1_0 = arith.constant 1 : index
+        memref.dma_start %2[%c0, %c0], %arg2[%3], %c3, %alloc[%c0], %c1_0, %c1 : memref<256x256xf32, 1>, memref<65536xf32>, memref<1xi32> {dram_stride = [256, 1], padding = 0 : i64, sram_stride = [1, 256]}
+      } {outer_loop = true, subtile_loop = "n"}
+    } {outer_loop = true, subtile_loop = "m"}
+    return
+  }
+  func.func @wrapper_kernel(%arg0: memref<65536xf32>, %arg1: memref<65536xf32>, %arg2: memref<65536xf32>) {
+    call @kernel(%arg0, %arg1, %arg2) : (memref<65536xf32>, memref<65536xf32>, memref<65536xf32>) -> ()
+    return
+  }
+}
diff --git a/tests/test_togsim_runtime.py b/tests/test_togsim_runtime.py
index f17bccef..a5d6cb3d 100644
--- a/tests/test_togsim_runtime.py
+++ b/tests/test_togsim_runtime.py
@@ -5,8 +5,9 @@
 DRAM addresses are resolved (base[arg_id] + offset*elem_bytes), compute cycles
 are looked up from the cycle table, and every wait gets a handle a dma minted.
 
-Skipped unless the MLIR bindings, `mlir-translate`, a C++ compiler, and a
-post-vcix `.mlir` fixture (`TOGSIM_SKELETON_FIXTURE`) are available.
+Uses a checked-in post-vcix `.mlir` fixture (tests/fixtures/), so it is
+self-contained; skipped only when the MLIR bindings, `mlir-translate`, or a C++
+compiler are missing.
 """
 import importlib.util
 import os
@@ -36,11 +37,16 @@ def _tools_ready():
             and _RUNTIME.is_file())
 
 
+# Checked-in post-vcix kernel: a 256^3 single-output-tile GEMM (X/W/Y_spad
+# 256x256), matching the trace assertions below. Self-contained so the test
+# runs wherever the tools are present -- no setup/env needed.
+_FIXTURE = pathlib.Path(__file__).resolve().parent / "fixtures" / "gemm256_postvcix.mlir"
+
+
 def _fixture():
-    fix = os.environ.get("TOGSIM_SKELETON_FIXTURE")
-    if not fix or not os.path.isfile(fix):
-        pytest.skip("set TOGSIM_SKELETON_FIXTURE to a post-vcix kernel .mlir")
-    return fix
+    if not _FIXTURE.is_file():
+        pytest.skip(f"missing checked-in fixture {_FIXTURE}")
+    return str(_FIXTURE)
 
 
 # Drives the loader with known tensor bases + a synthetic cycle table, then
@@ -57,7 +63,8 @@ def _fixture():
   uint64_t bases[3] = {0x1000, 0x2000, 0x3000};
   int64_t  cyc[3]   = {100, 200, 300};
   int64_t  ovl[3]   = {0, 200, 172};
-  RunResult r = run_producer(argv[1], nullptr, 0, bases, 3, cyc, ovl, 3, 1);
+  int32_t  pcores[1] = {0};  // round-robin work-items over core 0 (single-core harness)
+  RunResult r = run_producer(argv[1], nullptr, 0, bases, 3, cyc, ovl, 3, pcores, 1);
   if (!r.ok) { printf("run failed\n"); return 2; }
   int ndisp=0, nd=0, nc=0, nm=0, fail=0;
   std::vector<uint64_t> dma_a; std::vector<int> dma_arg, dma_dir;
@@ -136,7 +143,8 @@ def test_runtime_loads_and_records():
   uint64_t bases[3] = {0x1000, 0x2000, 0x3000};
   int64_t  cyc[3]   = {100, 200, 300};
   int64_t  ovl[3]   = {0, 200, 172};
-  RunResult r = run_producer(argv[1], nullptr, 0, bases, 3, cyc, ovl, 3, 1);
+  int32_t  pcores[1] = {0};  // round-robin work-items over core 0 (single-core harness)
+  RunResult r = run_producer(argv[1], nullptr, 0, bases, 3, cyc, ovl, 3, pcores, 1);
   if (!r.ok) { printf("run failed\n"); return 2; }
   TimingParams p; p.dma_latency = 100;
   SimResult s = simulate(r, p);

From 4558e65226d2c62b7857d4513400368e7ccd3cc5 Mon Sep 17 00:00:00 2001
From: Wonhyuk Yang <wonhyuk@postech.ac.kr>
Date: Wed, 24 Jun 2026 22:35:51 +0900
Subject: [PATCH 09/24] [Frontend] Trace cache-safe replay and compile-race
 fixes

Carry simulator headers through the wrapper for cache-safe replay; drop verbose
[P3-trace] logs; fix the key.mlir compile race in load().

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
---
 PyTorchSimFrontend/extension_codecache.py     | 94 ++++++++++---------
 PyTorchSimFrontend/mlir/mlir_autotune.py      |  7 +-
 .../mlir/mlir_codegen_backend.py              | 16 +---
 PyTorchSimFrontend/mlir/mlir_scheduling.py    |  5 +
 PyTorchSimFrontend/mlir/mlir_template.py      | 18 +---
 PyTorchSimFrontend/mlir/passes/__init__.py    |  8 +-
 6 files changed, 74 insertions(+), 74 deletions(-)

diff --git a/PyTorchSimFrontend/extension_codecache.py b/PyTorchSimFrontend/extension_codecache.py
index e573d1a5..c57f9c51 100644
--- a/PyTorchSimFrontend/extension_codecache.py
+++ b/PyTorchSimFrontend/extension_codecache.py
@@ -5,7 +5,7 @@
 import torch
 
 from PyTorchSimFrontend import extension_config
-from torch._inductor.codecache import get_hash, write
+from torch._inductor.codecache import get_hash, write, write_atomic
 from torch._inductor.async_compile import AsyncCompile
 from AsmParser.tog_generator import tog_generator
 from PyTorchSimFrontend.mlir.mlir_caller_codegen import MLIRKernelCallerCodeGen
@@ -23,6 +23,13 @@ def get_write_path(src_code):
     return os.path.join(extension_config.get_dump_path(), hash_prefix(get_hash(src_code.strip())))
 
 
+_HEADER_BY_HASH = {}
+def store_header(src_code, spike_header, gem5_header):
+    _HEADER_BY_HASH[get_hash(src_code.strip())] = (spike_header, gem5_header)
+def get_header(src_code):
+    return _HEADER_BY_HASH.get(get_hash(src_code.strip()))
+
+
 def get_lock_path(write_path):
     """Return lock file path for the given write_path (per-source_code lock)."""
     return os.path.join(write_path, ".compile.lock")
@@ -128,40 +135,52 @@ def load(cls, source_code,
         vlen = kwargs['vlen']
         vlenb = vlen // 8
         write_path = get_write_path(source_code)
-        key, input_path = write(source_code, "mlir", specified_dir=write_path)
-        # Run the Python out-of-line MLIR passes (MLIR bindings) on the kernel
-        # .mlir in place, before mlir-opt. Currently lowers torchsim.vlane_idx
-        # (replaces the old C++ -global-idx pass); add more in passes/__init__.py.
+        os.makedirs(write_path, exist_ok=True)
+        global_var_header = kwargs.get("global_var_header")
+        if global_var_header is not None:
+            write_atomic(os.path.join(write_path, "global_var.h"), global_var_header)
+        gem5_global_var_header = kwargs.get("gem5_global_var_header")
+        if gem5_global_var_header is not None:
+            write_atomic(os.path.join(write_path, "gem5_global_var.h"), gem5_global_var_header)
+        # The compile rewrites the kernel .mlir in place (run_python_passes) and reads
+        # it back (mlir-opt). Two compiles of the same source -- the autotune's chosen
+        # candidate and the final kernel -- share a write_path, so hold the per-path
+        # lock across the whole build to keep them from interleaving, and skip the
+        # rebuild when a prior build already finished (its tile_graph.onnx exists).
+        from filelock import FileLock
         from PyTorchSimFrontend.mlir.passes import (
             run_python_passes, run_module_passes, POST_OPT_PASSES,
             run_standard_lowering, run_tog,
         )
-        run_python_passes(input_path, vectorlane=vectorlane_size)
-        new_input_path = os.path.splitext(input_path)[0]
-        raw_tog_path = new_input_path + "_tog.py"
         tog_path = os.path.join(write_path, "tile_graph.onnx")
-        sample_mlir_path = new_input_path + "_sample"
-        validation_binary_path = os.path.join(write_path, validation_binary_name)
-        gem5_cmds = mlir_gem5_compile_command(new_input_path, sample_mlir_path, raw_tog_path, vectorlane_size)
-
-        from filelock import FileLock
-        os.makedirs(write_path, exist_ok=True)
         lock = FileLock(get_lock_path(write_path), timeout=LOCK_TIMEOUT)
-
-        if spad_info is not None:
-            link_option = f"-Wl,--section-start=.spad=0x{spad_info['spad_vaddr']:x}"
-        else:
-            link_option = ""
-        # Generate LLVM kernel calller and binary for validation
-        if extension_config.pytorchsim_functional_mode:
-            # Use custom malloc to avoid size error
-            new_link_option = link_option + " -Wl,--wrap=malloc -Wl,--wrap=free"
-            cmds = mlir_compile_command(new_input_path, vectorlane_size, vlen=vlen)
-            opt_pad_cmd = shlex.split(cmds[0])
-            translate_cmd = shlex.split(cmds[1])
-            llc_cmd = shlex.split(cmds[2])
-            llc_asm_cmd = shlex.split(cmds[3])
-            with lock:
+        with lock:
+            key, input_path = write(source_code, "mlir", specified_dir=write_path)
+            if os.path.isfile(tog_path):
+                return key
+            # Run the Python out-of-line MLIR passes (MLIR bindings) on the kernel
+            # .mlir in place, before mlir-opt. Currently lowers torchsim.vlane_idx
+            # (replaces the old C++ -global-idx pass); add more in passes/__init__.py.
+            run_python_passes(input_path, vectorlane=vectorlane_size)
+            new_input_path = os.path.splitext(input_path)[0]
+            raw_tog_path = new_input_path + "_tog.py"
+            sample_mlir_path = new_input_path + "_sample"
+            validation_binary_path = os.path.join(write_path, validation_binary_name)
+            gem5_cmds = mlir_gem5_compile_command(new_input_path, sample_mlir_path, raw_tog_path, vectorlane_size)
+
+            if spad_info is not None:
+                link_option = f"-Wl,--section-start=.spad=0x{spad_info['spad_vaddr']:x}"
+            else:
+                link_option = ""
+            # Generate LLVM kernel calller and binary for validation
+            if extension_config.pytorchsim_functional_mode:
+                # Use custom malloc to avoid size error
+                new_link_option = link_option + " -Wl,--wrap=malloc -Wl,--wrap=free"
+                cmds = mlir_compile_command(new_input_path, vectorlane_size, vlen=vlen)
+                opt_pad_cmd = shlex.split(cmds[0])
+                translate_cmd = shlex.split(cmds[1])
+                llc_cmd = shlex.split(cmds[2])
+                llc_asm_cmd = shlex.split(cmds[3])
                 try:
                     # loop-padding (mlir-opt) -> Python fine-grained + vcix (one parse/print)
                     subprocess.check_call(opt_pad_cmd)
@@ -195,17 +214,11 @@ def load(cls, source_code,
                     )
                     raise SpadOverflowError()
 
-        # Skip if TOG file already exists
-        if os.path.isfile(tog_path):
-            return key
+            # Launch tile graph generator
+            gem5_pad_cmd = shlex.split(gem5_cmds[0])
+            gem5_translate_cmd = shlex.split(gem5_cmds[1])
+            gem5_llc_cmd = shlex.split(gem5_cmds[2])
 
-        # Launch tile graph generator
-        gem5_pad_cmd = shlex.split(gem5_cmds[0])
-        gem5_translate_cmd = shlex.split(gem5_cmds[1])
-        gem5_llc_cmd = shlex.split(gem5_cmds[2])
-
-        lock = FileLock(get_lock_path(write_path), timeout=LOCK_TIMEOUT)
-        with lock:
             try:
                 # mlir-opt now runs only loop-padding/dma-fine-grained/pytorchsim-to-vcix
                 # and writes the post-vcix IR. The tile-operation-graph pass is ported
@@ -291,12 +304,9 @@ def load(cls, source_code,
                         _cl = list(cycle_list_for_trace)
                         if _cl and len(_cl) != _ntiles:
                             _cl = (_cl + [_cl[-1]] * _ntiles)[:_ntiles]
-                        logger.info(f"[P3-trace] cycle_list={cycle_list_for_trace} -> {_cl} "
-                                    f"(#tiles={_ntiles}, x_off={x_offset}, w_off={w_offset})")
                         _tbl = _ct.build_cycle_table(_mod, _cl, x_offset, w_offset)
                     _ct.dump_cycle_table_tsv(_tbl, os.path.join(write_path, "trace_cycles.tsv"))
                     _l2e.build_trace_so(pv, os.path.join(write_path, "trace.so"))
-                    logger.info(f"[P3-trace] wrote trace.so + trace_cycles.tsv in {write_path}")
                 except Exception as e:
                     logger.warning(f"[P3-trace] trace .so/sidecar dump skipped: {e}")
         return key
diff --git a/PyTorchSimFrontend/mlir/mlir_autotune.py b/PyTorchSimFrontend/mlir/mlir_autotune.py
index 396396f3..e4876b5b 100644
--- a/PyTorchSimFrontend/mlir/mlir_autotune.py
+++ b/PyTorchSimFrontend/mlir/mlir_autotune.py
@@ -54,7 +54,7 @@ def __str__(self) -> str:
     def make_run_fn(
         self, input_tensors: torch.Tensor, output_tensors: torch.Tensor
     ) -> Callable[[], None]:
-        from PyTorchSimFrontend.extension_codecache import CustomAsyncCompile
+        from PyTorchSimFrontend.extension_codecache import CustomAsyncCompile, get_header
         custom_async_compile = CustomAsyncCompile()
 
         # Check already cached result.
@@ -80,12 +80,15 @@ def cached_run_fn(*args, autotune_subprocess_timeout_sec=None, **kwargs):
                 return cached_run_fn
 
         # Run a candidate code
+        _headers = get_header(self.source_code)
+        _header_kwargs = {} if _headers is None else {
+            "global_var_header": _headers[0], "gem5_global_var_header": _headers[1]}
         run_method = custom_async_compile.mlir(
             self.source_code, vectorlane_size=self.extra_args["vector_lane"],
             loop_size=self.extra_args["loop_size"], spad_info=self.extra_args["spad_info"],
             vlen=self.extra_args["vlen"], arg_attributes=self.extra_args["arg_attributes"],
             origins=self.extra_args["origins"], silent_mode=True,
-            autotune=self.extra_args['autotune'])
+            autotune=self.extra_args['autotune'], **_header_kwargs)
 
         args = [
             tensor
diff --git a/PyTorchSimFrontend/mlir/mlir_codegen_backend.py b/PyTorchSimFrontend/mlir/mlir_codegen_backend.py
index 1ee62f36..8f695395 100644
--- a/PyTorchSimFrontend/mlir/mlir_codegen_backend.py
+++ b/PyTorchSimFrontend/mlir/mlir_codegen_backend.py
@@ -17,7 +17,6 @@
 from torch._inductor.codegen import cpp, wrapper, common, memory_planning
 from torch._inductor.ir import GraphPartitionSignature
 from torch._inductor.virtualized import V, _ops as ops
-from torch._inductor.codecache import write_atomic
 from torch._inductor.utils import (
     IndentedBuffer,
     is_welford_reduction,
@@ -1130,22 +1129,13 @@ def codegen_nodes(self, nodes, kernel_name):
         return src_code, meta_code
 
     def _prepare_simulator_headers(self, src_code):
-        from filelock import FileLock
-
-        write_path = extension_codecache.get_write_path(src_code)
-        os.makedirs(write_path, exist_ok=True)
-
-        spike_write_path = os.path.join(write_path, "global_var.h")
-        gem5_write_path = os.path.join(write_path, "gem5_global_var.h")
-
         spad_end_symbol = "int spad_end[0] __attribute__ ((section(\".spad\")));\n"
         spad_section_end_symbol = (
             f"int spad_section_end[0] __attribute__ ((section(\".spad\"), aligned({self.spad_info['spad_size']*self.vector_lane})));"
         )
-        lock = FileLock(extension_codecache.get_lock_path(write_path), timeout=extension_codecache.LOCK_TIMEOUT)
-        with lock:
-            write_atomic(spike_write_path, self.header.getvalue() + spad_end_symbol + spad_section_end_symbol)
-            write_atomic(gem5_write_path, self.gem5_header.getvalue())
+        spike_content = self.header.getvalue() + spad_end_symbol + spad_section_end_symbol
+        gem5_content = self.gem5_header.getvalue()
+        extension_codecache.store_header(src_code, spike_content, gem5_content)
 
     def get_arg_info(self, name):
         arg_info = dict()
diff --git a/PyTorchSimFrontend/mlir/mlir_scheduling.py b/PyTorchSimFrontend/mlir/mlir_scheduling.py
index 41ec61af..8520596c 100644
--- a/PyTorchSimFrontend/mlir/mlir_scheduling.py
+++ b/PyTorchSimFrontend/mlir/mlir_scheduling.py
@@ -5,6 +5,7 @@
 import operator
 from sympy import symbols, sympify
 from PyTorchSimFrontend import extension_config
+from PyTorchSimFrontend import extension_codecache
 from PyTorchSimFrontend.mlir.mlir_codegen_backend import MLIRKernel
 
 from torch.utils._ordered_set import OrderedSet
@@ -333,6 +334,10 @@ def define_kernel(self, src_code, meta_code, kernel_name, vector_lane, spad_info
             codecache_def.writeline(f"spad_info={spad_info},")
             codecache_def.writeline(f"origins={origins},")
             codecache_def.writeline(f"arg_attributes={meta_code},")
+            headers = extension_codecache.get_header(src_code)
+            if headers is not None:
+                codecache_def.writeline(f"global_var_header='''{headers[0]}''',")
+                codecache_def.writeline(f"gem5_global_var_header='''{headers[1]}''',")
             codecache_def.writeline(f"vlen={extension_config.vpu_vector_length_bits})")
             wrapper.define_kernel(kernel_name, codecache_def.getvalue(), gpu=False)
         return kernel_name
diff --git a/PyTorchSimFrontend/mlir/mlir_template.py b/PyTorchSimFrontend/mlir/mlir_template.py
index 529a49b5..2b8a0676 100644
--- a/PyTorchSimFrontend/mlir/mlir_template.py
+++ b/PyTorchSimFrontend/mlir/mlir_template.py
@@ -21,7 +21,6 @@
 from torch._inductor.autotune_process import TensorMeta
 from torch._inductor.virtualized import V, NullHandler, _ops as ops
 from torch._inductor.utils import IndentedBuffer
-from torch._inductor.codecache import write_atomic
 
 import PyTorchSimFrontend.extension_codecache as extension_codecache
 from PyTorchSimFrontend.mlir.mlir_autotune import MLIRBenchmarkRequest
@@ -613,22 +612,11 @@ def codegen_nodes(self, tile_candidates, render, template_node, prologue_nodes,
         return src_code, meta_code
 
     def _prepare_simulator_headers(self, src_code):
-        from filelock import FileLock
-
         spad_end_symbol = f"int spad_end[0] __attribute__ ((section(\".spad\")));\n"
         spad_section_end_symbol = f"int spad_section_end[0] __attribute__ ((section(\".spad\"), aligned({self.spad_info['spad_size']*self.vector_lane})));"
-
-        write_path = extension_codecache.get_write_path(src_code)
-        os.makedirs(write_path, exist_ok=True)
-        spike_write_path = os.path.join(write_path, "global_var.h")
-        gem5_write_path = os.path.join(write_path, "gem5_global_var.h")
-
-        lock = FileLock(extension_codecache.get_lock_path(write_path), timeout=extension_codecache.LOCK_TIMEOUT)
-        with lock:
-            if not os.path.exists(spike_write_path):
-                write_atomic(spike_write_path, self.header.getvalue()+spad_end_symbol+spad_section_end_symbol)
-            if not os.path.exists(gem5_write_path):
-                write_atomic(gem5_write_path, self.gem5_header.getvalue())
+        spike_content = self.header.getvalue()+spad_end_symbol+spad_section_end_symbol
+        gem5_content = self.gem5_header.getvalue()
+        extension_codecache.store_header(src_code, spike_content, gem5_content)
 
     def codegen_prologue_body(self):
         body = IndentedBuffer()
diff --git a/PyTorchSimFrontend/mlir/passes/__init__.py b/PyTorchSimFrontend/mlir/passes/__init__.py
index 82cadc2f..ab3cdcd3 100644
--- a/PyTorchSimFrontend/mlir/passes/__init__.py
+++ b/PyTorchSimFrontend/mlir/passes/__init__.py
@@ -76,8 +76,12 @@ def run_module_passes(in_path, out_path, passes, **opts):
             p.run(module, **opts)
         out = str(module)
 
-    with open(out_path, "w") as f:
-        f.write(out)
+    # Atomic write: run_python_passes rewrites the kernel .mlir in place outside
+    # load()'s FileLock, so a concurrent compile of the same source must never see a
+    # truncated file -- mlir-opt would parse it to an empty module and silently drop
+    # the kernel (-> undefined reference to wrapper_kernel at link).
+    from torch._inductor.codecache import write_atomic
+    write_atomic(out_path, out)
     return True
 
 

From 90339455bda2907371dbb026451ca7af3cb47e1d Mon Sep 17 00:00:00 2001
From: Wonhyuk Yang <wonhyuk@postech.ac.kr>
Date: Wed, 24 Jun 2026 20:36:13 +0900
Subject: [PATCH 10/24] [TOGSim] Redesign trace-bridge dependency, barrier,
 SRAM-version, and runtime model

Replace the trace bridge's accumulated special cases with one dataflow rule and
clean up the runtime that consumes it.

Dependency rule: per SRAM buffer keep a writers SET; a reader depends on all
current writers (occupancy=ISSUE when both are systolic-array ops, else
latency=DONE); a writer REPLACEs the set. The only exception is is_mm_accum (a
matmul that reads and writes the same buffer = a commutative accumulator): skip
its read edge and UNION its write, waiting only the non-matmul init seed and not
ordering co-matmuls. This drops the matmul-accumulator chain that deadlocked the
SA weight-slot pipeline while keeping the init->matmul edge, and lets a vector
epilogue or the store wait every K matmul (fixes the pure-vector store that an
empty COMPUTE_BAR let slip).

Remove COMPUTE_BAR entirely: a matmul is its own DONE-handle (finish == SA
drain), so the store JOINs the matmul writers directly. The whole emit/loader
chain is gone -- build_skeleton, lower_to_emitc, togsim.compute_barrier, the
runtime symbol, the Opcode/case/_fence_finish, and TraceRec::COMPUTE_BAR -- so a
stale producer fails loudly instead of emitting records the bridge would drop.
Only MEMORY_BAR remains (an async load's DONE is its data arrival, not issue).

Model compute-output spad footprint in the SRAM version/capacity machinery so
buffer reuse (WAR) is capacity-modeled, not a hard edge. The output size comes
from the DMA records that touch the same buffer (a buf_bytes pre-pass); an
in-place buffer (accumulator, relu) is version-transparent so footprint is not
double-counted. The occupy gate and version release sit in the MOVIN/MOVOUT/COMP
issue points (release before the COMP skip path so a skipped matmul still frees).

Runtime: collapse child_inst / _pipeline_children into one event-indexed
_deps[ISSUE|DONE] with add_dep(c, on) and fire(e); collapse the weight-slot
release queue and the async-load wakeup into one _due_events timed-effect table
drained by process_due_events. Both are behavior-preserving (byte-identical).

Require the weight-slot model: sa_weight_buffer_depth must be > 0 (errors at
init), and the round-robin disable mode is removed. Degenerate traces (a
consumer-less preload, an unpinned matmul) hit explicit error+exit guards rather
than asserts that vanish under NDEBUG.

Mark the legacy ONNX TOG path deprecated: it is superseded by the trace path, so
TileGraphParser logs a deprecation warning and the TORCHSIM_LEGACY_TOG=1 opt-in
warns at command build.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
---
 CLAUDE.md                                     |   1 +
 PyTorchSimFrontend/extension_codecache.py     |   6 +-
 .../mlir/passes/build_skeleton.py             |  18 +-
 .../mlir/passes/lower_to_emitc.py             |   8 -
 PyTorchSimFrontend/mlir/passes/togsim_ops.py  |   6 +-
 README.md                                     |   3 +-
 Simulator/simulator.py                        |   4 +-
 TOGSim/include/Core.h                         |  22 +-
 TOGSim/include/Instruction.h                  |  59 +++--
 TOGSim/include/togsim_loader.h                |   2 +-
 TOGSim/include/togsim_runtime.h               |   5 -
 TOGSim/src/Core.cc                            | 137 ++++++------
 TOGSim/src/Instruction.cc                     |  29 +--
 TOGSim/src/TileGraphParser.cc                 |   6 +-
 TOGSim/src/main.cc                            |   8 +-
 TOGSim/src/togsim_runtime.cc                  |   7 +-
 TOGSim/src/togsim_trace_bridge.cc             | 210 ++++++++++--------
 17 files changed, 255 insertions(+), 276 deletions(-)

diff --git a/CLAUDE.md b/CLAUDE.md
index 12d48082..5a3a47cd 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -85,6 +85,7 @@ Note: `TOGSIM_CONFIG` is **overwritten** while inside a `with TOGSimulator(confi
 Located under `configs/*.yml`:
 
 - `num_cores`, `core_freq_mhz`, `num_systolic_array_per_core`
+- `sa_weight_buffer_depth` (per-SA resident weight slots; **must be > 0** — the simulator errors on 0. Raise it to effectively disable the preload run-ahead throttle. Defaults to 2 if the key is absent.)
 - `vpu_num_lanes`, `vpu_spad_size_kb_per_lane`, `vpu_vector_length_bits`
 - `dram_type` (`ramulator2` | `simple`), `dram_channels`, `dram_freq_mhz`, `ramulator_config_path`
 - `icnt_type` (`simple` | `booksim`), `icnt_latency_cycles`, `icnt_freq_mhz`, `icnt_config_path`
diff --git a/PyTorchSimFrontend/extension_codecache.py b/PyTorchSimFrontend/extension_codecache.py
index c57f9c51..785a3d95 100644
--- a/PyTorchSimFrontend/extension_codecache.py
+++ b/PyTorchSimFrontend/extension_codecache.py
@@ -285,9 +285,9 @@ def load(cls, source_code,
 
             # Trace pipeline (DEFAULT): emit the compiled trace producer .so + the
             # cycle-table TSV from the post-vcix IR and gem5 cycle_list/offsets. This
-            # is the default simulation path (the C++ TOG); the legacy ONNX TOG is the
-            # opt-in fallback via TORCHSIM_LEGACY_TOG=1, in which case the .so is unused
-            # so skip emitting it. Best-effort: never breaks the compile.
+            # is the default simulation path (the C++ TOG); the legacy ONNX TOG is
+            # DEPRECATED, an opt-in fallback via TORCHSIM_LEGACY_TOG=1, in which case the
+            # .so is unused so skip emitting it. Best-effort: never breaks the compile.
             if os.environ.get("TORCHSIM_LEGACY_TOG") != "1":
                 try:
                     import mlir.ir as ir
diff --git a/PyTorchSimFrontend/mlir/passes/build_skeleton.py b/PyTorchSimFrontend/mlir/passes/build_skeleton.py
index 6e27cf45..4c3d89cb 100644
--- a/PyTorchSimFrontend/mlir/passes/build_skeleton.py
+++ b/PyTorchSimFrontend/mlir/passes/build_skeleton.py
@@ -53,7 +53,7 @@
 _KEEP = {
     "affine.for", "scf.for", "scf.while",
     "affine.yield", "scf.yield", "func.return",
-    ts.DMA, ts.COMPUTE, ts.COMPUTE_BAR, ts.MEMORY_BAR,
+    ts.DMA, ts.COMPUTE, ts.MEMORY_BAR,
 }
 
 
@@ -117,20 +117,6 @@ def _emit_dma(ctx, dma_node, tag_id, dram_index, tag_index, read_bufs, write_buf
     )
 
 
-def _emit_compute_bar(ctx, anchor_op):
-    """Insert a `togsim.compute_barrier` before `anchor_op` -- the fence that
-    drains in-flight async compute (the systolic-array matmuls) before a store
-    consumes their result (sec 10.7).
-
-    FIXME: this is the one barrier still synthesized here rather than read from
-    the IR. Like the async-load memory barrier (now mapped 1:1 from the explicit
-    dma_wait), the compute fence should eventually appear explicitly in the input
-    MLIR and be mapped through, not auto-inserted -- no surprising insertion."""
-    ir.Operation.create(
-        ts.COMPUTE_BAR, results=[], operands=[], attributes={},
-        loc=ir.Location.unknown(ctx), ip=ir.InsertionPoint(anchor_op))
-
-
 def _emit_memory_bar(ctx, anchor_op, tag_id, tag_index, write_bufs):
     """Insert a `togsim.memory_barrier` before `anchor_op` -- the explicit
     async-DMA sync that was the original `memref.dma_wait`. It pairs with its
@@ -456,8 +442,6 @@ def _emit_one_dma(ctx, op, node, builder, bufs, tags):
     read_bufs = [spad_id] if node.is_write else []
     write_bufs = [] if node.is_write else [spad_id]
     tag_id = tags.bind(_value_key(f["tag"]), spad_id)
-    if node.is_write:
-        _emit_compute_bar(ctx, op)   # FIXME(sec10.7): auto-inserted; should be explicit in the IR.
     _emit_dma(ctx, node, tag_id, dram_index, tag_index, read_bufs, write_bufs)
 
 
diff --git a/PyTorchSimFrontend/mlir/passes/lower_to_emitc.py b/PyTorchSimFrontend/mlir/passes/lower_to_emitc.py
index 6bf7d559..3d1f7cde 100644
--- a/PyTorchSimFrontend/mlir/passes/lower_to_emitc.py
+++ b/PyTorchSimFrontend/mlir/passes/lower_to_emitc.py
@@ -429,14 +429,6 @@ def _rewrite_togsim_ops(ctx, kernel, ctx_val):
                    i32(0), _opaque(ctx, "nullptr"),
                    _arr(ctx, _rb), i32(len(_rb)), _arr(ctx, _wb), i32(len(_wb))])
             victims.append(op)
-        elif name == ts.COMPUTE_BAR:
-            # explicit compute fence -> togsim_compute_barrier(ctx) (sec 10.7).
-            ir.Operation.create(
-                "emitc.call_opaque", results=[], operands=[ctx_val],
-                attributes={"callee": ir.StringAttr.get(ts.EMITC_CALLEE[ts.COMPUTE_BAR]),
-                            "args": ir.ArrayAttr.get([_idx(0)])},
-                loc=ir.Location.unknown(ctx), ip=ipo)
-            victims.append(op)
     for op in victims:
         op.operation.erase()
 
diff --git a/PyTorchSimFrontend/mlir/passes/togsim_ops.py b/PyTorchSimFrontend/mlir/passes/togsim_ops.py
index 740a8f2f..21983da0 100644
--- a/PyTorchSimFrontend/mlir/passes/togsim_ops.py
+++ b/PyTorchSimFrontend/mlir/passes/togsim_ops.py
@@ -32,8 +32,6 @@
             tag_id = i32, write_bufs = [..]             tag_id, tag_slot, write_bufs)
          } : (index) -> ()
 
-    "togsim.compute_barrier"() : () -> ()       -> togsim_compute_barrier(ctx)
-
 How an async dma pairs with its sync point: NOT by a compile-time id. One static
 `togsim.dma` op runs once per loop iteration, each with a different RUNTIME tag
 slot `%tag[%idx]`, so the pairing must be a runtime key. `togsim.dma` carries a
@@ -52,17 +50,15 @@
 # ---- op names -------------------------------------------------------------
 DMA    = "togsim.dma"
 COMPUTE = "togsim.compute"
-COMPUTE_BAR = "togsim.compute_barrier"  # fence: drain async compute before a consumer (sec 10.7)
 MEMORY_BAR = "togsim.memory_barrier"    # explicit async-DMA sync (the original dma_wait); tag-keyed
 
 #: every op this module owns (for matchers / DCE roots in C2).
-OP_NAMES = (DMA, COMPUTE, COMPUTE_BAR, MEMORY_BAR)
+OP_NAMES = (DMA, COMPUTE, MEMORY_BAR)
 
 #: op name -> the togsim_runtime.h symbol C4 lowers it to.
 EMITC_CALLEE = {
     DMA:     "togsim_dma",
     COMPUTE: "togsim_compute",
-    COMPUTE_BAR: "togsim_compute_barrier",
     MEMORY_BAR: "togsim_memory_barrier",
 }
 
diff --git a/README.md b/README.md
index f0bdc772..c2298376 100644
--- a/README.md
+++ b/README.md
@@ -385,6 +385,7 @@ num_cores: 1
 core_freq_mhz: 940
 core_stats_print_period_cycles: 10000
 num_systolic_array_per_core: 2
+sa_weight_buffer_depth: 2   # per-SA resident weight slots; must be > 0 (default 2). Raise to loosen the preload throttle.
 # Optional: one entry per core, default ws_mesh
 # core_type: [ws_mesh, ws_mesh]
 # Optional STONNE cores: stonne_config_path, num_stonne_per_core, num_stonne_port
@@ -453,7 +454,7 @@ codegen_compiler_optimization: all    # all | none | list of option names
 
 One-line meaning for each group (details in the YAML block above).
 
-- **Core (`num_cores`, `core_freq_mhz`, `core_stats_print_period_cycles`, `num_systolic_array_per_core`, optional `core_type`, STONNE keys)**: how many cores, their clock, stats cadence, systolic count per core, and optional non-default mesh vs STONNE mix.
+- **Core (`num_cores`, `core_freq_mhz`, `core_stats_print_period_cycles`, `num_systolic_array_per_core`, `sa_weight_buffer_depth`, optional `core_type`, STONNE keys)**: how many cores, their clock, stats cadence, systolic count per core, the per-SA resident weight-slot count (must be > 0; bounds preload run-ahead—raise it to loosen the throttle), and optional non-default mesh vs STONNE mix.
 - **VPU (`vpu_*`)**: vector lane count, per-lane scratchpad (KB), and vector register width—**compiler** uses these for tiling/codegen.
 - **DRAM (`dram_type`, `dram_channels`, …)**: `ramulator2` uses `ramulator_config_path`; `simple` uses fixed latency and optional bandwidth caps (`dram_bandwidth_gbps_*`, `dram_freq_mhz` when capped). `dram_num_partitions` splits channels for NUMA-style addressing.
 - **Interconnect (`icnt_*`, `booksim_config_path`)**: `simple` adds fixed hop latency (`icnt_latency_cycles`); `booksim2` points at a BookSim2 topology file.
diff --git a/Simulator/simulator.py b/Simulator/simulator.py
index 1c0c5cbe..a4517285 100644
--- a/Simulator/simulator.py
+++ b/Simulator/simulator.py
@@ -571,9 +571,11 @@ def run_standalone(
             base_cmd = TOGSimulator.get_togsim_command(config_path, togsim_path)
             use_trace = (os.environ.get("TORCHSIM_LEGACY_TOG") != "1"
                          and os.path.exists(trace_so))
+            if os.environ.get("TORCHSIM_LEGACY_TOG") == "1":
+                logger.warning("TORCHSIM_LEGACY_TOG=1 selects the DEPRECATED legacy ONNX TOG path")
             if use_trace:
                 cmd = f"{base_cmd} --trace_so {trace_so} --cycle_table {cycle_tsv}"
-            else:
+            else:  # DEPRECATED: legacy ONNX TOG path
                 cmd = f"{base_cmd} --models_list {trace_file_path}"
             if extension_config.CONFIG_TOGSIM_DEBUG_LEVEL:
                 cmd += f" --log_level {extension_config.CONFIG_TOGSIM_DEBUG_LEVEL}"
diff --git a/TOGSim/include/Core.h b/TOGSim/include/Core.h
index 0b6f8595..75ad9cf4 100644
--- a/TOGSim/include/Core.h
+++ b/TOGSim/include/Core.h
@@ -20,6 +20,13 @@ enum class InstFinishTraceTag {
   DmaRespComplete,
 };
 
+// A timed effect due at a cycle: free a weight slot, or wake a MEMORY_BAR.
+struct DueAction {
+  enum Kind { FreeWeightSlot, WakeBar } kind;
+  std::shared_ptr<WeightToken> token;
+  std::shared_ptr<Instruction> bar;
+};
+
 class Core {
  public:
   Core(uint32_t id, SimulationConfig config);
@@ -63,13 +70,14 @@ class Core {
   // SRAM-capacity throttle (sec 10.x): a consumer frees the buffer-versions it
   // read (refcount -> 0 releases the spad bytes). Called when COMP/MOVOUT issue.
   void release_sram(const std::shared_ptr<Instruction>& inst);
+  // Occupy inst's buffer-version footprint on issue; false if it would overflow
+  // the spad this cycle (the caller stalls it). True for untracked insts.
+  bool try_occupy_sram(const std::shared_ptr<Instruction>& inst);
   // SA weight-buffer throttle (sec 10.x): pick a systolic array that has a free
   // weight slot (round-robin among free); -1 if all full -> the preload stalls.
   int pick_free_weight_sa();
-  // Free weight slots due this cycle: a matmul releases its slot at its
-  // streaming-end (finish - overlapping, when it stops reading the weight),
-  // scheduled at issue in _weight_release_q. Last consumer frees it.
-  void process_weight_releases();
+  void process_due_events();   // drain _due_events due this cycle
+  void apply_due(const DueAction& a);
 
   /* Core id & config file */
   const uint32_t _id;
@@ -128,10 +136,8 @@ class Core {
 
   // SA weight-buffer throttle (sec 10.x). _weight_slots_used[s] = weights resident
   // on SA s (loaded by a preload, not yet freed by their last matmul);
-  // _weight_slot_depth = per-SA capacity (0 = disabled -> plain round-robin).
+  // _weight_slot_depth = per-SA weight-slot capacity (must be > 0).
   std::vector<int> _weight_slots_used;
   uint32_t _weight_slot_depth = 0;
-  // Pending weight-slot releases keyed by cycle (each matmul's streaming-end);
-  // process_weight_releases() drains those due and decrements the token.
-  std::multimap<cycle_type, std::shared_ptr<WeightToken>> _weight_release_q;
+  std::multimap<cycle_type, DueAction> _due_events;
 };
\ No newline at end of file
diff --git a/TOGSim/include/Instruction.h b/TOGSim/include/Instruction.h
index fa5d4ca1..24659791 100644
--- a/TOGSim/include/Instruction.h
+++ b/TOGSim/include/Instruction.h
@@ -6,6 +6,7 @@
 #include <list>
 #include <numeric>
 
+#include <array>
 #include <set>
 #include <cassert>
 #include <cstdint>
@@ -13,10 +14,12 @@
 #include <vector>
 
 // MEMORY_BAR: the DMA/memory barrier (waits a DMA tag in the tag table).
-// COMPUTE_BAR: the compute barrier -- waits the systolic-array compute pipeline(s)
-//              to drain (all SAs empty), then finishes. Used as the explicit
-//              fence before a store consumes async matmul results (sec 10.7).
-enum class Opcode { MOVIN, MOVOUT, COMP, MEMORY_BAR, COMPUTE_BAR, COUNT};
+enum class Opcode { MOVIN, MOVOUT, COMP, MEMORY_BAR, COUNT};
+
+// A dependency edge releases its consumer on one of the producer's lifecycle
+// events: ISSUE (occupancy -- the consumer overlaps the producer on the SA
+// pipeline) or DONE (latency -- the consumer needs the producer's result).
+enum class DepEvent : uint8_t { ISSUE = 0, DONE = 1, COUNT = 2 };
 
 // One weight slot on systolic array `sa` (sec 10.x). A preload sets refcount =
 // the matmuls reusing the weight; each frees it at its streaming-end, the last
@@ -37,15 +40,20 @@ class Instruction : public std::enable_shared_from_this<Instruction> {
               std::vector<int64_t> accum_tag_idx_list);
   Instruction(Opcode opcode);
   void finish_instruction();
-  void add_child(std::shared_ptr<Instruction> child);
-  // Occupancy (SA-pipeline) dependency: the child is released when THIS op is
-  // ISSUED (enters the pipeline), not when it finishes -- so a preload/matmul
-  // successor overlaps it instead of waiting its full latency (sec 10.7).
-  void add_pipeline_child(std::shared_ptr<Instruction> child);
-  void release_pipeline_children();
-  // SA weight-buffer model: the SA this op is pinned to (a preload picks it, its
-  // matmul consumers inherit it) and the shared weight slot the matmuls release.
-  const std::set<std::shared_ptr<Instruction>>& get_pipeline_children() { return _pipeline_children; }
+  // Subscribe `c` to this op's `on` event (ISSUE=occupancy, DONE=latency). The set
+  // dedups, so ready_counter is bumped only on a new edge (a producer writing
+  // several buffers one consumer reads links the pair once per buffer).
+  void add_dep(std::shared_ptr<Instruction> c, DepEvent on) {
+    if (_deps[static_cast<size_t>(on)].insert(c).second) c->inc_ready_counter();
+  }
+  // Release every subscriber of `e` (decrement its ready_counter) and clear.
+  void fire(DepEvent e) {
+    for (auto& c : _deps[static_cast<size_t>(e)]) c->dec_ready_counter();
+    _deps[static_cast<size_t>(e)].clear();
+  }
+  const std::set<std::shared_ptr<Instruction>>& get_deps(DepEvent e) {
+    return _deps[static_cast<size_t>(e)];
+  }
   void set_assigned_sa(int s) { _assigned_sa = s; }
   int get_assigned_sa() const { return _assigned_sa; }
   void set_weight_token(const std::shared_ptr<WeightToken>& t) { _weight_token = t; }
@@ -54,10 +62,6 @@ class Instruction : public std::enable_shared_from_this<Instruction> {
   // grouping/coloring in the timeline. Set by the bridge per TILE_BEGIN.
   void set_tile_group(int g) { _tile_group = g; }
   int get_tile_group() const { return _tile_group; }
-  // COMPUTE_BAR fence: the max finish_cycle of the async computes it gates (its
-  // own dispatch only), so it drains those instead of every SA pipeline.
-  void update_fence_finish(cycle_type c) { if (c > _fence_finish) _fence_finish = c; }
-  cycle_type get_fence_finish() const { return _fence_finish; }
   bool check_ready() { return ready_counter == 0; }
   const Opcode get_opcode() { return opcode; }
   bool is_dma_read() { return opcode == Opcode::MOVIN; }
@@ -115,7 +119,6 @@ class Instruction : public std::enable_shared_from_this<Instruction> {
   void prepare_tag_key();
   bool is_sparse_inst() { return _is_sparse_inst; }
   void set_sparse_state(bool state) { _is_sparse_inst = state; }
-  std::set<std::shared_ptr<Instruction>>& get_child_inst() { return child_inst; }
   uint64_t get_global_inst_id() const { return _global_inst_id; }
 
   // SRAM-capacity model (sec 10.x). A load contributes its footprint to a
@@ -129,10 +132,15 @@ class Instruction : public std::enable_shared_from_this<Instruction> {
   int64_t get_sram_alloc() const { return _sram_alloc_id; }
   void add_sram_release(int64_t id) { _sram_release_allocs.push_back(id); }
   const std::vector<int64_t>& get_sram_release() const { return _sram_release_allocs; }
-  // bytes this load occupies in the spad (from the tile it moves in).
-  size_t sram_footprint() const { return _tile_numel * (_elem_bits / 8); }
+  // bytes this instruction's buffer occupies in the spad. A DMA derives it from
+  // the tile it moves; a compute output gets it set explicitly by the bridge (the
+  // buffer's size is known from the DMA records that touch the same buffer).
+  void set_sram_footprint(size_t b) { _sram_footprint_override = b; }
+  size_t sram_footprint() const {
+    return _sram_footprint_override ? _sram_footprint_override
+                                    : _tile_numel * (_elem_bits / 8);
+  }
 
-  cycle_type start_cycle = 0;
   cycle_type finish_cycle = 0;
   cycle_type bubble_cycle=0;
 
@@ -149,8 +157,11 @@ class Instruction : public std::enable_shared_from_this<Instruction> {
   cycle_type overlapping_cycle = 0;
   size_t ready_counter = 0;   // parents not yet finished; the minimal Instruction(Opcode)
                               // ctor (barriers) relies on this default + inc_ready_counter
-  std::set<std::shared_ptr<Instruction>> child_inst;
-  std::set<std::shared_ptr<Instruction>> _pipeline_children;  // released at issue (sec 10.7)
+  // Per-event subscriber sets: _deps[ISSUE] released at issue (occupancy),
+  // _deps[DONE] released at finish (latency). std::set dedups + keeps a stable
+  // iteration order (byte-identical release order).
+  std::array<std::set<std::shared_ptr<Instruction>>,
+             static_cast<size_t>(DepEvent::COUNT)> _deps;
   std::vector<size_t> tile_size;
   std::vector<int> tile_stride;
   size_t _tile_numel = 0;
@@ -175,9 +186,9 @@ class Instruction : public std::enable_shared_from_this<Instruction> {
   // SRAM-capacity model (see the setters above).
   int64_t _sram_alloc_id = -1;
   std::vector<int64_t> _sram_release_allocs;
+  size_t _sram_footprint_override = 0;
   // SA weight-buffer model (see the setters above).
   int _assigned_sa = -1;
   std::shared_ptr<WeightToken> _weight_token;
   int _tile_group = -1;   // trace-only work-item id (see set_tile_group)
-  cycle_type _fence_finish = 0;   // COMPUTE_BAR: drain target (see update_fence_finish)
 };
\ No newline at end of file
diff --git a/TOGSim/include/togsim_loader.h b/TOGSim/include/togsim_loader.h
index 9b326da6..17e10b34 100644
--- a/TOGSim/include/togsim_loader.h
+++ b/TOGSim/include/togsim_loader.h
@@ -23,7 +23,7 @@ namespace togsim {
 
 // One modeled instruction recorded by the runtime callbacks.
 struct TraceRec {
-  enum Kind { TILE_BEGIN, TILE_END, DMA, COMPUTE, MEMORY_BAR, COMPUTE_BAR } kind;
+  enum Kind { TILE_BEGIN, TILE_END, DMA, COMPUTE, MEMORY_BAR } kind;
   int32_t  core;          // work-item -> core binding (set by togsim_dispatch)
   // DMA / MEMORY_BAR
   int32_t  dir;           // togsim_dma_dir
diff --git a/TOGSim/include/togsim_runtime.h b/TOGSim/include/togsim_runtime.h
index e8fd6b84..d87c61d5 100644
--- a/TOGSim/include/togsim_runtime.h
+++ b/TOGSim/include/togsim_runtime.h
@@ -162,11 +162,6 @@ typedef void (*togsim_tile_fn)(EmitCtx* ctx, int64_t* iv, int32_t n_iv);
 void togsim_dispatch(EmitCtx* ctx, togsim_tile_fn fn,
                      int64_t* iv, int32_t n_iv);
 
-// Compute fence: drain in-flight async compute (the systolic-array matmuls)
-// before the following op (a store) consumes their result. Explicit barrier in
-// the trace; the loader turns it into a COMPUTE_BAR instruction (sec 10.7).
-void togsim_compute_barrier(EmitCtx* ctx);
-
 // Entry point the loader resolves in the producer `.so`. `shape_args` carries
 // the runtime values for the kernel's symbolic dimensions (in a kernel-specific
 // order recorded alongside the cached `.so`); `n_shape_args` is their count.
diff --git a/TOGSim/src/Core.cc b/TOGSim/src/Core.cc
index 980b4fa6..915988ce 100644
--- a/TOGSim/src/Core.cc
+++ b/TOGSim/src/Core.cc
@@ -18,7 +18,11 @@ Core::Core(uint32_t id, SimulationConfig config)
   _stat_inst_count.resize(static_cast<size_t>(Opcode::COUNT), 0);
   _stat_tot_skipped_inst.resize(static_cast<size_t>(Opcode::COUNT), 0);
   _sram_capacity = (size_t)config.core_spad_size_kb * 1024;  // 0 = throttle disabled
-  _weight_slot_depth = config.sa_weight_buffer_depth;        // 0 = disabled (plain rr)
+  _weight_slot_depth = config.sa_weight_buffer_depth;        // per-SA weight slots (>0)
+  if (_weight_slot_depth == 0) {
+    spdlog::error("sa_weight_buffer_depth must be > 0 (raise it to loosen the preload throttle)");
+    exit(EXIT_FAILURE);
+  }
   _weight_slots_used.resize(_num_systolic_array_per_core, 0);
 }
 
@@ -35,11 +39,23 @@ int Core::pick_free_weight_sa() {
   return -1;
 }
 
-void Core::process_weight_releases() {
-  while (!_weight_release_q.empty() && _weight_release_q.begin()->first <= _core_cycle) {
-    auto tok = _weight_release_q.begin()->second;
-    _weight_release_q.erase(_weight_release_q.begin());
-    if (--tok->refcount <= 0) _weight_slots_used[tok->sa]--;  // last reader frees the slot
+void Core::apply_due(const DueAction& a) {
+  switch (a.kind) {
+    case DueAction::FreeWeightSlot:
+      if (--a.token->refcount <= 0) _weight_slots_used[a.token->sa]--;  // last reader frees the slot
+      break;
+    case DueAction::WakeBar: {
+      auto bar = a.bar;            // async load data arrived -> fire its MEMORY_BAR
+      finish_instruction(bar);
+      break;
+    }
+  }
+}
+
+void Core::process_due_events() {
+  while (!_due_events.empty() && _due_events.begin()->first <= _core_cycle) {
+    apply_due(_due_events.begin()->second);
+    _due_events.erase(_due_events.begin());
   }
 }
 
@@ -55,6 +71,15 @@ void Core::release_sram(const std::shared_ptr<Instruction>& inst) {
   }
 }
 
+bool Core::try_occupy_sram(const std::shared_ptr<Instruction>& inst) {
+  if (!_sram_capacity || inst->get_sram_alloc() < 0) return true;   // untracked
+  size_t F = inst->sram_footprint();
+  if (_sram_used + F > _sram_capacity) return false;                // would overflow -> stall
+  _sram_used += F;
+  _sram_allocs[inst->get_sram_alloc()] += F;                        // accumulate version footprint
+  return true;
+}
+
 bool Core::can_issue(const std::shared_ptr<Tile>& op) {
   /* Bound concurrent dispatches so their combined spad working set fits: with the
    * global @buffers each in-flight dispatch piles its own load versions, and too
@@ -174,7 +199,7 @@ void Core::dma_cycle() {
       finish_instruction(instruction, InstFinishTraceTag::DmaRespComplete);
       for (auto & wait_inst : _dma.get_tag_waiter(instruction->subgraph_id, key)) {
         _dma.mark_tag_used(instruction->subgraph_id, key);
-        finish_instruction(wait_inst);
+        _due_events.emplace(_core_cycle, DueAction{DueAction::WakeBar, nullptr, wait_inst});
       }
     }
     _dma_finished_queue.erase(_dma_finished_queue.begin());
@@ -239,7 +264,7 @@ void Core::cycle() {
   /* Increase core cycle counter */
   _core_cycle++;
 
-  process_weight_releases();  // free weight slots due this cycle before dispatch
+  process_due_events();  // weight-slot frees + DMA-arrival wakeups due this cycle
 
   /* Iterate tile while an instruction is issued */
   bool issued = false;
@@ -248,9 +273,6 @@ void Core::cycle() {
     auto& instructions = _tiles[i]->get_ready_instructions();
     for (auto it=instructions.begin(); it!=instructions.end();) {
       auto& inst = *it;
-      /* Skip instruction is not ready  */
-      //if (!inst->is_ready())
-      //  continue;
 
       switch (inst->get_opcode()) {
         case Opcode::MOVIN:
@@ -281,22 +303,8 @@ void Core::cycle() {
               _stat_tot_skipped_inst.at(static_cast<size_t>(inst->get_opcode()))++;
               break;
             } else {
-              // SRAM-capacity gate (sec 10.x): a load that would overflow the
-              // per-core spad does not issue this cycle -- leave it in the ready
-              // queue (it++ retries next cycle) until a consumer frees a tile. On
-              // issue, occupy its bytes under its buffer-version allocation.
-              if (_sram_capacity && inst->get_sram_alloc() >= 0) {
-                size_t F = inst->sram_footprint();
-                // Stall if the tile does not fit in the free spad right now. If
-                // it can never fit (the kernel's working set exceeds the whole
-                // spad), the sim wedges -- Simulator::cycle() detects that frozen
-                // state and exits with a "spad too small" error rather than
-                // looping forever.
-                if (_sram_used + F > _sram_capacity)
-                  break;                                       // not issued -> retry next cycle
-                _sram_used += F;
-                _sram_allocs[inst->get_sram_alloc()] += F;     // accumulate version footprint
-              }
+              // load occupies its spad bytes on issue; stall (retry next cycle) if full.
+              if (!try_occupy_sram(inst)) break;
               core_trace_log::trace_instruction_line(_core_cycle,
                                                        _id,
                                                        TraceLogTag::pad15(
@@ -324,41 +332,38 @@ void Core::cycle() {
         case Opcode::COMP:
           {
             const int ct = inst->get_compute_type();
-            // --- SA selection + weight-buffer gate (sec 10.x) ---
-            // A preload picks a systolic array with a free weight slot and pins
-            // its matmul consumers to that SA (they free the slot on finish). A
-            // matmul runs on the SA its weight was preloaded into. This both
-            // bounds preload run-ahead and keeps matmuls on their weight's SA.
+            // a fresh-output compute occupies its spad bytes on issue; stall if full.
+            if (!try_occupy_sram(inst)) break;
+            // SA selection (sec 10.x): a preload picks an SA with a free weight slot
+            // and pins its matmul consumers there; a matmul runs on its pinned SA.
             int sa_idx = -1;
             if (ct == MATMUL || ct == PRELOAD) {
               if (ct == PRELOAD) {
                 int n_consumers = 0;   // matmuls reusing this weight
-                for (auto& c : inst->get_pipeline_children())
+                for (auto& c : inst->get_deps(DepEvent::ISSUE))
                   if (c->get_compute_type() == MATMUL) n_consumers++;
-                if (_weight_slot_depth > 0 && n_consumers > 0) {
-                  sa_idx = pick_free_weight_sa();
-                  if (sa_idx < 0) break;            // all weight slots full -> stall (retry)
-                  _weight_slots_used[sa_idx]++;
-                  auto tok = std::make_shared<WeightToken>(WeightToken{sa_idx, n_consumers});
-                  for (auto& c : inst->get_pipeline_children())
-                    if (c->get_compute_type() == MATMUL) {
-                      c->set_assigned_sa(sa_idx);
-                      c->set_weight_token(tok);
-                    }
-                } else {                            // disabled / no consumers -> plain rr
-                  sa_idx = _systolic_array_rr;
-                  _systolic_array_rr = (_systolic_array_rr + 1) % _num_systolic_array_per_core;
+                if (n_consumers == 0) {            // weight-slot model needs >=1 consumer
+                  spdlog::error("preload has no matmul consumer (weight-slot model invariant)");
+                  exit(EXIT_FAILURE);
                 }
+                sa_idx = pick_free_weight_sa();
+                if (sa_idx < 0) break;              // all weight slots full -> stall (retry)
+                _weight_slots_used[sa_idx]++;
+                auto tok = std::make_shared<WeightToken>(WeightToken{sa_idx, n_consumers});
+                for (auto& c : inst->get_deps(DepEvent::ISSUE))
+                  if (c->get_compute_type() == MATMUL) {
+                    c->set_assigned_sa(sa_idx);
+                    c->set_weight_token(tok);
+                  }
               } else {                              // MATMUL
-                sa_idx = inst->get_assigned_sa();
-                if (sa_idx < 0) {                   // no preload pinned it -> rr fallback
-                  sa_idx = _systolic_array_rr;
-                  _systolic_array_rr = (_systolic_array_rr + 1) % _num_systolic_array_per_core;
+                sa_idx = inst->get_assigned_sa();   // pinned by its preload
+                if (sa_idx < 0) {                   // unpinned -> no preload set its SA
+                  spdlog::error("matmul was not pinned to an SA by a preload (weight-slot model invariant)");
+                  exit(EXIT_FAILURE);
                 }
               }
               inst->set_assigned_sa(sa_idx);         // record the SA actually used (for the trace)
             }
-            release_sram(inst);   // consumer issued -> free the tiles it read
             auto& target_pipeline = (ct == VECTOR_UNIT) ? _vu_compute_pipeline
                                                         : _sa_compute_pipeline.at(sa_idx);
             if (target_pipeline.empty()) {
@@ -370,19 +375,19 @@ void Core::cycle() {
               inst->finish_cycle = target_pipeline.back()->finish_cycle + inst->get_compute_cycle() - overlapped_cycle;
               inst->bubble_cycle = bubble_cycle;
             }
-            // sec 10.7: release the occupancy (pipeline) dependents so a successor
-            // overlaps this op. finish_cycle is set first so release can feed it to
-            // a COMPUTE_BAR child's per-dispatch fence (see release_pipeline_children).
-            inst->release_pipeline_children();
+            // release the occupancy (ISSUE) dependents so a successor overlaps this op.
+            inst->fire(DepEvent::ISSUE);
 
             // Release this matmul's weight slot at its streaming-end (finish -
             // overlapping), not at full finish (the drain tail does not read it).
             if (ct == MATMUL && inst->get_weight_token()) {
               cycle_type rel = inst->finish_cycle > inst->get_overlapping_cycle()
                                  ? inst->finish_cycle - inst->get_overlapping_cycle() : _core_cycle;
-              _weight_release_q.emplace(rel, inst->get_weight_token());
+              _due_events.emplace(rel, DueAction{DueAction::FreeWeightSlot,
+                                                 inst->get_weight_token(), nullptr});
             }
 
+            release_sram(inst);   // free the tiles it read (before the skip path)
             if (inst->get_compute_cycle() == 0) {
               inst->finish_instruction();
               static_cast<Tile*>(inst->get_owner())->inc_finished_inst();
@@ -409,7 +414,7 @@ void Core::cycle() {
             auto& key = inst->get_tag_id();
             uint32_t finished = _dma.get_tag_finish(inst->subgraph_id, key);
             if (finished == -1) {
-              for (auto child_inst : inst->get_child_inst()) {
+              for (auto child_inst : inst->get_deps(DepEvent::DONE)) {
                 if (child_inst->get_opcode() == Opcode::COMP && child_inst->get_compute_type() == MATMUL) {
                   child_inst->set_compute_cycle(0);
                 }
@@ -431,24 +436,6 @@ void Core::cycle() {
             issued = true;
           }
           break;
-        case Opcode::COMPUTE_BAR:
-          {
-            // Compute fence (sec 10.7): finish once THIS dispatch's async computes
-            // have drained -- i.e. the current cycle has reached the max finish of
-            // the computes it gates (fed in via update_fence_finish when each
-            // issued). Scoped to its own dispatch, so an unrelated tile's matmuls
-            // sharing the SA pipelines do not delay it (no cross-dispatch
-            // serialization). Not yet drained -> stays in the ready queue.
-            if (_core_cycle >= inst->get_fence_finish()) {
-              core_trace_log::trace_instruction_line(_core_cycle, _id,
-                  TraceLogTag::pad15(TraceLogTag::kInstructionFinished),
-                  inst->get_global_inst_id(),
-                  core_trace_log::format_instruction_detail_line(*inst));
-              finish_instruction(inst);
-              issued = true;
-            }
-          }
-          break;
         default:
           core_trace_log::log_error_undefined_opcode();
           exit(EXIT_FAILURE);
diff --git a/TOGSim/src/Instruction.cc b/TOGSim/src/Instruction.cc
index c5778a28..ee184a1a 100644
--- a/TOGSim/src/Instruction.cc
+++ b/TOGSim/src/Instruction.cc
@@ -24,7 +24,6 @@ std::string opcode_to_string(Opcode opcode) {
         case Opcode::MOVOUT:       return "MOVOUT";
         case Opcode::COMP:         return "COMP";
         case Opcode::MEMORY_BAR:   return "MEMORY_BAR";
-        case Opcode::COMPUTE_BAR:  return "COMPUTE_BAR";
         default:                   return "Unknown";
     }
 }
@@ -51,36 +50,10 @@ Instruction::Instruction(Opcode opcode)
 }
 
 void Instruction::finish_instruction() {
-  for (auto& counter : child_inst)
-    counter->dec_ready_counter();
+  fire(DepEvent::DONE);   // latency consumers
   finished = true;
 }
 
-void Instruction::add_child(std::shared_ptr<Instruction> child) {
-  // child_inst is a set (each child released exactly once at finish), so the
-  // ready_counter must be bumped only when the edge is NEW -- a producer that
-  // writes several buffers a single consumer reads (e.g. a sort tile reading the
-  // value+index buffers its predecessor wrote) links the same pair once per shared
-  // buffer; double-counting would leave ready_counter stuck above 0 -> deadlock.
-  if (child_inst.insert(child).second)
-    child->inc_ready_counter();
-}
-
-void Instruction::add_pipeline_child(std::shared_ptr<Instruction> child) {
-  if (_pipeline_children.insert(child).second)
-    child->inc_ready_counter();
-}
-
-void Instruction::release_pipeline_children() {
-  for (auto& c : _pipeline_children) {
-    // a COMPUTE_BAR child fences only its own dispatch -> it drains the max
-    // finish of the computes it gates, fed here as each one issues.
-    if (c->get_opcode() == Opcode::COMPUTE_BAR) c->update_fence_finish(finish_cycle);
-    c->dec_ready_counter();
-  }
-  _pipeline_children.clear();
-}
-
 void Instruction::inc_waiting_request() {
   _nr_waiting_request++;
 }
diff --git a/TOGSim/src/TileGraphParser.cc b/TOGSim/src/TileGraphParser.cc
index 572062e0..c252258e 100644
--- a/TOGSim/src/TileGraphParser.cc
+++ b/TOGSim/src/TileGraphParser.cc
@@ -584,7 +584,7 @@ std::vector<std::shared_ptr<Tile>> TileLoopNode::get_tiles_from_iter(TileGraphPa
         for (const auto& child_node: node->get_child()) {
           if (link_map.find(child_node) != link_map.end()) {
             std::shared_ptr<Instruction> child_inst = link_map[child_node];
-            inst->add_child(child_inst);
+            inst->add_dep(child_inst, DepEvent::DONE);
           }
         }
       }
@@ -606,7 +606,7 @@ std::vector<std::shared_ptr<Tile>> TileLoopNode::get_tiles_from_iter(TileGraphPa
             for (auto& inner_inst : inner_tile->get_instructions()) {
               tile_vec.back()->append_instuction(inner_inst);
               if (nr_inst) {
-                last_instruction->add_child(inner_inst);
+                last_instruction->add_dep(inner_inst, DepEvent::DONE);
               }
             }
           }
@@ -662,7 +662,7 @@ std::vector<std::shared_ptr<Tile>> TileLoopNode::get_tiles_from_iter(TileGraphPa
     for (const auto& child_node: node->get_child()) {
       if (link_map.find(child_node) != link_map.end()) {
         std::shared_ptr<Instruction> child_inst = link_map[child_node];
-        inst->add_child(child_inst);
+        inst->add_dep(child_inst, DepEvent::DONE);
       }
     }
   }
diff --git a/TOGSim/src/main.cc b/TOGSim/src/main.cc
index bd570188..274d63da 100644
--- a/TOGSim/src/main.cc
+++ b/TOGSim/src/main.cc
@@ -54,10 +54,9 @@ std::unique_ptr<TileGraph> build_trace_tilegraph(Simulator* simulator,
 void launchKernel(Simulator* simulator, unsigned int kernel_id, std::string onnx_path, std::string attribute_path, const YAML::Node& config_yaml, cycle_type request_time=0, int partition_id=0, int device_id=0) {
   std::unique_ptr<TileGraph> tile_graph;
   std::string tog_path = onnx_path;  // for the log line
-  // Prefer the C++ trace path: the kernel's trace.so / trace_cycles.tsv sit next to
-  // its tile_graph.onnx (same write_path). This brings the multi-tenant scheduler
-  // onto the new TOG too; opt out with TORCHSIM_LEGACY_TOG=1, and fall back to the
-  // legacy ONNX parser when the .so is absent or fails to run.
+  // The C++ trace path is the supported one: the kernel's trace.so / trace_cycles.tsv
+  // sit next to its tile_graph.onnx (same write_path). The legacy ONNX parser below is
+  // DEPRECATED -- only used via TORCHSIM_LEGACY_TOG=1 or when the .so is absent / fails.
   const char* legacy = std::getenv("TORCHSIM_LEGACY_TOG");
   std::string dir = fs::path(onnx_path).parent_path().string();
   std::string trace_so = dir + "/trace.so";
@@ -68,6 +67,7 @@ void launchKernel(Simulator* simulator, unsigned int kernel_id, std::string onnx
     else spdlog::warn("[TOGSim] trace.so run failed for {}; falling back to ONNX", trace_so);
   }
   if (!tile_graph) {
+    spdlog::warn("[TOGSim] using the DEPRECATED legacy ONNX TOG path for {}", onnx_path);
     auto graph_praser = TileGraphParser(onnx_path, attribute_path, config_yaml);
     tile_graph = std::move(graph_praser.get_tile_graph());
   }
diff --git a/TOGSim/src/togsim_runtime.cc b/TOGSim/src/togsim_runtime.cc
index d15d92b9..a83b8541 100644
--- a/TOGSim/src/togsim_runtime.cc
+++ b/TOGSim/src/togsim_runtime.cc
@@ -108,10 +108,6 @@ void togsim_memory_barrier(EmitCtx* ctx, int32_t tag_id, uint64_t tag_slot,
   ctx->trace.push_back(r);
 }
 
-void togsim_compute_barrier(EmitCtx* ctx) {
-  ctx->trace.push_back(blank(togsim::TraceRec::COMPUTE_BAR, ctx->cur_core));
-}
-
 }  // extern "C"
 
 namespace togsim {
@@ -189,8 +185,7 @@ SimResult simulate(const RunResult& run, const TimingParams& params) {
       }
       case TraceRec::TILE_BEGIN:
       case TraceRec::TILE_END:
-      case TraceRec::COMPUTE_BAR:
-        break;  // work-item boundary / compute fence: no cost in this reference timer
+        break;  // work-item boundary: no cost in this reference timer
     }
   }
   for (auto& kv : dma_free) out.total_cycle = std::max(out.total_cycle, kv.second);
diff --git a/TOGSim/src/togsim_trace_bridge.cc b/TOGSim/src/togsim_trace_bridge.cc
index 3a75945c..351e313e 100644
--- a/TOGSim/src/togsim_trace_bridge.cc
+++ b/TOGSim/src/togsim_trace_bridge.cc
@@ -1,6 +1,7 @@
 // togsim_trace_bridge.cc -- see togsim_trace_bridge.h
 #include "togsim_trace_bridge.h"
 
+#include <algorithm>
 #include <map>
 #include <utility>
 #include <vector>
@@ -48,7 +49,7 @@ std::shared_ptr<Instruction> make_dma(const togsim::TraceRec& t, int64_t uniq) {
 
 // A MEMORY_BAR carrying the SAME `uniq` tag key as the async dma it gates -- the
 // Core's tag table signals it at the dma's DATA-ready (resp-complete), unlike a
-// raw add_child which the async dma releases at issue-complete.
+// raw DONE edge that the async dma releases at issue-complete.
 std::shared_ptr<Instruction> make_mem_bar(const togsim::TraceRec& t, int64_t uniq) {
   auto bar = std::make_shared<Instruction>(
       Opcode::MEMORY_BAR, 0, 0, 0,
@@ -83,19 +84,21 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
 
   std::shared_ptr<TileSubGraph> sg;
   std::shared_ptr<Tile> tile;
-  // Explicit dependency DAG (sec 10): a reader depends on the last writer of each
-  // SRAM buffer it reads. Scoped per work-item (reset at each dispatch) -- buffers
-  // are work-item-local, so distinct work-items are independent (-> parallel).
-  std::map<int64_t, std::shared_ptr<Instruction>> last_writer;  // buffer id -> producer
+  // Explicit dependency DAG (sec 10), one clean dataflow rule (see `link`).
+  // Per SRAM buffer we keep writers(b) -- a SET of the current producers'
+  // DONE-handles -- and readers(b). Scoped per work-item (reset at each dispatch)
+  // -- buffers are work-item-local, so distinct work-items are independent
+  // (-> parallel).
+  std::map<int64_t, std::vector<std::shared_ptr<Instruction>>> writers;       // buffer id -> current producers (DONE-handles)
   // An async dma is paired with its explicit memory_barrier(s) by the runtime tag
   // (tag_id, tag_slot). It is 1 load : N barriers (the load happens once per
   // reduction iteration; each consumer in that iteration is preceded by a wait on
   // the same tag), so we track the CURRENT (most recent) load per (tag_id,
-  // tag_slot) -- like last_writer for a buffer -- not a FIFO. Each load gets a
-  // fresh `uniq` Core key, so successive reduction iterations (multi-tile-K, conv)
-  // never collide in the tag table; the iteration's barriers reuse that load's
-  // uniq. Correct because the load nest and its consumer nest run in order within
-  // the reduction body (no cross-iteration prefetch). Scoped per work-item.
+  // tag_slot) -- not a FIFO. Each load gets a fresh `uniq` Core key, so successive
+  // reduction iterations (multi-tile-K, conv) never collide in the tag table; the
+  // iteration's barriers reuse that load's uniq. Correct because the load nest and
+  // its consumer nest run in order within the reduction body (no cross-iteration
+  // prefetch). Scoped per work-item.
   std::map<std::pair<int32_t, uint64_t>,
            std::pair<int64_t, std::shared_ptr<Instruction>>> current_dma;
   // Dedup identical dma_waits: the barrier already built for the CURRENT load of a
@@ -108,13 +111,6 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
            std::pair<int64_t, std::shared_ptr<Instruction>>> bar_for_load;
   int64_t next_tag = 0;   // mints a unique Core tag key per dma record
   int cur_tile_group = -1;   // work-item index, bumped per TILE_BEGIN (trace grouping)
-  // Async compute (matmul/preload): issued and pipelined on the systolic array;
-  // they do not block each other. A store then needs the drained result, so it
-  // FLUSHes -- waits all outstanding async compute before running (like a fence
-  // after async ops). No per-op completion events; one barrier before the store.
-  std::vector<std::shared_ptr<Instruction>> outstanding_async;
-  std::shared_ptr<Instruction> pending_bar;   // last COMPUTE_BAR fence, awaited by the next store
-  auto is_async_compute = [](int ct) { return ct == 1 || ct == 2; };  // matmul / preload
 
   auto flush = [&]() {
     if (sg && tile) {
@@ -124,49 +120,59 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
     }
     sg.reset();
     tile.reset();
-    last_writer.clear();
+    writers.clear();
     current_dma.clear();
     bar_for_load.clear();
     next_tag = 0;
-    outstanding_async.clear();
-    pending_bar.reset();
   };
 
-  // Build edges from the recorded read/write buffer sets: reader <- last writer of
-  // each buffer it reads (the virtual SA_WEIGHTS buffer carries preload->matmul;
-  // the Y_spad accumulator carries the reduction chain; the spads carry load->
-  // compute). No in-order chain, no tag matching, no op heuristics.
-  // sec 10.7 occupancy/latency split. An edge from a systolic-array producer
-  // (preload=2 or matmul=1) to a matmul (1) is an OCCUPANCY dependency: the
-  // successor overlaps the producer on the SA pipeline, so use add_pipeline_child
-  // (released when the producer ISSUES). Every other edge is a LATENCY
-  // dependency (the consumer needs the producer's result): load->compute,
-  // init->matmul, matmul->store -> add_child (released at the producer's finish).
+  // Single dataflow rule (sec 10). Per buffer b, writers(b) is a SET of the
+  // current producers' DONE-handles.
+  //  - READ b: depend on ALL writers(b) -- occupancy (ISSUE) when both are SA ops
+  //    (preload/matmul overlap on the pipeline), else latency (DONE).
+  //  - WRITE b: REPLACE -- reset writers(b)={inst}.
+  //  - Exception is_mm_accum (a MATMUL reading AND writing b = a commutative
+  //    accumulator, Y += X@W): skip the read edge and UNION the write -- wait only the
+  //    non-matmul seed (init/bias) and join writers(b) without resetting or ordering
+  //    against co-matmuls, so the K matmuls do not chain through the accumulator and a
+  //    later reader joins all of them. TOGSim is timing-only (values come from trace).
+  // Buffer-reuse (WAR) ordering is modeled by the resource models, not edges: the SRAM
+  // version/capacity machinery for spad buffers, the weight-slot machinery for weights.
   const int MATMUL_CT = 1, PRELOAD_CT = 2;
+  auto is_mm_accum = [&](const std::shared_ptr<Instruction>& inst, int64_t b,
+                         const std::vector<int64_t>& writes) {
+    if (inst->get_compute_type() != MATMUL_CT) return false;
+    for (int64_t w : writes) if (w == b) return true;
+    return false;
+  };
   auto link = [&](std::shared_ptr<Instruction> inst,
                   const std::vector<int64_t>& reads,
                   const std::vector<int64_t>& writes) {
     for (int64_t b : reads) {
-      // A matmul reading its own accumulator (a buffer it also WRITES) imposes NO
-      // producer order: Y += X@W is commutative. Chaining matmuls through the
-      // accumulator (M_k <- M_{k-1}) needlessly serializes them and DEADLOCKS the SA
-      // weight-slot pipeline -- a later iteration's preload can grab the last weight
-      // slot while the in-order head matmul is starved of one, and that head can never
-      // run to release a slot. The store still waits every matmul via the COMPUTE_BAR
-      // fence, so dropping this edge is safe (TOGSim is timing-only; values come from
-      // the recorded trace).
-      bool is_accum = false;
-      for (int64_t w : writes) if (w == b) { is_accum = true; break; }
-      if (inst->get_compute_type() == MATMUL_CT && is_accum) continue;
-      auto it = last_writer.find(b);
-      if (it == last_writer.end()) continue;
-      int pct = it->second->get_compute_type();
-      if (inst->get_compute_type() == MATMUL_CT && (pct == MATMUL_CT || pct == PRELOAD_CT))
-        it->second->add_pipeline_child(inst);  // SA pipeline -> occupancy (overlap)
-      else
-        it->second->add_child(inst);           // data/result -> latency (full wait)
+      if (is_mm_accum(inst, b, writes)) continue;   // accumulator read -> handled in WRITE (UNION)
+      auto it = writers.find(b);
+      if (it != writers.end())
+        for (auto& w : it->second) {
+          int pct = w->get_compute_type();
+          // both SA ops -> occupancy (overlap on the SA pipeline); else latency.
+          DepEvent on = (inst->get_compute_type() == MATMUL_CT &&
+                         (pct == MATMUL_CT || pct == PRELOAD_CT))
+                            ? DepEvent::ISSUE : DepEvent::DONE;
+          w->add_dep(inst, on);
+        }
+    }
+    for (int64_t b : writes) {
+      if (is_mm_accum(inst, b, writes)) {            // UNION (commutative accumulate)
+        auto it = writers.find(b);
+        if (it != writers.end())
+          for (auto& s : it->second)
+            if (s->get_compute_type() != MATMUL_CT)
+              s->add_dep(inst, DepEvent::DONE);   // wait the init/bias seed only
+        writers[b].push_back(inst);        // join; no reset, no co-matmul edge
+      } else {                             // REPLACE (normal output; resets the producer set)
+        writers[b] = { inst };
+      }
     }
-    for (int64_t b : writes) last_writer[b] = inst;
     tile->append_instuction(inst);
   };
 
@@ -175,15 +181,30 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
   // one allocation, freed once all the version's consumers have issued (refcount
   // -> 0). NOT reset in flush(): the spad is one physical per-core resource, so a
   // buffer reused by the next reduction iter / work-item is a NEW version that
-  // must wait for the old one to free (WAR / double-buffer). Tracked buffers are
-  // the DMA-loaded ones; the accumulator / virtual SA-weights are never written
-  // by a load, so cur_alloc has no entry and they are skipped. (v1: single-core;
-  // multi-core would key cur_alloc/vers by (core, buf).)
+  // must wait for the old one to free (WAR / double-buffer). Both DMA-loaded
+  // buffers AND compute outputs (the accumulator, vector epilogue results) are
+  // tracked; the virtual SA-weights are not (weight slots model them). (v1:
+  // single-core; multi-core would key cur_alloc/vers by (core, buf).)
   int64_t next_alloc = 0;
   std::map<int64_t, int64_t> cur_alloc;   // buf -> current version id
-  std::map<int64_t, bool> open_ver;       // buf -> version still accepting loads
+  std::map<int64_t, bool> open_ver;       // buf -> version still accepting writes
   struct Ver { std::vector<std::shared_ptr<Instruction>> loads, readers; };
   std::map<int64_t, Ver> vers;
+  // Spad bytes per buffer id, taken from the DMA records that touch it (load fills
+  // its dst, store drains its src) -- the authoritative tile size. A compute output
+  // (never DMA-loaded but stored) gets its footprint from its store record. Built
+  // in a pre-pass so it is known before the producing compute is processed.
+  auto rec_bytes = [](const TraceRec& t) {        // single source of the tile footprint
+    size_t numel = 1;
+    for (auto d : t.dims) numel *= (size_t)d;
+    return numel * (t.elem_bits / 8);
+  };
+  std::map<int64_t, size_t> buf_bytes;
+  for (const auto& t : run.trace) {
+    if (t.kind != TraceRec::DMA) continue;
+    const auto& bs = (t.dir == 1) ? t.read_bufs : t.write_bufs;  // store reads spad, load writes spad
+    for (int64_t b : bs) buf_bytes[b] = rec_bytes(t);
+  }
   auto sram_on_load = [&](int64_t b, const std::shared_ptr<Instruction>& ld) {
     if (!cur_alloc.count(b) || !open_ver[b]) {   // a read closed it -> new version
       cur_alloc[b] = next_alloc++;
@@ -193,6 +214,23 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
     ld->set_sram_alloc(cur_alloc[b]);
     vers[cur_alloc[b]].loads.push_back(ld);
   };
+  // A compute that freshly produces buffer b (b not read-and-written in place) opens
+  // a version like a load; the opener carries b's footprint (from buf_bytes). A
+  // version continues across the producing writes until a consuming read closes it,
+  // and its last reader frees it (sram_finalize) -- identical lifecycle to a load.
+  auto sram_on_write = [&](int64_t b, const std::shared_ptr<Instruction>& w) {
+    auto bb = buf_bytes.find(b);
+    if (bb == buf_bytes.end()) return;           // size unknown (never DMA'd) -> untracked
+    if (!cur_alloc.count(b) || !open_ver[b]) {   // a consuming read closed it -> new version
+      cur_alloc[b] = next_alloc++;
+      open_ver[b] = true;
+      vers[cur_alloc[b]] = {};
+      w->set_sram_alloc(cur_alloc[b]);
+      w->set_sram_footprint(bb->second);
+      vers[cur_alloc[b]].loads.push_back(w);
+    }
+    // already-open version (further producing writes): same physical bytes, no re-add.
+  };
   auto sram_on_read = [&](int64_t b, const std::shared_ptr<Instruction>& rd) {
     auto it = cur_alloc.find(b);
     if (it == cur_alloc.end()) return;           // not a load buffer -> untracked
@@ -232,39 +270,39 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
       int64_t uniq = next_tag++;                         // fresh Core tag key per dma record
       auto inst = make_dma(t, uniq);
       inst->set_tile_group(cur_tile_group);
-      size_t numel = 1;                                  // SRAM footprint (ready-tile ordering)
-      for (auto d : t.dims) numel *= (size_t)d;
-      tile->inc_required_sram_size(numel * (t.elem_bits / 8));
+      tile->inc_required_sram_size(rec_bytes(t));         // SRAM footprint (ready-tile ordering)
       if (t.dir == 1) {                                  // STORE
-        if (pending_bar) {
-          // after a compute fence: wait it (drains the async matmuls) -- covers
-          // the accumulator read, so no per-buffer read edge.
-          pending_bar->add_child(inst);
-          pending_bar.reset();
-          for (int64_t b : t.write_bufs) last_writer[b] = inst;
-          tile->append_instuction(inst);
-        } else {
-          link(inst, t.read_bufs, t.write_bufs);
-        }
+        // store reads the result buffer(s) -> link() JOINs all their writers.
+        link(inst, t.read_bufs, t.write_bufs);
         for (int64_t b : t.read_bufs) sram_on_read(b, inst);  // store frees what it drains
       } else {                                           // LOAD
         tile->append_instuction(inst);
         // async load: record it as the CURRENT load for this (tag_id, tag_slot)
         // with its fresh uniq; the barriers in this reduction iteration reuse that
         // uniq (1 load : N barriers). A new iteration's load overwrites it with a
-        // new uniq -> distinct tag key, no collision. last_writer = the dma for now;
+        // new uniq -> distinct tag key, no collision. writers = the dma for now;
         // the barrier overwrites it so consumers gate on data arrival. A sync load
         // has no barrier and blocks to arrival itself.
         if (t.is_async) current_dma[{t.tag_id, t.tag_slot}] = {uniq, inst};
-        for (int64_t b : t.write_bufs) last_writer[b] = inst;
-        for (int64_t b : t.write_bufs) sram_on_load(b, inst);   // occupy spad
+        for (int64_t b : t.write_bufs) {
+          // No hard WAR edge here: load-buffer reuse (double-buffering, X_spad/
+          // W_spad reloaded each reduction iter) is modeled by the SRAM
+          // version/capacity machinery (sram_on_load), which sizes how many
+          // versions physically coexist. A latency WAR edge would force
+          // single-buffering and kill the overlap the spad permits. (The
+          // accumulator Y is NOT a load buffer -> its cross-tile WAR is handled by
+          // the REPLACE branch of link() when the next tile's init overwrites it.)
+          writers[b] = { inst };
+          sram_on_load(b, inst);                         // occupy spad
+        }
       }
     } else if (t.kind == TraceRec::MEMORY_BAR) {
       // the explicit async-DMA sync (the original dma_wait). Pair with the CURRENT
       // load for this (tag_id, tag_slot), reusing its uniq Core key so the dma and
       // bar pair in the tag table; the dma releases the bar at issue-complete
-      // (add_child), then the bar parks on the tag until data-ready (resp-complete,
-      // set_tag_finish). Consumers of the loaded buffer then gate on the bar.
+      // (a DONE edge), then the bar parks on the tag until data-ready (resp-complete,
+      // set_tag_finish). Consumers of the loaded buffer then gate on the bar, so
+      // the bar (not the load) is the load's DONE-handle in writers(b).
       auto it = current_dma.find({t.tag_id, t.tag_slot});
       int64_t uniq = next_tag++;                         // fallback if unpaired
       std::shared_ptr<Instruction> dma_inst;
@@ -273,31 +311,29 @@ std::unique_ptr<TileGraph> trace_to_tilegraph(const togsim::RunResult& run,
       // so the buffer's consumers gate on it, instead of emitting a redundant barrier.
       auto bf = bar_for_load.find({t.tag_id, t.tag_slot});
       if (bf != bar_for_load.end() && bf->second.first == uniq) {
-        for (int64_t b : t.write_bufs) last_writer[b] = bf->second.second;
+        for (int64_t b : t.write_bufs) writers[b] = { bf->second.second };
         continue;
       }
       auto bar = make_mem_bar(t, uniq);
       bar->set_tile_group(cur_tile_group);
-      if (dma_inst) dma_inst->add_child(bar);
+      if (dma_inst) dma_inst->add_dep(bar, DepEvent::DONE);
       tile->append_instuction(bar);
-      for (int64_t b : t.write_bufs) last_writer[b] = bar;
+      // the bar is the load's DONE-handle: REPLACE writers(b) with it (no WAR -- the
+      // load already WAR'd the prior readers when it wrote).
+      for (int64_t b : t.write_bufs) writers[b] = { bar };
       bar_for_load[{t.tag_id, t.tag_slot}] = {uniq, bar};
     } else if (t.kind == TraceRec::COMPUTE) {
       auto inst = make_compute(t);
       inst->set_tile_group(cur_tile_group);
       link(inst, t.read_bufs, t.write_bufs);
-      for (int64_t b : t.read_bufs) sram_on_read(b, inst);     // frees the tiles it consumes
-      if (is_async_compute(t.compute_type)) outstanding_async.push_back(inst);
-    } else if (t.kind == TraceRec::COMPUTE_BAR) {
-      // explicit compute fence: ready once all outstanding async compute have
-      // ISSUED (pipeline-child release); the Core then waits the SA pipelines to
-      // drain before it finishes (-> the store it gates).
-      auto bar = std::make_shared<Instruction>(Opcode::COMPUTE_BAR);
-      bar->set_tile_group(cur_tile_group);
-      for (auto& a : outstanding_async) a->add_pipeline_child(bar);
-      outstanding_async.clear();
-      tile->append_instuction(bar);
-      pending_bar = bar;
+      // in-place buffers (read AND written) are version-transparent (accumulator,
+      // in-place vector): skip the self-read and the self-write so footprint is not
+      // double-counted. read_bufs/write_bufs are tiny, so a linear scan beats a set.
+      auto in = [](const std::vector<int64_t>& v, int64_t b) {
+        return std::find(v.begin(), v.end(), b) != v.end();
+      };
+      for (int64_t b : t.read_bufs)  if (!in(t.write_bufs, b)) sram_on_read(b, inst);   // consuming reads
+      for (int64_t b : t.write_bufs) if (!in(t.read_bufs, b))  sram_on_write(b, inst);  // fresh outputs
     }
   }
   flush();

From 2146ee50a65a9078ca39190125f6fadb3a46b2de Mon Sep 17 00:00:00 2001
From: Wonhyuk Yang <wonhyuk@postech.ac.kr>
Date: Mon, 22 Jun 2026 20:45:34 +0900
Subject: [PATCH 11/24] [Frontend] Guard MLIR tile sizing against symbolic dims

Under torch.compile(dynamic=True) the Inductor loop ranges carry sympy
symbols (e.g. ks0/s52) instead of concrete ints. The tile-size
heuristics did concrete-int arithmetic on those ranges and crashed with
sympy "cannot determine truth value" before any MLIR was emitted.

Neutralize the tile-fit heuristics for symbolic dims: they only shave a
tile to a known dim to minimize the wasted tail, which is meaningless
when the dim is unknown at compile time. Skip them, keep the fixed init
tile, and let the tail become a runtime remainder (masked).

- trim_large_tail: skip a dim whose range is symbolic
- get_padding_ratio: report zero padding for a symbolic dim/tile
- is_dim_dividable: raise a clear NotImplementedError for symbolic dims
  (the recompile-to-divisible path has no symbolic equivalent and would
  loop forever; index_expr/indirect indexing under dynamic shape is a
  later step)
- make_choices: drop a symbolic axis from the tile-grow candidates

All guards are isinstance(sympy.Expr)-gated, so the concrete-shape path
is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
---
 .../mlir/mlir_codegen_backend.py              |  8 +++++--
 PyTorchSimFrontend/mlir/mlir_common.py        | 21 +++++++++++++++++++
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/PyTorchSimFrontend/mlir/mlir_codegen_backend.py b/PyTorchSimFrontend/mlir/mlir_codegen_backend.py
index 8f695395..7b0c0aeb 100644
--- a/PyTorchSimFrontend/mlir/mlir_codegen_backend.py
+++ b/PyTorchSimFrontend/mlir/mlir_codegen_backend.py
@@ -980,8 +980,12 @@ def make_choices(self, nodes, kernel_name):
                 for axis in list(candidate_axes):
                     prev_tile_sz = self.kernel_group.tile_desc.get_tile_size()
 
-                    # If tile size is maximized for this axis, remove from candidate axes
-                    if prev_tile_sz[axis] >= prev_ranges[axis] * 2 or prev_tile_sz[axis] >= 2 ** 13:
+                    # If tile size is maximized for this axis, remove from candidate axes.
+                    # Dynamic shape: a symbolic dim has no compile-time bound to grow the
+                    # tile toward, so drop the axis (keep the fixed tile) rather than
+                    # comparing tile >= sympy*2 (cannot determine truth value).
+                    if isinstance(prev_ranges[axis], sympy.Expr) or \
+                            prev_tile_sz[axis] >= prev_ranges[axis] * 2 or prev_tile_sz[axis] >= 2 ** 13:
                         candidate_axes.remove(axis)
                         self.reset(None)
                         continue
diff --git a/PyTorchSimFrontend/mlir/mlir_common.py b/PyTorchSimFrontend/mlir/mlir_common.py
index a70d1c7d..6614b1ca 100644
--- a/PyTorchSimFrontend/mlir/mlir_common.py
+++ b/PyTorchSimFrontend/mlir/mlir_common.py
@@ -328,6 +328,16 @@ def is_dim_dividable(self, dim_sizes: list[int]) -> bool:
         if len(dim_sizes) != len(self._tile_size):
             raise ValueError("dim_sizes must match the tile size dimensions")
 
+        # Dynamic shape: divisibility cannot be proven at compile time, and the
+        # recompile-to-divisible path (adjust_tile_to_divisible -> RecompileSignal)
+        # has no symbolic equivalent -- it would loop forever shrinking the tile to 1.
+        # index_expr / indirect indexing under dynamic shape is Step 2 (B3); fail
+        # clearly here instead of a sympy "cannot determine truth value" crash.
+        if any(isinstance(d, sympy.Expr) for d in dim_sizes):
+            raise NotImplementedError(
+                "index_expr/indirect indexing under dynamic shape is not supported "
+                "yet (symbolic dim reached is_dim_dividable)")
+
         dim_sizes_cpy = list(dim_sizes)
         axis, stride = self.vmap.vlane_split_axis, self.vmap.vlane_stride
         remain = dim_sizes_cpy[axis] % stride
@@ -395,6 +405,13 @@ def trim_large_tail(self, ranges: list[int]):
             constraint = self.tile_constraint[i]
             if constraint.fixed:
                 continue
+            # Dynamic shape: the tail-padding heuristic exists only to shave the tile
+            # to a KNOWN dim and minimize wasted tail. With a symbolic dim the tail
+            # extent is unknown, so keep the fixed init tile and let the tail become a
+            # runtime remainder tile (masked). Skipping also avoids %/comparison on a
+            # sympy symbol (cannot determine truth value).
+            if isinstance(dim_range, sympy.Expr):
+                continue
             elif constraint.must_divide_dim:
                 BETA = 0
 
@@ -460,6 +477,10 @@ def init_tile_size(ranges, vlane_stride, vector_lane):
 
     @staticmethod
     def get_padding_ratio(tile_range: int, dim_range: int) -> float:
+        # Dynamic shape: a symbolic dim has no compile-time tail, so report zero
+        # padding waste ("nothing to trim") rather than doing %/<= on a sympy symbol.
+        if isinstance(dim_range, sympy.Expr) or isinstance(tile_range, sympy.Expr):
+            return 0.0
         if tile_range <= 0 or dim_range <= 0:
             raise ValueError("tile_range and dim_range must be positive integers")
         tail = dim_range % tile_range

From cf2950cef80fa37d89d802ec5ee3d5effd11dc36 Mon Sep 17 00:00:00 2001
From: Wonhyuk Yang <wonhyuk@postech.ac.kr>
Date: Mon, 22 Jun 2026 21:22:51 +0900
Subject: [PATCH 12/24] [Frontend] Emit symbolic loop bounds and dynamic memref
 dims

Make the MLIR backend emit valid IR for torch.compile(dynamic=True). A
size symbol (e.g. ks0) now becomes a usable kernel argument and the loop
over the dynamic dim carries the symbol as a runtime bound:

- mlir_argdefs: a size-symbol arg had no buffer_types entry (it is not a
  buffer/graph_input/constant), so it KeyError'd. Key it by name (which
  is also the host-side SymInt the wrapper passes) and describe it as a
  scalar int.
- get_mlir_shape: a symbolic numel becomes a dynamic memref dim ("?")
  instead of being stringified into an invalid type.
- LoopLevel: a symbolic upper bound is emitted as an index SSA value
  (%<name>_bound); a non-symbol symbolic expr raises NotImplementedError.
- codegen_loops: a prologue at the function top level loads each size arg
  (memref<1xi64>) and index_casts it to %<name>_bound, a valid affine
  symbol usable as the loop bound.

The emitted IR parses and lowers through the whole standard pipeline
(decompose/vlane -> fine-grained/vcix -> standard lowering) for a dynamic
elementwise add. Static kernels are unchanged (every path gates on
isinstance(.., sympy.Expr)).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
---
 .../mlir/mlir_codegen_backend.py              | 13 ++++++++
 PyTorchSimFrontend/mlir/mlir_common.py        | 33 ++++++++++++++++---
 2 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/PyTorchSimFrontend/mlir/mlir_codegen_backend.py b/PyTorchSimFrontend/mlir/mlir_codegen_backend.py
index 7b0c0aeb..9aa64caa 100644
--- a/PyTorchSimFrontend/mlir/mlir_codegen_backend.py
+++ b/PyTorchSimFrontend/mlir/mlir_codegen_backend.py
@@ -903,6 +903,19 @@ def codegen_loops(self):
         code.splice(self.const_buffer)
         code.splice(self.alloc_buffer)
         code.splice(self.spad_buffer)
+        # Dynamic shape: materialize each symbolic loop bound as an index SSA at the
+        # function top level (a valid affine symbol). The extent arrives as a
+        # memref<1xi64> arg named after the symbol (mlir_argdefs sizevars); load and
+        # cast it once before the loop nest. LoopLevel._bound_str emits %<name>_bound.
+        dyn_syms = []
+        for lp in loops.loops + reductions.loops:
+            if isinstance(lp.size, sympy.Symbol) and lp.size.name not in dyn_syms:
+                dyn_syms.append(lp.size.name)
+        if dyn_syms:
+            code.writeline("%dyn_zero = arith.constant 0 : index")
+            for nm in dyn_syms:
+                code.writeline(f"%{nm}_val = memref.load %{nm}[%dyn_zero] : memref<1xi64>")
+                code.writeline(f"%{nm}_bound = arith.index_cast %{nm}_val : i64 to index")
         # Outerloop
         with contextlib.ExitStack() as stack:
             for loop in loops.loops:
diff --git a/PyTorchSimFrontend/mlir/mlir_common.py b/PyTorchSimFrontend/mlir/mlir_common.py
index 6614b1ca..6b8d905f 100644
--- a/PyTorchSimFrontend/mlir/mlir_common.py
+++ b/PyTorchSimFrontend/mlir/mlir_common.py
@@ -177,7 +177,12 @@ def is_mlir_arg_inout(value):
     @staticmethod
     def get_mlir_shape(info):
         tensor_type = DTYPE_TO_MLIR[info[0]]
-        return f"memref<{info[1]}x{tensor_type}>"
+        numel = info[1]
+        # Dynamic shape: a symbolic numel becomes a dynamic memref dim ("?"); the
+        # actual extent arrives at runtime via the size-symbol arg (mlir_argdefs
+        # sizevars) and is materialized as the loop bound (codegen_loops).
+        dim = "?" if isinstance(numel, sympy.Expr) else numel
+        return f"memref<{dim}x{tensor_type}>"
 
     def mlir_argdefs(self, extra_node=dict()):
         buffer_types = {}
@@ -224,7 +229,15 @@ def set_info(outer, inner, arg_type):
                 continue
             set_info(outer, inner, self.MLIR_ARGS_OUT)
         for outer, inner in self.sizevars.items():
-            set_info(outer, inner, self.MLIR_ARGS_VAR)
+            # Dynamic shape: a size symbol (e.g. s52) is not a buffer/graph_input/
+            # constant, so buffer_types has no entry for it. Key it by its NAME (str)
+            # like a buffer -- the symbol's name is also the host-side SymInt variable
+            # the wrapper passes at the call site -- and describe it as a scalar int
+            # (-> memref<1x i64>), mirroring the sympy graph_input case above.
+            name = str(outer)
+            if name not in buffer_types:
+                buffer_types[name] = [get_sympy_Expr_dtype(outer), 1, [1], [1]]
+            set_info(name, inner, self.MLIR_ARGS_VAR)
         return arg_defs, call_args, arg_attributes, buffer_types
 
 class VectorLaneMapping():
@@ -1040,14 +1053,26 @@ class LoopLevel:
     reduction_vars: Dict[str, str] = dataclasses.field(default_factory=dict)
     affine_yield: Dict[str, str] = dataclasses.field(default_factory=dict)
 
+    def _bound_str(self):
+        # Dynamic shape: a symbolic upper bound is emitted as an index SSA value
+        # (%<name>_bound, materialized at the function top level by codegen_loops),
+        # which is a valid affine symbol; a concrete bound stays an integer literal.
+        if isinstance(self.size, sympy.Expr) and not self.size.is_number:
+            if not isinstance(self.size, sympy.Symbol):
+                raise NotImplementedError(
+                    f"dynamic loop bound must be a single size symbol, got {self.size}")
+            return f"%{self.size.name}_bound"
+        return f"{self.size}"
+
     def lines(self):
+        bound = self._bound_str()
         if len(self.reduction_vars):
             acc = ', '.join([f"%{acc.name}" for acc in self.reduction_vars.keys()])
             args = ', '.join([f"%{iter.name} = %{init.name}" for (_, iter, init, _) in self.reduction_vars.values()])
             dtype = ', '.join([f"{dtype}" for (_, _, _, dtype) in self.reduction_vars.values()])
-            line = f"{acc} = affine.for %{self.var} = {self.start} to {self.size} step {self.step} iter_args({args}) -> ({dtype})"
+            line = f"{acc} = affine.for %{self.var} = {self.start} to {bound} step {self.step} iter_args({args}) -> ({dtype})"
         else:
-            line = f"affine.for %{self.var} = {self.start} to {self.size} step {self.step}"
+            line = f"affine.for %{self.var} = {self.start} to {bound} step {self.step}"
 
         return [line]
 

From 5743a20b702b526c90c45b571f71571b0ca8e194 Mon Sep 17 00:00:00 2001
From: Wonhyuk Yang <wonhyuk@postech.ac.kr>
Date: Mon, 22 Jun 2026 21:37:47 +0900
Subject: [PATCH 13/24] [Frontend] Make the kernel meta import-safe under
 dynamic shape

torch.compile(dynamic=True) puts sympy size symbols (e.g. s52) in the
arg_attributes shape/stride fields. define_kernel emitted that list as a
module-scope Python literal in the generated wrapper, so a bare s52 was
undefined at import time and raised NameError before call() ran.

Recursively stringify sympy expressions in the meta before emitting it
('s52'). The real extent already reaches the kernel as a runtime arg (the
wrapper's call() computes s52 from the input tensor shape and passes it),
so the compile-time descriptor only needs to be import-safe and
shape-agnostic. No-op for static kernels (their meta has no sympy).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
---
 PyTorchSimFrontend/mlir/mlir_scheduling.py | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/PyTorchSimFrontend/mlir/mlir_scheduling.py b/PyTorchSimFrontend/mlir/mlir_scheduling.py
index 8520596c..cb73c23e 100644
--- a/PyTorchSimFrontend/mlir/mlir_scheduling.py
+++ b/PyTorchSimFrontend/mlir/mlir_scheduling.py
@@ -321,6 +321,20 @@ def define_function(self, kernel):
                 wrapper.header.writeline(code)
                 self.outer_function.add(function_name)
 
+    @staticmethod
+    def _literalize_meta(obj):
+        """Render meta (arg_attributes) as a valid Python literal for the generated
+        wrapper. Dynamic shapes put sympy symbols (e.g. s52) in the shape/stride
+        fields; emitted bare they are undefined at module scope -> NameError on
+        import. Stringify them ('s52'); the real extent arrives as a runtime kernel
+        arg (see the wrapper's call() body), so the compile-time descriptor only
+        needs to be import-safe and shape-agnostic."""
+        if isinstance(obj, sympy.Expr):
+            return str(obj)
+        if isinstance(obj, (list, tuple)):
+            return type(obj)(MLIRScheduling._literalize_meta(x) for x in obj)
+        return obj
+
     def define_kernel(self, src_code, meta_code, kernel_name, vector_lane, spad_info, loop_size=None, origins={}):
         wrapper = V.graph.wrapper_code
         if src_code in wrapper.src_to_kernel:
@@ -333,7 +347,7 @@ def define_kernel(self, src_code, meta_code, kernel_name, vector_lane, spad_info
             codecache_def.writeline(f"loop_size={loop_size},")
             codecache_def.writeline(f"spad_info={spad_info},")
             codecache_def.writeline(f"origins={origins},")
-            codecache_def.writeline(f"arg_attributes={meta_code},")
+            codecache_def.writeline(f"arg_attributes={self._literalize_meta(meta_code)},")
             headers = extension_codecache.get_header(src_code)
             if headers is not None:
                 codecache_def.writeline(f"global_var_header='''{headers[0]}''',")

From a3a8c575e4b2b29a60c7d7d6ff6e7422936b89d3 Mon Sep 17 00:00:00 2001
From: Wonhyuk Yang <wonhyuk@postech.ac.kr>
Date: Mon, 22 Jun 2026 21:44:13 +0900
Subject: [PATCH 14/24] [Frontend] Skip compile-time Spike validation for
 dynamic-shape kernels

The functional (Spike) validation binary is generated in MLIRCodeCache.load
at compile time with the tensor extent baked into the host buffer sizes
(mlir_caller_codegen allocates each buffer from arg_size). Under
torch.compile(dynamic=True) the extent is a runtime value (memref<?>), so
there is no concrete size to instantiate the fixed-shape validation binary
-- generate_args_define would size a buffer from the symbol and fail.

Skip the functional-validation block when the kernel MLIR carries a dynamic
memref dim (same effect as pytorchsim_functional_mode=off). The kernel is
still compiled shape-agnostically and timed via the gem5/TOG + trace path;
correctness of a dynamic kernel is validated at its concrete instantiation,
not at compile time.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
---
 PyTorchSimFrontend/extension_codecache.py | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/PyTorchSimFrontend/extension_codecache.py b/PyTorchSimFrontend/extension_codecache.py
index 785a3d95..6e9415b5 100644
--- a/PyTorchSimFrontend/extension_codecache.py
+++ b/PyTorchSimFrontend/extension_codecache.py
@@ -172,8 +172,16 @@ def load(cls, source_code,
                 link_option = f"-Wl,--section-start=.spad=0x{spad_info['spad_vaddr']:x}"
             else:
                 link_option = ""
-            # Generate LLVM kernel calller and binary for validation
-            if extension_config.pytorchsim_functional_mode:
+            # Generate LLVM kernel calller and binary for validation.
+            # Dynamic shape: the functional (Spike) validation binary is built here at
+            # compile time with the tensor extent baked into the host buffer sizes
+            # (mlir_caller_codegen allocates from arg_size). A runtime-determined extent
+            # (memref<?>) has no concrete size at compile time, so the fixed-shape
+            # validation cannot be instantiated -- skip it (same effect as
+            # pytorchsim_functional_mode=off). The kernel is still compiled
+            # shape-agnostically and timed via the gem5/TOG + trace path below.
+            is_dynamic_shape = "memref<?" in source_code
+            if extension_config.pytorchsim_functional_mode and not is_dynamic_shape:
                 # Use custom malloc to avoid size error
                 new_link_option = link_option + " -Wl,--wrap=malloc -Wl,--wrap=free"
                 cmds = mlir_compile_command(new_input_path, vectorlane_size, vlen=vlen)

From 5164d86b636e94d8db879977a6d1c16b1f180c11 Mon Sep 17 00:00:00 2001
From: Wonhyuk Yang <wonhyuk@postech.ac.kr>
Date: Mon, 22 Jun 2026 23:36:22 +0900
Subject: [PATCH 15/24] [Frontend] Sample per-tile cycles on a one-tile copy
 (dynamic shape)

gem5 measures per-tile compute cost, which is shape-invariant. Add
pin_loops_to_one_tile (cycle_table.py): a general MLIR-bindings rewrite
that forces every affine.for which would iterate more than once to run a
single tile (upper bound -> the loop step). It handles both a constant
multi-iteration bound and a symbolic (runtime-extent) bound, so the cpp
TOG cycle sampling can use it for static and dynamic kernels alike.

Wire it into MLIRCodeCache.load for dynamic shape: run the legacy cycle
machinery (run_tog -> _custom.mlir -> cycle binary -> gem5) on a one-tile
COPY of the post-vcix IR, while the symbolic _postvcix.mlir is kept for
the producer .so / cycle_table. The sampling host buffers are sized to
one tile (_concretize_attrs_for_sampling), and the legacy ONNX TOG output
(generate_tile_graph) is skipped for dynamic (it enumerates tiles
statically and is unused when the trace path is the default sim path).
dump_metadata now also tolerates a scalar size argument.

Static kernels are unchanged (every new branch gates on a dynamic memref
dim). Wiring the static cycle sampling through pin_loops_to_one_tile too
is the intended next step but needs the sampling decoupled from run_tog
(which also builds the legacy full TOG).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
---
 PyTorchSimFrontend/extension_codecache.py     | 75 +++++++++++++++----
 PyTorchSimFrontend/mlir/passes/cycle_table.py | 49 ++++++++++++
 2 files changed, 111 insertions(+), 13 deletions(-)

diff --git a/PyTorchSimFrontend/extension_codecache.py b/PyTorchSimFrontend/extension_codecache.py
index 6e9415b5..de585a6c 100644
--- a/PyTorchSimFrontend/extension_codecache.py
+++ b/PyTorchSimFrontend/extension_codecache.py
@@ -41,9 +41,26 @@ def dump_metadata(args, arg_attributes, path):
 
     with open(meta_path, "a") as file:
         for (arg_name, arg_attribute), arg in zip(arg_attributes, args):
-            file.write(f'{arg_name}=({arg_attribute[0]}, {arg.dtype}, {arg.shape})\n')
+            if isinstance(arg, torch.Tensor):
+                file.write(f'{arg_name}=({arg_attribute[0]}, {arg.dtype}, {arg.shape})\n')
+            else:
+                # Dynamic shape: a scalar size argument (e.g. s52) -- not a tensor.
+                file.write(f'{arg_name}=({arg_attribute[0]}, {type(arg).__name__}, {arg})\n')
     return
 
+def _concretize_attrs_for_sampling(arg_attributes, tile):
+    """Size the cycle-sampling host buffers to one tile. Under dynamic shape the
+    arg_attributes carry stringified symbolic extents (e.g. 's52'); the one-tile
+    sampling kernel only touches [0, tile) of each tensor, so replace any symbolic
+    numel/size with `tile` (a static int). Non-symbolic entries (e.g. the size
+    arg, numel 1) are left as is."""
+    cz = lambda v: tile if isinstance(v, str) else v
+    out = []
+    for name, (atype, dtype, numel, sizes, stride) in arg_attributes:
+        out.append([name, [atype, dtype, cz(numel), [cz(s) for s in sizes], stride]])
+    return out
+
+
 def mlir_compile_command(filename, vectorlane_size, vlen=256):
     # The C++ -dma-fine-grained and -test-pytorchsim-to-vcix passes are ported to
     # Python (passes/dma_fine_grained.py, lower_to_vcix.py), run in-process between
@@ -238,7 +255,29 @@ def load(cls, source_code,
                 run_module_passes(sample_mlir_path + "_padded.mlir",
                                   sample_mlir_path + "_postvcix.mlir",
                                   POST_OPT_PASSES, vectorlane=vectorlane_size, vlen=vlen)
-                run_tog(sample_mlir_path + "_postvcix.mlir", raw_tog_path,
+                # Dynamic shape: gem5 measures per-tile compute cost, which is
+                # shape-invariant. Sample it on a one-tile copy (each symbolic loop
+                # bound pinned to its step) so the legacy cycle machinery runs on a
+                # concrete kernel, while the symbolic _postvcix.mlir is kept for the
+                # producer .so / cycle_table below.
+                # pin_loops_to_one_tile is general (static + dynamic); today it is
+                # wired only for dynamic, where the legacy full TOG cannot be built
+                # (symbolic trip count) and is skipped anyway. Driving the trace
+                # path's cycle sampling through it for STATIC too is the intended
+                # direction, but needs the sampling decoupled from run_tog first
+                # (run_tog also builds the legacy full TOG, which needs full loops).
+                tog_input = sample_mlir_path + "_postvcix.mlir"
+                sample_tile = None
+                if is_dynamic_shape:
+                    import mlir.ir as _ir
+                    from PyTorchSimFrontend.mlir.passes.cycle_table import pin_loops_to_one_tile
+                    _ctx = _ir.Context(); _ctx.allow_unregistered_dialects = True
+                    with _ctx:
+                        _pm = _ir.Module.parse(open(tog_input).read(), _ctx)
+                        sample_tile = pin_loops_to_one_tile(_pm)
+                        tog_input = sample_mlir_path + "_pinned.mlir"
+                        open(tog_input, "w").write(str(_pm))
+                run_tog(tog_input, raw_tog_path,
                         sample_mlir_path + "_custom.mlir",
                         sample_mode=extension_config.CONFIG_TLS_MODE,
                         vectorlane=vectorlane_size)
@@ -254,8 +293,13 @@ def load(cls, source_code,
             if not extension_config.pytorchsim_timing_mode:
                 return key
 
-            # Generate MLIR kernel calller and binary for cycle calculation
-            cycle_llvm_caller = MLIRKernelCallerCodeGen(False, arg_attributes, cycle_sim=True)
+            # Generate MLIR kernel calller and binary for cycle calculation.
+            # Dynamic shape: size the host buffers to one tile (the sampling kernel
+            # was pinned to a single tile above); arg_attributes carry symbolic
+            # extents that cannot size a buffer.
+            sample_attrs = (_concretize_attrs_for_sampling(arg_attributes, sample_tile)
+                            if is_dynamic_shape else arg_attributes)
+            cycle_llvm_caller = MLIRKernelCallerCodeGen(False, sample_attrs, cycle_sim=True)
             cycle_llvm_caller.generate_wrapper_file(write_path, cycle_wrapper_name)
             cycle_llvm_caller.compile_wih_kernel(write_path, key + "_sample", cycle_wrapper_name, cycle_binary_name, link_option)
 
@@ -281,15 +325,20 @@ def load(cls, source_code,
             if kwargs['loop_size'] is not None and kwargs['loop_size'][-1] < vectorlane_size:
                 w_offset = kwargs['loop_size'][-1]
             w_offset = 0 # max(w_offset - x_offset, 0)
-            tile_graph_generator = tog_generator(origins)
-            tile_graph_generator.load_file(raw_tog_path)
-            tile_graph_generator.generate_tile_graph(
-                tog_path,
-                cycle_list=cycle_list,
-                x_offset=x_offset, # FIXME.
-                w_offset=w_offset, # FIXME.
-                vector_lane=vectorlane_size
-            )
+            # DEPRECATED legacy ONNX-TOG output (tile_graph.onnx); unused when the
+            # trace pipeline is the default sim path. It enumerates tiles statically,
+            # so it cannot be built for a dynamic (runtime-extent) kernel -- skip it.
+            # x_offset/w_offset above are still needed by the trace cycle_table.
+            if not is_dynamic_shape:
+                tile_graph_generator = tog_generator(origins)
+                tile_graph_generator.load_file(raw_tog_path)
+                tile_graph_generator.generate_tile_graph(
+                    tog_path,
+                    cycle_list=cycle_list,
+                    x_offset=x_offset, # FIXME.
+                    w_offset=w_offset, # FIXME.
+                    vector_lane=vectorlane_size
+                )
 
             # Trace pipeline (DEFAULT): emit the compiled trace producer .so + the
             # cycle-table TSV from the post-vcix IR and gem5 cycle_list/offsets. This
diff --git a/PyTorchSimFrontend/mlir/passes/cycle_table.py b/PyTorchSimFrontend/mlir/passes/cycle_table.py
index 40dd3459..2cd99daf 100644
--- a/PyTorchSimFrontend/mlir/passes/cycle_table.py
+++ b/PyTorchSimFrontend/mlir/passes/cycle_table.py
@@ -49,6 +49,55 @@ def overlapping_cycle(cycle, compute_type, x_offset, w_offset):
     return max(int(cycle) - int(offset), 0)
 
 
+def pin_loops_to_one_tile(module):
+    """Pin every affine.for that would run more than once to a SINGLE tile, by
+    forcing its upper bound to the loop's step (one iteration). The cpp-TOG cycle
+    sampling needs only per-tile compute cost, which is shape-invariant -- one tile
+    is enough -- so this is the general sampling reduction for BOTH static and
+    dynamic kernels (it replaces the legacy sample-mode step rewrite for the trace
+    path):
+
+      * static bound C > step S  -> set bound = S (was ceil(C/S) iterations).
+      * symbolic bound (%..._bound, dynamic dim) -> set bound = S (runtime extent
+        unknown; one tile suffices and avoids needing the extent at all).
+      * bound already <= step (e.g. the innermost compute loop) -> left as is.
+
+    Run this on a COPY used only for gem5 sampling; the original module is kept for
+    the producer .so / cycle_table (both stay shape-agnostic). Mutates `module` in
+    place. Returns the largest pinned step (tile element count) for sizing the
+    sampling host buffers.
+    """
+    tile = 1
+    idx_t = ir.IndexType.get()
+    for op in list(walk_ops(module.body)):
+        o = op.operation
+        if o.name != "affine.for":
+            continue
+        step = ir.IntegerAttr(o.attributes["step"]).value
+        ub_map = ir.AffineMapAttr(o.attributes["upperBoundMap"]).value
+        const_ub = (len(ub_map.results) == 1
+                    and ir.AffineConstantExpr.isinstance(ub_map.results[0]))
+        if const_ub:
+            ub = ir.AffineConstantExpr(ub_map.results[0]).value
+            if ub <= step:
+                continue                           # already a single iteration
+            # constant, multi-iteration: rewrite the bound map to the step
+            o.attributes["upperBoundMap"] = ir.AffineMapAttr.get(
+                ir.AffineMap.get_constant(step))
+        else:
+            # symbolic bound: replace its SSA upper-bound operand with a constant=step
+            seg = o.attributes["operandSegmentSizes"]
+            n_lb = seg[0]                           # [lb operands, ub operands, iter operands]
+            ub_val = o.operands[n_lb]
+            cst = ir.Operation.create(
+                "arith.constant", results=[idx_t],
+                attributes={"value": ir.IntegerAttr.get(idx_t, step)},
+                ip=ir.InsertionPoint(op), loc=ir.Location.unknown())
+            ub_val.replace_all_uses_with(cst.results[0])
+        tile = max(tile, step)
+    return tile
+
+
 def _compute_types(skeleton_module):
     """tile_id-ordered list of compute_type ints, from the skeleton's
     togsim.compute ops."""

From 62010312ad34c99d2d5e1bab3a4a118a77934ce7 Mon Sep 17 00:00:00 2001
From: Wonhyuk Yang <wonhyuk@postech.ac.kr>
Date: Tue, 23 Jun 2026 12:54:31 +0900
Subject: [PATCH 16/24] [Frontend] Emit a dynamic-shape trace producer
 (shape_args loop bounds)

Make the C++ trace producer .so build for a dynamic (runtime-extent)
kernel, so its loop bounds are read at runtime from shape_args.

- build_tog._build gains serialize=False: build_skeleton only needs the
  builder side effects (loop/compute/DMA nodes), not the serialized TOG
  string, whose display() formats a constant loop_end -- None for a
  dynamic loop. The bound stays on the affine.for in the IR.
- lower_to_emitc._rewrite_signature: an original kernel arg still used
  after build_skeleton's DCE is a size symbol (its memref.load feeds a
  loop bound; tensors are referenced by name in togsim.dma attrs and DCE
  to unused). Re-source each such load from shape_args[k] via
  emitc.subscript (k = the size arg's order), then drop the arg. The
  producer's loop then reads the runtime extent: for (iv=0; iv<shape_args[k]; ...).

Verified: a dynamic elementwise add builds one trace.so whose recorded
trace scales with shape_args (1024 -> 14 insts, 2048 -> 28).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
---
 .../mlir/passes/build_skeleton.py             |  5 +-
 PyTorchSimFrontend/mlir/passes/build_tog.py   | 14 +++--
 .../mlir/passes/lower_to_emitc.py             | 54 +++++++++++++++----
 3 files changed, 59 insertions(+), 14 deletions(-)

diff --git a/PyTorchSimFrontend/mlir/passes/build_skeleton.py b/PyTorchSimFrontend/mlir/passes/build_skeleton.py
index 4c3d89cb..cb011137 100644
--- a/PyTorchSimFrontend/mlir/passes/build_skeleton.py
+++ b/PyTorchSimFrontend/mlir/passes/build_skeleton.py
@@ -499,7 +499,10 @@ def build_skeleton(module):
     """
     _reset_ids()
     builder = TogBuilder()
-    _build(module, builder)  # populates loop/compute nodes + op back-pointers
+    # serialize=False: we only need the builder side effects (loop/compute/DMA
+    # nodes), not the TOG string -- and display() needs a constant loop_end, which
+    # is None for a dynamic loop. The loop bound stays on the affine.for in the IR.
+    _build(module, builder, serialize=False)
 
     block = _kernel_block(module)
     if block is None:
diff --git a/PyTorchSimFrontend/mlir/passes/build_tog.py b/PyTorchSimFrontend/mlir/passes/build_tog.py
index ae515010..11fe6843 100644
--- a/PyTorchSimFrontend/mlir/passes/build_tog.py
+++ b/PyTorchSimFrontend/mlir/passes/build_tog.py
@@ -1047,8 +1047,15 @@ def _find_kernel(module):
     return None
 
 
-def _build(module, builder):
-    """Build the graph and return its display string, populating `builder`."""
+def _build(module, builder, serialize=True):
+    """Build the graph, populating `builder`; return its display string.
+
+    `serialize=False` skips the bfs/display string pass and returns "". The
+    skeleton path (build_skeleton) only needs the builder side effects (loop /
+    compute / DMA nodes), not the serialized TOG, and display() formats a constant
+    `loop_end` -- which is None for a dynamic (runtime-extent) loop. The loop bound
+    itself is carried by the affine.for op in the IR (lowered to a runtime-bounded
+    loop downstream), so the skeleton does not need it serialized here."""
     func_op = _find_kernel(module)
     if func_op is None:
         return ""
@@ -1063,7 +1070,8 @@ def _build(module, builder):
         root = TOGNode("root")
         builder._reset_matmul_fsm()
         builder.print_operation(op, root)
-        root.bfs(out)
+        if serialize:
+            root.bfs(out)
     return "".join(out)
 
 
diff --git a/PyTorchSimFrontend/mlir/passes/lower_to_emitc.py b/PyTorchSimFrontend/mlir/passes/lower_to_emitc.py
index 3d1f7cde..a6c3b1a8 100644
--- a/PyTorchSimFrontend/mlir/passes/lower_to_emitc.py
+++ b/PyTorchSimFrontend/mlir/passes/lower_to_emitc.py
@@ -117,23 +117,57 @@ def _strip_aux(module):
 def _rewrite_signature(kernel, ctx):
     """Replace @kernel's memref tensor args with the ABI args
     (EmitCtx*, int64_t* shape_args, int32_t n) and rename it to togsim_kernel.
-    Returns the ctx Value."""
+    Returns the ctx Value.
+
+    Dynamic shape: any original arg still USED after build_skeleton's DCE is a size
+    symbol (memref<1xi64>) whose load feeds a loop bound -- tensor args are
+    referenced by name in the togsim.dma attrs, not by SSA value, so they DCE to
+    unused. Re-source each such `memref.load %argSize[..]` from `shape_args[k]`
+    (k = the size arg's order; the runtime fills shape_args in the same order), so
+    the producer's loop bound reads the runtime extent and the arg can be dropped.
+    """
     block = kernel.regions[0].blocks[0]
-    for arg in block.arguments:
-        if len(list(arg.uses)) > 0:
-            raise ValueError(
-                "kernel arg still used after build_skeleton; cannot drop it "
-                "(expected the DCE to have removed all tensor-data ops)")
-    # erase existing (memref) args high-to-low, then append the ABI args.
-    for i in reversed(range(len(block.arguments))):
-        block.erase_argument(i)
+    orig_args = list(block.arguments)
+    loc = ir.Location.unknown(ctx)
     ptr = ir.Type.parse(CTX_TYPE, ctx)
     i64ptr = ir.Type.parse("!emitc.ptr<i64>", ctx)
     i32 = ir.IntegerType.get_signless(32)
-    loc = ir.Location.unknown(ctx)
+    # Append the ABI args first so shape_args exists to re-source size reads from.
     block.add_argument(ptr, loc)
     block.add_argument(i64ptr, loc)
     block.add_argument(i32, loc)
+    shape_args = block.arguments[len(orig_args) + 1]
+
+    idx_t = ir.IndexType.get()
+    i64_t = ir.IntegerType.get_signless(64)
+    k = 0
+    for a in orig_args:
+        if not list(a.uses):
+            continue
+        for use in list(a.uses):
+            ld = use.owner
+            if ld.name != "memref.load":
+                raise ValueError(
+                    "kernel arg still used after build_skeleton by %s; only a size "
+                    "load (memref.load) is expected under dynamic shape" % ld.name)
+            ip = ir.InsertionPoint(ld)
+            kc = ir.Operation.create(
+                "arith.constant", results=[idx_t],
+                attributes={"value": ir.IntegerAttr.get(idx_t, k)}, ip=ip, loc=loc)
+            sub = ir.Operation.create(
+                "emitc.subscript", results=[i64_t],
+                operands=[shape_args, kc.results[0]], ip=ip, loc=loc)
+            ld.results[0].replace_all_uses_with(sub.results[0])
+            ld.erase()
+        k += 1
+
+    # every original arg is unused now -> drop them, leaving only the ABI args.
+    for a in orig_args:
+        if len(list(a.uses)) > 0:
+            raise ValueError(
+                "kernel arg still used after the shape rewrite; cannot drop it")
+    for i in reversed(range(len(orig_args))):
+        block.erase_argument(i)
     kernel.operation.attributes["function_type"] = ir.TypeAttr.get(
         ir.FunctionType.get([ptr, i64ptr, i32], []))
     kernel.operation.attributes["sym_name"] = ir.StringAttr.get(ENTRY)

From 7d985f1f9afe2f586c5a553f7cbab4a552236c46 Mon Sep 17 00:00:00 2001
From: Wonhyuk Yang <wonhyuk@postech.ac.kr>
Date: Tue, 23 Jun 2026 12:54:45 +0900
Subject: [PATCH 17/24] [TOGSim] Pass the runtime shape to the trace producer
 via the attribute file

The dynamic trace producer reads its loop bounds from shape_args; feed
them at simulation time through the existing per-kernel attribute YAML
(the file that already carries address_info), not a bespoke channel.

- write_kernel_attribute_file: a scalar input (a dynamic size arg, e.g.
  s52) is not a tensor address -- collect such scalars into a shape_args
  sequence in the YAML, in arg order (== the producer's shape_args[k]).
- run_standalone: pass --attribute <yaml> alongside --trace_so so the
  trace path receives it, the same file the legacy path passes via the
  models_list command.
- main.cc: add --attribute; in the trace branch load the YAML and fill
  shape_args from its shape_args sequence, passed to run_producer (was
  nullptr,0).
- run_kernel_simulation: skip the Spike functional run for a dynamic
  kernel (its fixed-shape validation binary is intentionally not built).

Verified end to end: one compiled add runs at 1024 (183 cycles) and 2048
(261 cycles) from the same trace.so, driven by shape_args.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
---
 PyTorchSimFrontend/extension_codecache.py |  7 +++++-
 Simulator/simulator.py                    | 16 +++++++++++-
 TOGSim/src/main.cc                        | 30 ++++++++++++++++++++---
 3 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/PyTorchSimFrontend/extension_codecache.py b/PyTorchSimFrontend/extension_codecache.py
index de585a6c..1c1706ce 100644
--- a/PyTorchSimFrontend/extension_codecache.py
+++ b/PyTorchSimFrontend/extension_codecache.py
@@ -398,7 +398,12 @@ def run_kernel_simulation(*args, autotune_subprocess_timeout_sec=None, **kwargs)
                 # Dump arguments and meta data
                 dump_metadata(args, arg_attributes, result_path)
                 runtime_path = FunctionalSimulator.get_runtime_dump_path(result_path)
-                if extension_config.pytorchsim_functional_mode and not autotune:
+                # Dynamic shape: the kernel is called with scalar size argument(s)
+                # (e.g. s52) after its tensors -- the runtime extents, in size-arg
+                # order (== the producer's shape_args[k]). They also mark the run as
+                # dynamic, where the fixed-shape Spike validation binary was not built.
+                shape_args = [int(a) for a in args if not isinstance(a, torch.Tensor)]
+                if extension_config.pytorchsim_functional_mode and not autotune and not shape_args:
                     funcsim = FunctionalSimulator(result_path, key)
                     funcsim.run_spike(args, arg_attributes,
                                     runtime_path, self.validation_binary_name,
diff --git a/Simulator/simulator.py b/Simulator/simulator.py
index a4517285..c3239905 100644
--- a/Simulator/simulator.py
+++ b/Simulator/simulator.py
@@ -467,9 +467,18 @@ def write_kernel_attribute_file(attribute_dir, inputs, alloc_pool=None):
         index = str(len(os.listdir(attribute_dir)))
         attribute_file = os.path.join(attribute_dir, index)
 
+        # Tensors carry an address; a scalar (e.g. a dynamic-shape size arg s52)
+        # carries a runtime extent -- collect those into shape_args, in arg order,
+        # which is the order the trace producer reads shape_args[k].
+        shape_args = []
         for idx, tensor in enumerate(inputs):
-            address_info[f"arg{idx}"] = tensor.data_ptr()
+            if isinstance(tensor, torch.Tensor):
+                address_info[f"arg{idx}"] = tensor.data_ptr()
+            else:
+                shape_args.append(int(tensor))
         yaml_content["address_info"] = address_info
+        if shape_args:
+            yaml_content["shape_args"] = shape_args
 
         for buf_name, range in alloc_pool.items():
             sram_buffer[buf_name] = range
@@ -575,6 +584,11 @@ def run_standalone(
                 logger.warning("TORCHSIM_LEGACY_TOG=1 selects the DEPRECATED legacy ONNX TOG path")
             if use_trace:
                 cmd = f"{base_cmd} --trace_so {trace_so} --cycle_table {cycle_tsv}"
+                # Carry the per-kernel attribute YAML (address_info + a dynamic
+                # kernel's shape_args) into the trace path, the same file the legacy
+                # path passes via the models_list command.
+                if attribute_path:
+                    cmd += f" --attribute {attribute_path}"
             else:  # DEPRECATED: legacy ONNX TOG path
                 cmd = f"{base_cmd} --models_list {trace_file_path}"
             if extension_config.CONFIG_TOGSIM_DEBUG_LEVEL:
diff --git a/TOGSim/src/main.cc b/TOGSim/src/main.cc
index 274d63da..d0bf9a9f 100644
--- a/TOGSim/src/main.cc
+++ b/TOGSim/src/main.cc
@@ -25,6 +25,7 @@ namespace po = boost::program_options;
 std::unique_ptr<TileGraph> build_trace_tilegraph(Simulator* simulator,
                                                  const std::string& trace_so_path,
                                                  const std::string& cycle_table_path,
+                                                 const std::string& attribute_path,
                                                  int partition_id) {
   const auto& cfg = simulator->get_hardware_config_yaml();
   int num_cores = cfg["num_cores"] ? cfg["num_cores"].as<int>() : 1;
@@ -43,7 +44,21 @@ std::unique_ptr<TileGraph> build_trace_tilegraph(Simulator* simulator,
     while (ct >> c >> o) { cyc.push_back(c); ovl.push_back(o); }
   }
   if (cyc.empty()) { cyc.assign(256, 128); ovl.assign(256, 0); }
-  auto run = togsim::run_producer(trace_so_path.c_str(), nullptr, 0,
+  // Dynamic shape: the producer reads its loop bounds from shape_args[k]. Read
+  // them from the per-kernel attribute YAML (the same file that carries
+  // address_info for the legacy path), under the `shape_args` sequence.
+  std::vector<int64_t> shape_args;
+  if (!attribute_path.empty()) {
+    YAML::Node attr = YAML::LoadFile(attribute_path);
+    if (attr["shape_args"]) {
+      for (const auto& v : attr["shape_args"]) shape_args.push_back(v.as<int64_t>());
+      spdlog::info("[TOGSim-trace] shape_args: {} values from {}",
+                   shape_args.size(), attribute_path);
+    }
+  }
+  auto run = togsim::run_producer(trace_so_path.c_str(),
+                                  shape_args.empty() ? nullptr : shape_args.data(),
+                                  (int)shape_args.size(),
                                   bases.data(), (int)bases.size(),
                                   cyc.data(), ovl.data(), (int)cyc.size(),
                                   partition_cores.data(), (int32_t)partition_cores.size());
@@ -62,7 +77,7 @@ void launchKernel(Simulator* simulator, unsigned int kernel_id, std::string onnx
   std::string trace_so = dir + "/trace.so";
   std::string cycle_tsv = dir + "/trace_cycles.tsv";
   if ((!legacy || std::string(legacy) != "1") && fs::exists(trace_so)) {
-    tile_graph = build_trace_tilegraph(simulator, trace_so, cycle_tsv, partition_id);
+    tile_graph = build_trace_tilegraph(simulator, trace_so, cycle_tsv, attribute_path, partition_id);
     if (tile_graph) tog_path = trace_so;
     else spdlog::warn("[TOGSim] trace.so run failed for {}; falling back to ONNX", trace_so);
   }
@@ -164,6 +179,10 @@ int main(int argc, char** argv) {
   cmd_parser.add_command_line_option<std::string>(
       "cycle_table", "Path to a 'cycle<TAB>overlapping' per-tile_id sidecar (TSV) "
                      "for --trace_so; falls back to a flat stub if omitted");
+  cmd_parser.add_command_line_option<std::string>(
+      "attribute", "Path to the per-kernel attribute YAML (address_info, "
+                   "shape_args) for --trace_so; carries a dynamic kernel's runtime "
+                   "shape the same way the legacy path carries address_info");
   try {
     cmd_parser.parse(argc, argv);
   } catch (const CommandLineParser::ParsingError& e) {
@@ -216,7 +235,12 @@ int main(int argc, char** argv) {
     // round-robin over partition 0's cores only; see build_trace_tilegraph).
     std::string cycle_table_path;
     cmd_parser.set_if_defined("cycle_table", &cycle_table_path);
-    auto tg = build_trace_tilegraph(simulator, trace_so_path, cycle_table_path, 0);
+    // Dynamic shape: the producer reads its loop bounds from shape_args[k], which
+    // build_trace_tilegraph loads from the per-kernel attribute YAML (the same
+    // file that carries address_info for the legacy path).
+    std::string attribute_path;
+    cmd_parser.set_if_defined("attribute", &attribute_path);
+    auto tg = build_trace_tilegraph(simulator, trace_so_path, cycle_table_path, attribute_path, 0);
     if (!tg) { spdlog::error("[TOGSim] trace producer run failed"); exit(1); }
     tg->set_arrival_time(simulator->get_core_cycle());
     tg->set_kernel_id(0);

From 479d40708de39445d742ba0690db44172d99b143 Mon Sep 17 00:00:00 2001
From: Wonhyuk Yang <wonhyuk@postech.ac.kr>
Date: Tue, 23 Jun 2026 15:00:39 +0900
Subject: [PATCH 18/24] [TOGSim] Functional output for dynamic shape
 (shape-agnostic Spike binary)

Produce correct output VALUES for a dynamic kernel: the Spike validation
binary is now shape-agnostic and reads the runtime extent from the
size-arg buffer, the same way the trace producer reads shape_args.

- Simulator.dump_args/write_arg: a size symbol arg (MLIR_ARGS_VAR) is a
  kernel input -- write its runtime value (int64) to a .raw so the kernel
  can load its loop bound. This is Spike's existing per-arg .raw channel
  (used for tensors); the size arg was just being skipped.
- mlir_caller_codegen: the validation binary loads each size arg first
  into N_<sym>, then mallocs the tensor buffers and builds the memref
  descriptors from N at runtime (not the compile-time extent). argv slots
  are assigned in arg order (matching dump_args). A numel that is a size
  SYMBOL becomes N_<sym>; a concrete numel (including a stringified
  sympy.Integer like '128') stays a literal.
- extension_codecache: build + run the validation binary for dynamic too.

Verified: one compiled add returns correct values at 1024 / 2048 / 1536
and a 1D tail size 1000 from the same binary. Tail/lane padding for >1D
shapes is a separate follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
---
 PyTorchSimFrontend/extension_codecache.py     | 16 ++--
 .../mlir/mlir_caller_codegen.py               | 88 ++++++++++++++-----
 Simulator/simulator.py                        | 11 ++-
 3 files changed, 82 insertions(+), 33 deletions(-)

diff --git a/PyTorchSimFrontend/extension_codecache.py b/PyTorchSimFrontend/extension_codecache.py
index 1c1706ce..a59e9b2e 100644
--- a/PyTorchSimFrontend/extension_codecache.py
+++ b/PyTorchSimFrontend/extension_codecache.py
@@ -189,16 +189,12 @@ def load(cls, source_code,
                 link_option = f"-Wl,--section-start=.spad=0x{spad_info['spad_vaddr']:x}"
             else:
                 link_option = ""
-            # Generate LLVM kernel calller and binary for validation.
-            # Dynamic shape: the functional (Spike) validation binary is built here at
-            # compile time with the tensor extent baked into the host buffer sizes
-            # (mlir_caller_codegen allocates from arg_size). A runtime-determined extent
-            # (memref<?>) has no concrete size at compile time, so the fixed-shape
-            # validation cannot be instantiated -- skip it (same effect as
-            # pytorchsim_functional_mode=off). The kernel is still compiled
-            # shape-agnostically and timed via the gem5/TOG + trace path below.
+            # Generate LLVM kernel calller and binary for validation. The validation
+            # binary is shape-agnostic: under dynamic shape it reads the runtime extent
+            # from the size-arg buffer and sizes its host buffers from it
+            # (mlir_caller_codegen), so one binary serves any size -- like the producer.
             is_dynamic_shape = "memref<?" in source_code
-            if extension_config.pytorchsim_functional_mode and not is_dynamic_shape:
+            if extension_config.pytorchsim_functional_mode:
                 # Use custom malloc to avoid size error
                 new_link_option = link_option + " -Wl,--wrap=malloc -Wl,--wrap=free"
                 cmds = mlir_compile_command(new_input_path, vectorlane_size, vlen=vlen)
@@ -403,7 +399,7 @@ def run_kernel_simulation(*args, autotune_subprocess_timeout_sec=None, **kwargs)
                 # order (== the producer's shape_args[k]). They also mark the run as
                 # dynamic, where the fixed-shape Spike validation binary was not built.
                 shape_args = [int(a) for a in args if not isinstance(a, torch.Tensor)]
-                if extension_config.pytorchsim_functional_mode and not autotune and not shape_args:
+                if extension_config.pytorchsim_functional_mode and not autotune:
                     funcsim = FunctionalSimulator(result_path, key)
                     funcsim.run_spike(args, arg_attributes,
                                     runtime_path, self.validation_binary_name,
diff --git a/PyTorchSimFrontend/mlir/mlir_caller_codegen.py b/PyTorchSimFrontend/mlir/mlir_caller_codegen.py
index 7c842272..fea1c945 100644
--- a/PyTorchSimFrontend/mlir/mlir_caller_codegen.py
+++ b/PyTorchSimFrontend/mlir/mlir_caller_codegen.py
@@ -34,6 +34,32 @@ def get_argv_idx(self):
         self.arg_use_count += 1
         return self.arg_use_count-1
 
+    def _is_var(self, flag):
+        return bool(flag & MLIRKernelArgs.MLIR_ARGS_VAR)
+
+    @staticmethod
+    def _is_symbol(numel):
+        """A numel that is a size SYMBOL (e.g. 's52'), not a concrete value. Concrete
+        sizes may also be strings here (the meta stringifies sympy.Integer, e.g.
+        '128'); those are numeric, a symbol is not."""
+        return isinstance(numel, str) and not numel.isdigit()
+
+    def _numel_c_expr(self, numel):
+        """C expression for an arg's element count. Dynamic shape: a size SYMBOL is
+        the runtime extent, read into `N_<symbol>` from its size buffer (see
+        generate_args_define); a concrete numel (int or numeric string) is a literal."""
+        return f"N_{numel}" if self._is_symbol(numel) else str(numel)
+
+    def _assign_argv_indices(self):
+        """Assign each loaded/dumped arg an argv slot in arg_attributes order, the
+        same order Simulator.dump_args writes the .raw paths. Size (VAR) args get a
+        slot too (they are kernel inputs)."""
+        for arg_name, arg_attribute in self.arg_attributes:
+            flag = arg_attribute[0]
+            if (self.is_in_arg(flag) or self.is_out_arg(flag) or self._is_var(flag)) \
+                    and arg_name not in self.load_args:
+                self.load_args[arg_name] = self.get_argv_idx()
+
     def write_header(self):
         self.writeline('#include <stdio.h>')
         self.writeline('#include <stdlib.h>')
@@ -56,12 +82,12 @@ def is_inout_arg(self, value):
 
     def load_arg(self):
         for arg_name, arg_attribute in self.arg_attributes:
-            if self.is_in_arg(arg_attribute[0]):
-                argv_idx = self.get_argv_idx() if arg_name not in self.load_args else self.load_args[arg_name]
-                self.load_args[arg_name] = argv_idx
+            # VAR (size) args are loaded in generate_args_define (before the tensor
+            # buffers they size); skip them here.
+            if self.is_in_arg(arg_attribute[0]) and not self._is_var(arg_attribute[0]):
+                argv_idx = self.load_args[arg_name]
                 ctype = DTYPE_TO_C[arg_attribute[1]]
-                elem_count = arg_attribute[2]
-                size_expr = f'({elem_count}ULL * sizeof({ctype}))'
+                size_expr = f'((uint64_t)({self._numel_c_expr(arg_attribute[2])}) * sizeof({ctype}))'
 
                 self.writeline(f'if(load_arg(c_{arg_name}, {size_expr}, argv[{argv_idx}]) == -1){self.open_bracket}')
                 with self.code.indent():
@@ -71,10 +97,9 @@ def load_arg(self):
     def dump_arg(self):
         for arg_name, arg_attribute in self.arg_attributes:
             if self.is_out_arg(arg_attribute[0]):
-                argv_idx = self.get_argv_idx() if not self.is_inout_arg(arg_attribute[0]) else self.load_args[arg_name]
+                argv_idx = self.load_args[arg_name]
                 ctype = DTYPE_TO_C[arg_attribute[1]]
-                elem_count = arg_attribute[2]
-                size_expr = f'({elem_count}ULL * sizeof({ctype}))'
+                size_expr = f'((uint64_t)({self._numel_c_expr(arg_attribute[2])}) * sizeof({ctype}))'
                 self.writeline(f'if(dump_arg(c_{arg_name}, {size_expr}, argv[{argv_idx}]) == -1){self.open_bracket}')
                 with self.code.indent():
                     self.writeline(f'return -1{self.ending}')
@@ -93,30 +118,53 @@ def generate_args_define(self):
         name_set = set()
         if self.validation:
             self.writeline(f"int* padding = malloc(0x100000ULL * sizeof(int)){self.ending}")
-        for arg_name, (_, arg_type, arg_size, arg_sizes, arg_stride) in self.arg_attributes:
-            if not arg_name in name_set:
-                if torch.is_floating_point(torch.tensor([], dtype=arg_type)):
-                    bits = torch.finfo(arg_type).bits
-                elif arg_type == torch.bool:
-                    bits = 8
-                else:
-                    bits = torch.iinfo(arg_type).bits
-                buffer_size = int(math.ceil(arg_size * bits // 8 / 64) * 64) * 2 # Round up to 64 bytes + Add some padding for safety
-                self.writeline(f'{DTYPE_TO_C[arg_type]}* c_{arg_name} = malloc({buffer_size}ULL){self.ending}')
-                name_set.add(arg_name)
+        # Dynamic shape: handle size (VAR) args first -- malloc, load from argv, and
+        # read the runtime extent into N_<name>, BEFORE the tensor buffers, which are
+        # sized from it.
+        for arg_name, (flag, arg_type, arg_size, _, _) in self.arg_attributes:
+            if not self._is_var(flag) or arg_name in name_set:
+                continue
+            ctype = DTYPE_TO_C[arg_type]
+            self.writeline(f'{ctype}* c_{arg_name} = malloc(64ULL){self.ending}')
+            if self.validation:
+                self.writeline(f'if(load_arg(c_{arg_name}, sizeof(int64_t), argv[{self.load_args[arg_name]}]) == -1){self.open_bracket}')
+                with self.code.indent():
+                    self.writeline(f'return -1{self.ending}')
+                self.writeline(self.closed_bracket)
+            self.writeline(f'int64_t N_{arg_name} = ((int64_t*)c_{arg_name})[0]{self.ending}')
+            name_set.add(arg_name)
+        for arg_name, (flag, arg_type, arg_size, arg_sizes, arg_stride) in self.arg_attributes:
+            if self._is_var(flag) or arg_name in name_set:
+                continue
+            if torch.is_floating_point(torch.tensor([], dtype=arg_type)):
+                bits = torch.finfo(arg_type).bits
+            elif arg_type == torch.bool:
+                bits = 8
+            else:
+                bits = torch.iinfo(arg_type).bits
+            ctype = DTYPE_TO_C[arg_type]
+            if self._is_symbol(arg_size):
+                # runtime extent: round bytes up to 64 and double, computed in C.
+                nbytes = f"(N_{arg_size} * {bits} / 8)"
+                buffer_size = f"((({nbytes} + 63) / 64) * 64) * 2"
+            else:
+                buffer_size = f"{int(math.ceil(int(arg_size) * bits // 8 / 64) * 64) * 2}ULL"  # round up to 64 bytes + safety pad
+            self.writeline(f'{ctype}* c_{arg_name} = malloc({buffer_size}){self.ending}')
+            name_set.add(arg_name)
         self.writeline(self.newline)
 
     def generate_main(self):
         self.writeline(f'{self.newline}int main(int argc, char *argv[]) {self.open_bracket}{self.newline}')
         with self.code.indent():
             if self.validation:
+                self._assign_argv_indices()   # argv slots in arg order (incl. size args)
                 self.generate_args_define()
                 self.load_arg()
                 self.writeline(self.newline)
             else:
                 self.generate_args_define()
 
-            func_arguments = [f"c_{arg_name}, c_{arg_name}, 0, {arg_shape}, 1" for arg_name, (_, arg_type, arg_shape, _, _) in self.arg_attributes]
+            func_arguments = [f"c_{arg_name}, c_{arg_name}, 0, {self._numel_c_expr(arg_shape)}, 1" for arg_name, (_, arg_type, arg_shape, _, _) in self.arg_attributes]
             self.writeline(f"wrapper_{self.kernel_name}({', '.join(func_arguments)}){self.ending}{self.newline}")
 
             if self.validation:
diff --git a/Simulator/simulator.py b/Simulator/simulator.py
index c3239905..06fa694b 100644
--- a/Simulator/simulator.py
+++ b/Simulator/simulator.py
@@ -91,15 +91,17 @@ def write_arg(self, arg, path, name):
         os.makedirs(dump_path, exist_ok=True)
         index = self.get_biggest_filename(dump_path)
 
+        data_path = os.path.join(dump_path, f'{index}.raw')
         if (isinstance(arg, torch.Tensor)):
-            data_path = os.path.join(dump_path, f'{index}.raw')
             tensor = arg.cpu().detach()
             buffer_size = tensor.untyped_storage().size()
             buffer = (ctypes.c_char * buffer_size).from_address(tensor.data_ptr())
             t_arr = np.frombuffer(buffer, dtype=TORCH_TO_NUMPY[tensor.dtype], count=buffer_size // tensor.element_size())
             t_arr.tofile(data_path)
         else:
-            assert(0)
+            # Dynamic shape: a scalar size argument (a runtime extent, e.g. s52).
+            # The kernel reads it from a memref<1xi64> buffer, so write one int64.
+            np.array([int(arg)], dtype=np.int64).tofile(data_path)
         return index
 
     def dump_args(self, args, arg_attributes, load_path, dump_path):
@@ -108,7 +110,10 @@ def dump_args(self, args, arg_attributes, load_path, dump_path):
         for (arg_name, arg_attribute), arg in zip(arg_attributes, args):
             size = arg_attribute[2] if arg_attribute[1] != torch.bool else (arg_attribute[2] + 7) // 8
             array_size.append(size)
-            if MLIRKernelArgs.is_mlir_arg_in(arg_attribute[0]):
+            # A size symbol arg (MLIR_ARGS_VAR, e.g. a dynamic extent s52) is a kernel
+            # INPUT: the kernel loads it for its loop bound, so dump it like an input.
+            is_var = bool(arg_attribute[0] & MLIRKernelArgs.MLIR_ARGS_VAR)
+            if MLIRKernelArgs.is_mlir_arg_in(arg_attribute[0]) or is_var:
                 index = self.write_arg(arg, load_path, arg_name)
                 file_path.append(os.path.join(load_path, arg_name, f'{index}.raw'))
             elif MLIRKernelArgs.is_mlir_arg_out(arg_attribute[0]):

From e2841c27a3d6ba874bd583a16fce5e55e875bb06 Mon Sep 17 00:00:00 2001
From: Wonhyuk Yang <wonhyuk@postech.ac.kr>
Date: Tue, 23 Jun 2026 15:00:39 +0900
Subject: [PATCH 19/24] [Test] Dynamic-shape elementwise add on the trace path

One torch.compile(dynamic=True) add, run at 1024 and 2048 from a single
compiled trace producer .so, checking the output values (allclose) at
each size. Sizes are tile multiples so no tail padding is needed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
---
 tests/ops/elementwise/test_dynamic_add.py | 41 +++++++++++++++++++++++
 1 file changed, 41 insertions(+)
 create mode 100644 tests/ops/elementwise/test_dynamic_add.py

diff --git a/tests/ops/elementwise/test_dynamic_add.py b/tests/ops/elementwise/test_dynamic_add.py
new file mode 100644
index 00000000..6e6783c3
--- /dev/null
+++ b/tests/ops/elementwise/test_dynamic_add.py
@@ -0,0 +1,41 @@
+"""Dynamic-shape elementwise add on the C++ trace path.
+
+A single torch.compile(dynamic=True) kernel compiles to one trace producer .so
+and is simulated at several input sizes -- the producer reads its loop bound from
+shape_args at runtime, so the same .so serves any size. This exercises the
+dynamic-shape pipeline end to end (symbolic tiling -> symbolic MLIR loop bound ->
+shape_args producer -> per-tile cost table -> runtime shape via the attribute
+file, plus a shape-agnostic Spike validation binary for the output values).
+
+Sizes are multiples of the tile so no tail padding is needed (padding-shape
+correctness is a separate follow-up).
+"""
+import os
+import sys
+
+import torch
+import torch._dynamo
+
+sys.path.insert(0, os.path.join(os.environ.get("TORCHSIM_DIR", default="/workspace/PyTorchSim"), "tests"))
+from _pytorchsim_utils import test_result
+
+
+def test_dynamic_add(device, sizes=(1024, 2048)):
+    def add(a, b):
+        return a + b
+
+    # Compile once with a symbolic shape; run at every size from the same .so.
+    opt_fn = torch.compile(dynamic=True)(add)
+    for n in sizes:
+        x = torch.randn(n).to(device=device)
+        y = torch.randn(n).to(device=device)
+        torch._dynamo.mark_dynamic(x, 0)
+        torch._dynamo.mark_dynamic(y, 0)
+        res = opt_fn(x, y)
+        out = add(x.cpu(), y.cpu())
+        test_result(f"DynamicAdd(N={n})", res, out)
+
+
+if __name__ == "__main__":
+    device = torch.device("npu:0")
+    test_dynamic_add(device, (1024, 2048))

From c2c0db4dc6ee6379dc2e30d7398f6c1d8e6edb41 Mon Sep 17 00:00:00 2001
From: Wonhyuk Yang <wonhyuk@postech.ac.kr>
Date: Tue, 23 Jun 2026 20:41:16 +0900
Subject: [PATCH 20/24] [Frontend] Tidy dynamic-shape detection and drop dead
 code

Small robustness cleanups from the PR review (no behavior change):

- Add MLIRKernelArgs.is_mlir_arg_var and use it where the MLIR_ARGS_VAR
  mask was open-coded (mlir_caller_codegen._is_var, Simulator.dump_args).
- Detect a dynamic kernel in MLIRCodeCache.load via that flag
  (any size-symbol arg) instead of sniffing "memref<?" in the IR text.
- Drop a dead shape_args local in run_kernel_simulation: it was left over
  from an earlier run_spike gate; the runtime extents reach the simulator
  via the attribute YAML (write_kernel_attribute_file), not from there.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
---
 PyTorchSimFrontend/extension_codecache.py      | 13 +++++++------
 PyTorchSimFrontend/mlir/mlir_caller_codegen.py |  2 +-
 PyTorchSimFrontend/mlir/mlir_common.py         |  5 +++++
 Simulator/simulator.py                         |  3 +--
 4 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/PyTorchSimFrontend/extension_codecache.py b/PyTorchSimFrontend/extension_codecache.py
index a59e9b2e..c3395ec2 100644
--- a/PyTorchSimFrontend/extension_codecache.py
+++ b/PyTorchSimFrontend/extension_codecache.py
@@ -193,7 +193,11 @@ def load(cls, source_code,
             # binary is shape-agnostic: under dynamic shape it reads the runtime extent
             # from the size-arg buffer and sizes its host buffers from it
             # (mlir_caller_codegen), so one binary serves any size -- like the producer.
-            is_dynamic_shape = "memref<?" in source_code
+            # Dynamic shape: a kernel has a size-symbol arg (MLIR_ARGS_VAR) iff some dim
+            # is a runtime extent. Use that flag (authoritative) rather than sniffing the
+            # IR text.
+            from PyTorchSimFrontend.mlir.mlir_common import MLIRKernelArgs
+            is_dynamic_shape = any(MLIRKernelArgs.is_mlir_arg_var(attr[0]) for _, attr in arg_attributes)
             if extension_config.pytorchsim_functional_mode:
                 # Use custom malloc to avoid size error
                 new_link_option = link_option + " -Wl,--wrap=malloc -Wl,--wrap=free"
@@ -394,11 +398,8 @@ def run_kernel_simulation(*args, autotune_subprocess_timeout_sec=None, **kwargs)
                 # Dump arguments and meta data
                 dump_metadata(args, arg_attributes, result_path)
                 runtime_path = FunctionalSimulator.get_runtime_dump_path(result_path)
-                # Dynamic shape: the kernel is called with scalar size argument(s)
-                # (e.g. s52) after its tensors -- the runtime extents, in size-arg
-                # order (== the producer's shape_args[k]). They also mark the run as
-                # dynamic, where the fixed-shape Spike validation binary was not built.
-                shape_args = [int(a) for a in args if not isinstance(a, torch.Tensor)]
+                # The runtime extents reach the simulator via the attribute YAML
+                # (write_kernel_attribute_file -> shape_args), not from here.
                 if extension_config.pytorchsim_functional_mode and not autotune:
                     funcsim = FunctionalSimulator(result_path, key)
                     funcsim.run_spike(args, arg_attributes,
diff --git a/PyTorchSimFrontend/mlir/mlir_caller_codegen.py b/PyTorchSimFrontend/mlir/mlir_caller_codegen.py
index fea1c945..bdb71be5 100644
--- a/PyTorchSimFrontend/mlir/mlir_caller_codegen.py
+++ b/PyTorchSimFrontend/mlir/mlir_caller_codegen.py
@@ -35,7 +35,7 @@ def get_argv_idx(self):
         return self.arg_use_count-1
 
     def _is_var(self, flag):
-        return bool(flag & MLIRKernelArgs.MLIR_ARGS_VAR)
+        return MLIRKernelArgs.is_mlir_arg_var(flag)
 
     @staticmethod
     def _is_symbol(numel):
diff --git a/PyTorchSimFrontend/mlir/mlir_common.py b/PyTorchSimFrontend/mlir/mlir_common.py
index 6b8d905f..062b35e1 100644
--- a/PyTorchSimFrontend/mlir/mlir_common.py
+++ b/PyTorchSimFrontend/mlir/mlir_common.py
@@ -174,6 +174,11 @@ def is_mlir_arg_out(value):
     def is_mlir_arg_inout(value):
         return MLIRKernelArgs.MLIR_ARGS_INOUT & value
 
+    @staticmethod
+    def is_mlir_arg_var(value):
+        # A size-symbol arg (a dynamic extent passed as a scalar), not a tensor.
+        return bool(MLIRKernelArgs.MLIR_ARGS_VAR & value)
+
     @staticmethod
     def get_mlir_shape(info):
         tensor_type = DTYPE_TO_MLIR[info[0]]
diff --git a/Simulator/simulator.py b/Simulator/simulator.py
index 06fa694b..75bc0205 100644
--- a/Simulator/simulator.py
+++ b/Simulator/simulator.py
@@ -112,8 +112,7 @@ def dump_args(self, args, arg_attributes, load_path, dump_path):
             array_size.append(size)
             # A size symbol arg (MLIR_ARGS_VAR, e.g. a dynamic extent s52) is a kernel
             # INPUT: the kernel loads it for its loop bound, so dump it like an input.
-            is_var = bool(arg_attribute[0] & MLIRKernelArgs.MLIR_ARGS_VAR)
-            if MLIRKernelArgs.is_mlir_arg_in(arg_attribute[0]) or is_var:
+            if MLIRKernelArgs.is_mlir_arg_in(arg_attribute[0]) or MLIRKernelArgs.is_mlir_arg_var(arg_attribute[0]):
                 index = self.write_arg(arg, load_path, arg_name)
                 file_path.append(os.path.join(load_path, arg_name, f'{index}.raw'))
             elif MLIRKernelArgs.is_mlir_arg_out(arg_attribute[0]):

From d86e9cd0603b987565d87c2d0e6728a889842a31 Mon Sep 17 00:00:00 2001
From: Wonhyuk Yang <wonhyuk@postech.ac.kr>
Date: Tue, 23 Jun 2026 20:43:45 +0900
Subject: [PATCH 21/24] [Frontend] Consolidate symbolic-dim guards into
 is_symbolic_dim

The dynamic-shape tile/bound paths each had their own ad hoc guard for a
symbolic dimension (isinstance sympy.Expr / and-not-is_number variants).
Add one predicate, mlir_common.is_symbolic_dim(x) = a sympy.Expr that is
not a compile-time constant, and use it at every site: is_dim_dividable,
trim_large_tail, get_padding_ratio, LoopLevel._bound_str, and make_choices.
No behavior change (verified static 128/512 + dynamic add still pass); it
just gives one place to get the rule right when adding new dim arithmetic.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
---
 PyTorchSimFrontend/mlir/mlir_codegen_backend.py |  2 +-
 PyTorchSimFrontend/mlir/mlir_common.py          | 16 ++++++++++++----
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/PyTorchSimFrontend/mlir/mlir_codegen_backend.py b/PyTorchSimFrontend/mlir/mlir_codegen_backend.py
index 9aa64caa..48513249 100644
--- a/PyTorchSimFrontend/mlir/mlir_codegen_backend.py
+++ b/PyTorchSimFrontend/mlir/mlir_codegen_backend.py
@@ -997,7 +997,7 @@ def make_choices(self, nodes, kernel_name):
                     # Dynamic shape: a symbolic dim has no compile-time bound to grow the
                     # tile toward, so drop the axis (keep the fixed tile) rather than
                     # comparing tile >= sympy*2 (cannot determine truth value).
-                    if isinstance(prev_ranges[axis], sympy.Expr) or \
+                    if mlir_common.is_symbolic_dim(prev_ranges[axis]) or \
                             prev_tile_sz[axis] >= prev_ranges[axis] * 2 or prev_tile_sz[axis] >= 2 ** 13:
                         candidate_axes.remove(axis)
                         self.reset(None)
diff --git a/PyTorchSimFrontend/mlir/mlir_common.py b/PyTorchSimFrontend/mlir/mlir_common.py
index 062b35e1..61f0058e 100644
--- a/PyTorchSimFrontend/mlir/mlir_common.py
+++ b/PyTorchSimFrontend/mlir/mlir_common.py
@@ -96,6 +96,14 @@ def get_dtype_nbytes(dtype):
         raise NotImplementedError(f"Unsupported dtype for precision calculation: {dtype}")
     return MLIR_TO_BIT[mlir_dtype] // 8
 
+def is_symbolic_dim(x):
+    """True if `x` is a runtime (symbolic) dimension -- a sympy expression that is
+    not a compile-time constant. Dynamic shape (torch.compile(dynamic=True)) makes a
+    loop range / dim such a symbol (e.g. ks0); the tiling and bound-emission paths
+    must skip their concrete-int arithmetic for it. Single predicate for every such
+    guard (a concrete sympy.Integer is NOT symbolic)."""
+    return isinstance(x, sympy.Expr) and not x.is_number
+
 DTYPE_LOWP_FP = [
     torch.bfloat16,
     torch.float16,
@@ -351,7 +359,7 @@ def is_dim_dividable(self, dim_sizes: list[int]) -> bool:
         # has no symbolic equivalent -- it would loop forever shrinking the tile to 1.
         # index_expr / indirect indexing under dynamic shape is Step 2 (B3); fail
         # clearly here instead of a sympy "cannot determine truth value" crash.
-        if any(isinstance(d, sympy.Expr) for d in dim_sizes):
+        if any(is_symbolic_dim(d) for d in dim_sizes):
             raise NotImplementedError(
                 "index_expr/indirect indexing under dynamic shape is not supported "
                 "yet (symbolic dim reached is_dim_dividable)")
@@ -428,7 +436,7 @@ def trim_large_tail(self, ranges: list[int]):
             # extent is unknown, so keep the fixed init tile and let the tail become a
             # runtime remainder tile (masked). Skipping also avoids %/comparison on a
             # sympy symbol (cannot determine truth value).
-            if isinstance(dim_range, sympy.Expr):
+            if is_symbolic_dim(dim_range):
                 continue
             elif constraint.must_divide_dim:
                 BETA = 0
@@ -497,7 +505,7 @@ def init_tile_size(ranges, vlane_stride, vector_lane):
     def get_padding_ratio(tile_range: int, dim_range: int) -> float:
         # Dynamic shape: a symbolic dim has no compile-time tail, so report zero
         # padding waste ("nothing to trim") rather than doing %/<= on a sympy symbol.
-        if isinstance(dim_range, sympy.Expr) or isinstance(tile_range, sympy.Expr):
+        if is_symbolic_dim(dim_range) or is_symbolic_dim(tile_range):
             return 0.0
         if tile_range <= 0 or dim_range <= 0:
             raise ValueError("tile_range and dim_range must be positive integers")
@@ -1062,7 +1070,7 @@ def _bound_str(self):
         # Dynamic shape: a symbolic upper bound is emitted as an index SSA value
         # (%<name>_bound, materialized at the function top level by codegen_loops),
         # which is a valid affine symbol; a concrete bound stays an integer literal.
-        if isinstance(self.size, sympy.Expr) and not self.size.is_number:
+        if is_symbolic_dim(self.size):
             if not isinstance(self.size, sympy.Symbol):
                 raise NotImplementedError(
                     f"dynamic loop bound must be a single size symbol, got {self.size}")

From e8cb0d26675eb5d079db089b407173f1eb587b50 Mon Sep 17 00:00:00 2001
From: Wonhyuk Yang <wonhyuk@postech.ac.kr>
Date: Wed, 24 Jun 2026 20:56:01 +0900
Subject: [PATCH 22/24] [Docs] Dynamic-shape implementation plan (storage; drop
 before merge)

Full roadmap for extending the C++ trace path to general dynamic shape:
the runtime DMA stack already carries runtime dims/strides, so the work is
codegen (general symbolic index lowering + runtime togsim.dma descriptors);
7-phase build order, cross-cutting contracts, test matrix, risks. Notes
that dynamic floor/mod belongs in axis_split (symbolic-aware), not the
legacy convert_index affine path. Planning artifact -- remove before
merging the feature.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
---
 docs/dynamic-shape-plan.md | 231 +++++++++++++++++++++++++++++++++++++
 1 file changed, 231 insertions(+)
 create mode 100644 docs/dynamic-shape-plan.md

diff --git a/docs/dynamic-shape-plan.md b/docs/dynamic-shape-plan.md
new file mode 100644
index 00000000..7582f6c7
--- /dev/null
+++ b/docs/dynamic-shape-plan.md
@@ -0,0 +1,231 @@
+# C++ trace 경로의 dynamic shape — 구현 계획 (전체)
+
+목표: trace 경로(C++ TOG)를 확장해 `torch.compile(dynamic=True)`가 1D contiguous
+elementwise뿐 아니라 **전체 op**에서 동작하게 한다. 현재 상태부터 일반 dynamic shape
+지원까지의 빈틈없는 로드맵이다.
+
+---
+
+## 0. 현재 상태 (완료, PR #269)
+
+1D 단일 심볼 **elementwise add**가 **하나의 컴파일된 `trace.so`** 로 임의 크기에서
+**timing + functional 출력** 둘 다 e2e 동작한다:
+
+- symbolic dim용 tile-sizing 가드 (`is_symbolic_dim`)
+- symbolic MLIR loop bound (`affine.for ... to %<sym>_bound`, `memref<?>`)
+- import-safe wrapper meta
+- 한-타일 cycle 샘플링 (`pin_loops_to_one_tile`)
+- producer `.so`가 loop bound를 `shape_args[k]`에서 읽음 (`emitc.subscript`)
+- 런타임 shape를 attribute YAML로 전달 (`--attribute` → `run_producer`)
+- shape-agnostic Spike 검증 바이너리 (size 버퍼에서 런타임 extent를 읽음)
+
+검증: 1024/2048/1536 + 1D tail 1000 이 한 바이너리에서 정확한 값.
+
+**add가 되고 나머지가 안 되는 이유:** add는 *contiguous* → DRAM 접근 stride가 상수
+`[1]`, 타일 dims도 상수 `[512]`; 심볼은 오직 loop **trip count**(`to %s_bound`)에만 등장.
+*주소 산술*이나 *stride*에는 심볼이 안 들어간다. 남은 모든 op는 이 가정을 깬다.
+
+---
+
+## 1. 단 하나의 핵심 통찰 (나머지를 다루기 쉽게 만드는 것)
+
+**런타임은 이미 일반적이다. 작업은 전부 codegen에 있다.**
+
+trace DMA ABI/런타임은 이미 런타임·다차원·strided descriptor를 실어 나른다:
+
+```
+togsim_dma(ctx, dir, arg_id, offset, ndim, dims[], strides[], elem_bits, ...)   // 런타임 int64*
+   → TraceRec {addr, dims, strides, elem_bits}                                   // 런타임에 기록
+   → make_dma → Instruction(dram_addr, tile_size=dims, tile_stride=strides, ...)
+   → DMA / DRAM(Ramulator) 모델: dims/strides로 strided 주소 스트림 + 비용
+   → SRAM throttle footprint = prod(dims) * elem_bytes
+```
+
+즉 `offset`, `dims`, `strides`가 *이미* 런타임 int64 값으로 스택 전체를 흐른다.
+**TOGSim/런타임/DRAM 모델 재작성은 불필요하다.** add가 제한적인 유일한 이유는 codegen이
+dims/strides를 컴파일타임 상수 attr로 *굳히고* 사소한 주소 형태만 다루기 때문이다.
+
+따라서 아래 모든 것은 codegen 능력 두 가지로 귀결된다:
+
+- **(C1) 일반 symbolic index-식 lowering** — 임의의 affine/sympy index를 런타임 C로 lowering.
+  leaf를 `itervar → 루프 변수`, `size 심볼 → shape_args[k]`, `정수 → literal` 로 해소하고,
+  모든 연산자(`+`, `*`(symbol×symbol 포함), `//`, `%`)를 emit.
+- **(C2) 런타임 `togsim.dma` descriptor** — `dims`/`strides`를 상수 attr뿐 아니라 **런타임
+  operand**로도 실을 수 있게 하고, `lower_to_emitc`가 (C1) 값으로 배열을 채움.
+
+그러면 동적 **offset**(전체 index 식), 동적 **stride**(한 계수), **tail-trim dim**(`min` 식)이
+전부 (C1)+(C2)의 특수형이 된다.
+
+---
+
+## 2. 빌드 순서 (단계별)
+
+각 단계: 목표 / touch point / 변경 / 검증 / 위험. 각 단계가 이전 단계 위에 쌓이도록 정렬.
+
+### Phase 1 — 일반 symbolic index-식 lowering  [토대, P0]
+
+- **목표:** `_index_expr`가 itervar + size 심볼 + 정수로 된 임의의 index 식을 lowering.
+- **touch:** `mlir_codegen_backend.py:798-837`(`_index_expr`), `:883`(`index_expr` 리네임),
+  `mlir_common.py`(leaf 분류기 + 공유 `shape_args[k]` 인덱스 맵).
+- **변경:** `const_coeff * itervar` 패턴매칭을 **재귀 sympy walk**로 교체:
+  - leaf `itervar(indexN)` → 루프 induction 값 (기존 `dim_list`/`itervar_cses`);
+  - leaf `size 심볼(ks/s)` → `emitc.subscript(shape_args, k)` (`_rewrite_signature` 메커니즘
+    재사용; 심볼→`shape_args` 인덱스 맵 공유 필요, §3);
+  - leaf `Integer` → literal;
+  - 연산자: `Add`/`Mul`(symbol×symbol 포함)/`FloorDiv`/`Mod` → 런타임 vector/scalar op.
+  - 현재의 `int(str(arg)[1:])`(모든 심볼을 `indexN`으로 가정)와
+    `renamed = {s: "d"+str(s)[5:]}` 제거/일반화 — `ks0`에서 크래시함.
+- **검증:** strided 2D 접근(transpose 또는 matmul-타일 주소 `i*K + j`)이 유효한 MLIR로
+  lowering되고 producer가 올바른 주소를 기록; 1D add 불변.
+- **위험:** 중-상 — 중심 재작성, blast radius 큼(모든 load/store index). static은 상수계수
+  fast path를 byte 단위로 보존.
+
+#### Phase 1 보강 — FloorDiv/Mod 는 axis-split의 일 (convert_index 아님)
+floor/mod 경로를 끝까지 확인한 결과 — 정정: **codegen의 floor/mod 처리(`convert_index`)는
+확장 대상이 아니라 은퇴 대상이다.**
+
+- 설계 의도(`docs/axis-split-scheduling.md`)는 **"affine-only contract"**: codegen이
+  FloorDiv/ModularIndexing이 **전혀 없는** affine 인덱스만 받게, `axis_split.py`가 상류에서
+  floor/mod를 다차원 strided 접근으로 제거한다. `convert_index`(`:342`)/
+  `_convert_sympy_to_mlir_expr`(`:370`)가 `(x floordiv y) mod z` 를 affine map으로 emit하던
+  것은 그 이전의 **legacy codegen-내부 처리**이고, axis-split(현재 prototype) 전환이
+  끝나면 사라진다. → **convert_index의 floor/mod 분기는 동적용으로 일반화하지 말 것.**
+- **그런데 동적에선 axis-split도 지금은 못 한다.** `collect_boundaries`(`axis_split.py:44-54`)
+  가 divisor `k`와 extent `E` 를 둘 다 **concrete int 로 요구**한다(`_as_int(div)`,
+  `_as_int(var_ranges[base])`, `E % k == 0`). symbolic divisor/extent면 `_as_int`→None →
+  split 안 됨 → floor/mod 가 살아남아 codegen reject(`:1200` "Unlinearized floor/mod") 또는
+  convert_index raise(free symbol 2개 / invalid affine `floordiv s0`)로 간다.
+- **그래서 동적 floor/mod 작업 = `axis_split.py`를 symbolic-aware로**: divisor가 원본 shape의
+  *진짜 dim* 이면 symbolic extent를 **construction상 나눠떨어진다** (예: `[M,N]` flatten,
+  divisor N → `E = M*N`, `E % N == 0`). 이 "symbolic 정렬"을 인식해 **symbolic split** 을
+  내면, 그 결과가 Phase 2의 동적 strided 접근으로 흐른다. (정렬 안 되는 view는 graph-copy
+  영역 — 범위 밖.) **convert_index/affine-divisor 경로는 손대지 않는다.**
+- 참고: affine `floordiv`/`mod` 는 어차피 divisor가 상수여야 유효(MLIR 규칙)하므로, 동적
+  divisor를 affine으로 표현하는 길은 처음부터 없다 — 그래서 답은 "affine화"가 아니라
+  "axis-split이 strided로 미리 없애기"다.
+
+### Phase 2 — 런타임 `togsim.dma` dims/strides  [P0, Phase 1 필요]
+
+- **목표:** dim/stride가 심볼에 의존하는 DMA가 그 값을 런타임으로 실음.
+- **touch:**
+  - `passes/togsim_ops.py` — `togsim.dma` op: dims/strides에 런타임 operand 허용
+    (attr entry가 sentinel(예 `-1`)이면 "런타임: operand m 참조").
+  - `passes/build_skeleton.py:98-99`(`_emit_dma`가 dims/strides를 `i64_array`로),
+    `:204`(`n_symbols != 0` bail) — dim/stride가 심볼이면 런타임 operand로 emit((C1) index
+    값) + attr엔 sentinel; 심볼 bail 완화.
+  - `passes/lower_to_emitc.py:418-419`(`_arr(ctx, dims/strides)`) — 런타임-aware 배열 fill:
+    sentinel entry → `dims[i] = <operand via emitc.subscript shape_args>;`, 아니면 literal.
+- **변경:** 위와 같음. `offset` operand 경로는 이미 런타임(add가 증명); Phase 1이 그
+  *계수*를 런타임화.
+- **검증:** matmul 타일의 row-stride = 동적 K가 `strides=[K_runtime, 1]`로 기록;
+  Ramulator가 strided 접근 비용; SRAM footprint 정확.
+- **위험:** 중 — op 스키마 변경 + 두 패스; all-constant 경로는 동일하게 유지.
+
+### Phase 3 — tail-trim DMA (padding/masking 교체)  [TODO A, P0/P1]
+
+- **목표:** 경계(부분) 타일이 유효 remainder만 전송; 패딩/마스킹된 full 타일 없음.
+  동적 >1D/tail 정확성 + 정적 홀수 크기 실패 해결.
+- **touch:** `passes/decompose_transfer.py`, `passes/dma_fine_grained.py`,
+  `togsim.transfer`/dma emission; 이 경로에서 loop-padding 패스는 빠짐(레거시 메커니즘 —
+  `docs/loop-padding-elimination.md`).
+- **변경:** 타일 dim을 따라 마지막 타일은 DMA `dims`를 `min(tile, extent - offset)`로 emit
+  — (C1) 런타임 식, (C2)로 런타임 dim operand로 전달. masked-compute tail은 COMPUTE용으로
+  남길 수 있음; DMA는 유효 바이트만 옮김.
+- **타이밍:** 자동으로 정확 — producer가 trimmed `dims`를 기록하므로 trace 비용이 trimmed
+  전송 반영(레거시 "full-tile DMA 비용" 우려는 옛 모델 얘기).
+- **검증:** tail 크기(예 1000, 2D 47×10)가 정확한 값 + 마지막-타일 DMA 크기가 remainder인
+  trace.
+- **위험:** 중 — masked-compute 경로와 상호작용; compute는 마스킹, DMA는 trim 확인.
+
+### Phase 4 — dynamic shape op 템플릿  [TODO B, P1]
+
+- **목표:** matmul / conv / bmm / sdpa 가 `dynamic=True`로 컴파일.
+- **touch:** `mlir_gemm_template.py`, `mlir_conv_template.py`, `mlir_bmm_template.py`,
+  `mlir_sdpa_template.py`, `mlir_template.py`(`gemmini_gemm_mapping`,
+  `gemm_combination_mapping`: `math.ceil(M/...)`, `sympy.divisors`, divisor 루프).
+- **변경:** elementwise처럼 symbolic-aware MLIR emit — symbolic loop bound(`%<sym>_bound`),
+  `memref<?>`, stride가 동적 dim인 strided 접근(Phase 1/2 feed). 상수-int tiling 수학
+  (`math.ceil` 등)을 `is_symbolic_dim`으로 가드.
+- **검증:** 동적 matmul(M 동적, 이후 K/N 동적)이 정확한 값 + 한 `.so`에서 스케일링 trace;
+  런타임 seq_len을 가진 decode-style 커널.
+- **위험:** 상 — op별, 각 템플릿이 고유한 concrete-shape 가정.
+
+### Phase 5 — 다중 심볼 정확성 + 계약  [Phase 4 내 P0]
+
+size-arg ↔ `shape_args[k]` 순서는 e2e 단일 계약이어야 함:
+
+- **A-1** `lower_to_emitc._rewrite_signature`가 `k`를 *uses* 기준 배정(미사용 size 심볼
+  건너뜀); 런타임은 `shape_args`를 *arg 순서*(모든 비텐서)로 채움. 둘을 **같은 기준**(arg-
+  attributes 순서; 미사용 심볼도 슬롯 유지 또는 양쪽에서 드롭)으로. 단일 계약:
+  size-arg 위치 == `shape_args[k]`.
+- **A-2** 복합 numel `'128*s52'` → `_is_symbol(isdigit)`이 잘못된 C `N_128*s52` 생성.
+  Phase 1이면 numel을 문자열 휴리스틱 아닌 식으로 lowering. 그 전까지는 loud
+  `NotImplementedError`(조용히 잘못 emit 금지).
+- **A-3** `_concretize_attrs_for_sampling`의 `cz = isinstance(str)`가 stringify된 정적
+  `'128'`을 `tile`로 변환; `_is_symbol`/`is_symbolic_dim`과 같은 술어로 통일(숫자 문자열=
+  concrete).
+- **검증:** 서로 다른 동적 dim 2개(예 M, N)인 커널이 정확한 값 + trace가 각 extent를 맞게.
+
+### Phase 6 — loud-fail 가드 (중간 안전망)  [P1, 일찍]
+
+Phase 1-5가 안착하기 전, 아직 미지원인 동적 케이스를 전부 **큰 소리로 실패**(명확한
+`NotImplementedError`)하게, 절대 조용히 틀리지 않게: 복합 numel(A-2), 다중 심볼 어긋남
+(A-1), 공유-bound pin(A-4), bool-동적 dump(A-6). 일반 경로 구축 중 "단일 심볼 우연" 부류를
+방어.
+
+### Phase 7 — 인프라 / 검증 / 정리
+
+- **loop-padding 배포:** `TestLoopPadding.cpp`의 symbolic-skip이 LLVM 포크에만 있음. 배포
+  결정(재빌드 + 툴체인 반영) — 아래 둘을 게이트.
+- **CI (C-1):** `tests/ops/elementwise/test_dynamic_add.py`(+ 새 동적 테스트)를
+  `pytorchsim_test.yml`에 등록 — 단 **loop-padding fix가 CI 툴체인에 들어간 후**, 안 그러면
+  CI 실패(loop-padding이 symbolic bound를 2^32로 클로버).
+- **결과 파싱:** `TOGSimulator.get_result_from_file`가 trace-경로 로그를 파싱 안 함
+  ("Unable to parse the output file"); cycle은 로그에 정확 — 파싱 배선.
+- **static cost 샘플링:** static 경로도 `pin_loops_to_one_tile` 경유(`run_tog`에서 분리,
+  레거시 full TOG도 만들기 때문).
+- **A-4** `pin_loops_to_one_tile`: 루프별 상수 + 그 루프의 bound operand만 교체(전역
+  `replace_all_uses_with` 금지); ub operand >1 처리.
+- **A-5** `write_kernel_attribute_file`: 텐서 전용 카운터로 `arg{idx}` 부여 → 앞선 scalar가
+  `address_info` 인덱스에 구멍 안 내게(`main.cc`가 stub 대신 실제 base 읽을 때 중요).
+- **A-6** `dump_args` bool 분기: symbolic numel일 때 `+7//8` 산술 스킵.
+
+---
+
+## 3. Cross-cutting 계약 (단일로 유지)
+
+- **size 심볼 ↔ `shape_args[k]`:** 단일 순서, `_rewrite_signature`(producer),
+  `write_kernel_attribute_file`(Spike + attribute YAML), `main.cc`가 공유. `k` = size arg의
+  arg-attributes 순서상 위치; 런타임도 같은 순서로 채움.
+- **`is_symbolic_dim(x)`** (`mlir_common`): "런타임 dim"의 단일 술어(sympy.Expr, `is_number`
+  아님). 모든 tiling/bound/dma 가드가 사용. 숫자 문자열은 concrete.
+- **런타임 DMA ABI** (`togsim_runtime.h`): `offset`, `dims[]`, `strides[]`가 계약; codegen이
+  채우고 모델이 소비. 병행 채널 추가 금지.
+
+---
+
+## 4. 테스트 매트릭스 (단계별 추가; Phase 7에서 CI 등록)
+
+| 테스트 | 검증 대상 |
+|---|---|
+| 1D add, 다중 크기 (완료) | trip count, functional, 한 .so |
+| 2D add, 한 dim 동적 | strided 접근(Phase 1), tail-trim(Phase 3) |
+| 2D add, 두 dim 동적 | 다중 심볼 계약(Phase 5) |
+| matmul, M 동적 | 템플릿(Phase 4), 동적 stride(Phase 2) |
+| matmul, M+K+N 동적 | 다중 심볼 + strided |
+| decode (런타임 seq_len) | 동기가 된 실제 케이스 |
+| tail / 비배수 크기 | remainder 정확성(Phase 3) |
+
+---
+
+## 5. 위험 / 열린 질문
+
+- Phase 1 blast radius: 모든 memref index가 `_index_expr`를 거침. static 상수 경로를 동일
+  유지해야(정적 matmul/conv 회귀 테스트).
+- 심볼에 의한 FloorDiv/Mod(view/reshape/broadcast) — Inductor가 주는 index 식에 실제로
+  등장하는지, producer에서 런타임 `//`/`%` 비용이 수용 가능한지 확인.
+- 동적 stride 하의 cost-table 유효성: per-tile COMPUTE 비용은 shape-invariant(타일 크기
+  고정)라 테이블 유효; DMA 비용은 trace 주소(Phase 2)에서. compute 비용이 stride에
+  의존하지 않음을 확인.
+- loop-padding: Phase 3가 동적 의존을 제거; 패스를 완전히 은퇴할지
+  (`docs/loop-padding-elimination.md`) static용으로 남길지 결정.

From 82e92555d240773fd6e04ce69549e7e36d2b0685 Mon Sep 17 00:00:00 2001
From: Wonhyuk Yang <wonhyuk@postech.ac.kr>
Date: Wed, 24 Jun 2026 23:12:29 +0900
Subject: [PATCH 23/24] [Frontend] Make aligned axis-split symbolic-aware
 (detection layer)

Generalise axis_split boundary detection and divisibility-chain construction to
accept symbolic size expressions, as a strict superset of the integer case:
concrete-int reshapes produce identical split plans, and a dynamic reshape whose
flattened extent E is a product of dims (divisor a genuine factor, e.g.
FloorDiv(v, N) / ModularIndexing(v, 1, N) with extent M*N) is now detected.

- _divides/_eq/_gt1/_proper/_quotient/_as_size: boundary arithmetic that reduces
  exactly to int ops when operands are concrete and otherwise uses sympy (Mod
  simplifies to 0, cancel gives the quotient) under the symbols' integer/positive
  assumptions.
- _ordered_chain replaces _is_chain + numeric sort: orders boundaries by the
  divisibility partial order (b_i precedes b_j iff b_i | b_j) instead of numeric
  value, so symbolic suffix-product boundaries (N | M*N) chain; returns None on a
  non-total chain (incompatible radices) exactly as before.
- collect_boundaries / find_split_plan keep symbolic divisors and extents.
- build_split_body sizes sub-vars with _quotient/_as_size (symbolic seg extents).

Detection layer only: the residual-floor/mod folding (_fold_with_ranges) for
symbolic divisors and the runtime dynamic-stride DMA needed for end-to-end
symbolic reshape are follow-ups. Verified by tests/test_axis_split_symbolic.py
(static cases match legacy, symbolic cases detected, misaligned/non-divisor bail)
and confirmed behaviour-neutral on the static view suite.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
---
 PyTorchSimFrontend/mlir/axis_split.py | 147 +++++++++++++++++++++-----
 tests/test_axis_split_symbolic.py     |  89 ++++++++++++++++
 2 files changed, 208 insertions(+), 28 deletions(-)
 create mode 100644 tests/test_axis_split_symbolic.py

diff --git a/PyTorchSimFrontend/mlir/axis_split.py b/PyTorchSimFrontend/mlir/axis_split.py
index 71ec4809..15404bd0 100644
--- a/PyTorchSimFrontend/mlir/axis_split.py
+++ b/PyTorchSimFrontend/mlir/axis_split.py
@@ -29,43 +29,130 @@ def _as_int(x):
         return None
 
 
+# --- symbolic-aware boundary arithmetic ------------------------------------
+# These reduce EXACTLY to the integer case when their operands are concrete, so
+# static axis splitting is unchanged; they additionally accept symbolic size
+# expressions (e.g. a flattened reshape extent E = M*N with divisor N), where a
+# boundary that is a genuine product of dims divides the extent by construction.
+# A dynamic dim symbol is created integer/positive, so sympy proves the
+# divisibility (Mod(M*N, N) -> 0) and the quotient (cancel(M*N/N) -> M).
+
+def _divides(d, E):
+    """True iff d divides E. For concrete ints this is `E % d == 0`."""
+    di, Ei = _as_int(d), _as_int(E)
+    if di is not None and Ei is not None:
+        return di != 0 and Ei % di == 0
+    try:
+        return bool(sympy.simplify(sympy.Mod(E, d)) == 0)
+    except Exception:
+        return False
+
+
+def _eq(a, b):
+    """Provable equality of two size exprs (structural for ints)."""
+    ai, bi = _as_int(a), _as_int(b)
+    if ai is not None and bi is not None:
+        return ai == bi
+    try:
+        return bool(sympy.simplify(a - b) == 0)
+    except Exception:
+        return a == b
+
+
+def _gt1(x):
+    """True iff x is a non-trivial boundary (> 1). A symbolic dim is assumed > 1."""
+    xi = _as_int(x)
+    if xi is not None:
+        return xi > 1
+    return not _eq(x, sympy.Integer(1))
+
+
+def _proper(b, E):
+    """True iff b is a proper interior divisor of E: 1 < b < E and b | E."""
+    bi, Ei = _as_int(b), _as_int(E)
+    if bi is not None and Ei is not None:
+        return 1 < bi < Ei and Ei % bi == 0
+    return _gt1(b) and not _eq(b, E) and _divides(b, E)
+
+
+def _quotient(a, b):
+    """a / b as an exact int (concrete) or simplified sympy expr (symbolic)."""
+    ai, bi = _as_int(a), _as_int(b)
+    if ai is not None and bi is not None:
+        return ai // bi
+    return sympy.cancel(a / b)
+
+
+def _as_size(x):
+    """Wrap a concrete int as sympy.Integer; pass a sympy expr through unchanged
+    (preserving its integer/positive assumptions)."""
+    xi = _as_int(x)
+    return sympy.Integer(xi) if xi is not None else x
+
+
+def _ordered_chain(boundaries, E):
+    """Order the proper divisors of E into a divisibility chain [1, ..., E], else None.
+
+    Generalises the old `_is_chain` + numeric `sorted`: orders by the divisibility
+    partial order (b_i precedes b_j iff b_i | b_j) rather than by numeric value, so
+    symbolic boundaries (suffix-products of dims, e.g. N | M*N) chain correctly. For
+    concrete ints this yields exactly the old ascending divisibility chain. Returns
+    None when the boundaries do not form a TOTAL divisibility chain (the
+    incompatible-radix / misaligned case), so the axis is left unsplit.
+    """
+    bs = []
+    for b in boundaries:
+        if _proper(b, E) and not any(_eq(b, x) for x in bs):
+            bs.append(b)
+    ordered = []
+    remaining = list(bs)
+    while remaining:
+        # the divisibility-minimum is the unique element that divides all others.
+        mins = [b for b in remaining
+                if all(_divides(b, o) for o in remaining if not _eq(b, o))]
+        if len(mins) != 1:
+            return None  # no unique minimum -> incomparable -> not a chain
+        ordered.append(mins[0])
+        remaining = [o for o in remaining if not _eq(o, mins[0])]
+    chain = [sympy.Integer(1)] + ordered + [_as_size(E)]
+    for i in range(len(chain) - 1):
+        if not _divides(chain[i], chain[i + 1]):
+            return None
+    return chain
+
+
 def collect_boundaries(exprs, var_to_axis, var_ranges):
     """{axis_index: set(boundary cut points)} for the given index expressions.
 
     A FloorDiv(v, k) contributes boundary k; ModularIndexing(v, k, m) contributes
     k and k*m. Only aligned terms count (boundary divides the var extent). Shared
-    by find_split_plan (fused LoopBody) and graph_copy (operand loaders).
+    by find_split_plan (fused LoopBody) and graph_copy (operand loaders). Boundaries
+    and extents may be symbolic (dynamic reshape); divisibility is checked via
+    `_divides`, so a symbolic divisor that is a genuine factor of the extent counts.
     """
     import collections
     bset = collections.defaultdict(set)
     for expr in exprs:
         for fd in expr.atoms(FloorDiv):
             base, div = fd.args
-            k = _as_int(div)
-            if base in var_to_axis and k and k > 1:
-                E = _as_int(var_ranges.get(base))
-                if E and E % k == 0:
-                    bset[var_to_axis[base]].add(k)
+            if base in var_to_axis and _gt1(div):
+                E = var_ranges.get(base)
+                if E is not None and _divides(div, E):
+                    bset[var_to_axis[base]].add(div)
         for mi in expr.atoms(ModularIndexing):
             base, div, mod = mi.args
-            k, m = _as_int(div), _as_int(mod)
-            if base in var_to_axis and k and m:
-                E = _as_int(var_ranges.get(base))
-                if E and E % (k * m) == 0:
+            if base in var_to_axis:
+                E = var_ranges.get(base)
+                km = div * mod
+                if E is not None and _divides(km, E):
                     ax = var_to_axis[base]
-                    if k > 1:
-                        bset[ax].add(k)
-                    if k * m < E:
-                        bset[ax].add(k * m)
+                    if _gt1(div):
+                        bset[ax].add(div)
+                    if _proper(km, E):
+                        bset[ax].add(km)
     return bset
 
 
-def _is_chain(boundaries, E):
-    """True iff [1, sorted(boundaries in (1,E)), E] is a divisibility chain."""
-    chain = [1] + sorted(b for b in boundaries if 1 < b < E) + [E]
-    return all(chain[i + 1] % chain[i] == 0 for i in range(len(chain) - 1))
-
-
 def find_split_plan(nodes):
     """Inspect a group of scheduler nodes and return {axis_index: boundaries}.
 
@@ -80,13 +167,14 @@ def find_split_plan(nodes):
     collected boundaries for an axis do NOT form a divisibility chain (e.g.
     floor-by-2 and mod-by-3 on extent 6), the radices are incompatible -> the axis
     is left unsplit (its floor/mod stays for the misaligned/recompile path).
+    Boundaries/extents may be symbolic (see _ordered_chain).
 
     axis_index is positional in the group's iteration space, so the same plan
     applies to every fused node sharing that space.
     """
     import collections
     bset = collections.defaultdict(set)     # axis -> set of boundary cut points
-    ext_of = {}                             # axis -> extent
+    ext_of = {}                             # axis -> extent (int or symbolic)
     for n in nodes:
         body = getattr(n, "_body", None)
         if body is None:
@@ -95,14 +183,17 @@ def find_split_plan(nodes):
         nb = collect_boundaries(body.indexing_exprs.values(), var_to_axis, body.var_ranges)
         for ax, bs in nb.items():
             bset[ax] |= bs
-            ext_of[ax] = _as_int(body.var_ranges[body.iter_vars[ax]])
+            ext_of[ax] = body.var_ranges[body.iter_vars[ax]]
 
     plan = {}
     for ax, bs in bset.items():
-        E = ext_of[ax]
+        E = ext_of.get(ax)
+        if E is None:
+            continue
         # require a real, divisibility-chain split (incompatible radices -> skip).
-        if E and any(1 < b < E for b in bs) and _is_chain(bs, E):
-            plan[ax] = [1] + sorted(b for b in bs if 1 < b < E) + [E]
+        chain = _ordered_chain(bs, E)
+        if chain is not None and len(chain) > 2:
+            plan[ax] = chain
 
     # A split may push the per-axis index rank past 4. The resulting >4D logical tile
     # is peeled into <=4D physical descriptors by the decompose-transfer pass (an
@@ -143,15 +234,15 @@ def build_split_body(node, plan, prefix="z"):
             subs = []                         # (symbol, extent, significance) low->high
             expr = sympy.Integer(0)
             for i in range(len(bounds) - 1):
-                seg_ext = bounds[i + 1] // bounds[i]
+                seg_ext = _quotient(bounds[i + 1], bounds[i])
                 nv = sympy_index_symbol(f"{prefix}{ctr}"); ctr += 1
                 subs.append((nv, seg_ext, bounds[i]))
                 expr = expr + nv * bounds[i]
             # iteration nest: most-significant (outermost) dim first.
             for nv, seg_ext, _sig in reversed(subs):
                 iter_vars.append(nv)
-                var_ranges[nv] = sympy.Integer(seg_ext)
-                index_size.append(sympy.Integer(seg_ext))
+                var_ranges[nv] = _as_size(seg_ext)
+                index_size.append(_as_size(seg_ext))
             index_args.append(expr)
         else:
             nv = sympy_index_symbol(f"{prefix}{ctr}"); ctr += 1
diff --git a/tests/test_axis_split_symbolic.py b/tests/test_axis_split_symbolic.py
new file mode 100644
index 00000000..7f8623b2
--- /dev/null
+++ b/tests/test_axis_split_symbolic.py
@@ -0,0 +1,89 @@
+"""Unit test for symbolic-aware aligned axis splitting (axis_split.py).
+
+Pure sympy/Inductor test (no simulator): verifies the boundary-detection and
+divisibility-chain layer is a strict SUPERSET -- concrete-int reshapes behave
+exactly as before, and symbolic reshapes (flattened extent E = product of dims,
+divisor a genuine factor) are detected and chained correctly. The incompatible
+(misaligned) and non-divisor cases must bail (no split), for both int and symbol.
+
+Not in CI's simulator allowlist; run directly: python tests/test_axis_split_symbolic.py
+"""
+import sympy
+from torch._inductor.utils import sympy_index_symbol
+from torch.utils._sympy.functions import FloorDiv, ModularIndexing
+import PyTorchSimFrontend.mlir.axis_split as ax
+
+v = sympy_index_symbol("v")
+
+
+def I(x):
+    return sympy.Integer(x)
+
+
+def _chain_vals(chain):
+    if chain is None:
+        return None
+    if all(c.is_number for c in chain):
+        return [int(c) for c in chain]
+    return [str(c) for c in chain]
+
+
+def _boundaries(exprs, E):
+    return ax.collect_boundaries(exprs, {v: 0}, {v: E}).get(0, set())
+
+
+_failures = []
+
+
+def check(name, got, exp):
+    if got != exp:
+        _failures.append(f"{name}: got {got}, expected {exp}")
+        print("FAIL", name, "->", got, f"(expected {exp})")
+    else:
+        print("PASS", name, "->", got)
+
+
+def main():
+    # ---- static (must match legacy behaviour) ----
+    b = _boundaries([FloorDiv(v, I(3)), ModularIndexing(v, I(1), I(3))], I(12))
+    check("static reshape [4,3] boundaries", {int(x) for x in b}, {3})
+    check("static reshape [4,3] chain", _chain_vals(ax._ordered_chain(b, I(12))), [1, 3, 12])
+
+    check("static incompatible {2,3} E=6", _chain_vals(ax._ordered_chain({I(2), I(3)}, I(6))), None)
+
+    b = _boundaries(
+        [FloorDiv(v, I(12)), ModularIndexing(v, I(4), I(3)), ModularIndexing(v, I(1), I(4))],
+        I(24),
+    )
+    check("static 3-level boundaries", {int(x) for x in b}, {4, 12})
+    check("static 3-level chain", _chain_vals(ax._ordered_chain(b, I(24))), [1, 4, 12, 24])
+
+    # ---- symbolic (new) ----
+    M = sympy.Symbol("M", integer=True, positive=True)
+    N = sympy.Symbol("N", integer=True, positive=True)
+    A = sympy.Symbol("A", integer=True, positive=True)
+    B = sympy.Symbol("B", integer=True, positive=True)
+    C = sympy.Symbol("C", integer=True, positive=True)
+    P = sympy.Symbol("P", integer=True, positive=True)
+
+    b = _boundaries([FloorDiv(v, N), ModularIndexing(v, I(1), N)], M * N)
+    check("sym reshape [M,N] boundaries", {str(x) for x in b}, {"N"})
+    check("sym reshape [M,N] chain", _chain_vals(ax._ordered_chain(b, M * N)), ["1", "N", "M*N"])
+    check("sym seg_ext M*N/N", str(ax._quotient(M * N, N)), "M")
+
+    b = _boundaries([FloorDiv(v, B * C), ModularIndexing(v, C, B), ModularIndexing(v, I(1), C)], A * B * C)
+    check("sym 3-level boundaries", {str(x) for x in b}, {"C", "B*C"})
+    check("sym 3-level chain", _chain_vals(ax._ordered_chain(b, A * B * C)), ["1", "C", "B*C", "A*B*C"])
+
+    # incomparable symbolic divisors -> bail (misaligned)
+    check("sym incomparable {N,P} E=N*P", _chain_vals(ax._ordered_chain({N, P}, N * P)), None)
+    # non-divisor symbolic -> no boundary collected
+    check("sym non-divisor E=M*N+1", dict(ax.collect_boundaries([FloorDiv(v, N)], {v: 0}, {v: M * N + 1})), {})
+
+    if _failures:
+        raise SystemExit("Axis-split symbolic unit test FAILED:\n  " + "\n  ".join(_failures))
+    print("\nAxis-split symbolic unit test: ALL PASS")
+
+
+if __name__ == "__main__":
+    main()

From 50f6550a0c14eb61c34ea0ddccb79165f1e1828c Mon Sep 17 00:00:00 2001
From: Wonhyuk Yang <wonhyuk@postech.ac.kr>
Date: Thu, 25 Jun 2026 15:36:20 +0900
Subject: [PATCH 24/24] [Frontend] Modularize buffer-to-memref type
 construction

Extract the per-dimension and per-type memref rendering into mlir_dim /
mlir_memref_type so the symbolic-vs-concrete decision lives in one place.
get_mlir_shape now delegates to mlir_memref_type.

This fixes a gating leak: get_mlir_shape decided a dynamic memref dim with
isinstance(numel, sympy.Expr), which is also true for a concrete sympy.Integer,
so a static buffer could be emitted as memref<?xf32> in one place and
memref<128xf32> in another -- an MLIR type mismatch that broke the static view
suite (test_cat, transpose2D/3D). mlir_dim gates on is_symbolic_dim, so a
concrete sympy.Integer renders as its value and only a true symbol becomes "?".
mlir_memref_type also takes a dim list, ready for multi-axis symbolic memrefs.

Verified: static view suite recovered (cat 10, transpose2D 4, transpose3D 6,
view3D_2D 3) and test_dynamic_add still passes at N=1024 and N=2048.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
---
 PyTorchSimFrontend/mlir/mlir_common.py | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/PyTorchSimFrontend/mlir/mlir_common.py b/PyTorchSimFrontend/mlir/mlir_common.py
index 61f0058e..8d342e19 100644
--- a/PyTorchSimFrontend/mlir/mlir_common.py
+++ b/PyTorchSimFrontend/mlir/mlir_common.py
@@ -104,6 +104,17 @@ def is_symbolic_dim(x):
     guard (a concrete sympy.Integer is NOT symbolic)."""
     return isinstance(x, sympy.Expr) and not x.is_number
 
+
+def mlir_dim(d):
+    return "?" if is_symbolic_dim(d) else str(d)
+
+
+def mlir_memref_type(dims, dtype, memspace=None):
+    body = "x".join(mlir_dim(d) for d in dims)
+    suffix = f", {memspace}" if memspace is not None else ""
+    return f"memref<{body}x{DTYPE_TO_MLIR[dtype]}{suffix}>"
+
+
 DTYPE_LOWP_FP = [
     torch.bfloat16,
     torch.float16,
@@ -189,13 +200,7 @@ def is_mlir_arg_var(value):
 
     @staticmethod
     def get_mlir_shape(info):
-        tensor_type = DTYPE_TO_MLIR[info[0]]
-        numel = info[1]
-        # Dynamic shape: a symbolic numel becomes a dynamic memref dim ("?"); the
-        # actual extent arrives at runtime via the size-symbol arg (mlir_argdefs
-        # sizevars) and is materialized as the loop bound (codegen_loops).
-        dim = "?" if isinstance(numel, sympy.Expr) else numel
-        return f"memref<{dim}x{tensor_type}>"
+        return mlir_memref_type([info[1]], info[0])
 
     def mlir_argdefs(self, extra_node=dict()):
         buffer_types = {}