TOGSim C++ trace-generation pipeline (P0-P3): explicit dataflow producer + barriers by YWHyuk · Pull Request #267 · PSAL-POSTECH/PyTorchSim

YWHyuk · 2026-06-19T04:22:35Z

What

Replaces the timing-path TOG producer (MLIR -> Python dict -> ONNX -> C++ TileGraphParser) with a compiled, shape-parametric trace producer: post-vcix MLIR -> skeleton -> EmitC -> C++ -> .so. TOGSim dlopens the .so, runs it to record an instruction trace, and feeds it into the existing Simulator/Core (timing core unchanged). Driven by a new --trace_so mode; the legacy ONNX-TOG path is kept and marked DEPRECATED, so nothing existing breaks.

Pipeline

post-vcix .mlir
  | build_skeleton.py        loops + memref.dma_start/wait -> togsim.* ; DCE the rest
  | dep_analysis.py          per-op read/write SRAM buffers (SSA) + vcix preload/matmul pairing
  | lower_to_emitc.py        togsim.* -> emitc.call_opaque ; drive upstream convert-*-to-emitc
  v
EmitC --mlir-translate--> C++ --g++ -shared--> trace.so
  | run_producer (dlopen)    EmitCtx callbacks record a TraceRec stream
  | togsim_trace_bridge.cc   TraceRec -> TileGraph (explicit dependency DAG)
  v
existing Simulator / Core    cycles, DRAM traffic

Dependency model (no in-order, no runtime tag-hash, no op heuristics)

Dependencies are derived from two sources available pre-collapse:

SRAM last-writer per buffer (load->compute, the Y_spad accumulator chain), recovered via SSA + a virtual SA_WEIGHTS buffer that folds preload->matmul.
The systolic array modeled as a pipeline (occupancy/latency split) with two explicit, distinctly-named barriers:
- MEMORY_BAR (renamed from BAR): the DMA/tag memory fence; an async load -> compute waits the data's resp-complete.
- COMPUTE_BAR (new): the compute fence; a store waits all systolic-array pipelines to drain.

Both barriers are first-class trace ops (togsim.compute_barrier -> ABI togsim_compute_barrier) visible in the trace dump and the instruction stream.

Status

256^3 GEMM runs end-to-end through the real Simulator via --trace_so.
Cycle comparison vs the legacy build_tog path on the same kernel + gem5 cycle_list: compute work and DRAM traffic match; matmuls pipeline on 2 SAs; the memory fence correctly delays compute until the weight load arrives.
Known open items (documented in docs/design/togsim_cpp_trace.md sec 10): preload-concurrency cap (needs non-zero preload occupancy), parallel output tiles (dispatch granularity), broader op coverage (conv/SDPA/vector).

Testing

tests/test_togsim_skeleton.py, test_togsim_emitc.py, test_togsim_runtime.py (7 tests).
Manual --trace_so GEMM through TOGSim.
Legacy path untouched (comment-only DEPRECATED markers).

Design of record: docs/design/togsim_cpp_trace.md (sec 9-10).

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

… feed Skeleton + EmitC + cost/dep analysis on the frontend; the trace runtime, loader, bridge, and Core feed on the simulator; shared MLIR pass helpers and the pipeline tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

Per-record tag key in the bridge plus per-iteration tag alloc in dma-fine-grained so multi-tile-K and conv loads do not collide; strip the reduction accum marker from the memory_barrier slot. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

togsim_dispatch with TILE_BEGIN/TILE_END; outline each work-item into togsim_kernel_tile. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

DMA-capacity throttle and frozen-state guard, per-core VMEM in the configs, and the SA weight-buffer throttle. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

trace_timeline.py with per-work-item grouping and resource-centric DMA lanes; the trace logs the first DRAM response and the assigned systolic array, and scopes the compute barrier to its dispatch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

Default to the trace path; fix uninitialized Instruction fields, the matmul accumulator wedge, fused-subtile dedup, nested/fused epilogue dataflow, and dma_wait fusion; bound concurrent dispatches to the spad, round-robin work-items within a partition, benchmark autotune and run the multi-tenant scheduler through the trace path, and emit trace.so for pooling/reduction. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

Carry simulator headers through the wrapper for cache-safe replay; drop verbose [P3-trace] logs; fix the key.mlir compile race in load(). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

… runtime model Replace the trace bridge's accumulated special cases with one dataflow rule and clean up the runtime that consumes it. Dependency rule: per SRAM buffer keep a writers SET; a reader depends on all current writers (occupancy=ISSUE when both are systolic-array ops, else latency=DONE); a writer REPLACEs the set. The only exception is is_mm_accum (a matmul that reads and writes the same buffer = a commutative accumulator): skip its read edge and UNION its write, waiting only the non-matmul init seed and not ordering co-matmuls. This drops the matmul-accumulator chain that deadlocked the SA weight-slot pipeline while keeping the init->matmul edge, and lets a vector epilogue or the store wait every K matmul (fixes the pure-vector store that an empty COMPUTE_BAR let slip). Remove COMPUTE_BAR entirely: a matmul is its own DONE-handle (finish == SA drain), so the store JOINs the matmul writers directly. The whole emit/loader chain is gone -- build_skeleton, lower_to_emitc, togsim.compute_barrier, the runtime symbol, the Opcode/case/_fence_finish, and TraceRec::COMPUTE_BAR -- so a stale producer fails loudly instead of emitting records the bridge would drop. Only MEMORY_BAR remains (an async load's DONE is its data arrival, not issue). Model compute-output spad footprint in the SRAM version/capacity machinery so buffer reuse (WAR) is capacity-modeled, not a hard edge. The output size comes from the DMA records that touch the same buffer (a buf_bytes pre-pass); an in-place buffer (accumulator, relu) is version-transparent so footprint is not double-counted. The occupy gate and version release sit in the MOVIN/MOVOUT/COMP issue points (release before the COMP skip path so a skipped matmul still frees). Runtime: collapse child_inst / _pipeline_children into one event-indexed _deps[ISSUE|DONE] with add_dep(c, on) and fire(e); collapse the weight-slot release queue and the async-load wakeup into one _due_events timed-effect table drained by process_due_events. Both are behavior-preserving (byte-identical). Require the weight-slot model: sa_weight_buffer_depth must be > 0 (errors at init), and the round-robin disable mode is removed. Degenerate traces (a consumer-less preload, an unpinned matmul) hit explicit error+exit guards rather than asserts that vanish under NDEBUG. Mark the legacy ONNX TOG path deprecated: it is superseded by the trace path, so TileGraphParser logs a deprecation warning and the TORCHSIM_LEGACY_TOG=1 opt-in warns at command build. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

… spad/2 The validation-binary spad-overflow check sat inside `if functional_mode:`, so in timing-only / autotune (non-functional) runs an over-spad tiling was never rejected and reached TOGSim, which wedged ("spad too small") and crashed the compile via assert 0. Move the compile + check out of the functional gate (the Spike execution itself stays gated, run_spike below) and budget per dispatch at spad/2 -- two work-items run concurrently (double buffer), so each must fit half the spad or they deadlock competing for it. This matches the GEMM tiling gate (max_spad_size = spad/2), which pointwise ops lacked. Fixes the resnet / test_scheduler wedge where a fused BatchNorm+ReLU tile exceeded spad/2. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

Every generated trace.cpp now opens with a "GENERATED ... DO NOT EDIT" banner carrying the TOGSim ABI version and the togsim_* call formats, so a dumped trace is self-documenting. Both are read from togsim_runtime.h at codegen time -- the version from the TOGSIM_ABI_VERSION define, the call-format text from a marked block next to the declarations -- so they never drift when the ABI changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

…rint The bridge sums each work-item's distinct-buffer footprint (each buffer once, so a reduction's reloads of the same section do not inflate it) onto its Tile. can_issue then admits two concurrent dispatches only when each fits half the spad; a dispatch larger than spad/2 runs alone with the whole spad, so two work-items never compete for the shared spad and deadlock -- a runtime safety net beneath the codegen spad/2 gate. The footprint and resulting max_dispatch are logged on the TILE_SCHEDULED trace line for debugging. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

…election select_tile fed the epilogue buffer count to gemm_combination_mapping as max(n_extra_read - 2, 0), which optimistically assumed the X and W DMA buffers were already freed and reusable by the epilogue. The codegen lays every buffer out as a disjoint .spad global and never reuses freed space, so the estimate undercounted the real footprint: for a fused matmul+relu kernel the budget came to 64512 B/lane (treated as a bare GEMM) while the emitted kernel used 89088 B/lane including the relu output buffer. Under the full-spad guard this was harmless (89088 < 131072), but the spad/2 guard rejected it and crashed test_transformer_fusion with SpadOverflowError, since the heuristic path has no tile-shrink fallback. Pass n_extra_node + n_extra_read instead: one output-tile-sized buffer per epilogue node plus one per extra read operand, matching what the codegen emits. For the matmul+relu kernel the budget now equals the actual footprint exactly, and tile selection picks TILE_M=128 (62976 B/lane) which fits spad/2. Liveness-based spad reuse and broadcast over-allocation are tracked separately as optimizations in issue #275. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The spad-overflow check summed the kernel stack frame into the per-lane scratchpad usage (spad_usage = stack_size + spad_size) and rejected the tiling when that exceeded the spad/2 budget. But only the .spad section actually lives in the scratchpad -- it is pinned there by the --section-start=.spad link option. The kernel stack is in main memory (sp is set up by pk in the -m region, not at the scratchpad vaddr), so it does not consume the scratchpad and must not be charged against it. The scalar frame is also shared across lanes, not per-lane, so adding it double-counted. On small configs (8x8) this falsely rejected feasible tilings: the wrapper3 conv2d/resnet/mistral kernels fit the 32 KB spad with room to spare but were tripped over the spad/2 gate purely by the ~200-800 B stack term, crashing compile with SpadOverflowError. Drop the stack term; the .spad-only check still correctly rejects real buffer overflows (e.g. sparsity, which is fixed separately by f05ac8a). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01P5qH5qvM4STHKYA7eMAfmx

Spike v1.0.3 zero-inits the MVIN/MVOUT DMA address buffer so ROUNDUP padding entries are skipped instead of dereferencing uninitialized garbage, fixing the host store segfault on 8-lane (wrapper3) configs where a tile's split axis is not a multiple of vlane_stride*n_vu.

Add pytorchsim_functional_verify_per_kernel, a sub-option of pytorchsim_functional_mode that localizes the first compiled kernel whose Spike output diverges from a CPU reference. When enabled, the generated wrapper compares every realized buffer (the output of each fused kernel) against a CPU "golden" computed once per call by running the original aten graph (V.graph.module) on CPU with the same inputs, via an fx.Interpreter that records each node's output by name. A buffer is mapped to its originating fx node through V.graph.get_buffer().origin_node, so the check reports the kernel, the originating op, the offending indices and the max abs diff, then raises at the first divergence (stop-at-first). Comparison granularity is the fused-cluster output, the finest observable in a fused pipeline. Auto-disabled when functional mode is off (no Spike values to verify); the config accessor AND-gates the key with functional mode. Codegen bakes the verify_init/verify_check calls into the wrapper only when enabled at compile time, so clear the codegen cache when toggling. Tolerances via TORCHSIM_FUNCTIONAL_VERIFY_RTOL / _ATOL (default 1e-4). extension_functional_verify.py holds the graph registry and the runtime golden/compare logic; mlir_codegen_backend injects the calls at the wrapper level; extension_config reads the new YAML key. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add docs/per-kernel-functional-verify.md (usage, options, mechanism, limitations, code map) and point to it from CLAUDE.md: a one-line entry in the debugging section and in the YAML knobs list. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Keep only a one-line note on what TOGSIM_ABI_VERSION guards; the per-bump v1..v12 history was noise nobody reads.

The BMM tile selector fed only n_extra_node to gemm_combination_mapping and dropped the prologue's extra-read operands entirely, so a softmax-fused attention matmul (value^T @ softmax(scores)) was tiled as a bare BMM. The codegen lays the softmax max/sum operands out as their own disjoint weight-tile-sized .spad globals (buf3/buf4) and never reuses freed space, so the estimate undercounted the real footprint: on the 32x32 wrapper2 config (16 KB/lane spad/2 budget) tile selection picked TILE_N=512, whose emitted kernel used 16896 B/lane and overflowed the budget by 512 B, crashing test_transformer with SpadOverflowError. Under the 128x128 full-budget config the slack hid it. This mirrors the epilogue n_extra_read gap fixed for the GEMM template (commit f05ac8a), now on the BMM prologue path. Add an n_prologue_extra_read knob to gemm_combination_mapping that charges each extra prologue-read operand as one weight-tile-sized (TILE_K x TILE_N) .spad buffer, matching what the codegen emits, and have the BMM template count those operands the same way codegen does (the one numel-matching main input reuses the matmul-operand buffer; every other read gets its own global). Tile selection now rejects TILE_N=512 (16896 B/lane) and picks a fitting tile (8704 B/lane), so wrapper2 test_transformer passes the full Spike + allclose check. The new parameter defaults to 0, leaving the GEMM and conv callers unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01P5qH5qvM4STHKYA7eMAfmx

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01P5qH5qvM4STHKYA7eMAfmx

[Frontend] Budget fused-prologue spad buffers in BMM tile selection

…s off format_dma_inst_issued_trace_line and format_instruction_detail_line are only ever used as the message argument to trace_instruction_line (a spdlog::trace sink). Because the argument is evaluated eagerly at the call site, the kernel builds a formatted std::string for every issued/finished instruction and then spdlog drops it whenever the level is above trace (the default). Bail out of both formatters when trace logging is disabled so the fmt::format work is skipped entirely. This removes wasted work but is not the simulation-speed bottleneck (that is the per-cycle ready-queue rescan in Core::cycle, addressed separately); the conv kernel wall time is unchanged within noise.

Core::cycle() walked every tile's ready-instruction list on every simulated cycle to find one instruction to issue. Profiling an 8x8 conv kernel showed this scan is ~96% of simulation time and, within it, the walk is >99%: the ready list holds ~2000 instructions (mostly blocked on the SRAM-capacity / weight-slot throttle), only one issues per cycle, and the blocked ones are never removed, so the same ~2000 are re-examined every cycle (~2.6k list iterations/cycle, ~1e9 over 400k cycles). At small arrays this dominates because tiny tiles inflate both the ready-list length and the number of DMA-wait stall cycles. Gate the scan behind a per-Core _issue_dirty flag. The scan's outcome can only change when the ready set grows or a resource frees, so set the flag on exactly those events: - ready set grows: Instruction::dec_ready_counter, on enqueueing a now-ready instruction, sets the OWNING core's _issue_dirty via _owner_dirty_ref (wired in Core::issue when the tile is dispatched) -- per-core so a remote core's enqueue does not force every core to rescan; - SRAM freed (release_sram) and weight slot freed (apply_due) set _issue_dirty; - a new dispatch (issue) and a successful issue keep it set. On a cycle with none of these the scan would re-walk the same blocked instructions and issue nothing, so it is skipped. Instructions are never moved between queues, so there is no re-admission churn. Issue-identical: it never changes which instruction issues or when, only whether the (side-effect-free under those conditions) scan runs. The 8x8 conv kernel produces the same 403026 cycles as before, with its TOGSim wall dropping from 214.5s to 41.9s (~5x, 71% of scans skipped). A forced-scan invariant check (no skipped cycle ever issues or inline-finishes a zero-cycle COMP) found 0 violations across 39 kernels spanning GEMM, BMM, softmax and conv. DMA-bound kernels (the ones that hit the 6h CI cap) stall more and gain more.

In Core::cycle()'s issue scan, a COMP with compute_cycle == 0 finishes inline and calls instructions.erase(it), but does not set `issued` or break -- so control falls through to the `it++` at the end of the loop body, incrementing the std::list iterator that erase() just invalidated. That reads a freed node and is undefined behavior; it usually limps along because the freed node's next pointer is briefly intact, but under a different allocator / sanitizer / build it can crash or skip-or-duplicate the remaining ready instructions in that tile, corrupting issue order. Use the iterator erase() returns and continue, the standard erase-in-loop idiom. No behavior change on the path that happened to work; it just removes the UB.

…X TOG The main compile/sim path no longer generates or selects the legacy ONNX Tile-Operation-Graph. extension_codecache emits only trace.so + trace_cycles.tsv (the build-skip now keys on trace.so), and TOGSimulator.run_standalone always drives TOGSim with --trace_so. The TORCHSIM_LEGACY_TOG opt-in is removed from the frontend. The ONNX --models_list branch is kept solely for the STONNE sparse path (extension_op.py); TOGSim's C++ ONNX parser is untouched (separate PR). origins (which FX nodes a kernel came from) is preserved: logged per kernel run and recorded as a trailing "# origins:" line in trace_cycles.tsv -- the legacy ONNX TOG carried this as node metadata, and the C++ cycle-table loader stops at the comment so the current parser is unaffected. Also drop the dead tog_file param from mlir_gem5_compile_command, migrate scripts/chiplet.sh to --trace_so/--cycle_table (the trace path stubs per-tensor addresses and --attributes_list is no longer a Simulator option), and refresh the CLAUDE.md TOG-generation notes.

YWHyuk force-pushed the feature/togsim-cpp-trace branch 3 times, most recently from 1151f6a to 7f70bbb Compare June 22, 2026 12:13

YWHyuk mentioned this pull request Jun 23, 2026

Dynamic shape on the C++ trace path (elementwise add) #269

Open

YWHyuk force-pushed the feature/togsim-cpp-trace branch 4 times, most recently from b7c1ec4 to 9033945 Compare June 24, 2026 13:37

YWHyuk and others added 11 commits June 25, 2026 16:29

[Docs] C++ trace pipeline design (runtime-tag pairing, ABI)

c437316

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

[TOGSim] Work-item outlining and ABI v12 dispatch

551b2cb

togsim_dispatch with TILE_BEGIN/TILE_END; outline each work-item into togsim_kernel_tile. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

[TOGSim] Make the trace runtime test self-contained

63c6d24

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

YWHyuk force-pushed the feature/togsim-cpp-trace branch from c166abd to ed5c747 Compare June 25, 2026 07:29

YWHyuk and others added 10 commits June 25, 2026 16:49

[TOGSim] Drop the ABI version changelog comment

618e4fc

Keep only a one-line note on what TOGSIM_ABI_VERSION guards; the per-bump v1..v12 history was noise nobody reads.

[Frontend] Condense BMM prologue spad-budget comments to one line

c9ad81f

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01P5qH5qvM4STHKYA7eMAfmx

YWHyuk and others added 5 commits June 26, 2026 00:35

Merge pull request #278 from PSAL-POSTECH/fix/bmm-prologue-spad-budget

90b5560

[Frontend] Budget fused-prologue spad buffers in BMM tile selection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TOGSim C++ trace-generation pipeline (P0-P3): explicit dataflow producer + barriers#267

TOGSim C++ trace-generation pipeline (P0-P3): explicit dataflow producer + barriers#267
YWHyuk wants to merge 26 commits into
developfrom
feature/togsim-cpp-trace

YWHyuk commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

YWHyuk commented Jun 19, 2026

What

Pipeline

Dependency model (no in-order, no runtime tag-hash, no op heuristics)

Status

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant