Skip to content

TOGSim C++ trace-generation pipeline (P0-P3): explicit dataflow producer + barriers#267

Open
YWHyuk wants to merge 26 commits into
developfrom
feature/togsim-cpp-trace
Open

TOGSim C++ trace-generation pipeline (P0-P3): explicit dataflow producer + barriers#267
YWHyuk wants to merge 26 commits into
developfrom
feature/togsim-cpp-trace

Conversation

@YWHyuk

@YWHyuk YWHyuk commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

What

Replaces the timing-path TOG producer (MLIR -> Python dict -> ONNX -> C++ TileGraphParser) with a compiled, shape-parametric trace producer: post-vcix MLIR -> skeleton -> EmitC -> C++ -> .so. TOGSim dlopens the .so, runs it to record an instruction trace, and feeds it into the existing Simulator/Core (timing core unchanged). Driven by a new --trace_so mode; the legacy ONNX-TOG path is kept and marked DEPRECATED, so nothing existing breaks.

Pipeline

post-vcix .mlir
  | build_skeleton.py        loops + memref.dma_start/wait -> togsim.* ; DCE the rest
  | dep_analysis.py          per-op read/write SRAM buffers (SSA) + vcix preload/matmul pairing
  | lower_to_emitc.py        togsim.* -> emitc.call_opaque ; drive upstream convert-*-to-emitc
  v
EmitC --mlir-translate--> C++ --g++ -shared--> trace.so
  | run_producer (dlopen)    EmitCtx callbacks record a TraceRec stream
  | togsim_trace_bridge.cc   TraceRec -> TileGraph (explicit dependency DAG)
  v
existing Simulator / Core    cycles, DRAM traffic

Dependency model (no in-order, no runtime tag-hash, no op heuristics)

Dependencies are derived from two sources available pre-collapse:

  • SRAM last-writer per buffer (load->compute, the Y_spad accumulator chain), recovered via SSA + a virtual SA_WEIGHTS buffer that folds preload->matmul.
  • The systolic array modeled as a pipeline (occupancy/latency split) with two explicit, distinctly-named barriers:
    • MEMORY_BAR (renamed from BAR): the DMA/tag memory fence; an async load -> compute waits the data's resp-complete.
    • COMPUTE_BAR (new): the compute fence; a store waits all systolic-array pipelines to drain.

Both barriers are first-class trace ops (togsim.compute_barrier -> ABI togsim_compute_barrier) visible in the trace dump and the instruction stream.

Status

  • 256^3 GEMM runs end-to-end through the real Simulator via --trace_so.
  • Cycle comparison vs the legacy build_tog path on the same kernel + gem5 cycle_list: compute work and DRAM traffic match; matmuls pipeline on 2 SAs; the memory fence correctly delays compute until the weight load arrives.
  • Known open items (documented in docs/design/togsim_cpp_trace.md sec 10): preload-concurrency cap (needs non-zero preload occupancy), parallel output tiles (dispatch granularity), broader op coverage (conv/SDPA/vector).

Testing

  • tests/test_togsim_skeleton.py, test_togsim_emitc.py, test_togsim_runtime.py (7 tests).
  • Manual --trace_so GEMM through TOGSim.
  • Legacy path untouched (comment-only DEPRECATED markers).

Design of record: docs/design/togsim_cpp_trace.md (sec 9-10).

🤖 Generated with Claude Code

@YWHyuk YWHyuk force-pushed the feature/togsim-cpp-trace branch 3 times, most recently from 1151f6a to 7f70bbb Compare June 22, 2026 12:13
@YWHyuk YWHyuk force-pushed the feature/togsim-cpp-trace branch 4 times, most recently from b7c1ec4 to 9033945 Compare June 24, 2026 13:37
YWHyuk and others added 11 commits June 25, 2026 16:29
… feed

Skeleton + EmitC + cost/dep analysis on the frontend; the trace runtime,
loader, bridge, and Core feed on the simulator; shared MLIR pass helpers and
the pipeline tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
Per-record tag key in the bridge plus per-iteration tag alloc in
dma-fine-grained so multi-tile-K and conv loads do not collide; strip the
reduction accum marker from the memory_barrier slot.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
togsim_dispatch with TILE_BEGIN/TILE_END; outline each work-item into
togsim_kernel_tile.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
DMA-capacity throttle and frozen-state guard, per-core VMEM in the configs,
and the SA weight-buffer throttle.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
trace_timeline.py with per-work-item grouping and resource-centric DMA lanes;
the trace logs the first DRAM response and the assigned systolic array, and
scopes the compute barrier to its dispatch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
Default to the trace path; fix uninitialized Instruction fields, the matmul
accumulator wedge, fused-subtile dedup, nested/fused epilogue dataflow, and
dma_wait fusion; bound concurrent dispatches to the spad, round-robin
work-items within a partition, benchmark autotune and run the multi-tenant
scheduler through the trace path, and emit trace.so for pooling/reduction.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
Carry simulator headers through the wrapper for cache-safe replay; drop verbose
[P3-trace] logs; fix the key.mlir compile race in load().

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
… runtime model

Replace the trace bridge's accumulated special cases with one dataflow rule and
clean up the runtime that consumes it.

Dependency rule: per SRAM buffer keep a writers SET; a reader depends on all
current writers (occupancy=ISSUE when both are systolic-array ops, else
latency=DONE); a writer REPLACEs the set. The only exception is is_mm_accum (a
matmul that reads and writes the same buffer = a commutative accumulator): skip
its read edge and UNION its write, waiting only the non-matmul init seed and not
ordering co-matmuls. This drops the matmul-accumulator chain that deadlocked the
SA weight-slot pipeline while keeping the init->matmul edge, and lets a vector
epilogue or the store wait every K matmul (fixes the pure-vector store that an
empty COMPUTE_BAR let slip).

Remove COMPUTE_BAR entirely: a matmul is its own DONE-handle (finish == SA
drain), so the store JOINs the matmul writers directly. The whole emit/loader
chain is gone -- build_skeleton, lower_to_emitc, togsim.compute_barrier, the
runtime symbol, the Opcode/case/_fence_finish, and TraceRec::COMPUTE_BAR -- so a
stale producer fails loudly instead of emitting records the bridge would drop.
Only MEMORY_BAR remains (an async load's DONE is its data arrival, not issue).

Model compute-output spad footprint in the SRAM version/capacity machinery so
buffer reuse (WAR) is capacity-modeled, not a hard edge. The output size comes
from the DMA records that touch the same buffer (a buf_bytes pre-pass); an
in-place buffer (accumulator, relu) is version-transparent so footprint is not
double-counted. The occupy gate and version release sit in the MOVIN/MOVOUT/COMP
issue points (release before the COMP skip path so a skipped matmul still frees).

Runtime: collapse child_inst / _pipeline_children into one event-indexed
_deps[ISSUE|DONE] with add_dep(c, on) and fire(e); collapse the weight-slot
release queue and the async-load wakeup into one _due_events timed-effect table
drained by process_due_events. Both are behavior-preserving (byte-identical).

Require the weight-slot model: sa_weight_buffer_depth must be > 0 (errors at
init), and the round-robin disable mode is removed. Degenerate traces (a
consumer-less preload, an unpinned matmul) hit explicit error+exit guards rather
than asserts that vanish under NDEBUG.

Mark the legacy ONNX TOG path deprecated: it is superseded by the trace path, so
TileGraphParser logs a deprecation warning and the TORCHSIM_LEGACY_TOG=1 opt-in
warns at command build.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
… spad/2

The validation-binary spad-overflow check sat inside `if functional_mode:`, so in
timing-only / autotune (non-functional) runs an over-spad tiling was never
rejected and reached TOGSim, which wedged ("spad too small") and crashed the
compile via assert 0. Move the compile + check out of the functional gate (the
Spike execution itself stays gated, run_spike below) and budget per dispatch at
spad/2 -- two work-items run concurrently (double buffer), so each must fit half
the spad or they deadlock competing for it. This matches the GEMM tiling gate
(max_spad_size = spad/2), which pointwise ops lacked. Fixes the resnet /
test_scheduler wedge where a fused BatchNorm+ReLU tile exceeded spad/2.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
@YWHyuk YWHyuk force-pushed the feature/togsim-cpp-trace branch from c166abd to ed5c747 Compare June 25, 2026 07:29
YWHyuk and others added 10 commits June 25, 2026 16:49
Every generated trace.cpp now opens with a "GENERATED ... DO NOT EDIT" banner
carrying the TOGSim ABI version and the togsim_* call formats, so a dumped trace
is self-documenting. Both are read from togsim_runtime.h at codegen time -- the
version from the TOGSIM_ABI_VERSION define, the call-format text from a marked
block next to the declarations -- so they never drift when the ABI changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
…rint

The bridge sums each work-item's distinct-buffer footprint (each buffer once, so
a reduction's reloads of the same section do not inflate it) onto its Tile.
can_issue then admits two concurrent dispatches only when each fits half the
spad; a dispatch larger than spad/2 runs alone with the whole spad, so two
work-items never compete for the shared spad and deadlock -- a runtime safety net
beneath the codegen spad/2 gate. The footprint and resulting max_dispatch are
logged on the TILE_SCHEDULED trace line for debugging.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
…election

select_tile fed the epilogue buffer count to gemm_combination_mapping as
max(n_extra_read - 2, 0), which optimistically assumed the X and W DMA buffers
were already freed and reusable by the epilogue. The codegen lays every buffer
out as a disjoint .spad global and never reuses freed space, so the estimate
undercounted the real footprint: for a fused matmul+relu kernel the budget came
to 64512 B/lane (treated as a bare GEMM) while the emitted kernel used 89088
B/lane including the relu output buffer. Under the full-spad guard this was
harmless (89088 < 131072), but the spad/2 guard rejected it and crashed
test_transformer_fusion with SpadOverflowError, since the heuristic path has no
tile-shrink fallback.

Pass n_extra_node + n_extra_read instead: one output-tile-sized buffer per
epilogue node plus one per extra read operand, matching what the codegen emits.
For the matmul+relu kernel the budget now equals the actual footprint exactly,
and tile selection picks TILE_M=128 (62976 B/lane) which fits spad/2.

Liveness-based spad reuse and broadcast over-allocation are tracked separately
as optimizations in issue #275.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The spad-overflow check summed the kernel stack frame into the per-lane
scratchpad usage (spad_usage = stack_size + spad_size) and rejected the
tiling when that exceeded the spad/2 budget. But only the .spad section
actually lives in the scratchpad -- it is pinned there by the
--section-start=.spad link option. The kernel stack is in main memory
(sp is set up by pk in the -m region, not at the scratchpad vaddr), so it
does not consume the scratchpad and must not be charged against it. The
scalar frame is also shared across lanes, not per-lane, so adding it
double-counted.

On small configs (8x8) this falsely rejected feasible tilings: the
wrapper3 conv2d/resnet/mistral kernels fit the 32 KB spad with room to
spare but were tripped over the spad/2 gate purely by the ~200-800 B
stack term, crashing compile with SpadOverflowError. Drop the stack
term; the .spad-only check still correctly rejects real buffer overflows
(e.g. sparsity, which is fixed separately by f05ac8a).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01P5qH5qvM4STHKYA7eMAfmx
Spike v1.0.3 zero-inits the MVIN/MVOUT DMA address buffer so ROUNDUP
padding entries are skipped instead of dereferencing uninitialized
garbage, fixing the host store segfault on 8-lane (wrapper3) configs
where a tile's split axis is not a multiple of vlane_stride*n_vu.
Add pytorchsim_functional_verify_per_kernel, a sub-option of
pytorchsim_functional_mode that localizes the first compiled kernel whose Spike
output diverges from a CPU reference.

When enabled, the generated wrapper compares every realized buffer (the output of
each fused kernel) against a CPU "golden" computed once per call by running the
original aten graph (V.graph.module) on CPU with the same inputs, via an
fx.Interpreter that records each node's output by name. A buffer is mapped to its
originating fx node through V.graph.get_buffer().origin_node, so the check reports
the kernel, the originating op, the offending indices and the max abs diff, then
raises at the first divergence (stop-at-first). Comparison granularity is the
fused-cluster output, the finest observable in a fused pipeline.

Auto-disabled when functional mode is off (no Spike values to verify); the config
accessor AND-gates the key with functional mode. Codegen bakes the
verify_init/verify_check calls into the wrapper only when enabled at compile time,
so clear the codegen cache when toggling. Tolerances via
TORCHSIM_FUNCTIONAL_VERIFY_RTOL / _ATOL (default 1e-4).

extension_functional_verify.py holds the graph registry and the runtime
golden/compare logic; mlir_codegen_backend injects the calls at the wrapper level;
extension_config reads the new YAML key.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add docs/per-kernel-functional-verify.md (usage, options, mechanism,
limitations, code map) and point to it from CLAUDE.md: a one-line entry in the
debugging section and in the YAML knobs list.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Keep only a one-line note on what TOGSIM_ABI_VERSION guards; the per-bump
v1..v12 history was noise nobody reads.
The BMM tile selector fed only n_extra_node to gemm_combination_mapping and
dropped the prologue's extra-read operands entirely, so a softmax-fused
attention matmul (value^T @ softmax(scores)) was tiled as a bare BMM. The
codegen lays the softmax max/sum operands out as their own disjoint
weight-tile-sized .spad globals (buf3/buf4) and never reuses freed space, so
the estimate undercounted the real footprint: on the 32x32 wrapper2 config
(16 KB/lane spad/2 budget) tile selection picked TILE_N=512, whose emitted
kernel used 16896 B/lane and overflowed the budget by 512 B, crashing
test_transformer with SpadOverflowError. Under the 128x128 full-budget config
the slack hid it. This mirrors the epilogue n_extra_read gap fixed for the
GEMM template (commit f05ac8a), now on the BMM prologue path.

Add an n_prologue_extra_read knob to gemm_combination_mapping that charges
each extra prologue-read operand as one weight-tile-sized (TILE_K x TILE_N)
.spad buffer, matching what the codegen emits, and have the BMM template count
those operands the same way codegen does (the one numel-matching main input
reuses the matmul-operand buffer; every other read gets its own global). Tile
selection now rejects TILE_N=512 (16896 B/lane) and picks a fitting tile
(8704 B/lane), so wrapper2 test_transformer passes the full Spike + allclose
check. The new parameter defaults to 0, leaving the GEMM and conv callers
unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01P5qH5qvM4STHKYA7eMAfmx
YWHyuk and others added 5 commits June 26, 2026 00:35
[Frontend] Budget fused-prologue spad buffers in BMM tile selection
…s off

format_dma_inst_issued_trace_line and format_instruction_detail_line are only
ever used as the message argument to trace_instruction_line (a spdlog::trace
sink). Because the argument is evaluated eagerly at the call site, the kernel
builds a formatted std::string for every issued/finished instruction and then
spdlog drops it whenever the level is above trace (the default). Bail out of
both formatters when trace logging is disabled so the fmt::format work is
skipped entirely.

This removes wasted work but is not the simulation-speed bottleneck (that is the
per-cycle ready-queue rescan in Core::cycle, addressed separately); the conv
kernel wall time is unchanged within noise.
Core::cycle() walked every tile's ready-instruction list on every simulated
cycle to find one instruction to issue. Profiling an 8x8 conv kernel showed this
scan is ~96% of simulation time and, within it, the walk is >99%: the ready list
holds ~2000 instructions (mostly blocked on the SRAM-capacity / weight-slot
throttle), only one issues per cycle, and the blocked ones are never removed, so
the same ~2000 are re-examined every cycle (~2.6k list iterations/cycle, ~1e9
over 400k cycles). At small arrays this dominates because tiny tiles inflate both
the ready-list length and the number of DMA-wait stall cycles.

Gate the scan behind a per-Core _issue_dirty flag. The scan's outcome can only
change when the ready set grows or a resource frees, so set the flag on exactly
those events:
 - ready set grows: Instruction::dec_ready_counter, on enqueueing a now-ready
   instruction, sets the OWNING core's _issue_dirty via _owner_dirty_ref (wired
   in Core::issue when the tile is dispatched) -- per-core so a remote core's
   enqueue does not force every core to rescan;
 - SRAM freed (release_sram) and weight slot freed (apply_due) set _issue_dirty;
 - a new dispatch (issue) and a successful issue keep it set.
On a cycle with none of these the scan would re-walk the same blocked
instructions and issue nothing, so it is skipped. Instructions are never moved
between queues, so there is no re-admission churn.

Issue-identical: it never changes which instruction issues or when, only whether
the (side-effect-free under those conditions) scan runs. The 8x8 conv kernel
produces the same 403026 cycles as before, with its TOGSim wall dropping from
214.5s to 41.9s (~5x, 71% of scans skipped). A forced-scan invariant check (no
skipped cycle ever issues or inline-finishes a zero-cycle COMP) found 0
violations across 39 kernels spanning GEMM, BMM, softmax and conv. DMA-bound
kernels (the ones that hit the 6h CI cap) stall more and gain more.
In Core::cycle()'s issue scan, a COMP with compute_cycle == 0 finishes inline and
calls instructions.erase(it), but does not set `issued` or break -- so control
falls through to the `it++` at the end of the loop body, incrementing the
std::list iterator that erase() just invalidated. That reads a freed node and is
undefined behavior; it usually limps along because the freed node's next pointer
is briefly intact, but under a different allocator / sanitizer / build it can
crash or skip-or-duplicate the remaining ready instructions in that tile,
corrupting issue order.

Use the iterator erase() returns and continue, the standard erase-in-loop idiom.
No behavior change on the path that happened to work; it just removes the UB.
…X TOG

The main compile/sim path no longer generates or selects the legacy ONNX
Tile-Operation-Graph. extension_codecache emits only trace.so + trace_cycles.tsv
(the build-skip now keys on trace.so), and TOGSimulator.run_standalone always
drives TOGSim with --trace_so. The TORCHSIM_LEGACY_TOG opt-in is removed from the
frontend. The ONNX --models_list branch is kept solely for the STONNE sparse path
(extension_op.py); TOGSim's C++ ONNX parser is untouched (separate PR).

origins (which FX nodes a kernel came from) is preserved: logged per kernel run
and recorded as a trailing "# origins:" line in trace_cycles.tsv -- the legacy
ONNX TOG carried this as node metadata, and the C++ cycle-table loader stops at
the comment so the current parser is unaffected.

Also drop the dead tog_file param from mlir_gem5_compile_command, migrate
scripts/chiplet.sh to --trace_so/--cycle_table (the trace path stubs per-tensor
addresses and --attributes_list is no longer a Simulator option), and refresh
the CLAUDE.md TOG-generation notes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant