Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
c437316
[Docs] C++ trace pipeline design (runtime-tag pairing, ABI)
YWHyuk Jun 24, 2026
29a4c34
[TOGSim] C++ trace pipeline: front end, runtime, loader, bridge, Core…
YWHyuk Jun 24, 2026
b2d73da
[TOGSim] Per-iteration tag pairing for multi-tile-K and conv
YWHyuk Jun 24, 2026
551b2cb
[TOGSim] Work-item outlining and ABI v12 dispatch
YWHyuk Jun 24, 2026
11716ec
[TOGSim] SRAM-capacity and SA weight-buffer throttle for the trace path
YWHyuk Jun 24, 2026
f1f5ec0
[Tooling] TOGSim trace timeline (Perfetto) and the trace emits it needs
YWHyuk Jun 24, 2026
83d0f2f
[TOGSim] Make the C++ trace path the default and stabilize it
YWHyuk Jun 24, 2026
63c6d24
[TOGSim] Make the trace runtime test self-contained
YWHyuk Jun 24, 2026
1512705
[Frontend] Trace cache-safe replay and compile-race fixes
YWHyuk Jun 24, 2026
064bb27
[TOGSim] Redesign trace-bridge dependency, barrier, SRAM-version, and…
YWHyuk Jun 24, 2026
ed5c747
[Frontend] Run the spad-overflow check in timing-only mode, budget at…
YWHyuk Jun 25, 2026
4e459b9
[Frontend] Generate the trace.cpp ABI/API banner from togsim_runtime.h
YWHyuk Jun 25, 2026
df61f76
[TOGSim] Pick 1- vs 2-dispatch concurrency by per-dispatch spad footp…
YWHyuk Jun 25, 2026
f05ac8a
[Frontend] Budget fused-epilogue spad buffers honestly in GEMM tile s…
YWHyuk Jun 25, 2026
7148a48
[Frontend] Stop charging the kernel stack frame against the spad budget
YWHyuk Jun 25, 2026
4f9018b
[CI] Bump spike pin to v1.0.3
YWHyuk Jun 25, 2026
e2d5608
[Frontend] Add per-kernel CPU functional verify sub-option
YWHyuk Jun 25, 2026
8b36968
[Docs] Document per-kernel functional verify
YWHyuk Jun 25, 2026
618e4fc
[TOGSim] Drop the ABI version changelog comment
YWHyuk Jun 25, 2026
5c65b78
[Frontend] Budget fused-prologue spad buffers in BMM tile selection
YWHyuk Jun 25, 2026
c9ad81f
[Frontend] Condense BMM prologue spad-budget comments to one line
YWHyuk Jun 25, 2026
90b5560
Merge pull request #278 from PSAL-POSTECH/fix/bmm-prologue-spad-budget
YWHyuk Jun 25, 2026
c790fa2
[TOGSim] Skip per-instruction trace string build when trace logging i…
YWHyuk Jun 26, 2026
f9732e6
[TOGSim] Skip the per-cycle issue scan when nothing can newly issue
YWHyuk Jun 26, 2026
2c193b6
[TOGSim] Fix use-after-erase in the zero-cycle COMP skip path
YWHyuk Jun 26, 2026
6160411
[Frontend] Make the C++ trace the sole main TOG path; drop legacy ONN…
YWHyuk Jun 25, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions AsmParser/tog_generator.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
# DEPRECATED (timing path): legacy ONNX Tile-Operation-Graph producer. Builds
# the TOG and serializes it to ONNX for the C++ TileGraphParser. Superseded by
# the C++ trace pipeline (PyTorchSimFrontend/mlir/passes/build_skeleton.py +
# lower_to_emitc.py + cycle_table.py -> a compiled trace .so). Kept live so the
# current pipeline does not break; to be retired once the trace pipeline (P3+)
# stabilizes. See docs/design/togsim_cpp_trace.md.
import os
import sys
import importlib.util
Expand Down
12 changes: 10 additions & 2 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ The pipeline runs in that order on every `torch.compile` invocation; you'll see
| `Simulator/simulator.py` | Python drivers: `FunctionalSimulator` (Spike), `CycleSimulator` (Gem5), `TOGSimulator` (the cycle-accurate one + multi-tenant context manager) |
| `Scheduler/scheduler.py` | Poisson arrival generator + scheduling utilities for multi-tenant runs |
| `TOGSim/` | C++ TOGSim source. `src/Simulator.cc`, `Core.cc`, `Dram.cc`, `Interconnect.cc`, `L2Cache.cc`, `Tile.cc`, `TileGraph.cc` are the core models. Externals: ramulator2, booksim, stonneCore, onnx, protobuf, spdlog, yaml-cpp |
| `AsmParser/` | `tog_generator.py`, `onnx_utility.py` — TOG generation from ONNX/ASM |
| `AsmParser/` | `tog_generator.py`, `onnx_utility.py` — legacy ONNX TOG generation; now used only by the STONNE sparse path (the main path emits a C++ `trace.so` instead) |
| `configs/` | TOGSim hardware configs (YAML). The default is `systolic_ws_128x128_c1_simple_noc_tpuv3.yml`. Naming pattern: `systolic_ws_<size>_c<cores>_<noc>_<target>.yml` |
| `tests/` | Op- and model-level tests organized under `ops/<family>/` (elementwise, reduce, gemm, conv, attention, view, sort, sparsity, misc, fusion), `models/<name>/` (Llama, Mixtral8x7B, DeepSeek, Diffusion, MoE, MLP, MobileNet, Yolov5) plus single-file model tests (test_resnet, test_transformer, test_vit, test_mlp, test_single_perceptron), and `system/` (scheduler, eager, hetro, stonne, vectorops). Shared helper: `tests/_utils.py` |
| `experiments/artifact/` | Paper reproduction scripts (`cycle_validation/run_cycle.sh`, `speedup/run_speedup.sh`) |
Expand Down Expand Up @@ -58,6 +58,12 @@ export TORCHSIM_DUMP_MLIR_IR=1
export TORCHSIM_DUMP_LLVM_IR=1
```

**To find which op a wrong result first diverges at** (per-kernel CPU cross-check;
sub-option of functional mode). Set `pytorchsim_functional_verify_per_kernel: 1`
in the config YAML, clear the codegen cache, and re-run: each compiled kernel's
output is compared to a CPU golden and the run stops at the first divergent
kernel, naming the op and offending indices. See `docs/per-kernel-functional-verify.md`.

## Key environment variables

Read in `PyTorchSimFrontend/extension_config.py`:
Expand Down Expand Up @@ -85,11 +91,13 @@ Note: `TOGSIM_CONFIG` is **overwritten** while inside a `with TOGSimulator(confi
Located under `configs/*.yml`:

- `num_cores`, `core_freq_mhz`, `num_systolic_array_per_core`
- `sa_weight_buffer_depth` (per-SA resident weight slots; **must be > 0** — the simulator errors on 0. Raise it to effectively disable the preload run-ahead throttle. Defaults to 2 if the key is absent.)
- `vpu_num_lanes`, `vpu_spad_size_kb_per_lane`, `vpu_vector_length_bits`
- `dram_type` (`ramulator2` | `simple`), `dram_channels`, `dram_freq_mhz`, `ramulator_config_path`
- `icnt_type` (`simple` | `booksim`), `icnt_latency_cycles`, `icnt_freq_mhz`, `icnt_config_path`
- `l2d_type` (e.g., `datacache`), `l2d_config` (AccelSim-format cache config string)
- `pytorchsim_functional_mode` (Spike on/off), `pytorchsim_timing_mode`
- `pytorchsim_functional_verify_per_kernel` (debug: per-kernel CPU cross-check; see `docs/per-kernel-functional-verify.md`)
- `codegen_mapping_strategy`: `heuristic` | `autotune` | `external-then-heuristic` | `external-then-autotune`
- `codegen_external_mapping_file` (key `"M_N_K"` → `{TILE_M, TILE_K, TILE_N}` JSON)
- `codegen_compiler_optimization`: `"all"` | `"none"` | a list from `{fusion, reduction_epilogue, reduction_reduction, prologue, single_batch_conv, multi_tile_conv, subtile}`
Expand Down Expand Up @@ -122,7 +130,7 @@ Conan deps for TOGSim: `boost/1.79.0`, `robin-hood-hashing/3.11.5`, `spdlog/1.11
- **Adding a new op (Inductor lowering):** `PyTorchSimFrontend/mlir/mlir_ops.py`, `mlir_lowering.py`, plus a new `mlir_<op>_template.py` if it needs its own MLIR template. Decomposition rules: `mlir_decomposition.py`. Scheduling: `mlir_scheduling.py`. Autotune: `mlir_autotune.py`.
- **Adding a PyTorch device op:** `PyTorchSimDevice/csrc/aten/native/*` (Minimal/Extra split mirrors `torch_openreg`).
- **TOGSim hardware model changes:** `TOGSim/src/{Core,Dram,Interconnect,L2Cache,Tile,TileGraph}.cc` + matching `include/*.h`.
- **TOG generation:** `AsmParser/tog_generator.py` builds the raw graph and serializes it via `AsmParser/onnx_utility.py` to **ONNX, which is the on-disk TOG format** consumed by TOGSim.
- **TOG generation:** the main path compiles each kernel to a C++ **`trace.so`** (`mlir/passes/build_skeleton.py` + `lower_to_emitc.py`) plus a `trace_cycles.tsv` cycle table, which TOGSim turns into a TileGraph via `trace_to_tilegraph`. `AsmParser/tog_generator.py` + `onnx_utility.py` (the legacy ONNX TOG) remain only for the **STONNE sparse path** (`extension_op.py`).
- **Eager fallback registration:** `torch.npu.register_eager_to_compile([...])` — see `tests/system/test_eager.py`.
- **Per-run results:** `togsim_results/<YYYYMMDD_HHMMSS_<hash>>.log` (stats) and `.trace` (instruction trace). The path is also printed at the end of every run.
- **Wrapper codegen path:** printed as `Wrapper Codegen Path = /tmp/torchinductor_<user>/<hash>/...py` — useful for inspecting generated kernel code and tensor names for `SRAM_BUFFER_PLAN_PATH`.
Expand Down
Loading
Loading