Skip to content

[WIP] Fix: clean up "version" fields in L2 swimlane / dep_gen JSON#862

Open
indigo1973 wants to merge 1 commit into
hw-native-sys:mainfrom
indigo1973:dep_0525_v2
Open

[WIP] Fix: clean up "version" fields in L2 swimlane / dep_gen JSON#862
indigo1973 wants to merge 1 commit into
hw-native-sys:mainfrom
indigo1973:dep_0525_v2

Conversation

@indigo1973
Copy link
Copy Markdown
Contributor

deps.json and l2_perf_records.json both carried a "version" field that consumers were getting wrong:

  • deps.json bumped v2 → v3 in Refactor: a2a3 + a5 Tensor to strided (stride + start_offset) model #808 but swimlane_converter still guarded on version != 2, silently rejected every fresh capture, and fell back to L2PerfRecord::fanout[] — losing the race-window edges dep_gen replay exists to recover.

  • l2_perf_records.json's "version" was never a schema version — the producer writes L2PerfLevel (1..4) there. Misreading it caused swimlane_converter._print_verbose_data_info and sched_overhead_analysis to short-circuit on version != 2 / < 2, while phase blocks only exist at level >= 3.

deps.json — drop the "version" field
Producer no longer emits it; deps_to_graph drops its version guard (same release as the producer; KeyError is a clearer failure than a synthetic guard); test_dep_gen + test_dep_gen_chain drop the version assertion; dep_gen_replay.{cpp,h}, docs/dfx/dep_gen.md, and tools/README.md drop the v2/v3 schema labels. dep_gen.md §4 example JSON, fields table, and §5 arg-row description are rewritten against the current strided-Tensor producer (buffer_numel replaces raw_shapes; start_offset + strides[] replace multi-dim offset[]).

l2_perf_records.json — rename "version" → "l2_perf_level"
Producer (a2a3 + a5) writes the new name; swimlane_converter.
read_perf_data + verbose print + _swimlane_validate.py follow.
Both misaligned short-circuits removed: load_deps_json's guard
outright (only reads edges[].pred / .succ, stable across every
schema), _print_verbose_data_info's version != 2, and
parse_scheduler_from_json_phases's version < 2 (the
if not phases_by_thread check below was already correct).

Doc / comment fallout — keep code, comments, and docs in sync per .claude/rules/doc-consistency.md:

  • swimlane_converter / sched_overhead_analysis / profiling_levels.md (a2a3 + a5): "(version 2)" / "v2 JSON" wording → explicit l2_perf_level >= N.
  • 6 scheduler comments (a2a3 + a5: scheduler_dispatch.cpp, scheduler_cold_path.cpp, scheduler_types.h) describing "v2 JSON dispatch records" now point at dispatch-phase records in aicpu_scheduler_phases[].
  • tools/README.md Input File Format example is now a real two-task slice; the stale stub had wrong field names and missed ring_id / dispatch_time_us / finish_time_us. Troubleshooting "Unsupported version" entry renamed to "Unsupported l2_perf_level".
  • docs/dfx/l2-swimlane-profiling.md: level table, per-task table, phase-record table, and prose all corrected — start_time → us, core_type 0/1 → "aic"/"aiv", phase_id SCHED* → lowercase phase strings, plus ring_id / pop_hit / loop_iter vs submit_idx / core_to_thread[] additions.

deps.json and l2_perf_records.json both carried a "version" field
that consumers were getting wrong:

- deps.json bumped v2 → v3 in hw-native-sys#808 but swimlane_converter still
  guarded on `version != 2`, silently rejected every fresh capture,
  and fell back to L2PerfRecord::fanout[] — losing the race-window
  edges dep_gen replay exists to recover.

- l2_perf_records.json's "version" was never a schema version — the
  producer writes L2PerfLevel (1..4) there. Misreading it caused
  swimlane_converter._print_verbose_data_info and
  sched_overhead_analysis to short-circuit on `version != 2` /
  `< 2`, while phase blocks only exist at level >= 3.

deps.json — drop the "version" field
  Producer no longer emits it; deps_to_graph drops its version guard
  (same release as the producer; KeyError is a clearer failure than
  a synthetic guard); test_dep_gen + test_dep_gen_chain drop the
  version assertion; dep_gen_replay.{cpp,h}, docs/dfx/dep_gen.md,
  and tools/README.md drop the v2/v3 schema labels. dep_gen.md §4
  example JSON, fields table, and §5 arg-row description are
  rewritten against the current strided-Tensor producer (buffer_numel
  replaces raw_shapes; start_offset + strides[] replace multi-dim
  offset[]).

l2_perf_records.json — rename "version" → "l2_perf_level"
  Producer (a2a3 + a5) writes the new name; swimlane_converter.
  read_perf_data + verbose print + _swimlane_validate.py follow.
  Both misaligned short-circuits removed: load_deps_json's guard
  outright (only reads edges[].pred / .succ, stable across every
  schema), _print_verbose_data_info's `version != 2`, and
  parse_scheduler_from_json_phases's `version < 2` (the
  `if not phases_by_thread` check below was already correct).

Doc / comment fallout — keep code, comments, and docs in sync per
.claude/rules/doc-consistency.md:
- swimlane_converter / sched_overhead_analysis / profiling_levels.md
  (a2a3 + a5): "(version 2)" / "v2 JSON" wording → explicit
  l2_perf_level >= N.
- 6 scheduler comments (a2a3 + a5: scheduler_dispatch.cpp,
  scheduler_cold_path.cpp, scheduler_types.h) describing "v2 JSON
  dispatch records" now point at dispatch-phase records in
  aicpu_scheduler_phases[].
- tools/README.md Input File Format example is now a real two-task
  slice; the stale stub had wrong field names and missed ring_id /
  dispatch_time_us / finish_time_us. Troubleshooting "Unsupported
  version" entry renamed to "Unsupported l2_perf_level".
- docs/dfx/l2-swimlane-profiling.md: level table, per-task table,
  phase-record table, and prose all corrected — start_time → _us,
  core_type 0/1 → "aic"/"aiv", phase_id SCHED_* → lowercase phase
  strings, plus ring_id / pop_hit / loop_iter vs submit_idx /
  core_to_thread[] additions.
@indigo1973 indigo1973 changed the title Fix: clean up "version" fields in L2 swimlane / dep_gen JSON [WIP] Fix: clean up "version" fields in L2 swimlane / dep_gen JSON May 26, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the profiling and dependency graph schemas, replacing the legacy "version" fields in deps.json and l2_perf_records.json with a descriptive l2_perf_level and a strided-tensor representation. The changes update the documentation, downstream analysis tools, tests, and C++ collectors/replay logic to use buffer_numel, start_offset, and strides instead of raw shapes and offsets. One issue was identified in dep_gen_replay.cpp where strides are implemented and serialized as unsigned integers (uint32_t) despite being documented as signed int32 arrays, which could cause serialization issues for negative strides.

Comment on lines +312 to +314
// Strided tensor representation. tensors[].buffer_numel is the underlying
// storage element count; tasks[].args[] and edges[] carry per-slice
// geometry as (start_offset uint64, strides[] int32).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment states that strides[] is an int32 array, and the documentation in dep_gen.md also specifies int32 array for strides. However, the implementation in TaskArgEntry and EdgeAnnot defines strides as uint32_t arrays, and they are serialized using write_uint_array. If any negative strides are present (e.g., for reversed dimensions), they will be serialized as large unsigned integers (e.g., 4294967295), causing incorrect behavior in downstream consumers.

Consider changing the strides fields in TaskArgEntry and EdgeAnnot to int32_t and implementing/using a signed write_int_array helper to serialize them correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant