Refactor: AICore-as-producer for L2 swimlane (skip per-task staging read) by hw-native-sys-bot · Pull Request #878 · hw-native-sys/simpler

hw-native-sys-bot · 2026-05-27T14:27:00Z

Summary

AICore-as-producer with rotation for the L2 swimlane profiler:

AICore writes a slim 32 B L2PerfAicoreRecord (start, end, reg dispatch token) into a small per-core L2PerfAicoreBuffer (1024 records / 32 KB).
AICPU rotates the buffer at every PLATFORM_AICORE_BUFFER_SIZE-th dispatch — immediately before write_reg(DATA_MAIN_BASE). The runtime's completion-before-dispatch invariant guarantees AICore has finished writing all prior records into the old buffer before AICPU enqueues it to the per-thread ready queue (kind is_phase=2).
AICore picks up rotations by dcci-ing a per-core AicoreRotation cache line per task and comparing the published generation to its locally cached copy. Negligible cost relative to the baseline dcci(payload, ENTIRE_DATA_CACHE) AICore already pays per task.
AICPU's complete_record path drops the per-task rmb() + staging cache-line read entirely; it only writes AICPU-owned fields (task_id, reg_task_id, func_id, core_type, dispatch_time, finish_time), leaving start/end zero on-device.
Host's join_aicore_records() builds a reg_task_id → (start, end) map per core from the rotated multi-buffer AICore stream and patches L2PerfRecords at flush time before export_swimlane_json().
Session length is unbounded by per-core buffer sum — BufferPoolManager recycles AICore buffers via the ready queue while the session runs, same as the existing AICPU pool.
L2PerfRecord gains a reg_task_id field in existing pad space (no struct size change). TypedBuffer<Record, N> templates L2PerfBuffer and the new L2PerfAicoreBuffer. Drops PLATFORM_L2_AICORE_RING_SIZE; legacy aicore_ring_ptr / mismatch_record_count fields kept for ABI continuity.
Scoped to a2a3 only — a5 has its own l2_perf_profiling.h and retains the legacy staging-ring design; mirror can land in a follow-up.

Hardware bench

paged_attention_unroll Case1, swimlane=4, device 0, n=15 (rotation) / 10 (baselines), matched conditions:

Build	n	sched (µs)	orch (µs)	total (µs)
upstream/main, swimlane=4	10	1178.14 ± 6.42	947.06	1178.25
upstream/main, swimlane=0	10	1164.37 ± 5.84	815.30	1164.46
this PR (rotation), swimlane=4	15	1173.80 ± 6.64	927.81	1173.85

Δ vs upstream/main swimlane=4: sched −4.34 µs, orch −19.25 µs, total −4.40 µs. Profiling overhead vs swimlane=0 baseline drops from ~150 µs to ~121 µs.

The rotation design pays ~3 µs more sched than the no-rotation predecessor (which had this PR's earlier commit) — that's the price for unbounded session length and no wrap loss.

Testing

tests/st/a2a3/tensormap_and_ringbuffer/dfx/ — all 7 tests pass (dep_gen, dep_gen_chain, l2_swimlane, l2_swimlane_mixed, pmu, tensor_dump × 2)
paged_attention_unroll Case1 PASSED with swimlane=4 (1024 tasks, all start_time_us > 0 after join, durations 1.2–67 µs)
CI on hardware

coderabbitai · 2026-05-27T14:27:16Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5d34ba7f-5f6f-4c1c-b0da-9b071fbc5f36

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR refactors the L2 swimlane performance profiling buffer architecture from ring-based staging to buffer-based staging. AICore now writes slim timing records into per-core buffers indexed by task ID. AICPU writes only metadata and leaves timing zero. The host post-processor joins records by matching register task IDs to patch in timing data.

Changes

L2 Profiling Buffer Architecture Refactor

Layer / File(s)	Summary
Data Contracts and Type Definitions `src/a2a3/platform/include/common/l2_perf_profiling.h`, `src/a2a3/platform/include/common/platform_config.h`, `src/a2a3/platform/include/common/kernel_args.h`	`L2PerfRecord` gains `reg_task_id` join key field. `L2PerfAicoreRing` is replaced with slim `L2PerfAicoreRecord` and new `TypedBuffer<Record, N>` template defining `L2PerfAicoreBuffer` and `L2PerfBuffer` aliases. `PLATFORM_AICORE_BUFFER_SIZE` constant added; `PLATFORM_L2_AICORE_RING_SIZE` removed. `L2PerfBufferState` documentation clarifies `mismatch_record_count` as legacy ABI-retained, no longer written.
AICore Producer Path `src/a2a3/platform/include/aicore/aicore_profiling_state.h`, `src/a2a3/platform/include/aicore/l2_perf_collector_aicore.h`, `src/a2a3/platform/onboard/aicore/kernel.cpp`, `src/a2a3/platform/sim/aicore/kernel.cpp`, `src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp`, `src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp`	Profiling state getter/setter signatures updated to accept/return `L2PerfAicoreBuffer` instead of `L2PerfAicoreRing`. `l2_perf_aicore_record_task()` refactored to index `buf->records[task_id % PLATFORM_AICORE_BUFFER_SIZE]` and write task_id directly. Kernel entry and simulation reinterpret ring-table pointers as `L2PerfAicoreBuffer*` before passing to profiling state setters. Executor local profiling cache variables updated to buffer type.
AICPU Consumer Simplification `src/a2a3/platform/include/aicpu/l2_perf_collector_aicpu.h`, `src/a2a3/platform/src/aicpu/l2_perf_collector_aicpu.cpp`	AICPU no longer reads AICore buffer on hot path. `l2_perf_aicpu_complete_record()` simplified to initialize records with zero timing/duration while writing only metadata fields (task_id, reg_task_id, func_id, core_type). Ring mismatch invariant checks and associated error path removed. Documentation updated to describe AICore and AICPU responsibility split.
Host-side Join and Reconciliation `src/a2a3/platform/include/host/l2_perf_collector.h`, `src/a2a3/platform/src/host/l2_perf_collector.cpp`	New `join_aicore_records()` method patches collected records' timing fields by indexing per-core `L2PerfAicoreBuffer` with `reg_task_id % PLATFORM_AICORE_BUFFER_SIZE`, logs unmatched records, and is wired into `export_swimlane_json()`. Buffer allocation updated to `L2PerfAicoreBuffer` type. Counter reconciliation and mismatch logging updated to reflect that `mismatch_record_count` is no longer written post AICore-as-producer.
Protocol Documentation `docs/dfx/l2-swimlane-profiling.md`	Documentation revised to describe stable per-core `aicore_ring_ptr`, explicit per-core `L2PerfAicoreBuffer` containing records indexed by `reg_task_id % SIZE`, AICore-as-producer slim writes, AICPU metadata-only writes leaving timing zero, and host post-processor join semantics. Overflow/wrap behavior and unmatched record semantics described. `mismatch_record_count` marked as legacy.

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly Related PRs

hw-native-sys/simpler#873: Both PRs modify l2_perf_collector_aicpu.cpp to change how AICPU reads (or no longer reads) AICore-published L2 perf staging data, including removal of cache-invalidate operations on that path.

🐰 A profiling plot twist, oh what fun,
AICore writes fast, the join waits for none,
Slim records and buffers dance in the light,
While AICPU steps back to let hosts join right! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 38.46% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main refactoring: AICore now writes directly as a producer, eliminating the per-task staging read that was previously required.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description clearly relates to the changeset by detailing the AICore-as-producer refactor, buffer rotation mechanism, and removal of per-task staging reads.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces an "AICore-as-producer" design for L2 swimlane profiling. Instead of AICore writing to a small staging ring that AICPU reads on the hot path, AICore now writes slim records directly to a larger per-core buffer (L2PerfAicoreBuffer). AICPU only writes its own metadata fields, and the host post-processor joins the AICore-written timestamps with the AICPU-written records at flush time. This optimization eliminates the per-task rmb() and staging cache-line read overhead on the device. The review feedback suggests changing PLATFORM_AICORE_BUFFER_SIZE to a power of two (e.g., 8192 instead of 8000) to allow the compiler to optimize the modulo operation on the AICore hot path into a fast bitwise AND.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/a2a3/platform/onboard/aicore/kernel.cpp`:
- Line 95: The code calls set_aicore_l2_perf_ring(reinterpret_cast<__gm__
L2PerfAicoreBuffer *>(ring_table[block_idx])) without guarding that the ring
table exists; add a null guard that checks k_args->aicore_ring_addr != 0 (and/or
ring_table != nullptr) before indexing ring_table and calling
set_aicore_l2_perf_ring, and skip that call when the address/table is null to
avoid an invalid memory read; update the logic around ring_table, block_idx and
set_aicore_l2_perf_ring to only compute/index the table and invoke the setter
when the guard passes.

In `@src/a2a3/platform/src/host/l2_perf_collector.cpp`:
- Around line 536-540: When computing base_time_cycles inside
L2PerfCollector::join_aicore_records, ignore any AICore records whose start
(and/or end) is zero so they don't become the chosen base; change the base-time
selection logic (the code around base_time_cycles / the loop around lines
~597-600) to skip records with start==0 (or both start==0 and end==0) when
scanning for the minimum timestamp, leaving unmatched records as start=end=0 but
excluding them from the base_time_cycles calculation and any subsequent
timestamp anchoring/export.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8615e8fe-929a-4eed-9dcd-e9cf72f7adf9

📥 Commits

Reviewing files that changed from the base of the PR and between d873e48 and dee1bd3.

📒 Files selected for processing (14)

docs/dfx/l2-swimlane-profiling.md
src/a2a3/platform/include/aicore/aicore_profiling_state.h
src/a2a3/platform/include/aicore/l2_perf_collector_aicore.h
src/a2a3/platform/include/aicpu/l2_perf_collector_aicpu.h
src/a2a3/platform/include/common/kernel_args.h
src/a2a3/platform/include/common/l2_perf_profiling.h
src/a2a3/platform/include/common/platform_config.h
src/a2a3/platform/include/host/l2_perf_collector.h
src/a2a3/platform/onboard/aicore/kernel.cpp
src/a2a3/platform/sim/aicore/kernel.cpp
src/a2a3/platform/src/aicpu/l2_perf_collector_aicpu.cpp
src/a2a3/platform/src/host/l2_perf_collector.cpp
src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp

💤 Files with no reviewable changes (1)

src/a2a3/platform/include/common/platform_config.h

…ead) AICore writes a slim 32B L2PerfAicoreRecord (start, end, task_id) into a small per-core L2PerfAicoreBuffer (1024 records, 32 KB). The buffer is rotated by AICPU in the completion path: each call to l2_perf_aicpu_complete_record bumps a per-core completion counter and, when that counter crosses a PLATFORM_AICORE_BUFFER_SIZE boundary, hands the just-filled AICore buffer to the per-thread ready queue (kind is_phase=2) and pops a fresh one from L2PerfAicoreBufferState::free_queue. Race-free design: the just-completed task's FIN signal (the one we are handling now) guarantees AICore has already finished writing (with dcci + dsb) all records 0..N-1 into the old buffer. Earlier versions of this PR placed rotation in the dispatch path, which left a dispatch-boundary race where AICore could still be running task K-1 while AICPU rotated for task K, causing record K-1 to land in the new buffer instead of the old one. AICore picks up the rotation by dcci'ing a per-core AicoreRotation cache line per task (cheap relative to the baseline dcci(payload, ENTIRE_DATA_CACHE) it already pays) and comparing the published generation to its locally cached copy. AICPU's complete_record path drops the rmb() + staging cache-line read it used to do per task; AICPU now writes only AICPU-owned fields (task_id, reg_task_id, func_id, core_type, dispatch_time, finish_time) into the rotating L2PerfBuffer, with start/end zero on-device. Host's join_aicore_records() builds a reg_task_id -> (start, end) map per core from the multi-buffer AICore stream and patches L2PerfRecords at flush time, before export_swimlane_json(). Other review fixes folded in: - Flush stamps buf->count using complete_count - rotations_done * BUFFER_SIZE so a previously failed rotation no longer truncates the full buffer it stranded. - export_swimlane_json filters tasks with unmatched AICore timing (start_time == 0) before the per-record loop, so the underflow that previously produced huge bogus timestamps is gone. - PLATFORM_PROF_READYQUEUE_SIZE bumped to include the new AICore pool's worst-case backlog (MAX_CORES * AICORE_BUFFERS_PER_CORE). L2PerfRecord gains a reg_task_id field in existing pad space (no struct size change). TypedBuffer<Record, N> templates L2PerfBuffer and L2PerfAicoreBuffer. Drops PLATFORM_L2_AICORE_RING_SIZE; legacy aicore_ring_ptr / mismatch_record_count fields kept for ABI continuity. Hardware bench (paged_attention_unroll Case1, swimlane=4, device 0, clean iters after outlier filter): - sched_cost: 1178.14 -> ~1175 us (-3 us) - orch_cost: 947.06 -> ~915 us (-32 us) Profiling overhead vs swimlane=0 baseline drops from ~150 us to ~125 us. Scoped to a2a3 only; a5 retains the legacy staging-ring design. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed May 27, 2026

View reviewed changes

Comment thread src/a2a3/platform/include/common/l2_perf_profiling.h Outdated

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

Comment thread src/a2a3/platform/onboard/aicore/kernel.cpp Outdated

Comment thread src/a2a3/platform/src/host/l2_perf_collector.cpp Outdated

hw-native-sys-bot force-pushed the refactor/aicore-as-producer branch 3 times, most recently from 66d6d90 to 0f5e9c3 Compare May 28, 2026 07:36

hw-native-sys-bot force-pushed the refactor/aicore-as-producer branch from 0f5e9c3 to a5f8f27 Compare May 28, 2026 08:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: AICore-as-producer for L2 swimlane (skip per-task staging read)#878

Refactor: AICore-as-producer for L2 swimlane (skip per-task staging read)#878
hw-native-sys-bot wants to merge 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/aicore-as-producer

hw-native-sys-bot commented May 27, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 27, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Possibly Related PRs

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hw-native-sys-bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Hardware bench

Testing

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Possibly Related PRs

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hw-native-sys-bot commented May 27, 2026 •

edited

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading