Skip to content

Refactor: AICore-as-producer for L2 swimlane (skip per-task staging read)#878

Open
hw-native-sys-bot wants to merge 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/aicore-as-producer
Open

Refactor: AICore-as-producer for L2 swimlane (skip per-task staging read)#878
hw-native-sys-bot wants to merge 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/aicore-as-producer

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

@hw-native-sys-bot hw-native-sys-bot commented May 27, 2026

Summary

AICore-as-producer with rotation for the L2 swimlane profiler:

  • AICore writes a slim 32 B L2PerfAicoreRecord (start, end, reg dispatch token) into a small per-core L2PerfAicoreBuffer (1024 records / 32 KB).
  • AICPU rotates the buffer at every PLATFORM_AICORE_BUFFER_SIZE-th dispatch — immediately before write_reg(DATA_MAIN_BASE). The runtime's completion-before-dispatch invariant guarantees AICore has finished writing all prior records into the old buffer before AICPU enqueues it to the per-thread ready queue (kind is_phase=2).
  • AICore picks up rotations by dcci-ing a per-core AicoreRotation cache line per task and comparing the published generation to its locally cached copy. Negligible cost relative to the baseline dcci(payload, ENTIRE_DATA_CACHE) AICore already pays per task.
  • AICPU's complete_record path drops the per-task rmb() + staging cache-line read entirely; it only writes AICPU-owned fields (task_id, reg_task_id, func_id, core_type, dispatch_time, finish_time), leaving start/end zero on-device.
  • Host's join_aicore_records() builds a reg_task_id → (start, end) map per core from the rotated multi-buffer AICore stream and patches L2PerfRecords at flush time before export_swimlane_json().
  • Session length is unbounded by per-core buffer sumBufferPoolManager recycles AICore buffers via the ready queue while the session runs, same as the existing AICPU pool.
  • L2PerfRecord gains a reg_task_id field in existing pad space (no struct size change). TypedBuffer<Record, N> templates L2PerfBuffer and the new L2PerfAicoreBuffer. Drops PLATFORM_L2_AICORE_RING_SIZE; legacy aicore_ring_ptr / mismatch_record_count fields kept for ABI continuity.
  • Scoped to a2a3 only — a5 has its own l2_perf_profiling.h and retains the legacy staging-ring design; mirror can land in a follow-up.

Hardware bench

paged_attention_unroll Case1, swimlane=4, device 0, n=15 (rotation) / 10 (baselines), matched conditions:

Build n sched (µs) orch (µs) total (µs)
upstream/main, swimlane=4 10 1178.14 ± 6.42 947.06 1178.25
upstream/main, swimlane=0 10 1164.37 ± 5.84 815.30 1164.46
this PR (rotation), swimlane=4 15 1173.80 ± 6.64 927.81 1173.85

Δ vs upstream/main swimlane=4: sched −4.34 µs, orch −19.25 µs, total −4.40 µs. Profiling overhead vs swimlane=0 baseline drops from ~150 µs to ~121 µs.

The rotation design pays ~3 µs more sched than the no-rotation predecessor (which had this PR's earlier commit) — that's the price for unbounded session length and no wrap loss.

Testing

  • tests/st/a2a3/tensormap_and_ringbuffer/dfx/ — all 7 tests pass (dep_gen, dep_gen_chain, l2_swimlane, l2_swimlane_mixed, pmu, tensor_dump × 2)
  • paged_attention_unroll Case1 PASSED with swimlane=4 (1024 tasks, all start_time_us > 0 after join, durations 1.2–67 µs)
  • CI on hardware

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5d34ba7f-5f6f-4c1c-b0da-9b071fbc5f36

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR refactors the L2 swimlane performance profiling buffer architecture from ring-based staging to buffer-based staging. AICore now writes slim timing records into per-core buffers indexed by task ID. AICPU writes only metadata and leaves timing zero. The host post-processor joins records by matching register task IDs to patch in timing data.

Changes

L2 Profiling Buffer Architecture Refactor

Layer / File(s) Summary
Data Contracts and Type Definitions
src/a2a3/platform/include/common/l2_perf_profiling.h, src/a2a3/platform/include/common/platform_config.h, src/a2a3/platform/include/common/kernel_args.h
L2PerfRecord gains reg_task_id join key field. L2PerfAicoreRing is replaced with slim L2PerfAicoreRecord and new TypedBuffer<Record, N> template defining L2PerfAicoreBuffer and L2PerfBuffer aliases. PLATFORM_AICORE_BUFFER_SIZE constant added; PLATFORM_L2_AICORE_RING_SIZE removed. L2PerfBufferState documentation clarifies mismatch_record_count as legacy ABI-retained, no longer written.
AICore Producer Path
src/a2a3/platform/include/aicore/aicore_profiling_state.h, src/a2a3/platform/include/aicore/l2_perf_collector_aicore.h, src/a2a3/platform/onboard/aicore/kernel.cpp, src/a2a3/platform/sim/aicore/kernel.cpp, src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp, src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
Profiling state getter/setter signatures updated to accept/return L2PerfAicoreBuffer* instead of L2PerfAicoreRing*. l2_perf_aicore_record_task() refactored to index buf->records[task_id % PLATFORM_AICORE_BUFFER_SIZE] and write task_id directly. Kernel entry and simulation reinterpret ring-table pointers as L2PerfAicoreBuffer* before passing to profiling state setters. Executor local profiling cache variables updated to buffer type.
AICPU Consumer Simplification
src/a2a3/platform/include/aicpu/l2_perf_collector_aicpu.h, src/a2a3/platform/src/aicpu/l2_perf_collector_aicpu.cpp
AICPU no longer reads AICore buffer on hot path. l2_perf_aicpu_complete_record() simplified to initialize records with zero timing/duration while writing only metadata fields (task_id, reg_task_id, func_id, core_type). Ring mismatch invariant checks and associated error path removed. Documentation updated to describe AICore and AICPU responsibility split.
Host-side Join and Reconciliation
src/a2a3/platform/include/host/l2_perf_collector.h, src/a2a3/platform/src/host/l2_perf_collector.cpp
New join_aicore_records() method patches collected records' timing fields by indexing per-core L2PerfAicoreBuffer with reg_task_id % PLATFORM_AICORE_BUFFER_SIZE, logs unmatched records, and is wired into export_swimlane_json(). Buffer allocation updated to L2PerfAicoreBuffer type. Counter reconciliation and mismatch logging updated to reflect that mismatch_record_count is no longer written post AICore-as-producer.
Protocol Documentation
docs/dfx/l2-swimlane-profiling.md
Documentation revised to describe stable per-core aicore_ring_ptr, explicit per-core L2PerfAicoreBuffer containing records indexed by reg_task_id % SIZE, AICore-as-producer slim writes, AICPU metadata-only writes leaving timing zero, and host post-processor join semantics. Overflow/wrap behavior and unmatched record semantics described. mismatch_record_count marked as legacy.

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly Related PRs

  • hw-native-sys/simpler#873: Both PRs modify l2_perf_collector_aicpu.cpp to change how AICPU reads (or no longer reads) AICore-published L2 perf staging data, including removal of cache-invalidate operations on that path.

🐰 A profiling plot twist, oh what fun,
AICore writes fast, the join waits for none,
Slim records and buffers dance in the light,
While AICPU steps back to let hosts join right!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.46% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main refactoring: AICore now writes directly as a producer, eliminating the per-task staging read that was previously required.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description clearly relates to the changeset by detailing the AICore-as-producer refactor, buffer rotation mechanism, and removal of per-task staging reads.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an "AICore-as-producer" design for L2 swimlane profiling. Instead of AICore writing to a small staging ring that AICPU reads on the hot path, AICore now writes slim records directly to a larger per-core buffer (L2PerfAicoreBuffer). AICPU only writes its own metadata fields, and the host post-processor joins the AICore-written timestamps with the AICPU-written records at flush time. This optimization eliminates the per-task rmb() and staging cache-line read overhead on the device. The review feedback suggests changing PLATFORM_AICORE_BUFFER_SIZE to a power of two (e.g., 8192 instead of 8000) to allow the compiler to optimize the modulo operation on the AICore hot path into a fast bitwise AND.

Comment thread src/a2a3/platform/include/common/l2_perf_profiling.h Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/a2a3/platform/onboard/aicore/kernel.cpp`:
- Line 95: The code calls set_aicore_l2_perf_ring(reinterpret_cast<__gm__
L2PerfAicoreBuffer *>(ring_table[block_idx])) without guarding that the ring
table exists; add a null guard that checks k_args->aicore_ring_addr != 0 (and/or
ring_table != nullptr) before indexing ring_table and calling
set_aicore_l2_perf_ring, and skip that call when the address/table is null to
avoid an invalid memory read; update the logic around ring_table, block_idx and
set_aicore_l2_perf_ring to only compute/index the table and invoke the setter
when the guard passes.

In `@src/a2a3/platform/src/host/l2_perf_collector.cpp`:
- Around line 536-540: When computing base_time_cycles inside
L2PerfCollector::join_aicore_records, ignore any AICore records whose start
(and/or end) is zero so they don't become the chosen base; change the base-time
selection logic (the code around base_time_cycles / the loop around lines
~597-600) to skip records with start==0 (or both start==0 and end==0) when
scanning for the minimum timestamp, leaving unmatched records as start=end=0 but
excluding them from the base_time_cycles calculation and any subsequent
timestamp anchoring/export.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8615e8fe-929a-4eed-9dcd-e9cf72f7adf9

📥 Commits

Reviewing files that changed from the base of the PR and between d873e48 and dee1bd3.

📒 Files selected for processing (14)
  • docs/dfx/l2-swimlane-profiling.md
  • src/a2a3/platform/include/aicore/aicore_profiling_state.h
  • src/a2a3/platform/include/aicore/l2_perf_collector_aicore.h
  • src/a2a3/platform/include/aicpu/l2_perf_collector_aicpu.h
  • src/a2a3/platform/include/common/kernel_args.h
  • src/a2a3/platform/include/common/l2_perf_profiling.h
  • src/a2a3/platform/include/common/platform_config.h
  • src/a2a3/platform/include/host/l2_perf_collector.h
  • src/a2a3/platform/onboard/aicore/kernel.cpp
  • src/a2a3/platform/sim/aicore/kernel.cpp
  • src/a2a3/platform/src/aicpu/l2_perf_collector_aicpu.cpp
  • src/a2a3/platform/src/host/l2_perf_collector.cpp
  • src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
💤 Files with no reviewable changes (1)
  • src/a2a3/platform/include/common/platform_config.h

Comment thread src/a2a3/platform/onboard/aicore/kernel.cpp Outdated
Comment thread src/a2a3/platform/src/host/l2_perf_collector.cpp Outdated
@hw-native-sys-bot hw-native-sys-bot force-pushed the refactor/aicore-as-producer branch 3 times, most recently from 66d6d90 to 0f5e9c3 Compare May 28, 2026 07:36
…ead)

AICore writes a slim 32B L2PerfAicoreRecord (start, end, task_id) into
a small per-core L2PerfAicoreBuffer (1024 records, 32 KB). The buffer
is rotated by AICPU in the completion path: each call to
l2_perf_aicpu_complete_record bumps a per-core completion counter and,
when that counter crosses a PLATFORM_AICORE_BUFFER_SIZE boundary,
hands the just-filled AICore buffer to the per-thread ready queue
(kind is_phase=2) and pops a fresh one from
L2PerfAicoreBufferState::free_queue.

Race-free design: the just-completed task's FIN signal (the one we are
handling now) guarantees AICore has already finished writing (with
dcci + dsb) all records 0..N-1 into the old buffer. Earlier versions
of this PR placed rotation in the dispatch path, which left a
dispatch-boundary race where AICore could still be running task K-1
while AICPU rotated for task K, causing record K-1 to land in the new
buffer instead of the old one.

AICore picks up the rotation by dcci'ing a per-core AicoreRotation
cache line per task (cheap relative to the baseline
dcci(payload, ENTIRE_DATA_CACHE) it already pays) and comparing the
published generation to its locally cached copy.

AICPU's complete_record path drops the rmb() + staging cache-line
read it used to do per task; AICPU now writes only AICPU-owned
fields (task_id, reg_task_id, func_id, core_type, dispatch_time,
finish_time) into the rotating L2PerfBuffer, with start/end zero
on-device. Host's join_aicore_records() builds a reg_task_id ->
(start, end) map per core from the multi-buffer AICore stream and
patches L2PerfRecords at flush time, before export_swimlane_json().

Other review fixes folded in:
  - Flush stamps buf->count using complete_count -
    rotations_done * BUFFER_SIZE so a previously failed rotation no
    longer truncates the full buffer it stranded.
  - export_swimlane_json filters tasks with unmatched AICore timing
    (start_time == 0) before the per-record loop, so the underflow
    that previously produced huge bogus timestamps is gone.
  - PLATFORM_PROF_READYQUEUE_SIZE bumped to include the new AICore
    pool's worst-case backlog (MAX_CORES * AICORE_BUFFERS_PER_CORE).

L2PerfRecord gains a reg_task_id field in existing pad space (no
struct size change). TypedBuffer<Record, N> templates L2PerfBuffer
and L2PerfAicoreBuffer. Drops PLATFORM_L2_AICORE_RING_SIZE; legacy
aicore_ring_ptr / mismatch_record_count fields kept for ABI
continuity.

Hardware bench (paged_attention_unroll Case1, swimlane=4, device 0,
clean iters after outlier filter):
  - sched_cost: 1178.14 -> ~1175 us  (-3 us)
  - orch_cost:   947.06 ->  ~915 us  (-32 us)
Profiling overhead vs swimlane=0 baseline drops from ~150 us to ~125 us.

Scoped to a2a3 only; a5 retains the legacy staging-ring design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hw-native-sys-bot hw-native-sys-bot force-pushed the refactor/aicore-as-producer branch from 0f5e9c3 to a5f8f27 Compare May 28, 2026 08:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants