Refactor: AICore-as-producer for L2 swimlane (skip per-task staging read)#878
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR refactors the L2 swimlane performance profiling buffer architecture from ring-based staging to buffer-based staging. AICore now writes slim timing records into per-core buffers indexed by task ID. AICPU writes only metadata and leaves timing zero. The host post-processor joins records by matching register task IDs to patch in timing data. ChangesL2 Profiling Buffer Architecture Refactor
🎯 4 (Complex) | ⏱️ ~60 minutes Possibly Related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces an "AICore-as-producer" design for L2 swimlane profiling. Instead of AICore writing to a small staging ring that AICPU reads on the hot path, AICore now writes slim records directly to a larger per-core buffer (L2PerfAicoreBuffer). AICPU only writes its own metadata fields, and the host post-processor joins the AICore-written timestamps with the AICPU-written records at flush time. This optimization eliminates the per-task rmb() and staging cache-line read overhead on the device. The review feedback suggests changing PLATFORM_AICORE_BUFFER_SIZE to a power of two (e.g., 8192 instead of 8000) to allow the compiler to optimize the modulo operation on the AICore hot path into a fast bitwise AND.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/a2a3/platform/onboard/aicore/kernel.cpp`:
- Line 95: The code calls set_aicore_l2_perf_ring(reinterpret_cast<__gm__
L2PerfAicoreBuffer *>(ring_table[block_idx])) without guarding that the ring
table exists; add a null guard that checks k_args->aicore_ring_addr != 0 (and/or
ring_table != nullptr) before indexing ring_table and calling
set_aicore_l2_perf_ring, and skip that call when the address/table is null to
avoid an invalid memory read; update the logic around ring_table, block_idx and
set_aicore_l2_perf_ring to only compute/index the table and invoke the setter
when the guard passes.
In `@src/a2a3/platform/src/host/l2_perf_collector.cpp`:
- Around line 536-540: When computing base_time_cycles inside
L2PerfCollector::join_aicore_records, ignore any AICore records whose start
(and/or end) is zero so they don't become the chosen base; change the base-time
selection logic (the code around base_time_cycles / the loop around lines
~597-600) to skip records with start==0 (or both start==0 and end==0) when
scanning for the minimum timestamp, leaving unmatched records as start=end=0 but
excluding them from the base_time_cycles calculation and any subsequent
timestamp anchoring/export.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8615e8fe-929a-4eed-9dcd-e9cf72f7adf9
📒 Files selected for processing (14)
docs/dfx/l2-swimlane-profiling.mdsrc/a2a3/platform/include/aicore/aicore_profiling_state.hsrc/a2a3/platform/include/aicore/l2_perf_collector_aicore.hsrc/a2a3/platform/include/aicpu/l2_perf_collector_aicpu.hsrc/a2a3/platform/include/common/kernel_args.hsrc/a2a3/platform/include/common/l2_perf_profiling.hsrc/a2a3/platform/include/common/platform_config.hsrc/a2a3/platform/include/host/l2_perf_collector.hsrc/a2a3/platform/onboard/aicore/kernel.cppsrc/a2a3/platform/sim/aicore/kernel.cppsrc/a2a3/platform/src/aicpu/l2_perf_collector_aicpu.cppsrc/a2a3/platform/src/host/l2_perf_collector.cppsrc/a2a3/runtime/host_build_graph/aicore/aicore_executor.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
💤 Files with no reviewable changes (1)
- src/a2a3/platform/include/common/platform_config.h
66d6d90 to
0f5e9c3
Compare
…ead)
AICore writes a slim 32B L2PerfAicoreRecord (start, end, task_id) into
a small per-core L2PerfAicoreBuffer (1024 records, 32 KB). The buffer
is rotated by AICPU in the completion path: each call to
l2_perf_aicpu_complete_record bumps a per-core completion counter and,
when that counter crosses a PLATFORM_AICORE_BUFFER_SIZE boundary,
hands the just-filled AICore buffer to the per-thread ready queue
(kind is_phase=2) and pops a fresh one from
L2PerfAicoreBufferState::free_queue.
Race-free design: the just-completed task's FIN signal (the one we are
handling now) guarantees AICore has already finished writing (with
dcci + dsb) all records 0..N-1 into the old buffer. Earlier versions
of this PR placed rotation in the dispatch path, which left a
dispatch-boundary race where AICore could still be running task K-1
while AICPU rotated for task K, causing record K-1 to land in the new
buffer instead of the old one.
AICore picks up the rotation by dcci'ing a per-core AicoreRotation
cache line per task (cheap relative to the baseline
dcci(payload, ENTIRE_DATA_CACHE) it already pays) and comparing the
published generation to its locally cached copy.
AICPU's complete_record path drops the rmb() + staging cache-line
read it used to do per task; AICPU now writes only AICPU-owned
fields (task_id, reg_task_id, func_id, core_type, dispatch_time,
finish_time) into the rotating L2PerfBuffer, with start/end zero
on-device. Host's join_aicore_records() builds a reg_task_id ->
(start, end) map per core from the multi-buffer AICore stream and
patches L2PerfRecords at flush time, before export_swimlane_json().
Other review fixes folded in:
- Flush stamps buf->count using complete_count -
rotations_done * BUFFER_SIZE so a previously failed rotation no
longer truncates the full buffer it stranded.
- export_swimlane_json filters tasks with unmatched AICore timing
(start_time == 0) before the per-record loop, so the underflow
that previously produced huge bogus timestamps is gone.
- PLATFORM_PROF_READYQUEUE_SIZE bumped to include the new AICore
pool's worst-case backlog (MAX_CORES * AICORE_BUFFERS_PER_CORE).
L2PerfRecord gains a reg_task_id field in existing pad space (no
struct size change). TypedBuffer<Record, N> templates L2PerfBuffer
and L2PerfAicoreBuffer. Drops PLATFORM_L2_AICORE_RING_SIZE; legacy
aicore_ring_ptr / mismatch_record_count fields kept for ABI
continuity.
Hardware bench (paged_attention_unroll Case1, swimlane=4, device 0,
clean iters after outlier filter):
- sched_cost: 1178.14 -> ~1175 us (-3 us)
- orch_cost: 947.06 -> ~915 us (-32 us)
Profiling overhead vs swimlane=0 baseline drops from ~150 us to ~125 us.
Scoped to a2a3 only; a5 retains the legacy staging-ring design.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0f5e9c3 to
a5f8f27
Compare
Summary
AICore-as-producer with rotation for the L2 swimlane profiler:
L2PerfAicoreRecord(start, end, reg dispatch token) into a small per-coreL2PerfAicoreBuffer(1024 records / 32 KB).PLATFORM_AICORE_BUFFER_SIZE-th dispatch — immediately beforewrite_reg(DATA_MAIN_BASE). The runtime's completion-before-dispatch invariant guarantees AICore has finished writing all prior records into the old buffer before AICPU enqueues it to the per-thread ready queue (kindis_phase=2).dcci-ing a per-coreAicoreRotationcache line per task and comparing the published generation to its locally cached copy. Negligible cost relative to the baselinedcci(payload, ENTIRE_DATA_CACHE)AICore already pays per task.complete_recordpath drops the per-taskrmb()+ staging cache-line read entirely; it only writes AICPU-owned fields (task_id,reg_task_id,func_id,core_type,dispatch_time,finish_time), leaving start/end zero on-device.join_aicore_records()builds areg_task_id → (start, end)map per core from the rotated multi-buffer AICore stream and patchesL2PerfRecords at flush time beforeexport_swimlane_json().BufferPoolManagerrecycles AICore buffers via the ready queue while the session runs, same as the existing AICPU pool.L2PerfRecordgains areg_task_idfield in existing pad space (no struct size change).TypedBuffer<Record, N>templatesL2PerfBufferand the newL2PerfAicoreBuffer. DropsPLATFORM_L2_AICORE_RING_SIZE; legacyaicore_ring_ptr/mismatch_record_countfields kept for ABI continuity.l2_perf_profiling.hand retains the legacy staging-ring design; mirror can land in a follow-up.Hardware bench
paged_attention_unroll Case1, swimlane=4, device 0, n=15 (rotation) / 10 (baselines), matched conditions:Δ vs upstream/main swimlane=4: sched −4.34 µs, orch −19.25 µs, total −4.40 µs. Profiling overhead vs swimlane=0 baseline drops from ~150 µs to ~121 µs.
The rotation design pays ~3 µs more sched than the no-rotation predecessor (which had this PR's earlier commit) — that's the price for unbounded session length and no wrap loss.
Testing
tests/st/a2a3/tensormap_and_ringbuffer/dfx/— all 7 tests pass (dep_gen,dep_gen_chain,l2_swimlane,l2_swimlane_mixed,pmu,tensor_dump× 2)paged_attention_unroll Case1PASSED with swimlane=4 (1024 tasks, allstart_time_us> 0 after join, durations 1.2–67 µs)