Platform
a5 (Ascend 950 hardware)
Runtime Variant
tensormap_and_ringbuffer
Description
On a5, AICore reads its own PMU MMIO directly via ld_dev. The sequence per task is roughly:
write_reg(CTRL, CTRL & ~PMU_ENABLE_BIT); // pmu_aicore_end(): disable PMU
... fetch reg_base from somewhere ...
ld_dev(reg_base + CNT0_OFFSET); // event counters cnt[0..9]
...
ld_dev(reg_base + CNT_TOTAL0_OFFSET); // 64-bit cycle counter
ld_dev(reg_base + CNT_TOTAL1_OFFSET);
Empirically, the latency of "fetch reg_base" decides whether CNT_TOTAL0/1 reads valid values or returns 0. Event counters (cnt[0..9]) are unaffected — they hold their value after PMU disable.
Reporting this as a hardware-mechanism question because the three reg_base-fetch shapes we've tried map cleanly to three outcomes:
| Reg-base fetch shape |
Where the value lives |
Typical latency |
CNT_TOTAL result |
Read a volatile uint64_t field from a GM struct that AICore already accesses every task (so its cache line is L1-hot) |
GM (cached, hot) |
~1–2 cycles (L1 hit) |
Always non-zero |
Read table[block_idx] from a separate device-memory table that AICore touches only here (cold cache line) |
GM (uncached / cold) |
dozens to hundreds of cycles (L1 miss → DDR) |
~25% of records read 0 |
Read a [[block_local]] uint64_t value resolved once at kernel entry |
AICore per-block private storage |
~1 cycle (scalar register access) |
Always non-zero |
Same ld_dev sequence in all three cases; only the operation immediately before it changes.
Steps to Reproduce
- Build the kernel so AICore fetches
reg_base from a cold GM table per task — i.e. [[block_local]] static __gm__ uint64_t *table; and get_reg_base() { return table[block_idx]; }, where table points to a per-core device-memory array AICore otherwise never touches.
- Run any PMU-profiling test on real a5 hardware with enough tasks to populate
outputs/<run>/pmu.csv (we used examples/paged_attention_unroll, ~1024 tasks).
awk -F, 'NR>1 && $6=="0"' outputs/<run>/pmu.csv | wc -l.
Expected Behavior
CNT_TOTAL returns a valid cycle count whenever the kernel actually executed — i.e. should behave the same way the event counters do, sticky after PMU disable.
Actual Behavior
Cold-GM-table reg-base fetch:
| log level |
total rows |
rows with pmu_total_cycles == 0 |
| debug |
1024 |
482 (≈47%) |
| warn |
1024 |
265 (≈26%) |
Sample row (event counters valid, total cycles zero):
0,0,0x00000001000001e1,0,0,0,0,274,38,171,0,0,6,0,0,2
After switching to the block-local fetch shape, the same test yields 0 / 1024 zero rows.
The dependency on log level is informative: AICPU log throughput changes dispatch timing, which changes per-core task density, which changes how often the cold cache line gets evicted between reads. More eviction → more CNT_TOTAL == 0 rows. Suggests the failure is driven by cache-miss-rate, not by any deterministic counter-clear behavior.
Git Commit ID
N/A — the broken intermediate state is no longer on main. The pattern is reproducible by deliberately introducing a cold per-record GM read between pmu_aicore_end() and ld_dev(CNT_TOTAL0).
CANN Version
N/A.
Driver Version
N/A.
Host Platform
Linux (aarch64)
Additional Context
This issue exists as a hardware-behavior record, not an open repo bug. The software-side fix is already in place (resolve reg_base into block-local storage at kernel entry).
What we'd like the hardware team to confirm or correct:
- Is
CNT_TOTAL0/1 expected to remain readable indefinitely after PMU disable (CTRL bit 0 = 0), or is there a defined valid-read window after disable?
- If a window exists: is it specified in cycles, or in terms of "next access on the MMIO interface after disable"?
- Is the cycle counter's post-disable behavior expected to differ from event counters' (which are clearly sticky)?
If this is expected hardware behavior, then software has a hard constraint: after pmu_aicore_end(), nothing slow (cache miss, long scalar dependency, etc.) is allowed before ld_dev(CNT_TOTAL). The current fix relies on that constraint informally; a documented spec would let us assert it.
If this is unexpected / a hardware bug, please advise on a hardware-side guard so software does not have to manage this timing window.
Platform
a5 (Ascend 950 hardware)
Runtime Variant
tensormap_and_ringbuffer
Description
On a5, AICore reads its own PMU MMIO directly via
ld_dev. The sequence per task is roughly:Empirically, the latency of "fetch reg_base" decides whether
CNT_TOTAL0/1reads valid values or returns 0. Event counters (cnt[0..9]) are unaffected — they hold their value after PMU disable.Reporting this as a hardware-mechanism question because the three reg_base-fetch shapes we've tried map cleanly to three outcomes:
CNT_TOTALresultvolatile uint64_tfield from a GM struct that AICore already accesses every task (so its cache line is L1-hot)table[block_idx]from a separate device-memory table that AICore touches only here (cold cache line)[[block_local]]uint64_tvalue resolved once at kernel entrySame
ld_devsequence in all three cases; only the operation immediately before it changes.Steps to Reproduce
reg_basefrom a cold GM table per task — i.e.[[block_local]] static __gm__ uint64_t *table;andget_reg_base() { return table[block_idx]; }, wheretablepoints to a per-core device-memory array AICore otherwise never touches.outputs/<run>/pmu.csv(we usedexamples/paged_attention_unroll, ~1024 tasks).awk -F, 'NR>1 && $6=="0"' outputs/<run>/pmu.csv | wc -l.Expected Behavior
CNT_TOTALreturns a valid cycle count whenever the kernel actually executed — i.e. should behave the same way the event counters do, sticky after PMU disable.Actual Behavior
Cold-GM-table reg-base fetch:
pmu_total_cycles == 0Sample row (event counters valid, total cycles zero):
After switching to the block-local fetch shape, the same test yields 0 / 1024 zero rows.
The dependency on log level is informative: AICPU log throughput changes dispatch timing, which changes per-core task density, which changes how often the cold cache line gets evicted between reads. More eviction → more
CNT_TOTAL == 0rows. Suggests the failure is driven by cache-miss-rate, not by any deterministic counter-clear behavior.Git Commit ID
N/A — the broken intermediate state is no longer on
main. The pattern is reproducible by deliberately introducing a cold per-record GM read betweenpmu_aicore_end()andld_dev(CNT_TOTAL0).CANN Version
N/A.
Driver Version
N/A.
Host Platform
Linux (aarch64)
Additional Context
This issue exists as a hardware-behavior record, not an open repo bug. The software-side fix is already in place (resolve
reg_baseinto block-local storage at kernel entry).What we'd like the hardware team to confirm or correct:
CNT_TOTAL0/1expected to remain readable indefinitely after PMU disable (CTRL bit 0 = 0), or is there a defined valid-read window after disable?If this is expected hardware behavior, then software has a hard constraint: after
pmu_aicore_end(), nothing slow (cache miss, long scalar dependency, etc.) is allowed beforeld_dev(CNT_TOTAL). The current fix relies on that constraint informally; a documented spec would let us assert it.If this is unexpected / a hardware bug, please advise on a hardware-side guard so software does not have to manage this timing window.