[Bug] a5 PMU CNT_TOTAL returns 0 when reg_base read is slow between pmu_disable and ld_dev

### Platform

a5 (Ascend 950 hardware)

### Runtime Variant

tensormap_and_ringbuffer

### Description

On a5, AICore reads its own PMU MMIO directly via `ld_dev`. The sequence per task is roughly:

```
write_reg(CTRL, CTRL & ~PMU_ENABLE_BIT);   // pmu_aicore_end(): disable PMU
... fetch reg_base from somewhere ...
ld_dev(reg_base + CNT0_OFFSET);            // event counters cnt[0..9]
...
ld_dev(reg_base + CNT_TOTAL0_OFFSET);      // 64-bit cycle counter
ld_dev(reg_base + CNT_TOTAL1_OFFSET);
```

Empirically, the **latency of "fetch reg_base"** decides whether `CNT_TOTAL0/1` reads valid values or returns 0. Event counters (`cnt[0..9]`) are unaffected — they hold their value after PMU disable.

Reporting this as a hardware-mechanism question because the three reg_base-fetch shapes we've tried map cleanly to three outcomes:

| Reg-base fetch shape                                                 | Where the value lives                | Typical latency                                             | `CNT_TOTAL` result   |
| -------------------------------------------------------------------- | ------------------------------------ | ----------------------------------------------------------- | -------------------- |
| Read a `volatile uint64_t` field from a GM struct that AICore already accesses every task (so its cache line is L1-hot) | GM (cached, hot) | ~1–2 cycles (L1 hit) | Always non-zero |
| Read `table[block_idx]` from a separate device-memory table that AICore touches only here (cold cache line) | GM (uncached / cold) | dozens to hundreds of cycles (L1 miss → DDR) | ~25% of records read **0** |
| Read a `[[block_local]]` `uint64_t` value resolved once at kernel entry | AICore per-block private storage | ~1 cycle (scalar register access) | Always non-zero |

Same `ld_dev` sequence in all three cases; only the operation immediately before it changes.

### Steps to Reproduce

1. Build the kernel so AICore fetches `reg_base` from a cold GM table per task — i.e. `[[block_local]] static __gm__ uint64_t *table;` and `get_reg_base() { return table[block_idx]; }`, where `table` points to a per-core device-memory array AICore otherwise never touches.
2. Run any PMU-profiling test on real a5 hardware with enough tasks to populate `outputs/<run>/pmu.csv` (we used `examples/paged_attention_unroll`, ~1024 tasks).
3. `awk -F, 'NR>1 && $6=="0"' outputs/<run>/pmu.csv | wc -l`.

### Expected Behavior

`CNT_TOTAL` returns a valid cycle count whenever the kernel actually executed — i.e. should behave the same way the event counters do, sticky after PMU disable.

### Actual Behavior

Cold-GM-table reg-base fetch:

| log level | total rows | rows with `pmu_total_cycles == 0` |
| --------- | ---------- | --------------------------------- |
| debug     | 1024       | 482 (≈47%)                        |
| warn      | 1024       | 265 (≈26%)                        |

Sample row (event counters valid, total cycles zero):

```
0,0,0x00000001000001e1,0,0,0,0,274,38,171,0,0,6,0,0,2
```

After switching to the block-local fetch shape, the same test yields **0 / 1024** zero rows.

The dependency on log level is informative: AICPU log throughput changes dispatch timing, which changes per-core task density, which changes how often the cold cache line gets evicted between reads. More eviction → more `CNT_TOTAL == 0` rows. Suggests the failure is driven by cache-miss-rate, not by any deterministic counter-clear behavior.

### Git Commit ID

N/A — the broken intermediate state is no longer on `main`. The pattern is reproducible by deliberately introducing a cold per-record GM read between `pmu_aicore_end()` and `ld_dev(CNT_TOTAL0)`.

### CANN Version

N/A.

### Driver Version

N/A.

### Host Platform

Linux (aarch64)

### Additional Context

This issue exists as a **hardware-behavior record**, not an open repo bug. The software-side fix is already in place (resolve `reg_base` into block-local storage at kernel entry).

What we'd like the hardware team to confirm or correct:

- Is `CNT_TOTAL0/1` expected to remain readable indefinitely after PMU disable (CTRL bit 0 = 0), or is there a defined valid-read window after disable?
- If a window exists: is it specified in cycles, or in terms of "next access on the MMIO interface after disable"?
- Is the cycle counter's post-disable behavior expected to differ from event counters' (which are clearly sticky)?

If this is **expected hardware behavior**, then software has a hard constraint: after `pmu_aicore_end()`, nothing slow (cache miss, long scalar dependency, etc.) is allowed before `ld_dev(CNT_TOTAL)`. The current fix relies on that constraint informally; a documented spec would let us assert it.

If this is **unexpected / a hardware bug**, please advise on a hardware-side guard so software does not have to manage this timing window.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] a5 PMU CNT_TOTAL returns 0 when reg_base read is slow between pmu_disable and ld_dev #800

Platform

Runtime Variant

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Git Commit ID

CANN Version

Driver Version

Host Platform

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Reg-base fetch shape	Where the value lives	Typical latency	`CNT_TOTAL` result
Read a `volatile uint64_t` field from a GM struct that AICore already accesses every task (so its cache line is L1-hot)	GM (cached, hot)	~1–2 cycles (L1 hit)	Always non-zero
Read `table[block_idx]` from a separate device-memory table that AICore touches only here (cold cache line)	GM (uncached / cold)	dozens to hundreds of cycles (L1 miss → DDR)	~25% of records read 0
Read a `[[block_local]]` `uint64_t` value resolved once at kernel entry	AICore per-block private storage	~1 cycle (scalar register access)	Always non-zero

[Bug] a5 PMU CNT_TOTAL returns 0 when reg_base read is slow between pmu_disable and ld_dev #800

Description

Platform

Runtime Variant

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Git Commit ID

CANN Version

Driver Version

Host Platform

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions