Add Cayley-Hamilton triangular inverse example#357
Conversation
New intermediate example computing inv(I - A) for strict-lower-triangular A via the Cayley-Hamilton doubling algorithm (X = X + X@Y, Y = Y@Y for ceil(log2(n)) iterations). Targets the chunked-GDN tri-inverse used in Qwen3-Next. Validates against torch.linalg.inv(I - A) on Ascend 910B2 (n=128, FP32). Structure follows the canonical gemm.py pattern: M-parallel pl.parallel inside pl.at(chunked_loop_optimizer) with K-blocked pl.matmul + pl.matmul_acc to keep each InCore function inside the 192 KB AIV Vec budget. Y is snapshotted to Y_temp before the Y = Y@Y step so the matmul reads from a stable tensor while writing Y_state. Test setup: - Seeded random init (local torch.Generator(seed=0)) for reproducibility. - Init scale 1/(4*sqrt(n)) targets ||A||_op ~= 0.5, bounding the intermediate-tile magnitudes through the 7 doubling iterations. - Tolerance rtol = atol = 2e-2: 14 chained matmuls on the Ascend 910B2 cube (FP16 multiply + FP32 accumulate) compound to ~1e-2 relative error, and matmul reduction ordering is not bit-deterministic across runs. 2e-2 keeps the test green across 5+ consecutive invocations.
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR adds a complete example script implementing triangular matrix inverse using a Cayley–Hamilton doubling algorithm. It provides single and batched program variants with cube-engine matmul operations, test fixtures, reference torch implementations, and a CLI harness for validation. ChangesTriangular Inverse Kernel Example
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a new example script, examples/intermediate/tri_inverse.py, which implements the Cayley-Hamilton doubling algorithm to compute the inverse of a strict-lower-triangular matrix. The implementation utilizes tiled matrix multiplications and additions within a specialized program structure. Feedback highlights that the current logic assumes the matrix dimension is a multiple of the tile sizes, which should be explicitly validated to prevent out-of-bounds access or incorrect results. Additionally, the reviewer suggests using double buffering for the state tensors to avoid inefficient global memory copies during each doubling iteration.
…n refinement - Raise ValueError if n is not a multiple of m_tile / k_tile (was silently truncating the K reduction or generating out-of-bounds slices) - Double-buffer Y across Y_a / Y_b with a Python-level swap each step, removing the GM-to-GM snapshot scope (cayley_y_snapshot) - Add optional Newton iterative refinement (refinement_steps: int = 1) applied after the doubling: R = I - (I - A) X; X = X + X @ R. Quadratically convergent, crushes FP16-cube roundoff from O(eps) to O(eps^2); brings n=64 inside the 2e-2 tolerance and tightens max abs error from ~1e-3 to ~3e-4 at n=128
Replace the 10-scope GM-spilling implementation with a single InCore +
Orchestration function carrying X and Y as GM tensor iter_args of one
pl.range loop, mirroring examples/models/07_paged_attention_multi_config.py.
X + X @ Y is fused into one cube pipeline by:
(a) matmul(X_left, Y_right) -> L0C = X @ Y
(b) matmul_acc(L0C, X_left, I_right) -> L0C += X
X is reloaded from Mat into Left in step (b) because the L0C -> Mat
tile-move is not supported on a2a3; routing through Mat is the only
way to reuse X as both the matmul-acc accumulator C and the A operand.
Y_new = Y @ Y is a plain matmul. Both seeds (X = I, Y = A) go through
an AIC-only matmul -> store path so no AIV function is spawned (a direct
pl.store of a Mat tile would force a Vec round-trip and bust the AIV
Vec budget).
Supported sizes are n in {64, 128}; n > 128 raises ValueError because
the full-[n,n] tiles overflow L0 caches (Acc 128 KB, Right 64 KB) and
larger n requires K-blocking which is a separate piece of work.
Effect at n=128:
- source pl.at scopes: 10 -> 0
- lines: 240 -> 185
- max abs error vs torch.linalg.inv: 3.6e-4 -> 3.4e-8 (FP32-roundoff floor)
- per-launch wall (cached chip_callable): essentially unchanged (~12 ms)
build_tri_inverse_program_batched(n, batch) inverts B independent
strict-lower-triangular n x n matrices in one launch. Flat 2D layout
[B*n, n] (matches the dispatch idiom in
pypto3/examples/models/07_paged_attention_multi_config.py:524-538).
Shape:
- Orchestration loops `for b in pl.range(batch, init_values=(out,))`,
threading the output tensor through the loop so each iter's
pl.assemble lands on the previous iter's result.
- Per element: slice A and out at offset [b*n, 0], call the existing
single-matrix cayley_core InCore, and assemble its return into out.
- Y scratch is a [B*n, n] tensor sliced per element; X uses the b'th
slice of out directly as both working buffer and final result slot.
Validated at B in {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 632} for n=128
(B=632 = full Qwen3-Next GDN per-call workload). Max abs error vs
torch.linalg.inv is 8.2e-8 at B=632 (FP32 round-off floor; 0/10354688
elements above the 2e-2 tolerance).
`__main__` now runs a sweep by default (single + B=2 + B=4) to exercise
both code paths on every CI run; `-b N` selects only the batched build
at a specific N.
Performance analysis — engine utilization & host overhead breakdownWhat the kernel doesWe compute Starting from SetupAscend 910B2 (24 cube cores, 64 GB HBM, FP32). Pushed kernel commit 1. Per-launch breakdown — host vs NPU (msprof)For every shape, each launch decomposes into NPU AIC kernel time + host-side dispatch gap.
2. NPU engine utilization (PMU per-core averages)
3. Per-engine work ratios (msprof
|
| B | aic_mac_ratio | aic_mte1_ratio | aic_mte2_ratio |
|---|---|---|---|
| 1 | 0.000 | 0.000 | 0.000 |
| 4 | 0.000 | 0.000 | 0.000 |
| 16 | 0.001 | 0.001 | 0.001 |
| 64 | 0.006 | 0.002 | 0.002 |
| 128 | 0.012 | 0.005 | 0.003 |
| 632 | 0.053 | 0.021 | 0.014 |
aic_mac_ratiois the fraction of AIC total cycles spent actually
in the MAC unit. Even at B=632 only 5.3% of cube cycles are matrix
multiplication — the kernel is dispatch- and memory-bound, not
compute-bound. The [128,128]×[128,128] matmul is too small to amortize
per-task scheduling cost; only batching brings ratios up at all.
4. Multi-core dispatch
| B | cube_utilization (% of 2400) | active cores ≈ |
|---|---|---|
| 1 | 309 | 3.1 / 24 |
| 4 | 344 | 3.4 / 24 |
| 16 | 358 | 3.6 / 24 |
| 64 | 340 | 3.4 / 24 |
| 128 | 324 | 3.2 / 24 |
| 632 | 281 | 2.8 / 24 |
- ~3 of 24 cube cores active at every B, set by
pl.matmulparallelism
inside one matrix inversion; the outerpl.range(batch)loop dispatches
serially on the AICPU lane and does not fan out across cores. - 87% of cube capacity is idle even at large B. Multi-core batch dispatch
(e.g.pl.spmdover the batch axis) would unlock this — both refactor
attempts in this branch compile but produce wrong output; needs
IR-level investigation.
5. Buffer occupancy at n=128 (FP32)
| Cache | Capacity | Our usage | Headroom |
|---|---|---|---|
| L0A (Left) | 64 KB | 64 KB ([128,128] tile) | 0 — saturated |
| L0B (Right) | 64 KB | 64 KB ([128,128] tile) | 0 — saturated |
| L0C (Acc) | 128 KB | 64 KB ([128,128] tile) | 64 KB |
| L1 Mat | ≥256 KB | 128 KB (A_l1 + I_l1) | comfortable |
- L0A/L0B fill exactly at n=128, which is why n>128 is rejected
without K-blocking — operand tile would overflow. - L0C has 64 KB headroom; would fit up to n=181 if K-blocking shrunk the
operand tiles. A K-blocked variant was prototyped on a side branch
(n=160 verified end-to-end, ~99 µs/matrix at B=128) but is out of scope
for this PR.
6. Where the time goes — summary
Per-launch wall = 8.5 ms host gap + 3.5–4.7 ms NPU
───────────── ─────────
64–71% of wall 29–36% of wall
simpler-side kernel-side
(async submit) (memory + dispatch)
The kernel is at its dispatch-amortization sweet spot for the canonical
Qwen3-Next GDN workload (chunk=128). Per-matrix cost is 21 µs at B=632.
Pushing per-matrix lower requires either (a) reducing the 8.5 ms host gap
in simpler, or (b) lighting up the 21 idle cube cores via parallel batch
dispatch — both outside the kernel-source scope of this PR.
Known limits
- n > 128 —
[n,n]operand tile overflows L0A/L0B (64 KB). Lift
requires K-blocked matmul inside the doubling loop. - B ≥ 15 in the GDN dispatch shape (n_total = B × H × ⌈T/chunk⌉ ≥ 4740
sequential tasks per launch) hits PTO2 ring-buffer overflow (AICPU
507018). CI workaround bumpsPTO2_RING_TASK_WINDOWetc.; without it,
the practical ceiling is B=10 → 51.8 M numel / launch.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
examples/intermediate/tri_inverse.py (1)
83-83: 💤 Low valueRename unused loop index
k→_k.ruff (B007) flags the unused loop control variable at both loop sites.
♻️ Rename at both sites
- for k, (X_iter, Y_iter) in pl.range(n_steps, init_values=(X_buf, Y_buf)): + for _k, (X_iter, Y_iter) in pl.range(n_steps, init_values=(X_buf, Y_buf)):(apply the same rename at line 176)
Also applies to: 176-176
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/intermediate/tri_inverse.py` at line 83, Rename the unused loop index k to _k in both for loops that iterate with pl.range, specifically in the statement "for k, (X_iter, Y_iter) in pl.range(n_steps, init_values=(X_buf, Y_buf))" and the other equivalent loop later in the file; update the loop header variable from k to _k so the unused variable warning (ruff B007) is resolved without changing loop behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@examples/intermediate/tri_inverse.py`:
- Line 83: Rename the unused loop index k to _k in both for loops that iterate
with pl.range, specifically in the statement "for k, (X_iter, Y_iter) in
pl.range(n_steps, init_values=(X_buf, Y_buf))" and the other equivalent loop
later in the file; update the loop header variable from k to _k so the unused
variable warning (ruff B007) is resolved without changing loop behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b96508c3-708d-437c-b631-1d933a6bb72c
📒 Files selected for processing (1)
examples/intermediate/tri_inverse.py
Add Cayley-Hamilton triangular inverse example
Summary
Adds
examples/intermediate/tri_inverse.py— a new intermediate example thatcomputes
inv(I - A)for strict-lower-triangularAvia the Cayley-Hamiltondoubling algorithm. Targets the chunked-GDN triangular-inverse used in
Qwen3-Next gated delta-rule attention.
Algorithm
Ais strict-lower-triangular, therefore nilpotent (A^n = 0), so thedoubling iteration terminates exactly in
ceil(log2(n))steps: