Add Cayley-Hamilton triangular inverse example by ChristosMatzoros · Pull Request #357 · hw-native-sys/pypto-lib

ChristosMatzoros · 2026-05-22T09:47:12Z

Add Cayley-Hamilton triangular inverse example

Summary

Adds examples/intermediate/tri_inverse.py — a new intermediate example that
computes inv(I - A) for strict-lower-triangular A via the Cayley-Hamilton
doubling algorithm. Targets the chunked-GDN triangular-inverse used in
Qwen3-Next gated delta-rule attention.

Algorithm

A is strict-lower-triangular, therefore nilpotent (A^n = 0), so the
doubling iteration terminates exactly in ceil(log2(n)) steps:

X_0 = I,  Y_0 = A
for k in range(ceil(log2(n))):
    X_{k+1} = X_k + X_k @ Y_k     # equivalently X_k (I + Y_k)
    Y_{k+1} = Y_k @ Y_k
# Result: X = inv(I - A)

New intermediate example computing inv(I - A) for strict-lower-triangular A via the Cayley-Hamilton doubling algorithm (X = X + X@Y, Y = Y@Y for ceil(log2(n)) iterations). Targets the chunked-GDN tri-inverse used in Qwen3-Next. Validates against torch.linalg.inv(I - A) on Ascend 910B2 (n=128, FP32). Structure follows the canonical gemm.py pattern: M-parallel pl.parallel inside pl.at(chunked_loop_optimizer) with K-blocked pl.matmul + pl.matmul_acc to keep each InCore function inside the 192 KB AIV Vec budget. Y is snapshotted to Y_temp before the Y = Y@Y step so the matmul reads from a stable tensor while writing Y_state. Test setup: - Seeded random init (local torch.Generator(seed=0)) for reproducibility. - Init scale 1/(4*sqrt(n)) targets ||A||_op ~= 0.5, bounding the intermediate-tile magnitudes through the 7 doubling iterations. - Tolerance rtol = atol = 2e-2: 14 chained matmuls on the Ascend 910B2 cube (FP16 multiply + FP32 accumulate) compound to ~1e-2 relative error, and matmul reduction ordering is not bit-deterministic across runs. 2e-2 keeps the test green across 5+ consecutive invocations.

coderabbitai · 2026-05-22T09:47:20Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4fc35db7-03be-43e5-8fc1-417a78b3aaf3

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR adds a complete example script implementing triangular matrix inverse using a Cayley–Hamilton doubling algorithm. It provides single and batched program variants with cube-engine matmul operations, test fixtures, reference torch implementations, and a CLI harness for validation.

Changes

Triangular Inverse Kernel Example

Layer / File(s)	Summary
Single-matrix program `examples/intermediate/tri_inverse.py`	Module constants and `build_tri_inverse_program` define the single-matrix variant. The InCore `cayley_core` loads `A` and identity, seeds `(X, Y)` scratch tensors via matmul, then iterates `n_steps` times updating both tensors with matmul and matmul_acc until `Y` converges to zero and `X` becomes the inverse. Orchestration wrapper allocates per-call scratch and invokes the core.
Single-matrix test fixtures `examples/intermediate/tri_inverse.py`	`build_tensor_specs` creates TensorSpec entries for strictly-lower random `A`, identity matrix, and output tensor with deterministic seeding. `golden_tri_inverse` computes reference result using `torch.linalg.inv(I - A)` in FP32.
Batched program `examples/intermediate/tri_inverse.py`	`build_tri_inverse_program_batched` extends the single design to handle batched inputs. InCore `cayley_core` respecified for batch slicing; Orchestration loops over batch indices, slices flat `[batch*n, n]` input/output and per-element scratch, invokes the core per element, and assembles results.
Batched test fixtures `examples/intermediate/tri_inverse.py`	`build_batched_tensor_specs` generates flat `[batch*n, n]` tensors by packing independent strict-lower random matrices per batch element. `golden_tri_inverse_batched` loops over batch slices and computes per-element inverse reference.
CLI and test harness `examples/intermediate/tri_inverse.py`	`__main__` parses platform, device, matrix size, and batch options; selects single, batched, or default sweep builds; runs `golden.run` with compilation dump passes and FP32 tolerances; records failures and exits with error if any case fails.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A triangular scheme so fine,
Cayley–Hamilton makes matrices divine,
Single then batched, each step aligned,
Doublings and matmuls—inverse designed!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and concisely summarizes the main change: adding a Cayley-Hamilton triangular inverse example as a new intermediate example script.
Description check	✅ Passed	The description is directly related to the changeset, providing algorithm details, use case context, and implementation scope for the new triangular inverse example.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a new example script, examples/intermediate/tri_inverse.py, which implements the Cayley-Hamilton doubling algorithm to compute the inverse of a strict-lower-triangular matrix. The implementation utilizes tiled matrix multiplications and additions within a specialized program structure. Feedback highlights that the current logic assumes the matrix dimension is a multiple of the tile sizes, which should be explicitly validated to prevent out-of-bounds access or incorrect results. Additionally, the reviewer suggests using double buffering for the state tensors to avoid inefficient global memory copies during each doubling iteration.

…n refinement - Raise ValueError if n is not a multiple of m_tile / k_tile (was silently truncating the K reduction or generating out-of-bounds slices) - Double-buffer Y across Y_a / Y_b with a Python-level swap each step, removing the GM-to-GM snapshot scope (cayley_y_snapshot) - Add optional Newton iterative refinement (refinement_steps: int = 1) applied after the doubling: R = I - (I - A) X; X = X + X @ R. Quadratically convergent, crushes FP16-cube roundoff from O(eps) to O(eps^2); brings n=64 inside the 2e-2 tolerance and tightens max abs error from ~1e-3 to ~3e-4 at n=128

Replace the 10-scope GM-spilling implementation with a single InCore + Orchestration function carrying X and Y as GM tensor iter_args of one pl.range loop, mirroring examples/models/07_paged_attention_multi_config.py. X + X @ Y is fused into one cube pipeline by: (a) matmul(X_left, Y_right) -> L0C = X @ Y (b) matmul_acc(L0C, X_left, I_right) -> L0C += X X is reloaded from Mat into Left in step (b) because the L0C -> Mat tile-move is not supported on a2a3; routing through Mat is the only way to reuse X as both the matmul-acc accumulator C and the A operand. Y_new = Y @ Y is a plain matmul. Both seeds (X = I, Y = A) go through an AIC-only matmul -> store path so no AIV function is spawned (a direct pl.store of a Mat tile would force a Vec round-trip and bust the AIV Vec budget). Supported sizes are n in {64, 128}; n > 128 raises ValueError because the full-[n,n] tiles overflow L0 caches (Acc 128 KB, Right 64 KB) and larger n requires K-blocking which is a separate piece of work. Effect at n=128: - source pl.at scopes: 10 -> 0 - lines: 240 -> 185 - max abs error vs torch.linalg.inv: 3.6e-4 -> 3.4e-8 (FP32-roundoff floor) - per-launch wall (cached chip_callable): essentially unchanged (~12 ms)

build_tri_inverse_program_batched(n, batch) inverts B independent strict-lower-triangular n x n matrices in one launch. Flat 2D layout [B*n, n] (matches the dispatch idiom in pypto3/examples/models/07_paged_attention_multi_config.py:524-538). Shape: - Orchestration loops `for b in pl.range(batch, init_values=(out,))`, threading the output tensor through the loop so each iter's pl.assemble lands on the previous iter's result. - Per element: slice A and out at offset [b*n, 0], call the existing single-matrix cayley_core InCore, and assemble its return into out. - Y scratch is a [B*n, n] tensor sliced per element; X uses the b'th slice of out directly as both working buffer and final result slot. Validated at B in {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 632} for n=128 (B=632 = full Qwen3-Next GDN per-call workload). Max abs error vs torch.linalg.inv is 8.2e-8 at B=632 (FP32 round-off floor; 0/10354688 elements above the 2e-2 tolerance). `__main__` now runs a sweep by default (single + B=2 + B=4) to exercise both code paths on every CI run; `-b N` selects only the batched build at a specific N.

ChristosMatzoros · 2026-05-29T07:27:32Z

Performance analysis — engine utilization & host overhead breakdown

What the kernel does

We compute inv(I − A) for strict-lower-triangular A (a nilpotent matrix,
so A^n = 0) via the Cayley–Hamilton doubling identity:

X_{k+1} = X_k + X_k @ Y_k        # accumulate Σ Y^k into X
Y_{k+1} = Y_k @ Y_k              # square Y each step

Starting from X_0 = I, Y_0 = A, after ⌈log₂(n)⌉ = 7 iterations (at
n=128) Y → 0 and X → inv(I − A). All three multiplications per iter
run on the cube engine via pl.matmul (X@Y, Y@Y) and a fused
pl.matmul_acc for the +X term: L0C += X @ I, where reloading X as
the Left operand into L0A and using I as the Right operand into L0B
lets us fold +X into the existing X@Y accumulator in a single L0C
slot — avoiding the unsupported L0C→Mat tile-move on a2/a3 (which is
also why X and Y cannot stay in L0 across iterations and must round-trip
to GM at the end of every doubling step). The memory flow per iter is
GM → L1 Mat → L0A/L0B → cube → L0C → GM, with pl.range(n_steps, init_values=(X, Y)) threading the GM tensor handles for X and Y across
iterations as iter_args. The seeds X_buf ← I and Y_buf ← A are also
laid down via AIC-only matmul detours (I@I and A@I) so the data path
stays on the cube engine — a direct Mat → GM store would route through
the Vec engine and spawn a 512 KB cube-to-vec pipe buffer that exceeds
the 192 KB AIV budget. The batched program tiles this exact InCore over
the batch axis via pl.range(batch) with per-element GM slicing.

Setup

Ascend 910B2 (24 cube cores, 64 GB HBM, FP32). Pushed kernel commit
eaf6eb1. Inputs pre-uploaded to NPU memory (per-call wall = launch +
execute only, no PCIe transfer). 3 warmup + 10 measured iters per shape;
median reported. Tools: msprof (--task-time=on --aic-metrics=PipeUtilization)
and PMU (enable_pmu=2).

1. Per-launch breakdown — host vs NPU (msprof)

For every shape, each launch decomposes into NPU AIC kernel time + host-side dispatch gap.

B	NPU dur (ms)	host gap (ms)	total wall (ms)	per-matrix
1	4.12	8.58	12.70	12 696 µs
4	3.56	8.53	12.09	3 023 µs
16	3.56	9.00	12.57	785 µs
64	3.58	8.57	12.15	190 µs
128	3.80	8.58	12.38	97 µs
632	4.73	8.65	13.37	21 µs

Host gap is flat at ~8.5 ms regardless of B — AICPU dispatch + sync
round-trip in simpler.task_interface.run_prepared (C++ binding;
cProfile confirms 99.5% of host time is here, Python overhead is 0.5%).
NPU duration starts at 4.1 ms even for B=1 — fixed kernel floor; only
grows to 4.7 ms at B=632 (15% more NPU for 632× more matrices).
Per-matrix cost amortizes 600× from B=1 → B=632 for only 5% more wall.

2. NPU engine utilization (PMU per-core averages)

shape	cube_busy	mte1_busy	mte2_busy	scalar_busy	(100 − cube)
n=128 B=1	49.5 %	20.2 %	19.2 %	4.0 %	50.5 %
n=128 B=4	47.3 %	19.2 %	22.5 %	3.7 %	52.7 %
n=128 B=16	46.6 %	19.0 %	23.2 %	3.6 %	53.4 %
n=128 B=64	53.5 %	21.8 %	18.7 %	2.8 %	46.5 %
n=128 B=128	55.0 %	22.4 %	17.6 %	2.3 %	45.0 %
n=64 B=1	22.8 %	20.4 %	24.3 %	13.7 %	77.2 %
n=64 B=64	25.9 %	23.2 %	25.3 %	10.7 %	74.1 %

Cube engine 49–55% busy across n=128 cases — best in class compared
to canonical pypto-lib examples (gemm.py 15.6%, rms_norm.py 12.2%,
softmax.py 18.4% under the same tools), i.e. ~3× more cube-active
than the reference set.
Memory pipeline (mte1 + mte2) stalls account for ~40% of NPU cycles.
This is the unavoidable cost of the L0C→GM→L1 round-trip per doubling
iteration — Acc→Mat tile-move is unsupported on a2/a3 so X / Y cannot
stay in L0 across iters.
n=64 drops to ~23% cube — the [64,64] tile underfills the MAC unit
and overhead ratios dominate; n=128 is the right operating point.

3. Per-engine work ratios (msprof `aic_*_ratio`)

B	aic_mac_ratio	aic_mte1_ratio	aic_mte2_ratio
1	0.000	0.000	0.000
4	0.000	0.000	0.000
16	0.001	0.001	0.001
64	0.006	0.002	0.002
128	0.012	0.005	0.003
632	0.053	0.021	0.014

aic_mac_ratio is the fraction of AIC total cycles spent actually
in the MAC unit. Even at B=632 only 5.3% of cube cycles are matrix
multiplication — the kernel is dispatch- and memory-bound, not
compute-bound. The [128,128]×[128,128] matmul is too small to amortize
per-task scheduling cost; only batching brings ratios up at all.

4. Multi-core dispatch

B	cube_utilization (% of 2400)	active cores ≈
1	309	3.1 / 24
4	344	3.4 / 24
16	358	3.6 / 24
64	340	3.4 / 24
128	324	3.2 / 24
632	281	2.8 / 24

~3 of 24 cube cores active at every B, set by pl.matmul parallelism
inside one matrix inversion; the outer pl.range(batch) loop dispatches
serially on the AICPU lane and does not fan out across cores.
87% of cube capacity is idle even at large B. Multi-core batch dispatch
(e.g. pl.spmd over the batch axis) would unlock this — both refactor
attempts in this branch compile but produce wrong output; needs
IR-level investigation.

5. Buffer occupancy at n=128 (FP32)

Cache	Capacity	Our usage	Headroom
L0A (Left)	64 KB	64 KB ([128,128] tile)	0 — saturated
L0B (Right)	64 KB	64 KB ([128,128] tile)	0 — saturated
L0C (Acc)	128 KB	64 KB ([128,128] tile)	64 KB
L1 Mat	≥256 KB	128 KB (A_l1 + I_l1)	comfortable

L0A/L0B fill exactly at n=128, which is why n>128 is rejected
without K-blocking — operand tile would overflow.
L0C has 64 KB headroom; would fit up to n=181 if K-blocking shrunk the
operand tiles. A K-blocked variant was prototyped on a side branch
(n=160 verified end-to-end, ~99 µs/matrix at B=128) but is out of scope
for this PR.

6. Where the time goes — summary

Per-launch wall = 8.5 ms host gap + 3.5–4.7 ms NPU
                  ─────────────                  ─────────
                  64–71% of wall                  29–36% of wall
                  simpler-side                    kernel-side
                  (async submit)                  (memory + dispatch)

The kernel is at its dispatch-amortization sweet spot for the canonical
Qwen3-Next GDN workload (chunk=128). Per-matrix cost is 21 µs at B=632.
Pushing per-matrix lower requires either (a) reducing the 8.5 ms host gap
in simpler, or (b) lighting up the 21 idle cube cores via parallel batch
dispatch — both outside the kernel-source scope of this PR.

Known limits

n > 128 — [n,n] operand tile overflows L0A/L0B (64 KB). Lift
requires K-blocked matmul inside the doubling loop.
B ≥ 15 in the GDN dispatch shape (n_total = B × H × ⌈T/chunk⌉ ≥ 4740
sequential tasks per launch) hits PTO2 ring-buffer overflow (AICPU
507018). CI workaround bumps PTO2_RING_TASK_WINDOW etc.; without it,
the practical ceiling is B=10 → 51.8 M numel / launch.

coderabbitai

🧹 Nitpick comments (1)

examples/intermediate/tri_inverse.py (1)

83-83: 💤 Low value

Rename unused loop index k → _k.

ruff (B007) flags the unused loop control variable at both loop sites.

♻️ Rename at both sites

-            for k, (X_iter, Y_iter) in pl.range(n_steps, init_values=(X_buf, Y_buf)):
+            for _k, (X_iter, Y_iter) in pl.range(n_steps, init_values=(X_buf, Y_buf)):

(apply the same rename at line 176)

Also applies to: 176-176

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/intermediate/tri_inverse.py` at line 83, Rename the unused loop
index k to _k in both for loops that iterate with pl.range, specifically in the
statement "for k, (X_iter, Y_iter) in pl.range(n_steps, init_values=(X_buf,
Y_buf))" and the other equivalent loop later in the file; update the loop header
variable from k to _k so the unused variable warning (ruff B007) is resolved
without changing loop behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@examples/intermediate/tri_inverse.py`:
- Line 83: Rename the unused loop index k to _k in both for loops that iterate
with pl.range, specifically in the statement "for k, (X_iter, Y_iter) in
pl.range(n_steps, init_values=(X_buf, Y_buf))" and the other equivalent loop
later in the file; update the loop header variable from k to _k so the unused
variable warning (ruff B007) is resolved without changing loop behavior.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b96508c3-708d-437c-b631-1d933a6bb72c

📥 Commits

Reviewing files that changed from the base of the PR and between be3c794 and eaf6eb1.

📒 Files selected for processing (1)

examples/intermediate/tri_inverse.py

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

Comment thread examples/intermediate/tri_inverse.py Outdated

Comment thread examples/intermediate/tri_inverse.py Outdated

ChristosMatzoros added 3 commits May 26, 2026 09:21

ChristosMatzoros marked this pull request as ready for review May 29, 2026 07:49

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

Rename unused loop index k -> _k in tri_inverse (ruff B007)

ecd66f1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Cayley-Hamilton triangular inverse example#357

Add Cayley-Hamilton triangular inverse example#357
ChristosMatzoros wants to merge 5 commits into
hw-native-sys:mainfrom
ChristosMatzoros:triangular_inverse

ChristosMatzoros commented May 22, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 22, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

ChristosMatzoros commented May 29, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChristosMatzoros commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add Cayley-Hamilton triangular inverse example

Summary

Algorithm

Uh oh!

coderabbitai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

ChristosMatzoros commented May 29, 2026

Performance analysis — engine utilization & host overhead breakdown

What the kernel does

Setup

1. Per-launch breakdown — host vs NPU (msprof)

2. NPU engine utilization (PMU per-core averages)

3. Per-engine work ratios (msprof aic_*_ratio)

4. Multi-core dispatch

5. Buffer occupancy at n=128 (FP32)

6. Where the time goes — summary

Known limits

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChristosMatzoros commented May 22, 2026 •

edited

Loading

coderabbitai Bot commented May 22, 2026 •

edited

Loading

3. Per-engine work ratios (msprof `aic_*_ratio`)