Skip to content

Add Cayley-Hamilton triangular inverse example#357

Open
ChristosMatzoros wants to merge 5 commits into
hw-native-sys:mainfrom
ChristosMatzoros:triangular_inverse
Open

Add Cayley-Hamilton triangular inverse example#357
ChristosMatzoros wants to merge 5 commits into
hw-native-sys:mainfrom
ChristosMatzoros:triangular_inverse

Conversation

@ChristosMatzoros
Copy link
Copy Markdown

@ChristosMatzoros ChristosMatzoros commented May 22, 2026

Add Cayley-Hamilton triangular inverse example

Summary

Adds examples/intermediate/tri_inverse.py — a new intermediate example that
computes inv(I - A) for strict-lower-triangular A via the Cayley-Hamilton
doubling algorithm. Targets the chunked-GDN triangular-inverse used in
Qwen3-Next gated delta-rule attention.

Algorithm

A is strict-lower-triangular, therefore nilpotent (A^n = 0), so the
doubling iteration terminates exactly in ceil(log2(n)) steps:

X_0 = I,  Y_0 = A
for k in range(ceil(log2(n))):
    X_{k+1} = X_k + X_k @ Y_k     # equivalently X_k (I + Y_k)
    Y_{k+1} = Y_k @ Y_k
# Result: X = inv(I - A)

New intermediate example computing inv(I - A) for strict-lower-triangular
A via the Cayley-Hamilton doubling algorithm (X = X + X@Y, Y = Y@Y for
ceil(log2(n)) iterations).  Targets the chunked-GDN tri-inverse used in
Qwen3-Next.

Validates against torch.linalg.inv(I - A) on Ascend 910B2 (n=128, FP32).
Structure follows the canonical gemm.py pattern: M-parallel pl.parallel
inside pl.at(chunked_loop_optimizer) with K-blocked pl.matmul +
pl.matmul_acc to keep each InCore function inside the 192 KB AIV Vec
budget.  Y is snapshotted to Y_temp before the Y = Y@Y step so the
matmul reads from a stable tensor while writing Y_state.

Test setup:
- Seeded random init (local torch.Generator(seed=0)) for reproducibility.
- Init scale 1/(4*sqrt(n)) targets ||A||_op ~= 0.5, bounding the
  intermediate-tile magnitudes through the 7 doubling iterations.
- Tolerance rtol = atol = 2e-2: 14 chained matmuls on the Ascend 910B2
  cube (FP16 multiply + FP32 accumulate) compound to ~1e-2 relative
  error, and matmul reduction ordering is not bit-deterministic across
  runs.  2e-2 keeps the test green across 5+ consecutive invocations.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4fc35db7-03be-43e5-8fc1-417a78b3aaf3

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds a complete example script implementing triangular matrix inverse using a Cayley–Hamilton doubling algorithm. It provides single and batched program variants with cube-engine matmul operations, test fixtures, reference torch implementations, and a CLI harness for validation.

Changes

Triangular Inverse Kernel Example

Layer / File(s) Summary
Single-matrix program
examples/intermediate/tri_inverse.py
Module constants and build_tri_inverse_program define the single-matrix variant. The InCore cayley_core loads A and identity, seeds (X, Y) scratch tensors via matmul, then iterates n_steps times updating both tensors with matmul and matmul_acc until Y converges to zero and X becomes the inverse. Orchestration wrapper allocates per-call scratch and invokes the core.
Single-matrix test fixtures
examples/intermediate/tri_inverse.py
build_tensor_specs creates TensorSpec entries for strictly-lower random A, identity matrix, and output tensor with deterministic seeding. golden_tri_inverse computes reference result using torch.linalg.inv(I - A) in FP32.
Batched program
examples/intermediate/tri_inverse.py
build_tri_inverse_program_batched extends the single design to handle batched inputs. InCore cayley_core respecified for batch slicing; Orchestration loops over batch indices, slices flat [batch*n, n] input/output and per-element scratch, invokes the core per element, and assembles results.
Batched test fixtures
examples/intermediate/tri_inverse.py
build_batched_tensor_specs generates flat [batch*n, n] tensors by packing independent strict-lower random matrices per batch element. golden_tri_inverse_batched loops over batch slices and computes per-element inverse reference.
CLI and test harness
examples/intermediate/tri_inverse.py
__main__ parses platform, device, matrix size, and batch options; selects single, batched, or default sweep builds; runs golden.run with compilation dump passes and FP32 tolerances; records failures and exits with error if any case fails.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A triangular scheme so fine,
Cayley–Hamilton makes matrices divine,
Single then batched, each step aligned,
Doublings and matmuls—inverse designed!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main change: adding a Cayley-Hamilton triangular inverse example as a new intermediate example script.
Description check ✅ Passed The description is directly related to the changeset, providing algorithm details, use case context, and implementation scope for the new triangular inverse example.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new example script, examples/intermediate/tri_inverse.py, which implements the Cayley-Hamilton doubling algorithm to compute the inverse of a strict-lower-triangular matrix. The implementation utilizes tiled matrix multiplications and additions within a specialized program structure. Feedback highlights that the current logic assumes the matrix dimension is a multiple of the tile sizes, which should be explicitly validated to prevent out-of-bounds access or incorrect results. Additionally, the reviewer suggests using double buffering for the state tensors to avoid inefficient global memory copies during each doubling iteration.

Comment thread examples/intermediate/tri_inverse.py Outdated
Comment thread examples/intermediate/tri_inverse.py Outdated
…n refinement

- Raise ValueError if n is not a multiple of m_tile / k_tile (was silently
  truncating the K reduction or generating out-of-bounds slices)
- Double-buffer Y across Y_a / Y_b with a Python-level swap each step,
  removing the GM-to-GM snapshot scope (cayley_y_snapshot)
- Add optional Newton iterative refinement (refinement_steps: int = 1)
  applied after the doubling: R = I - (I - A) X; X = X + X @ R.
  Quadratically convergent, crushes FP16-cube roundoff from O(eps) to
  O(eps^2); brings n=64 inside the 2e-2 tolerance and tightens max abs
  error from ~1e-3 to ~3e-4 at n=128
Replace the 10-scope GM-spilling implementation with a single InCore +
Orchestration function carrying X and Y as GM tensor iter_args of one
pl.range loop, mirroring examples/models/07_paged_attention_multi_config.py.

X + X @ Y is fused into one cube pipeline by:
  (a) matmul(X_left, Y_right) -> L0C = X @ Y
  (b) matmul_acc(L0C, X_left, I_right) -> L0C += X
X is reloaded from Mat into Left in step (b) because the L0C -> Mat
tile-move is not supported on a2a3; routing through Mat is the only
way to reuse X as both the matmul-acc accumulator C and the A operand.

Y_new = Y @ Y is a plain matmul. Both seeds (X = I, Y = A) go through
an AIC-only matmul -> store path so no AIV function is spawned (a direct
pl.store of a Mat tile would force a Vec round-trip and bust the AIV
Vec budget).

Supported sizes are n in {64, 128}; n > 128 raises ValueError because
the full-[n,n] tiles overflow L0 caches (Acc 128 KB, Right 64 KB) and
larger n requires K-blocking which is a separate piece of work.

Effect at n=128:
  - source pl.at scopes: 10 -> 0
  - lines: 240 -> 185
  - max abs error vs torch.linalg.inv: 3.6e-4 -> 3.4e-8 (FP32-roundoff floor)
  - per-launch wall (cached chip_callable): essentially unchanged (~12 ms)
build_tri_inverse_program_batched(n, batch) inverts B independent
strict-lower-triangular n x n matrices in one launch. Flat 2D layout
[B*n, n] (matches the dispatch idiom in
pypto3/examples/models/07_paged_attention_multi_config.py:524-538).

Shape:
- Orchestration loops `for b in pl.range(batch, init_values=(out,))`,
  threading the output tensor through the loop so each iter's
  pl.assemble lands on the previous iter's result.
- Per element: slice A and out at offset [b*n, 0], call the existing
  single-matrix cayley_core InCore, and assemble its return into out.
- Y scratch is a [B*n, n] tensor sliced per element; X uses the b'th
  slice of out directly as both working buffer and final result slot.

Validated at B in {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 632} for n=128
(B=632 = full Qwen3-Next GDN per-call workload). Max abs error vs
torch.linalg.inv is 8.2e-8 at B=632 (FP32 round-off floor; 0/10354688
elements above the 2e-2 tolerance).

`__main__` now runs a sweep by default (single + B=2 + B=4) to exercise
both code paths on every CI run; `-b N` selects only the batched build
at a specific N.
@ChristosMatzoros
Copy link
Copy Markdown
Author

Performance analysis — engine utilization & host overhead breakdown

What the kernel does

We compute inv(I − A) for strict-lower-triangular A (a nilpotent matrix,
so A^n = 0) via the Cayley–Hamilton doubling identity:

X_{k+1} = X_k + X_k @ Y_k        # accumulate Σ Y^k into X
Y_{k+1} = Y_k @ Y_k              # square Y each step

Starting from X_0 = I, Y_0 = A, after ⌈log₂(n)⌉ = 7 iterations (at
n=128) Y → 0 and X → inv(I − A). All three multiplications per iter
run on the cube engine via pl.matmul (X@Y, Y@Y) and a fused
pl.matmul_acc for the +X term: L0C += X @ I, where reloading X as
the Left operand into L0A and using I as the Right operand into L0B
lets us fold +X into the existing X@Y accumulator in a single L0C
slot — avoiding the unsupported L0C→Mat tile-move on a2/a3 (which is
also why X and Y cannot stay in L0 across iterations and must round-trip
to GM at the end of every doubling step). The memory flow per iter is
GM → L1 Mat → L0A/L0B → cube → L0C → GM, with pl.range(n_steps, init_values=(X, Y)) threading the GM tensor handles for X and Y across
iterations as iter_args. The seeds X_buf ← I and Y_buf ← A are also
laid down via AIC-only matmul detours (I@I and A@I) so the data path
stays on the cube engine — a direct Mat → GM store would route through
the Vec engine and spawn a 512 KB cube-to-vec pipe buffer that exceeds
the 192 KB AIV budget. The batched program tiles this exact InCore over
the batch axis via pl.range(batch) with per-element GM slicing.

Setup

Ascend 910B2 (24 cube cores, 64 GB HBM, FP32). Pushed kernel commit
eaf6eb1. Inputs pre-uploaded to NPU memory (per-call wall = launch +
execute only, no PCIe transfer). 3 warmup + 10 measured iters per shape;
median reported. Tools: msprof (--task-time=on --aic-metrics=PipeUtilization)
and PMU (enable_pmu=2).

1. Per-launch breakdown — host vs NPU (msprof)

For every shape, each launch decomposes into NPU AIC kernel time + host-side dispatch gap.

B NPU dur (ms) host gap (ms) total wall (ms) per-matrix
1 4.12 8.58 12.70 12 696 µs
4 3.56 8.53 12.09 3 023 µs
16 3.56 9.00 12.57 785 µs
64 3.58 8.57 12.15 190 µs
128 3.80 8.58 12.38 97 µs
632 4.73 8.65 13.37 21 µs
  • Host gap is flat at ~8.5 ms regardless of B — AICPU dispatch + sync
    round-trip in simpler.task_interface.run_prepared (C++ binding;
    cProfile confirms 99.5% of host time is here, Python overhead is 0.5%).
  • NPU duration starts at 4.1 ms even for B=1 — fixed kernel floor; only
    grows to 4.7 ms at B=632 (15% more NPU for 632× more matrices).
  • Per-matrix cost amortizes 600× from B=1 → B=632 for only 5% more wall.

2. NPU engine utilization (PMU per-core averages)

shape cube_busy mte1_busy mte2_busy scalar_busy (100 − cube)
n=128 B=1 49.5 % 20.2 % 19.2 % 4.0 % 50.5 %
n=128 B=4 47.3 % 19.2 % 22.5 % 3.7 % 52.7 %
n=128 B=16 46.6 % 19.0 % 23.2 % 3.6 % 53.4 %
n=128 B=64 53.5 % 21.8 % 18.7 % 2.8 % 46.5 %
n=128 B=128 55.0 % 22.4 % 17.6 % 2.3 % 45.0 %
n=64 B=1 22.8 % 20.4 % 24.3 % 13.7 % 77.2 %
n=64 B=64 25.9 % 23.2 % 25.3 % 10.7 % 74.1 %
  • Cube engine 49–55% busy across n=128 cases — best in class compared
    to canonical pypto-lib examples (gemm.py 15.6%, rms_norm.py 12.2%,
    softmax.py 18.4% under the same tools), i.e. ~3× more cube-active
    than the reference set
    .
  • Memory pipeline (mte1 + mte2) stalls account for ~40% of NPU cycles.
    This is the unavoidable cost of the L0C→GM→L1 round-trip per doubling
    iteration — Acc→Mat tile-move is unsupported on a2/a3 so X / Y cannot
    stay in L0 across iters.
  • n=64 drops to ~23% cube — the [64,64] tile underfills the MAC unit
    and overhead ratios dominate; n=128 is the right operating point.

3. Per-engine work ratios (msprof aic_*_ratio)

B aic_mac_ratio aic_mte1_ratio aic_mte2_ratio
1 0.000 0.000 0.000
4 0.000 0.000 0.000
16 0.001 0.001 0.001
64 0.006 0.002 0.002
128 0.012 0.005 0.003
632 0.053 0.021 0.014
  • aic_mac_ratio is the fraction of AIC total cycles spent actually
    in the MAC unit. Even at B=632 only 5.3% of cube cycles are matrix
    multiplication
    — the kernel is dispatch- and memory-bound, not
    compute-bound. The [128,128]×[128,128] matmul is too small to amortize
    per-task scheduling cost; only batching brings ratios up at all.

4. Multi-core dispatch

B cube_utilization (% of 2400) active cores ≈
1 309 3.1 / 24
4 344 3.4 / 24
16 358 3.6 / 24
64 340 3.4 / 24
128 324 3.2 / 24
632 281 2.8 / 24
  • ~3 of 24 cube cores active at every B, set by pl.matmul parallelism
    inside one matrix inversion; the outer pl.range(batch) loop dispatches
    serially on the AICPU lane
    and does not fan out across cores.
  • 87% of cube capacity is idle even at large B. Multi-core batch dispatch
    (e.g. pl.spmd over the batch axis) would unlock this — both refactor
    attempts in this branch compile but produce wrong output; needs
    IR-level investigation.

5. Buffer occupancy at n=128 (FP32)

Cache Capacity Our usage Headroom
L0A (Left) 64 KB 64 KB ([128,128] tile) 0 — saturated
L0B (Right) 64 KB 64 KB ([128,128] tile) 0 — saturated
L0C (Acc) 128 KB 64 KB ([128,128] tile) 64 KB
L1 Mat ≥256 KB 128 KB (A_l1 + I_l1) comfortable
  • L0A/L0B fill exactly at n=128, which is why n>128 is rejected
    without K-blocking — operand tile would overflow.
  • L0C has 64 KB headroom; would fit up to n=181 if K-blocking shrunk the
    operand tiles. A K-blocked variant was prototyped on a side branch
    (n=160 verified end-to-end, ~99 µs/matrix at B=128) but is out of scope
    for this PR.

6. Where the time goes — summary

Per-launch wall = 8.5 ms host gap + 3.5–4.7 ms NPU
                  ─────────────                  ─────────
                  64–71% of wall                  29–36% of wall
                  simpler-side                    kernel-side
                  (async submit)                  (memory + dispatch)

The kernel is at its dispatch-amortization sweet spot for the canonical
Qwen3-Next GDN workload (chunk=128). Per-matrix cost is 21 µs at B=632.
Pushing per-matrix lower requires either (a) reducing the 8.5 ms host gap
in simpler, or (b) lighting up the 21 idle cube cores via parallel batch
dispatch — both outside the kernel-source scope of this PR.

Known limits

  • n > 128[n,n] operand tile overflows L0A/L0B (64 KB). Lift
    requires K-blocked matmul inside the doubling loop.
  • B ≥ 15 in the GDN dispatch shape (n_total = B × H × ⌈T/chunk⌉ ≥ 4740
    sequential tasks per launch) hits PTO2 ring-buffer overflow (AICPU
    507018). CI workaround bumps PTO2_RING_TASK_WINDOW etc.; without it,
    the practical ceiling is B=10 → 51.8 M numel / launch.

@ChristosMatzoros ChristosMatzoros marked this pull request as ready for review May 29, 2026 07:49
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
examples/intermediate/tri_inverse.py (1)

83-83: 💤 Low value

Rename unused loop index k_k.

ruff (B007) flags the unused loop control variable at both loop sites.

♻️ Rename at both sites
-            for k, (X_iter, Y_iter) in pl.range(n_steps, init_values=(X_buf, Y_buf)):
+            for _k, (X_iter, Y_iter) in pl.range(n_steps, init_values=(X_buf, Y_buf)):

(apply the same rename at line 176)

Also applies to: 176-176

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/intermediate/tri_inverse.py` at line 83, Rename the unused loop
index k to _k in both for loops that iterate with pl.range, specifically in the
statement "for k, (X_iter, Y_iter) in pl.range(n_steps, init_values=(X_buf,
Y_buf))" and the other equivalent loop later in the file; update the loop header
variable from k to _k so the unused variable warning (ruff B007) is resolved
without changing loop behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@examples/intermediate/tri_inverse.py`:
- Line 83: Rename the unused loop index k to _k in both for loops that iterate
with pl.range, specifically in the statement "for k, (X_iter, Y_iter) in
pl.range(n_steps, init_values=(X_buf, Y_buf))" and the other equivalent loop
later in the file; update the loop header variable from k to _k so the unused
variable warning (ruff B007) is resolved without changing loop behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b96508c3-708d-437c-b631-1d933a6bb72c

📥 Commits

Reviewing files that changed from the base of the PR and between be3c794 and eaf6eb1.

📒 Files selected for processing (1)
  • examples/intermediate/tri_inverse.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant