Skip to content

Symmetry-aware einsum cost rewrite + de-vendor opt_einsum#91

Open
spMohanty wants to merge 135 commits into
mainfrom
feat/symmetry-aware-einsum-cost
Open

Symmetry-aware einsum cost rewrite + de-vendor opt_einsum#91
spMohanty wants to merge 135 commits into
mainfrom
feat/symmetry-aware-einsum-cost

Conversation

@spMohanty
Copy link
Copy Markdown
Collaborator

@spMohanty spMohanty commented May 13, 2026

  1. Rewrite the symmetry-aware cost model to mirror the JS explorer's α/M direct-event model. Cost is now path-independent: (k−1)·∏Mₐ + ∏αₐ summed across connected components, with a 5-regime ladder (trivial → functionalProjection → singleton → young → partitionCount). Applies to:

    • fnp.einsum / einsum_path (symmetry-agnostic path search; α/M total via the new model)
    • np.ufunc.reduce family (sum, prod, max, min, ...) - Tier-1 reductions charge (n_input − n_output_orbits) accumulations; mean adds a n_output_orbits divide
    • np.median / np.percentile / np.quantile - Tier-2 discount (orbit-count selection)
  2. De-vendor opt_einsum. After (1) the fork wasn't doing anything custom for path search, so opt_einsum>=3.3.0,<4.0.0 becomes a runtime dep. flopscope._opt_einsum/ is now a 1.3k-line shim (was 4.5k) that adapts upstream's PathInfo and recomputes per-step FLOPs under flopscope's FMA convention.

Public surface

  • New: flopscope.einsum_accumulation_cost(...) returns an AccumulationCost with per-component breakdown, regime trace, and describe() for LaTeX.
  • New: flopscope.reduction_accumulation_cost(input_shape, axes_summed, symmetry=None) - same model applied to ufunc reductions.
  • New: path_info.accumulation field on einsum_path() results.
  • New settings: fma_cost (default 1; set to 2 for textbook convention - uniform across all flopscope cost surfaces), partition_budget (100k), dimino_budget (500k).

Breaking

  • path_info.optimized_cost changes for expressions with declared symmetry - now the whole-expression α/M total, not the old per-step dense · unique/total sum.
  • Per-step flop_count reverts to dense; symmetry-aware total lives on path_info.accumulation.total.
  • np.ufunc.reduce family (sum, prod, max, min, ...) now charges (n_input − n_output_orbits) accumulations instead of n_input. For unsymmetric inputs this is the issue-Off-by-one in sum/mean reductions #56 off-by-one fix; for symmetric inputs the orbit count gives further savings.
  • np.mean charges sum-cost + n_output_orbits divides (one per unique output element).
  • np.median / np.percentile / np.quantile charge a Tier-2 discount: n_output_orbits selections instead of dense.
  • _flops.analytical_reduction_cost and flopscope.accounting.reduction_cost route through the new model - same call signatures, different (lower) numbers for symmetric cases.
  • FMA_COST constant gone → fma_cost() function.
  • use_inner_symmetry setting removed.
  • Internal deletions (no public re-exports): _opt_einsum/_paths.py, _path_random.py, _parser.py, _blas.py, _testing.py, _typing.py, _subgraph_symmetry.py, _symmetry.py.

Migration notes in CHANGELOG.md.

Issues fully addressed

  • Closes Symmetric einsum FLOP counting: only count multiplications + use symmetry group of unsummed tensor #32 - Symmetric einsum FLOP counting. The α/M direct-event model is the architectural answer. It uses the full pointwise symmetry group G_pt (visible and summed side; declared + identical-operand-swap-induced) and counts both unique multiplications and accumulation events. The model is path-independent by construction.

    Numerical expectations in the original issue reflect an earlier framing ("only count multiplications, use symmetry group of unsummed tensor"). The new model is strictly more comprehensive - for example, the issue's einsum('ij,k->ik', B_sym2, C) at n=10 expects 550, the new model gives 1,550 (because it counts accumulation events on top of the 550 unique multiplications, and tracks symmetry on the full pointwise group). The shift supersedes the issue's specific numerical examples.

  • Closes Off-by-one in sum/mean reductions #56 - Off-by-one in sum/mean reductions. sum(A) for A.shape = (10,) now charges 9 flops (the n − 1 accumulations the issue asked for), not 10. mean charges sum-cost + 1 divide. Falls out of applying the α/M direct-event model to the reduction path: the first input element is a free copy, only the remaining n − 1 accumulations cost.

Issues partially addressed

  • Short-circuit einsum pre-cache symmetry/identity work before path lookup #26 - Short-circuit einsum pre-cache work.

    • landed: _path_cache key is now (subscripts, shapes, optimize, fma_cost) only - no per-op symmetry fingerprint, no identity-pattern grouping in the key. The pre-cache work for path search has been eliminated as a side effect of the symmetry-oracle removal.
    • not landed: the same fingerprinting/identity work still runs on the new _accumulation_cache path inside einsum(). The "no SymmetricTensor operands present" fast path inside _get_accumulation_cost could still skip fingerprint materialization entirely.
  • Repeated-operand outer should use symmetry-aware FLOP counting #65 - Repeated-operand outer FLOP counting.

    • landed: when callers go through fnp.einsum("i,j->ij", v, v), the new α/M model returns a symmetry-aware count via the operand-swap-induced S₂{i,j} on the output.
    • not landed: fnp.outer(v, v) itself is unchanged. The remaining work is to alias outer repeated-operand cases to the einsum cost path.

Explicitly out of scope

  • Deal with within-tensor summed indices #55 - Within-tensor summed indices. The single-tensor lowering trick this issue proposes (einsum("ik,jl->ij", A, B)einsum("i,j->ij", A.sum(axis=1), B.sum(axis=1))) is orthogonal to the α/M model. The model rigorously counts direct events on the full group action; the lowering captures algebraic equivalence on single-tensor reductions. The two could compose in a future pass.
  • Make symmetrize cost 1 flop per unique output element #34 - symmetrize cost model. Different op, not touched by this rewrite.

Test plan

  • CI green
  • uv run pytest tests/accumulation/ -q → 376 reduction + earlier accumulation tests pass (verified locally in 5.9s)
  • uv run pytest tests/accumulation/test_js_parity.py -v → 22/22 JS preset parity (verified)
  • uv run pytest tests/test_reduction_integration.py tests/test_method_tracking.py -q → reduction integration green
  • Cold-call benchmark ≤ 0.5 s budget (verified: 1.3 ms worst-case across the 22 examples; partition-counter S_6 ramp tops out at 5.7 s as a separate stress test)

spMohanty added 30 commits May 7, 2026 20:59
Adds empty stub modules for the new symmetry-aware einsum cost machinery
that mirrors website/components/symmetry-aware-einsum-contractions/engine/.
Subsequent tasks fill in each module.
H = Stab_G(V)|_V helpers and canonical tuple operations. Direct port of
website/components/symmetry-aware-einsum-contractions/engine/outputOrbit.js
with Python idioms (string key dedup, Sequence types, no JS Map/Set).
…perms

The previous test passed two identical permutations, only proving hash dedup
on identical inputs. Replace with two globally distinct permutations whose
restrictions to V both collapse to the local identity (kernel of restriction
is non-trivial). This actually validates the invariant claimed in
restrict_stabilizer_to_positions's docstring: |H| <= |Stab_G(V)| because
distinct g, g' in Stab_G(V) can yield the same g|_V.
Direct port of sizeAware/burnside.js. Validates cycle-size invariants
(all labels in a cycle share a size) and Burnside-sum divisibility.
Direct port of shapeLayer.js. Four shapes: trivial / allVisible /
allSummed / mixed. Diagnostic label that accompanies the regime ID.
…n.py

Falling factorial, partition normalization, typed-partition enumeration with
domain-class restriction (only same-sized positions can merge), labeling counts.
Cached by sizes tuple via lru_cache(maxsize=256).

Orbit-rep / induced-block-action utilities follow in the next commit.
Completes the port of partition/typedPartitions.js. Partition-orbit reps,
induced-block-permutation (uses IMAGE on blocks, not raw stabilizer order),
prefix-map dedup, output-side orbit count under H. These are the pieces the
partitionCount regime calls per typed partition.
partition_budget (default 100_000): caps typed-partition enumeration in the
partitionCount regime. Components exceeding the budget fall back to dense cost.

dimino_budget (default 500_000): caps whole-expression G_pt closure to bound
pathological declared-symmetry inputs.

Also adds set_setting() as a thin public wrapper around configure() and a
minimal _VALIDATORS map (used only for the two new budget keys) that rejects
negative integers at set time.
RegimeContext, Verdict, RegimeOutput, Regime, RegimeStep, AccumulationResult.
All frozen dataclasses; Literal types for regime_id and shape match the
JS engine's string vocabulary.
Direct port of regimes/functionalProjection.js. Fires when every g preserves
V as a set; computes alpha = M via size-aware Burnside. Covers JS appendix
B.2 (allVisible), B.3 (allSummed), and B.4 (mixed-but-functional).
Direct port of regimes/singleton.js. Closed-form weighted Burnside + inclusion-
exclusion over the visible label's G-orbit. Used for the |V|=1 case after
functional-projection refuses (i.e. when projection branches on the single
output coordinate's orbit).
Direct port of regimes/young.js. Closed-form multiset formula
alpha = C(n+|V|-1, |V|) * C(n+|W|-1, |W|) when G is the full symmetric
group on L_c, both V and W are nonempty, |V| >= 2, and all sizes agree.
Direct port of regimes/partitionCount.js. Iterates typed equality patterns
up to G-equivalence; per-pattern contribution is
  (typed_labelings / |Ḡ_x̃|) * |A_x̃ / H|.

Sub-trace records per-partition counts for diagnostic display + parity tests.
This is the general fallback that handles mixed-shape cases the closed-form
regimes (singleton, young, functionalProjection) refuse.
Direct port of accumulationCount.js. Three-stage dispatch:
  1. trivial short-circuit (|G| <= 1)
  2a. functionalProjection priority (covers allVisible/allSummed/mixed-functional)
  2b. mixed ladder: singleton -> young -> partitionCount
Fallthrough returns regime_id='unavailable' (brute-force disabled by policy).

Trace captures every refused regime with its reason for debugging.
Direct port of algorithm.js#buildBipartite + buildIncidenceMatrix. One
U-vertex per axis of each operand (no axis merging — per-operand symmetry
handled by the wreath enumeration in the next task). Column fingerprints
and fp_to_labels reverse map for derive_pi_canonical.
Direct port of wreath.js. Enumerates Pi_i (H_i wr S_{m_i}) where H_i is
operand i's declared axis symmetry and m_i is its multiplicity. Produces
row permutations on U-vertices for the sigma-loop. Supports None / symmetric
/ cyclic / dihedral / SymmetryGroup-typed declarations.
Direct port of algorithm.js#runSigmaLoop and derivePi. For each wreath
element sigma, applies sigma to the incidence matrix, derives a label
permutation pi via column-fingerprint matching, classifies pi as identity
/ v-only / w-only / cross-v-w. Cross-v-w actions are valid (a deliberate
deviation from the deprecated partition-preserving rejection).
Direct port of fullGroup.js#buildFullGroup and algorithm.js#classifyGroupName.
Collects valid pi's, dedupes by array form, greedy minimal-generating-set
selection, dimino closure, classifies the resulting group name (S_n{...},
C_n{...}, D_n{...}, Z2{...}, S2{...}, or PermGroup<...>).
Direct port of componentDecomposition.js. Builds the label-interaction graph
from G_pt's generators (labels coupled by any single generator), unions via
union-find, restricts each generator to each component, runs dimino on the
restriction, classifies the per-component group name. Each Component carries
its own labels, va/wa split, sizes, visible_positions, generators, and elements
ready for the regime ladder.
ComponentCost wraps a per-component AccumulationResult with M-side Burnside,
dense_count, and group metadata. run_ladder_per_component is the pure
transformation that both einsum and future reduction code paths reuse.
aggregate_einsum implements total = (k-1) * prod(M) + prod(alpha). When any
component is unavailable, total falls back to k * dense_baseline (no-symmetry
direct-event count) and a CostFallbackWarning fires. Gaming-resistant:
exceeding partition budget never lowers the charge.
Wires the whole detection + decomposition + ladder + aggregation pipeline.
Inputs: subscripts, shapes, per-operand symmetries, identity pattern, output
subscript. Output: AccumulationCost with per-component breakdown and total.
Body raises NotImplementedError pointing to a future sprint. Locking the
signature now lets us reuse run_ladder_per_component and decompose_into_components
unchanged for ufunc.reduce when that sprint lands.
LaTeX strings for each regime + shape are stored as module-level dicts and
looked up by describe(). Keeps dataclass instances small while preserving
diagnostic strings for users / IDE completion.
Pure inspection function: extracts per-operand SymmetryGroup from
SymmetricTensor inputs, builds an identity_pattern from id() groupings,
delegates to compute_accumulation_cost. Re-exported from flopscope as
einsum_accumulation_cost, AccumulationCost, ComponentCost, RegimeStep.
Adds an accumulation field plus a property-based optimized_cost that returns
the accumulation total when attached. Falls back to the inner PathInfo's
legacy optimized_cost otherwise. __getattr__ forwards all other field accesses.
Path search now uses stock opt_einsum behavior (no SubgraphSymmetryOracle).
The path cache keys only on (subscripts, shapes, optimize). Symmetry-aware
cost computation moves to a separate accumulation cache wired in the next task.

Test churn for tests/test_einsum*.py is handled in Tasks 34-37.
Mirrors _path_cache but caches AccumulationCost objects keyed by
(canonical_subscripts, shapes, sym_fingerprint, identity_pattern).
Decision Q1 in the spec: two separate caches so path-cache misses don't
trigger accumulation recomputation and vice versa.
…imized_cost

Path search remains stock opt_einsum (Task 26). Now the BudgetContext deduction
uses the new whole-expression direct-event count from _get_accumulation_cost.
PathInfo wraps in FlopscopePathInfo so .accumulation surfaces to callers.
…used)

Path search no longer threads a symmetry oracle (Task 26). The whole
SubgraphSymmetryOracle module + its test file are dead code. Deletion-
safety test asserts the module and its public name can no longer be imported.
spMohanty added 29 commits May 19, 2026 20:56
Adds _walk_path_and_aggregate which decomposes k>=3 einsums into binary
contractions via opt_einsum.contract_path, computing per-step
AccumulationCost by calling compute_accumulation_cost recursively for
each binary step (k=2 path). Fixes Wilson's bug where a 3-operand chain
ij,jk,kl->il was charging 29900 (fictitious 3-way cost) instead of 3800
(two binary matmuls at 2*n^3 - n^2 each). Full-expression per_component
is preserved in the returned AccumulationCost so JS-parity tests remain
unaffected.
Route _walk_path_and_aggregate binary sub-steps through
get_accumulation_cost_cached so shared steps (e.g. "ij,jk->ik")
hit the LRU cache across different top-level expressions.
Add test_per_step_cache_hits_across_expressions to assert >=1 hit
when two 3-operand chains share a binary sub-step.
…ulation_cost

Per-step costs in build_path_info now call get_accumulation_cost_cached via
symmetric_flop_count, making info.steps[i].flop_cost == info.accumulation.per_step[i].total
by construction. Updated test_build_path_info expected values (60→105, 120→105)
to reflect the accumulation formula (fma_cost-independent) and added parity
test confirming the two layers agree.
…term

The Task 7 refactor exposed a latent bug: aggregate_einsum never applied
fma_cost() to the mu = (k-1)·M multiplication term, so configure(fma_cost=2)
had no effect on accumulation-based costs.

Fix: multiply mu by _fma_cost() in aggregate_einsum. The alpha − num_output_orbits
accumulation term is intentionally NOT multiplied — accumulation adds are 1 op
regardless of FMA convention.

Also add fma_cost to the _accumulation_cache key so that calls under different
fma_cost settings produce distinct cache entries instead of returning stale results.

Updated tests:
- test_build_path_info_uses_fma_two_when_configured: 105 → 165 (correct for fma=2)
- test_fma_cost_in_path_cache_key: 8/16 → 12/20 (correct with alpha term)
- test_fma_cost_affects_multiplication_term_only: new regression test (12 and 20)
…k 7 cascade)

Fix real bug in _walk_path_and_aggregate: m_total was computed as the product
of per-step intermediate m_total values, which always exceeds dense_baseline and
makes _has_savings() return False for all multi-operand expressions. Fix uses
prod(c.m for c in full_expression_component_costs) instead, which correctly
reflects the unique output count of the full k-ary expression.

Update 4 test assertions that hardcoded pre-path-aware single-step formula values
(speedup 5x→2.778x, savings 80%→64%, optimized_cost 6380→20000 for triple S3 case).
…kes it unnecessary)

inner.optimized_cost == accumulation.total by construction after §6.4
reconciliation; simple delegation to fmt() is sufficient. Update test to
remove the now-invalid assertion that naive_cost must not appear in the header
(naive_cost is not reconciled, only optimized_cost is).
Compute dense_flop_cost (helpers.flop_count baseline) and
symmetry_savings (clamped to [0,1]) in the build_path_info per-step
loop; add test asserting all steps have non-zero dense_flop_cost and
in-range savings.
Adds _try_named_group, _fmt_generators, _fmt_sym, _fmt_step_sym, and
_fmt_unique_dense helpers; replaces the stripped format_table body with
main's full version including dense_flops, savings %, and symmetry columns.
Add missing _RICH_SYMMETRY_STYLES module-level dict and the
_rich_symmetry_token_text / _rich_step_sym_text helpers, then
replace the branch's stripped _rich_step_table with main's
symmetry-aware version (dense_flops, savings, unique/total, and
symmetry columns). Smoke test added for info.print(verbose=False/True).
…m main

Also restores three helper files required by _paths.py:
_subgraph_symmetry.py (529 lines), _symmetry.py (134 lines), _typing.py (37 lines).
Updates test_deletion_safety.py to reflect that these modules are now present again.
…edy) from main

Update test_deletion_safety.py: flip is-gone assertion to is-importable for _path_random.
…rent branch

- Cherry-pick 625 lines from origin/main:tests/test_opt_einsum_paths.py
- Fix 3 API-drift import errors:
    * _parser.get_symbol  -> opt_einsum.parser.get_symbol (deleted in 9c44177)
    * _testing.build_shapes/rand_equation -> opt_einsum.testing (deleted same)
    * _typing.* unchanged (still present on branch)
- Add PEP-562 __getattr__ hook to _opt_einsum/__init__.py exposing
  oe._helpers / oe._paths / oe._path_random without shadowing the local
  _helpers submodule (shadowing broke naive_cost calculation in 3 unrelated tests)
- Adapt 10 stale-assertion failures:
    * test_custom_dp_can_set_minimize: update 7 expected FMA-2 costs to FMA-1 values
    * test_custom_random_greedy / test_custom_branchbound / test_parallel_random_greedy:
      remove opt_cost == optimizer.best["flops"] assertions with TODO comments
      (FMA convention mismatch; path correctness is still verified)

125 passed, 2 skipped — full suite 0 failed.
…ggregate

- Build SubgraphSymmetryOracle once per k>=3 einsum from per_op_symmetries and identity_pattern
- For each binary step, query the oracle per input subset to derive sym_fingerprint and step_identity_pattern
- Propagate step_identity_pattern (restricted from original expression's identity groups) to per-step cache calls
- Sync inner.steps[i].flop_cost to acc_step.total in FlopscopePathInfo.from_inner to maintain reconciliation invariant
- Regression test: ij,jk,ki->ijk with S2 symmetric A gives cost.total=104 < 128 (was 128 with dense intermediates)
- Update tests for Task 17b tighter values: ij,ik,il->jkl 20000→11000, ijk,ai,bj->abk sym<dense
Insert a regime column (between subscript and flops) into PathInfo.format_table
and the matching FlopscopePathInfo.__str__ renderer. The column shows the
per-step regime id (trivial, functionalProjection, singleton, young,
partitionCount, unavailable) drawn from accumulation.per_step[i].per_component[0].
_fmt_step_regime returns '-' defensively when _regime is absent.
Mirror format_table's regime column in the Rich variant.  The column is
shown conditionally (only when any step carries _regime, mirroring the
any_unique / any_regime pattern) so the verbose-detail layout is not
compressed on narrow terminals.
Attaches _acc_step to each StepInfo in FlopscopePathInfo.__str__, then
reads m_total/alpha/num_output_orbits from it in both the plain-text
format_table verbose branch and the Rich _rich_verbose_detail_text helper.
…attern

Symmetric and dense operands sharing the same subscripts/shapes now get
distinct _path_cache entries, preventing a dense-optimal path from being
silently reused for symmetric inputs once symmetry-aware path search is
enabled. The cache key is extended with a per-operand symmetry fingerprint
(tuple of SymmetryGroup-or-None) and the identity_pattern. Test added to
verify the symmetric call is always a cache miss relative to the dense call.
…tributes

All accessed via __getattr__ hook returning upstream opt_einsum modules;
pyright resolves them as object. Also add explicit importlib.util import
in test_path_info_renderer.py to satisfy reportAttributeAccessIssue.
@spMohanty spMohanty force-pushed the feat/symmetry-aware-einsum-cost branch from 86f7952 to feecd54 Compare May 19, 2026 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Off-by-one in sum/mean reductions Symmetric einsum FLOP counting: only count multiplications + use symmetry group of unsummed tensor

2 participants