Skip to content

Add a deterministic mutation-testing gate for the model and engine core#215

Merged
astrogilda merged 17 commits into
mainfrom
chore/mutation-config
Jun 24, 2026
Merged

Add a deterministic mutation-testing gate for the model and engine core#215
astrogilda merged 17 commits into
mainfrom
chore/mutation-config

Conversation

@astrogilda

Copy link
Copy Markdown
Owner

What this adds

A deterministic mutation-testing gate for the model and engine core. It runs every
generated mutant in its own subprocess (the in-process runner deadlocks the compiled
parallel kernels), maps each mutant to only the unit tests that cover its function, and
fails only when a surviving mutant appears whose identity is not on the accepted-equivalent
allowlist.

Why

A passing test suite does not measure whether the tests actually detect wrong behavior.
The gate closes that blind spot for the parts of the library where a silent regression is
hardest to notice: the recursive AR, VAR, and ARIMA paths, the batched simulation engines,
and the fitting and stability code.

Results from a clean run

  • 1024 of 1064 mutants killed.
  • The 40 survivors are each a genuine behavioral equivalent (no test can distinguish the
    mutant from the original), verified one by one and catalogued with a proof.
  • Full run on a 16 core machine in about 18 minutes.

What is in the change

  • 16 deterministic tests that pin the contracts the property suite previously covered only
    by chance: error type, code, message, and context payloads; slice and lag orientation;
    residual centering; and the float32 simulation dtype.
  • A function-precise coverage map so each mutant runs against only its covering unit tests.
  • The accepted-equivalent allowlist (40 refactor-stable identities) plus a written proof for
    every entry.
  • A nightly workflow that regenerates and runs the gate.

Notes

The gate relies on the mutation-ratchet-core package for survivor identity and the
new-survivor diff. Identities are derived from the enclosing statement and the mutated
tokens, so they survive reformatting and line moves and carry no dependence on the
generator's mutant numbering.

Make the mutation-score ratchet correct and fast without the interpreted
numba penalty.

- mutate_only_covered_lines: skip lines the test suite never runs (an
  uncovered line can only ever produce a guaranteed survivor).
- do_not_mutate_patterns: suppress the systemic non-behavioral equivalents
  at the source-line level (error code= and context= kwargs, dtype=np.*
  defaults; dunder and logging patterns kept for generality). In mutmut 3.6
  these are Python regexes matched against each source line.
- pytest timeout (--timeout=60, thread method) so a non-terminating mutant
  is killed instead of stalling the run; add pytest-timeout to the dev extra.
- Two mutmut-profile-gated test fixtures: an ephemeral per-session
  NUMBA_CACHE_DIR so the compiled VAR kernel recompiles its mutated source
  fresh at native speed (replacing the slow JIT-disable workaround), and a
  clamp on the statsmodels MLE solver maxiter so a convergence-degrading
  mutant fails fast. Both are inert outside the mutation-testing profile.

The run command becomes HYPOTHESIS_PROFILE=mutmut mutmut run (JIT stays on).
mutmut 3.x runs pytest in-process and, when mutate_only_covered_lines is on,
unloads every module imported during its coverage pass so the source reloads
fresh per mutant. That unload is fatal to single-phase-init C extensions: numpy
raises 'cannot load module more than once per process' and numba raises
'duplicate registration for PolynomialType', aborting the run before any mutant
is tested. Add tools/mutmut_sitecustomize, put on PYTHONPATH for the run, which
patches the unload to keep numpy/numba/scipy/statsmodels resident while still
reloading tsbootstrap. It prefers the shared mutation-ratchet-core helper when
installed and falls back to an inline copy for standalone runs.

Also drop the statsmodels solver-maxiter clamp fixture: capping maxiter changed
the numerics on the unmutated code and broke the clean-baseline ARIMA golden,
which mutmut requires to pass. Spinning mutants are bounded by pytest-timeout
instead. Document that the full run needs an idle or dedicated machine, since the
in-process stats pass can hit the per-test timeout under heavy CPU contention.
A full mutation run over the engine and model core surfaced 41 surviving mutants
concentrated in select_ar_order, simulate_ar, the ARIMA recursion helpers, and
the exogenous-regression beta path. The existing tests asserted only loose bounds
(order ranges, penalty inequalities) and never exercised the select_ar_order
default upper-bound path, so order-selection, information-criterion routing, and
filter-state orientation were unguarded.

Add targeted tests that pin exact behavior: the exact selected order per criterion
on known series, the exact information-criterion penalty coefficients, the default
and clamped search bounds, the design-matrix lag columns, the sim-dtype cast, and
the AR initial-state lag orientation. Each kill was confirmed by applying the
mutant diff to the source and observing the matching test fail.

Genuine algorithmic equivalents that no test can kill (constant information-
criterion offsets, argmin-invariant changes, default-argument routing, numpy and
scipy aliasing) are catalogued in tests/mutation_equivalents.md as the ratchet
baseline rather than tested.
In-process mutmut (3.x) cannot run this numba parallel-JIT codebase: the
parallel-kernel LLVM finalize deadlocks inside mutmut's repeated in-process
pytest model (reproduced on a clean 16-core box, so it is a deadlock, not CPU
contention). Document the root cause in [tool.mutmut] and point to the locked
path: a subprocess-per-mutant runner that keeps mutmut for AST generation only
and executes each mutant in an isolated process, run nightly and non-blocking.
The durable value already landed is the killing tests and the equivalents
registry, not the in-process runner.
Drives the reliable mutation ratchet: generate mutants with mutmut, map each to
its covering test files by source module, execute every mutant in an isolated
subprocess via mutation_ratchet_core.subprocess_runner, and report
killed/survived/timeout. Replaces the in-process mutmut run that deadlocks the
numba parallel kernels. Runs on a remote/dedicated box, never the local machine.
mutmut results lists nothing on a freshly generated (untested) store, so the gate
saw zero mutants and falsely reported a clean run. Enumerate every mutant name
statically from the trampolined mutants/src tree instead (independent of result
status): 925 mutants across the engine+model scope.
mutate_only_covered_lines=true makes mutmut run an in-process coverage pass (the
suite under coverage.py) before generating, which hits the same numba parallel
LLVM deadlock as the in-process executor, yields empty coverage, and generates
ZERO mutants. Set it false: generation becomes pure libcst (no suite run, no
deadlock) over every in-scope line; the subprocess gate runs them all. Also make
the gate driver exit nonzero when zero mutants are enumerated, so a generation
failure can never again masquerade as a clean pass.
…htly CI

Driver writes a full per-mutant outcomes JSON (survivors AND timeouts itemized)
and maps each survivor to a refactor-stable AST identity via mutation-ratchet-core,
gating against a committed allowlist of accepted equivalents so the gate fails
only on NEW survivors. Add a nightly + on-demand mutation workflow that runs the
subprocess gate on the GitHub runner (a dedicated CI box, never a developer
machine). Allowlist generated post-triage via --update-allowlist.
Close ~46 surviving-mutant test gaps from the full subprocess mutation run with
~25 deterministic killing tests across the AR/VAR/ARIMA/batched-engine clusters:
stability guards (exact spectral-radius threshold, error code+context payloads),
order-selection and order-too-large guards, residual centering, innovation and
initial-block resample ranges, dtype honoring, and exact engine recurrences.
~29 further survivors are genuine algorithmic equivalents catalogued for the
new-survivor allowlist.
The equivariance property test could draw (via Hypothesis) a series whose AR
design is collinear; the fit then correctly raises TSB_PERFECT_COLLINEARITY,
which is not an equivariance violation. Catch InputDataError and reject the
example via assume(False), matching the existing pattern used by the other
property tests in this file. Surfaced by the default (random) Hypothesis profile;
the derandomized mutmut profile did not hit it.
The broad module->test map ran the whole property suite per mutant, so many
mutants hit the per-mutant timeout and timeouts masked real survivors. Generate a
function-level test-impact map from coverage.py per-test line contexts
(tools/gen_test_impact.py -> tools/mutation_test_impact.json) and have the gate
run each mutant against ONLY the tests covering its function (median 21 vs ~150
before). Deterministic, no timeout-masking. Raise the per-mutant timeout to 300s.
A full remote gate run exceeded the 1h cap because high-coverage functions pulled
the slow Hypothesis property suite into every mutant's covering set, and the 300s
per-mutant timeout let slow mutants hold ProcessPool workers. Build the impact map
from tests/unit/ contexts only (fast deterministic tests do the structural
killing; the property layer runs once as a coarse net in normal CI), drop the
property suite from the module fallback, and set the per-mutant timeout to 120s.
Targets a full run under ~15 min.
…engine contracts

Add 16 unit tests pinning behavioral contracts that the property suite previously
covered only stochastically, so the unit-level guard catches per-mutation regressions:
error type/code/message/context payloads (rank-deficient OLS, VAR order and multivariate
guards, backend-absent BackendError via import monkeypatch, exog incompatibility on the
random-block and burn-in branches), slice and lag orientation (arma_initial_state
reversed initials, batched AR initial-lag reversal, fit_ar most-recent-first lag order,
VARX exog-coefficient slice), residual centering (AR and VAR contexts subtract the mean),
and the float32 simulation-dtype cast on the VAR path.

Error-message tests assert the exact rendered string rather than a substring so a mutation
that only rewraps the surrounding text cannot slip through. Refresh the function-to-test
impact map so each new test routes to the functions it covers; the backend-absent test
newly covers the previously untested import-guard branch.
Record surviving mutants proven behaviorally indistinguishable from the original on
every reachable input, with the proof for each: numpy and scipy semantics (omitted
lstsq rcond equals rcond=None, out-of-range slice clipping, lfilter default axis on 2D
input, integers(high) equals integers(0, high), reshape with a single negative
dimension), runtime no-ops (typing.cast), dead writes and always-equal boolean operands,
redundant error codes that fall back to the same class default, and cosmetic warning
stacklevel changes. These form the accepted-survivor baseline for the mutation ratchet.

Drop the stale simulate_ar lfiltic entry: lfiltic does not treat a None numerator as the
all-pole default (that is lfilter behavior), so the mutation is a real difference and is
now killed by a test, not an equivalent.
…t allowlist

Add the accepted-survivor allowlist: 40 stable identities, each a mutant triaged as a
genuine behavioral equivalent (unkillable by any test) on a clean run that killed 1024 of
1064 mutants. Every identity is refactor-stable (enclosing-statement hash plus a
diff-content digest, no generation-number dependence). The nightly gate now fails only on
a survivor whose identity is not in this list.
@mergify

mergify Bot commented Jun 24, 2026

Copy link
Copy Markdown

Tick the box to add this pull request to the merge queue (same as @mergifyio queue).

  • Queue this pull request

…rtion

Exclude tools/ (the mutation-gate helper scripts) from SonarCloud, matching the existing
exclusion of scripts/, docs/, benchmarks/, and examples/; these are internal CI tooling,
not shipped library code. Replace an exact float equality in an ARIMA resample test with a
tolerance-based check (the asserted spike propagates exactly, but exact float comparison
trips the analyzer).
Resolve the outcomes and allowlist CLI paths and reject any that escape the repo
before reading or writing, so a faulty argument cannot touch the wider filesystem.
@astrogilda astrogilda merged commit 4b75921 into main Jun 24, 2026
24 checks passed
@astrogilda astrogilda deleted the chore/mutation-config branch June 24, 2026 20:06
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant