Add a deterministic mutation-testing gate for the model and engine core#215
Merged
Conversation
Make the mutation-score ratchet correct and fast without the interpreted numba penalty. - mutate_only_covered_lines: skip lines the test suite never runs (an uncovered line can only ever produce a guaranteed survivor). - do_not_mutate_patterns: suppress the systemic non-behavioral equivalents at the source-line level (error code= and context= kwargs, dtype=np.* defaults; dunder and logging patterns kept for generality). In mutmut 3.6 these are Python regexes matched against each source line. - pytest timeout (--timeout=60, thread method) so a non-terminating mutant is killed instead of stalling the run; add pytest-timeout to the dev extra. - Two mutmut-profile-gated test fixtures: an ephemeral per-session NUMBA_CACHE_DIR so the compiled VAR kernel recompiles its mutated source fresh at native speed (replacing the slow JIT-disable workaround), and a clamp on the statsmodels MLE solver maxiter so a convergence-degrading mutant fails fast. Both are inert outside the mutation-testing profile. The run command becomes HYPOTHESIS_PROFILE=mutmut mutmut run (JIT stays on).
mutmut 3.x runs pytest in-process and, when mutate_only_covered_lines is on, unloads every module imported during its coverage pass so the source reloads fresh per mutant. That unload is fatal to single-phase-init C extensions: numpy raises 'cannot load module more than once per process' and numba raises 'duplicate registration for PolynomialType', aborting the run before any mutant is tested. Add tools/mutmut_sitecustomize, put on PYTHONPATH for the run, which patches the unload to keep numpy/numba/scipy/statsmodels resident while still reloading tsbootstrap. It prefers the shared mutation-ratchet-core helper when installed and falls back to an inline copy for standalone runs. Also drop the statsmodels solver-maxiter clamp fixture: capping maxiter changed the numerics on the unmutated code and broke the clean-baseline ARIMA golden, which mutmut requires to pass. Spinning mutants are bounded by pytest-timeout instead. Document that the full run needs an idle or dedicated machine, since the in-process stats pass can hit the per-test timeout under heavy CPU contention.
A full mutation run over the engine and model core surfaced 41 surviving mutants concentrated in select_ar_order, simulate_ar, the ARIMA recursion helpers, and the exogenous-regression beta path. The existing tests asserted only loose bounds (order ranges, penalty inequalities) and never exercised the select_ar_order default upper-bound path, so order-selection, information-criterion routing, and filter-state orientation were unguarded. Add targeted tests that pin exact behavior: the exact selected order per criterion on known series, the exact information-criterion penalty coefficients, the default and clamped search bounds, the design-matrix lag columns, the sim-dtype cast, and the AR initial-state lag orientation. Each kill was confirmed by applying the mutant diff to the source and observing the matching test fail. Genuine algorithmic equivalents that no test can kill (constant information- criterion offsets, argmin-invariant changes, default-argument routing, numpy and scipy aliasing) are catalogued in tests/mutation_equivalents.md as the ratchet baseline rather than tested.
In-process mutmut (3.x) cannot run this numba parallel-JIT codebase: the parallel-kernel LLVM finalize deadlocks inside mutmut's repeated in-process pytest model (reproduced on a clean 16-core box, so it is a deadlock, not CPU contention). Document the root cause in [tool.mutmut] and point to the locked path: a subprocess-per-mutant runner that keeps mutmut for AST generation only and executes each mutant in an isolated process, run nightly and non-blocking. The durable value already landed is the killing tests and the equivalents registry, not the in-process runner.
Drives the reliable mutation ratchet: generate mutants with mutmut, map each to its covering test files by source module, execute every mutant in an isolated subprocess via mutation_ratchet_core.subprocess_runner, and report killed/survived/timeout. Replaces the in-process mutmut run that deadlocks the numba parallel kernels. Runs on a remote/dedicated box, never the local machine.
mutmut results lists nothing on a freshly generated (untested) store, so the gate saw zero mutants and falsely reported a clean run. Enumerate every mutant name statically from the trampolined mutants/src tree instead (independent of result status): 925 mutants across the engine+model scope.
mutate_only_covered_lines=true makes mutmut run an in-process coverage pass (the suite under coverage.py) before generating, which hits the same numba parallel LLVM deadlock as the in-process executor, yields empty coverage, and generates ZERO mutants. Set it false: generation becomes pure libcst (no suite run, no deadlock) over every in-scope line; the subprocess gate runs them all. Also make the gate driver exit nonzero when zero mutants are enumerated, so a generation failure can never again masquerade as a clean pass.
…htly CI Driver writes a full per-mutant outcomes JSON (survivors AND timeouts itemized) and maps each survivor to a refactor-stable AST identity via mutation-ratchet-core, gating against a committed allowlist of accepted equivalents so the gate fails only on NEW survivors. Add a nightly + on-demand mutation workflow that runs the subprocess gate on the GitHub runner (a dedicated CI box, never a developer machine). Allowlist generated post-triage via --update-allowlist.
Close ~46 surviving-mutant test gaps from the full subprocess mutation run with ~25 deterministic killing tests across the AR/VAR/ARIMA/batched-engine clusters: stability guards (exact spectral-radius threshold, error code+context payloads), order-selection and order-too-large guards, residual centering, innovation and initial-block resample ranges, dtype honoring, and exact engine recurrences. ~29 further survivors are genuine algorithmic equivalents catalogued for the new-survivor allowlist.
The equivariance property test could draw (via Hypothesis) a series whose AR design is collinear; the fit then correctly raises TSB_PERFECT_COLLINEARITY, which is not an equivariance violation. Catch InputDataError and reject the example via assume(False), matching the existing pattern used by the other property tests in this file. Surfaced by the default (random) Hypothesis profile; the derandomized mutmut profile did not hit it.
The broad module->test map ran the whole property suite per mutant, so many mutants hit the per-mutant timeout and timeouts masked real survivors. Generate a function-level test-impact map from coverage.py per-test line contexts (tools/gen_test_impact.py -> tools/mutation_test_impact.json) and have the gate run each mutant against ONLY the tests covering its function (median 21 vs ~150 before). Deterministic, no timeout-masking. Raise the per-mutant timeout to 300s.
A full remote gate run exceeded the 1h cap because high-coverage functions pulled the slow Hypothesis property suite into every mutant's covering set, and the 300s per-mutant timeout let slow mutants hold ProcessPool workers. Build the impact map from tests/unit/ contexts only (fast deterministic tests do the structural killing; the property layer runs once as a coarse net in normal CI), drop the property suite from the module fallback, and set the per-mutant timeout to 120s. Targets a full run under ~15 min.
…engine contracts Add 16 unit tests pinning behavioral contracts that the property suite previously covered only stochastically, so the unit-level guard catches per-mutation regressions: error type/code/message/context payloads (rank-deficient OLS, VAR order and multivariate guards, backend-absent BackendError via import monkeypatch, exog incompatibility on the random-block and burn-in branches), slice and lag orientation (arma_initial_state reversed initials, batched AR initial-lag reversal, fit_ar most-recent-first lag order, VARX exog-coefficient slice), residual centering (AR and VAR contexts subtract the mean), and the float32 simulation-dtype cast on the VAR path. Error-message tests assert the exact rendered string rather than a substring so a mutation that only rewraps the surrounding text cannot slip through. Refresh the function-to-test impact map so each new test routes to the functions it covers; the backend-absent test newly covers the previously untested import-guard branch.
Record surviving mutants proven behaviorally indistinguishable from the original on every reachable input, with the proof for each: numpy and scipy semantics (omitted lstsq rcond equals rcond=None, out-of-range slice clipping, lfilter default axis on 2D input, integers(high) equals integers(0, high), reshape with a single negative dimension), runtime no-ops (typing.cast), dead writes and always-equal boolean operands, redundant error codes that fall back to the same class default, and cosmetic warning stacklevel changes. These form the accepted-survivor baseline for the mutation ratchet. Drop the stale simulate_ar lfiltic entry: lfiltic does not treat a None numerator as the all-pole default (that is lfilter behavior), so the mutation is a real difference and is now killed by a test, not an equivalent.
…t allowlist Add the accepted-survivor allowlist: 40 stable identities, each a mutant triaged as a genuine behavioral equivalent (unkillable by any test) on a clean run that killed 1024 of 1064 mutants. Every identity is refactor-stable (enclosing-statement hash plus a diff-content digest, no generation-number dependence). The nightly gate now fails only on a survivor whose identity is not in this list.
|
Tick the box to add this pull request to the merge queue (same as
|
…rtion Exclude tools/ (the mutation-gate helper scripts) from SonarCloud, matching the existing exclusion of scripts/, docs/, benchmarks/, and examples/; these are internal CI tooling, not shipped library code. Replace an exact float equality in an ARIMA resample test with a tolerance-based check (the asserted spike propagates exactly, but exact float comparison trips the analyzer).
Resolve the outcomes and allowlist CLI paths and reject any that escape the repo before reading or writing, so a faulty argument cannot touch the wider filesystem.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



What this adds
A deterministic mutation-testing gate for the model and engine core. It runs every
generated mutant in its own subprocess (the in-process runner deadlocks the compiled
parallel kernels), maps each mutant to only the unit tests that cover its function, and
fails only when a surviving mutant appears whose identity is not on the accepted-equivalent
allowlist.
Why
A passing test suite does not measure whether the tests actually detect wrong behavior.
The gate closes that blind spot for the parts of the library where a silent regression is
hardest to notice: the recursive AR, VAR, and ARIMA paths, the batched simulation engines,
and the fitting and stability code.
Results from a clean run
mutant from the original), verified one by one and catalogued with a proof.
What is in the change
by chance: error type, code, message, and context payloads; slice and lag orientation;
residual centering; and the float32 simulation dtype.
every entry.
Notes
The gate relies on the mutation-ratchet-core package for survivor identity and the
new-survivor diff. Identities are derived from the enclosing statement and the mutated
tokens, so they survive reformatting and line moves and carry no dependence on the
generator's mutant numbering.