From e2d9b141dcbfa6df75e9979ab90f7dedb343c8a7 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 26 Jun 2026 12:11:53 +0000 Subject: [PATCH] ci(mutants-core): cap per-process address space at ~48 G via ulimit -v MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds the optional repo-side defense-in-depth from #590 (comment by @avrabe): RLIMIT_AS=~48 G before cargo mutants in the rivet-core shard. A runaway mutation can allocate ~100 G in seconds — faster than the 30 s per-mutant timeout — so the kernel OOM-killer fires first and can take down neighboring jobs on the lean-mem pool. With this cap, the runaway aborts inside its own process (ENOMEM); the shard records it as timeout/error and continue-on-error keeps the gate green. Primary fix is still the infra MemoryMax cgroup cap; the acceptance criterion ("zero kernel OOM kills attributable to rivet") can only be observed by the nightly mutants-core fan-out after this and the cgroup cap both land. Refs: #590 Refs: #509 --- .github/workflows/ci.yml | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 523f9cd5..1a252b35 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -587,7 +587,19 @@ jobs: # each shard takes ~2x as long (was 12-20 min, now 20-40 min), # but the lean-mem pool stops needing emergency cgroup-ceiling # bumps every quarter. - run: cargo mutants -p ${{ matrix.crate }} --shard ${{ matrix.shard }} --timeout 30 --jobs 2 --output mutants-out -- --lib || true + # Defense-in-depth for #590: cap per-process virtual address space + # at ~48 G (RLIMIT_AS) so a runaway mutant aborts inside its own + # process with ENOMEM instead of OOM-killing the lean-mem host. + # A memory-runaway mutation can allocate ~100 G in seconds — faster + # than the 30 s cargo-mutants timeout — so the kernel OOM-killer + # fires first and can take down neighboring jobs. Primary fix is + # the infra MemoryMax cgroup cap; this is the optional repo-side + # guard that works now. `continue-on-error: true` + `|| true` mean + # a clipped mutant is still recorded as timeout/error, not a gate + # failure. + run: | + ulimit -v 50331648 + cargo mutants -p ${{ matrix.crate }} --shard ${{ matrix.shard }} --timeout 30 --jobs 2 --output mutants-out -- --lib || true - name: Check surviving mutants run: | MISSED=0