Summary
The rivet_core test binary repeatedly grows to 70–100 GB resident and trips the system-wide Linux OOM-killer on the self-hosted lean-mem runners. Because it's a kernel (not cgroup-local) OOM, it can kill neighboring jobs, not just the offending one. On a 125 GB host, a single test using ~100 GB is fatal to the whole box.
Evidence (kernel OOM log, pulseengine-ci-01)
Jun 19 06:28 Killed process (rivet_core-715e) total-vm:168.9 GB anon-rss:94.8 GB UID:1102
Jun 22 11:11 Killed process (rivet_core-98fe) total-vm:172.1 GB anon-rss:99.9 GB UID:1104
Jun 23 10:24 Killed process (rivet_core-98fe) total-vm:112.3 GB anon-rss:72.4 GB UID:1103
rivet_core-<hash> is the rivet-core crate's compiled test binary (a native cargo test/nextest binary, not Miri).
- UIDs 1102–1104 = self-hosted runners 2/3/4, which are the
lean-mem class. These runners exercise the rivet-core test suite via the RAM-heavy jobs pinned there (Miri / mutants-core). The OOMs cluster on high-load days (host ldavg ~60 on 32 cores Jun 22–23).
- 3 kernel OOM kills in 10 days — every one is
rivet_core. No other process OOMs.
Why this is serious
anon-rss of 95–100 GB means the test process itself is allocating ~100 GB of heap — not build/link, not cache. That's a single test (or a fan-out of parallel test threads each holding large state) accumulating enormous memory. With MemoryHigh=32 GiB (soft) and no hard cap, it sails past 32 G, thrashes swap, and the kernel eventually OOM-kills system-wide.
What infra is doing (and why it changes this job's behavior)
We're adding a hard MemoryMax guardrail (~48 G) on the lean-mem cgroup so a runaway is killed cleanly inside its own cgroup instead of taking down the host/neighbors. After that lands, this test will FAIL fast at ~48 G instead of intermittently OOMing the box — i.e., it becomes a deterministic red until the memory use is fixed. So this needs a real fix on the rivet side, not just more RAM.
Requested
- Identify the offending test in
rivet-core — something is allocating ~100 GB. Likely candidates: a property/fuzz-style test with an unbounded input, a test that materializes a huge graph/corpus, or a proptest/quickcheck case with a large generator.
- Bound it — cap allocation, shrink the input, or mark it
#[ignore]/move it behind a feature so it doesn't run in the default test suite that Miri/mutants-core invoke.
- Cap test concurrency as a stopgap —
--test-threads=N / nextest test-threads so N parallel copies of a memory-heavy test can't sum to 100 G.
Acceptance
- No
rivet_core process exceeds the lean-mem MemoryMax (~48 G); zero kernel OOM kills attributable to rivet.
cc #509 (CI reliability theme). Related infra: a 14-day audit also flagged lean-mem as the contended pool (see rivet#523).
Filed from a host performance investigation (OOM logs + cgroup config). Suggested labels: ci, bug.
Summary
The
rivet_coretest binary repeatedly grows to 70–100 GB resident and trips the system-wide Linux OOM-killer on the self-hostedlean-memrunners. Because it's a kernel (not cgroup-local) OOM, it can kill neighboring jobs, not just the offending one. On a 125 GB host, a single test using ~100 GB is fatal to the whole box.Evidence (kernel OOM log,
pulseengine-ci-01)rivet_core-<hash>is the rivet-core crate's compiled test binary (a nativecargo test/nextest binary, not Miri).lean-memclass. These runners exercise the rivet-core test suite via the RAM-heavy jobs pinned there (Miri /mutants-core). The OOMs cluster on high-load days (host ldavg ~60 on 32 cores Jun 22–23).rivet_core. No other process OOMs.Why this is serious
anon-rssof 95–100 GB means the test process itself is allocating ~100 GB of heap — not build/link, not cache. That's a single test (or a fan-out of parallel test threads each holding large state) accumulating enormous memory. WithMemoryHigh=32 GiB(soft) and no hard cap, it sails past 32 G, thrashes swap, and the kernel eventually OOM-kills system-wide.What infra is doing (and why it changes this job's behavior)
We're adding a hard
MemoryMaxguardrail (~48 G) on thelean-memcgroup so a runaway is killed cleanly inside its own cgroup instead of taking down the host/neighbors. After that lands, this test will FAIL fast at ~48 G instead of intermittently OOMing the box — i.e., it becomes a deterministic red until the memory use is fixed. So this needs a real fix on the rivet side, not just more RAM.Requested
rivet-core— something is allocating ~100 GB. Likely candidates: a property/fuzz-style test with an unbounded input, a test that materializes a huge graph/corpus, or aproptest/quickcheckcase with a large generator.#[ignore]/move it behind a feature so it doesn't run in the default test suite that Miri/mutants-coreinvoke.--test-threads=N/ nextesttest-threadsso N parallel copies of a memory-heavy test can't sum to 100 G.Acceptance
rivet_coreprocess exceeds the lean-memMemoryMax(~48 G); zero kernel OOM kills attributable to rivet.cc #509 (CI reliability theme). Related infra: a 14-day audit also flagged lean-mem as the contended pool (see rivet#523).
Filed from a host performance investigation (OOM logs + cgroup config). Suggested labels:
ci,bug.