Skip to content

rivet_core test binary OOMs CI: up to 100 GB RSS triggers system-wide OOM-killer on lean-mem runners #590

Description

@avrabe

Summary

The rivet_core test binary repeatedly grows to 70–100 GB resident and trips the system-wide Linux OOM-killer on the self-hosted lean-mem runners. Because it's a kernel (not cgroup-local) OOM, it can kill neighboring jobs, not just the offending one. On a 125 GB host, a single test using ~100 GB is fatal to the whole box.

Evidence (kernel OOM log, pulseengine-ci-01)

Jun 19 06:28  Killed process (rivet_core-715e)  total-vm:168.9 GB  anon-rss:94.8 GB  UID:1102
Jun 22 11:11  Killed process (rivet_core-98fe)  total-vm:172.1 GB  anon-rss:99.9 GB  UID:1104
Jun 23 10:24  Killed process (rivet_core-98fe)  total-vm:112.3 GB  anon-rss:72.4 GB  UID:1103
  • rivet_core-<hash> is the rivet-core crate's compiled test binary (a native cargo test/nextest binary, not Miri).
  • UIDs 1102–1104 = self-hosted runners 2/3/4, which are the lean-mem class. These runners exercise the rivet-core test suite via the RAM-heavy jobs pinned there (Miri / mutants-core). The OOMs cluster on high-load days (host ldavg ~60 on 32 cores Jun 22–23).
  • 3 kernel OOM kills in 10 days — every one is rivet_core. No other process OOMs.

Why this is serious

anon-rss of 95–100 GB means the test process itself is allocating ~100 GB of heap — not build/link, not cache. That's a single test (or a fan-out of parallel test threads each holding large state) accumulating enormous memory. With MemoryHigh=32 GiB (soft) and no hard cap, it sails past 32 G, thrashes swap, and the kernel eventually OOM-kills system-wide.

What infra is doing (and why it changes this job's behavior)

We're adding a hard MemoryMax guardrail (~48 G) on the lean-mem cgroup so a runaway is killed cleanly inside its own cgroup instead of taking down the host/neighbors. After that lands, this test will FAIL fast at ~48 G instead of intermittently OOMing the box — i.e., it becomes a deterministic red until the memory use is fixed. So this needs a real fix on the rivet side, not just more RAM.

Requested

  1. Identify the offending test in rivet-core — something is allocating ~100 GB. Likely candidates: a property/fuzz-style test with an unbounded input, a test that materializes a huge graph/corpus, or a proptest/quickcheck case with a large generator.
  2. Bound it — cap allocation, shrink the input, or mark it #[ignore]/move it behind a feature so it doesn't run in the default test suite that Miri/mutants-core invoke.
  3. Cap test concurrency as a stopgap — --test-threads=N / nextest test-threads so N parallel copies of a memory-heavy test can't sum to 100 G.

Acceptance

  • No rivet_core process exceeds the lean-mem MemoryMax (~48 G); zero kernel OOM kills attributable to rivet.

cc #509 (CI reliability theme). Related infra: a 14-day audit also flagged lean-mem as the contended pool (see rivet#523).


Filed from a host performance investigation (OOM logs + cgroup config). Suggested labels: ci, bug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions