rivet_core test binary OOMs CI: up to 100 GB RSS triggers system-wide OOM-killer on lean-mem runners

## Summary

The `rivet_core` **test binary** repeatedly grows to **70–100 GB resident** and trips the **system-wide** Linux OOM-killer on the self-hosted `lean-mem` runners. Because it's a kernel (not cgroup-local) OOM, it can kill **neighboring jobs**, not just the offending one. On a 125 GB host, a single test using ~100 GB is fatal to the whole box.

## Evidence (kernel OOM log, `pulseengine-ci-01`)

```
Jun 19 06:28  Killed process (rivet_core-715e)  total-vm:168.9 GB  anon-rss:94.8 GB  UID:1102
Jun 22 11:11  Killed process (rivet_core-98fe)  total-vm:172.1 GB  anon-rss:99.9 GB  UID:1104
Jun 23 10:24  Killed process (rivet_core-98fe)  total-vm:112.3 GB  anon-rss:72.4 GB  UID:1103
```

- `rivet_core-<hash>` is the **rivet-core crate's compiled test binary** (a native `cargo test`/nextest binary, not Miri).
- UIDs 1102–1104 = self-hosted runners **2/3/4**, which are the **`lean-mem`** class. These runners exercise the rivet-core test suite via the RAM-heavy jobs pinned there (Miri / `mutants-core`). The OOMs cluster on high-load days (host ldavg ~60 on 32 cores Jun 22–23).
- 3 kernel OOM kills in 10 days — every one is `rivet_core`. No other process OOMs.

## Why this is serious

`anon-rss` of 95–100 GB means the **test process itself** is allocating ~100 GB of heap — not build/link, not cache. That's a single test (or a fan-out of parallel test threads each holding large state) accumulating enormous memory. With `MemoryHigh=32 GiB` (soft) and no hard cap, it sails past 32 G, thrashes swap, and the kernel eventually OOM-kills system-wide.

## What infra is doing (and why it changes this job's behavior)

We're adding a **hard `MemoryMax` guardrail** (~48 G) on the `lean-mem` cgroup so a runaway is killed **cleanly inside its own cgroup** instead of taking down the host/neighbors. **After that lands, this test will FAIL fast at ~48 G instead of intermittently OOMing the box** — i.e., it becomes a deterministic red until the memory use is fixed. So this needs a real fix on the rivet side, not just more RAM.

## Requested

1. **Identify the offending test** in `rivet-core` — something is allocating ~100 GB. Likely candidates: a property/fuzz-style test with an unbounded input, a test that materializes a huge graph/corpus, or a `proptest`/`quickcheck` case with a large generator.
2. **Bound it** — cap allocation, shrink the input, or mark it `#[ignore]`/move it behind a feature so it doesn't run in the default test suite that Miri/`mutants-core` invoke.
3. **Cap test concurrency** as a stopgap — `--test-threads=N` / nextest `test-threads` so N parallel copies of a memory-heavy test can't sum to 100 G.

## Acceptance

- No `rivet_core` process exceeds the lean-mem `MemoryMax` (~48 G); zero kernel OOM kills attributable to rivet.

cc #509 (CI reliability theme). Related infra: a 14-day audit also flagged lean-mem as the contended pool (see rivet#523).

---
*Filed from a host performance investigation (OOM logs + cgroup config). Suggested labels: `ci`, `bug`.*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rivet_core test binary OOMs CI: up to 100 GB RSS triggers system-wide OOM-killer on lean-mem runners #590

Summary

Evidence (kernel OOM log, `pulseengine-ci-01`)

Why this is serious

What infra is doing (and why it changes this job's behavior)

Requested

Acceptance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

rivet_core test binary OOMs CI: up to 100 GB RSS triggers system-wide OOM-killer on lean-mem runners #590

Description

Summary

Evidence (kernel OOM log, pulseengine-ci-01)

Why this is serious

What infra is doing (and why it changes this job's behavior)

Requested

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Evidence (kernel OOM log, `pulseengine-ci-01`)