Skip to content

Drive gradual reclaim by memory pressure instead of CPU idle#40778

Closed
benhillis wants to merge 2 commits into
microsoft:masterfrom
benhillis:benhill/mem-reclaim-2-pressure
Closed

Drive gradual reclaim by memory pressure instead of CPU idle#40778
benhillis wants to merge 2 commits into
microsoft:masterfrom
benhillis:benhill/mem-reclaim-2-pressure

Conversation

@benhillis

Copy link
Copy Markdown
Member

Stacked PR 2 of 4 — builds on #40777. Diff shown is cumulative (helpers rework + this change).

What this does

A CPU-bound workload can sit on gigabytes of cold page cache that a CPU-idle check would never reclaim. This drives Gradual reclaim by memory pressure instead of CPU idleness:

  • Reads the PSI some avg10 memory pressure from /proc/pressure/memory.
  • Reclaims cold cache toward the fixed floor whenever pressure is low, even while the VM is busy, backing off once the workload starts stalling on memory.
  • A busy interval reclaims at most a bounded step (c_gradualStepBusyBytes) so a large backlog is drained gently; an idle interval drains the full excess.
  • When PSI is unavailable (kernel built without CONFIG_PSI), gradual reclaim falls back to gating on CPU idle.

Stack

  1. Rework the memory reduction thread around explicit reclaim helpers #40777 — rework around explicit reclaim helpers
  2. this PR — drive gradual reclaim by memory pressure (PSI)
  3. adaptive working-set floor via refaults
  4. make gradual the default

Copilot AI review requested due to automatic review settings June 11, 2026 17:22

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the WSL2 guest mini-init memory reduction thread so Gradual cache reclaim is driven by memory pressure (PSI) rather than CPU idleness, enabling cold page-cache reclaim even during CPU-bound workloads while backing off under pressure.

Changes:

  • Add procfs helpers to read full procfs snapshots and to compute aggregate busy vs idle CPU time from /proc/stat.
  • Implement PSI-based gating for gradual reclaim using /proc/pressure/memory (some avg10), with bounded reclaim steps while busy and full drain while idle.
  • Refactor the reduction thread into clear per-tick helpers for Gradual reclaim, DropCache, and Compaction, with best-effort procfs writes.

Comment thread src/linux/init/util.cpp Outdated
Comment on lines +3553 to +3569
const int Result = TEMP_FAILURE_RETRY(read(Fd.get(), Buffer + Total, (Size - 1) - Total));
if (Result < 0)
{
if (LogErrors)
{
LOG_ERROR("read({}) failed {}", Path, errno);
}

return false;
}

if (Result == 0)
{
break;
}

Total += Result;
Comment thread src/linux/init/util.cpp Outdated
Comment on lines +3895 to +3899
// Best-effort: WriteToFile logs internally on failure. EAGAIN merely means the kernel could not evict
// the full amount this pass (pages were still freed), so it counts as reclaim. Never throw on a
// transient write error and tear down the long-lived reduction thread.
const int Status = WriteToFile(RECLAIM_PATH, Bytes.c_str());
return (Status == 0) || (errno == EAGAIN);
Ben Hillis and others added 2 commits June 11, 2026 11:06
Replace the ring-buffer idle detector and user-CPU-only sampling in the
mini-init memory reduction thread with a clearer, helper-based design:

- Sample aggregate non-idle CPU time (user, system, irq, softirq, steal)
  so kernel-bound work keeps the VM out of the idle state, instead of
  looking at user time alone.
- ReadProcFile reads a full procfs snapshot into a caller buffer
  (close-on-exec, partial-read safe); GetReclaimableCacheBytes /
  GetFreeMemoryBytes read the relevant counters through it.
- Gradual mode reclaims cold page cache (cgroup memory.reclaim) above a
  fixed floor while CPU-idle, with a hysteresis margin so it does not
  churn near the floor.
- DropCache mode stays gated on sustained CPU idle, drops once, and
  re-drops only after the reclaimable cache grows meaningfully.
- Compaction is gated on free-memory growth so it runs only when there
  are newly-freed pages worth coalescing.

RequestCgroupReclaim performs the memory.reclaim write best-effort: it
treats the kernel's expected EAGAIN (some, but not all, pages evicted)
as success without logging, and never throws so a transient write error
cannot tear down the long-lived reduction thread.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A CPU-bound workload can sit on gigabytes of cold page cache that a
CPU-idle check would never reclaim. Read the PSI "some avg10" memory
pressure from /proc/pressure/memory and reclaim cold cache toward the
fixed floor whenever pressure is low, even while the VM is busy, backing
off once the workload starts stalling on memory.

A busy interval reclaims at most a bounded step (c_gradualStepBusyBytes)
so a large backlog is drained gently; an idle interval drains the full
excess at once. When PSI is unavailable (kernel built without
CONFIG_PSI), gradual reclaim falls back to gating on CPU idle.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@benhillis benhillis force-pushed the benhill/mem-reclaim-2-pressure branch from e0d5682 to 386378f Compare June 11, 2026 18:10
@benhillis

Copy link
Copy Markdown
Member Author

Closing for now. These changes are being submitted one PR at a time; this will be reopened against master once the preceding PR in the series has merged, so it shows only its own incremental change. The branch remains pushed.

@benhillis benhillis closed this Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants