Skip to content

Adapt the gradual reclaim floor to the working set using refaults#40779

Closed
benhillis wants to merge 3 commits into
microsoft:masterfrom
benhillis:benhill/mem-reclaim-3-adaptive
Closed

Adapt the gradual reclaim floor to the working set using refaults#40779
benhillis wants to merge 3 commits into
microsoft:masterfrom
benhillis:benhill/mem-reclaim-3-adaptive

Conversation

@benhillis

Copy link
Copy Markdown
Member

Stacked PR 3 of 4 — builds on #40777 and the PSI change. Diff shown is cumulative.

What this does

Reclaiming toward a single fixed floor either gives back too little on large VMs or evicts pages a larger working set immediately faults back in. This makes the floor adaptive and self-regulating:

  • Tracks file refaults (/proc/vmstat workingset_refault*) as a signal that reclaim is cutting into the working set. When refaults spike (or PSI pressure rises into the backoff band), the floor is raised to protect what the workload is actually using and reclaim stops.
  • After sustained calm, the floor decays back toward the base so a shrunken working set is eventually re-probed downward.
  • Scales the floor's upper bound to a fraction of guest RAM so large working sets on large VMs can be fully protected, falling back to a fixed cap when total RAM is unavailable.
  • The PSI-unavailable path keeps the same refault brake and floor decay so behavior degrades gracefully on kernels without CONFIG_PSI.

Stack

  1. Rework the memory reduction thread around explicit reclaim helpers #40777 — rework around explicit reclaim helpers
  2. drive gradual reclaim by memory pressure (PSI)
  3. this PR — adaptive working-set floor via refaults
  4. make gradual the default

Copilot AI review requested due to automatic review settings June 11, 2026 17:22

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR (stacked PR 3/4) evolves the WSL2 mini-init memory reduction thread’s Gradual reclaim policy by making the reclaim “floor” adaptive to the guest’s working set, using file refaults (/proc/vmstat workingset_refault_file) and PSI memory pressure (/proc/pressure/memory) as feedback signals to back off reclaim when it starts impacting actively-used pages.

Changes:

  • Adds a robust procfs reader (ReadProcFile) and refactors CPU-idle detection to use aggregate busy vs idle jiffies from /proc/stat.
  • Implements an adaptive Gradual reclaim floor that raises quickly on refault/pressure backoff and decays after sustained calm; scales the max floor to a fraction of total guest RAM.
  • Updates DropCache + compaction behavior to run as best-effort, with compaction gated by free-memory growth.

Comment thread src/linux/init/util.cpp Outdated
Comment on lines +3538 to +3539
{
wil::unique_fd Fd{open("/proc/stat", O_RDONLY)};
wil::unique_fd Fd{open(Path, O_RDONLY)};
Comment thread src/linux/init/util.cpp Outdated
Comment on lines +3845 to +3851
const char* Marker = strstr(Buffer, "workingset_refault_file ");
if (Marker == nullptr)
{
return -1;
}

return strtoll(Marker + (sizeof("workingset_refault_file ") - 1), nullptr, 10);
Ben Hillis and others added 3 commits June 11, 2026 11:06
Replace the ring-buffer idle detector and user-CPU-only sampling in the
mini-init memory reduction thread with a clearer, helper-based design:

- Sample aggregate non-idle CPU time (user, system, irq, softirq, steal)
  so kernel-bound work keeps the VM out of the idle state, instead of
  looking at user time alone.
- ReadProcFile reads a full procfs snapshot into a caller buffer
  (close-on-exec, partial-read safe); GetReclaimableCacheBytes /
  GetFreeMemoryBytes read the relevant counters through it.
- Gradual mode reclaims cold page cache (cgroup memory.reclaim) above a
  fixed floor while CPU-idle, with a hysteresis margin so it does not
  churn near the floor.
- DropCache mode stays gated on sustained CPU idle, drops once, and
  re-drops only after the reclaimable cache grows meaningfully.
- Compaction is gated on free-memory growth so it runs only when there
  are newly-freed pages worth coalescing.

RequestCgroupReclaim performs the memory.reclaim write best-effort: it
treats the kernel's expected EAGAIN (some, but not all, pages evicted)
as success without logging, and never throws so a transient write error
cannot tear down the long-lived reduction thread.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A CPU-bound workload can sit on gigabytes of cold page cache that a
CPU-idle check would never reclaim. Read the PSI "some avg10" memory
pressure from /proc/pressure/memory and reclaim cold cache toward the
fixed floor whenever pressure is low, even while the VM is busy, backing
off once the workload starts stalling on memory.

A busy interval reclaims at most a bounded step (c_gradualStepBusyBytes)
so a large backlog is drained gently; an idle interval drains the full
excess at once. When PSI is unavailable (kernel built without
CONFIG_PSI), gradual reclaim falls back to gating on CPU idle.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reclaiming toward a single fixed floor either gives back too little on
large VMs or evicts pages a larger working set immediately faults back
in. Make the floor adaptive and self-regulating:

- Track file refaults (/proc/vmstat workingset_refault_file) as a signal
  that reclaim is cutting into the working set. When refaults spike (or
  PSI pressure rises into the backoff band), raise the floor to protect
  what the workload is actually using and stop reclaiming.
- After sustained calm, decay the floor back toward the base so a
  shrunken working set is eventually re-probed downward.
- Scale the floor's upper bound to a fraction of guest RAM so large
  working sets on large VMs can be fully protected, falling back to a
  fixed cap when total RAM is unavailable.

The refault counter is parsed unsigned and clamped so it cannot wrap
negative and disable the brake. The PSI-unavailable path keeps the same
refault brake and floor decay so behavior degrades gracefully on kernels
without CONFIG_PSI.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@benhillis benhillis force-pushed the benhill/mem-reclaim-3-adaptive branch from 3c2691b to 4e60f35 Compare June 11, 2026 18:10
@benhillis

Copy link
Copy Markdown
Member Author

Closing for now. These changes are being submitted one PR at a time; this will be reopened against master once the preceding PR in the series has merged, so it shows only its own incremental change. The branch remains pushed.

@benhillis benhillis closed this Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants