Skip to content

Back system distro overlay with a per-instance scratch vhd#40739

Draft
benhillis wants to merge 5 commits into
microsoft:masterfrom
benhillis:benhillis/scratch-vhd-overlay
Draft

Back system distro overlay with a per-instance scratch vhd#40739
benhillis wants to merge 5 commits into
microsoft:masterfrom
benhillis:benhillis/scratch-vhd-overlay

Conversation

@benhillis

@benhillis benhillis commented Jun 7, 2026

Copy link
Copy Markdown
Member

Summary of the Pull Request

The WSLg system distro runs from a read-only VHD with a writable overlay on top. That overlay's read/write layer was backed by tmpfs, so everything written into it (logs, temp files, copied-up files, build output) consumed guest memory and could spill into swap. Heavy writes ΓÇö e.g. compiling the Linux kernel inside the system distro ΓÇö could exhaust RAM and swap and trigger the OOM killer VM-wide.

This change backs the overlay read/write layer with a per-instance temporary ext4 "scratch" VHD (dynamically expanding, 64 GB cap) instead, mirroring the existing swap VHD. Overlay writes now land on reclaimable disk page cache rather than pinned guest memory, and a runaway write gets a clean ENOSPC instead of an OOM kill.

PR Checklist

  • Closes: Link to issue #xxx
  • Communication: I've discussed this with core contributors already. If work hasn't been agreed, this work might be rejected
  • Tests: Added/updated if needed and all pass
  • Localization: All end user facing strings can be localized
  • Dev docs: Added/updated if needed
  • Documentation updated: If checked, please file a pull request on our docs repo and link it here: #xxx

Detailed Description of the Pull Request / Additional comments

Host (src/windows/service/exe/)

  • When GUI apps are enabled (LaunchInit message), WslCoreVm::CreateInstance creates a sparse, dynamically-expanding scratch-<InstanceId>.vhdx (c_scratchVhdSizeBytes = 64 GB) under user impersonation, attaches it via AttachDiskLockHeld to get a SCSI LUN, derives the VHD path deterministically from the runtime instance GUID (GetInstanceScratchPath; no tracking map is kept), and passes the LUN to the guest in LX_MINI_INIT_MESSAGE.ScratchLun (default ULONG_MAX = none).
  • On terminate, LxssUserSession::_TerminateInstanceInternal captures the instance id and calls WslCoreVm::CleanupInstanceScratch, which recomputes the deterministic path, ejects the disk, and deletes the VHDX (idempotent; nothing is tracked). The per-VM temp directory teardown on VM shutdown is the backstop for any leak.
  • scope_exit guards cover every failure path (create-but-attach-fails, attach-but-track-fails, and post-registration startup failures), so a failed launch never leaks an attached disk, LUN, or file, and never advertises a torn-down LUN to the guest.

Message (src/shared/inc/lxinitshared.h)

  • ScratchLun moved from LX_MINI_INIT_EARLY_CONFIG_MESSAGE to LX_MINI_INIT_MESSAGE (it is now per-instance, not per-VM).

Guest (src/linux/init/)

  • CreateOverlayScratch(Lun) formats the scratch device as ext4 (no journal, lazy inode init ΓÇö the data is disposable) and mounts it at /scratch.
  • UtilMountOverlayFs gained an optional scratch-root parameter: when present, the overlay rw layer is a unique subdirectory bind-mounted from /scratch (disk-backed, reclaimable); otherwise it falls back to tmpfs. The scratch-backed rw layer is mounted with the overlayfs volatile option (skipping upper-fs syncs, since the data is disposable); a kernel that rejects volatile with EINVAL is retried without it.
  • If the scratch VHD cannot be created, attached, formatted, or mounted ΓÇö or the overlay rw setup fails (e.g. backing disk full) ΓÇö the overlay transparently falls back to the previous tmpfs behavior so the distro still launches.
  • Each instance runs in a private mount namespace, so the /scratch mount and the overlay are torn down automatically when the instance exits.

No new user-facing strings; no IDL/ABI changes (the ScratchLun field moved within the internal mini_init wire protocol, which is versioned together).

Validation Steps Performed

Built, deployed to host, and validated end-to-end inside the system distro (wsl --system -d <distro>):

  • ext4-backed overlay: root overlay (df -h /) reports 64 GB (the scratch size); mount shows upperdir=/system/rw/upper on the bind-mounted ext4 scratch.
  • RAM not pinned: a 4 GiB write lands in reclaimable buff/cache (freed by echo 3 > /proc/sys/vm/drop_caches) and not in shared/shmem, unlike the previous tmpfs behavior.
  • Host disk grows on demand: an 8 GiB random write grows the host scratch-<id>.vhdx from ~0.07 GB to ~8.07 GB; only the VHD for the instance written to grows (per-instance isolation confirmed with two running distros).
  • Cleanup: wsl --terminate <distro> ejects and deletes that instance's scratch-*.vhdx while the VM stays running; the per-VM temp directory is removed on wsl --shutdown.
  • Fallback: verified the overlay still launches via tmpfs when no scratch device is available.

Constrained-memory OOM A/B (the headline win): capped the VM at memory=4GB, swap=0 and wrote 6 GiB into the system-distro overlay on the same build, comparing the ext4 scratch against the old tmpfs behavior (mount -t tmpfs):

ext4 scratch (this change) tmpfs (old behavior)
6 GiB overlay write succeeds (/ 10% used, data on disk) fails
Global OOM killer none fires repeatedly
Processes killed none Xwayland, WSLGd/init(), pulseaudio, GnsEngine, …
vmmemWSL (host) bounded under the 4 GB cap pinned at the cap, then OOM

With ext4 the kernel writes overlay data back to the scratch VHD and frees the pages, so 6 GiB fits in a 4 GiB VM. With tmpfs every byte is unevictable shmem, so at the memory ceiling the kernel OOM-kills the WSLg stack and the write never completes ΓÇö the exact failure this change eliminates.

Memory reclaimability A/B (unconstrained VM): same 8 GiB write — ext4 keeps guest Shmem flat and buff/cache fully reclaimable (host vmmemWSL working set drops ~9.7 GB → ~5.4 GB after drop_caches), whereas tmpfs parks the full 8 GiB in Shmem that drop_caches cannot reclaim (working set stays pinned at ~9.5 GB).

The change was additionally reviewed across several rounds of multi-model code review; all findings were addressed.

Copilot AI review requested due to automatic review settings June 7, 2026 20:05
@benhillis benhillis force-pushed the benhillis/scratch-vhd-overlay branch from 9f44d5b to 7c4a74e Compare June 7, 2026 20:08

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes how the WSLg system distro’s writable overlay is backed: instead of using tmpfs (guest memory), it introduces a per-instance, dynamically-expanding scratch VHD that is formatted/mounted in the guest and used as the overlayfs upper/work backing store. This aims to prevent heavy write workloads in the system distro from exhausting VM RAM/swap and triggering VM-wide OOM.

Changes:

  • Windows host: create/attach a per-instance scratch-<InstanceId>.vhdx, track it per runtime instance, pass its SCSI LUN to the guest, and clean it up on failed startup and on termination.
  • Wire protocol: add ScratchLun to LX_MINI_INIT_MESSAGE so it can be specified per instance.
  • Linux guest init: format/mount the scratch device as ext4 at /scratch, and teach overlay mounting to optionally use a scratch-backed upper/work layer with fallback to tmpfs on failure.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/windows/service/exe/WslCoreVm.h Adds APIs and per-instance tracking for scratch VHD paths.
src/windows/service/exe/WslCoreVm.cpp Implements scratch VHD creation/attach, LUN passing, and cleanup logic.
src/windows/service/exe/WslCoreInstance.h Exposes instance id for cleanup coordination.
src/windows/service/exe/WslCoreInstance.cpp Implements GetInstanceId().
src/windows/service/exe/LxssUserSession.cpp Ensures scratch cleanup on post-create startup failure and on terminate.
src/shared/inc/lxinitshared.h Extends mini-init message with ScratchLun.
src/linux/init/util.h Extends overlay mount helper signature; adds temp dir helper decl.
src/linux/init/util.cpp Adds temp dir helper and implements scratch-backed overlay upper/work setup.
src/linux/init/main.cpp Formats/mounts scratch ext4 and uses it for system distro overlay with tmpfs fallback.

Comment thread src/windows/service/exe/WslCoreVm.cpp Outdated
Copilot AI review requested due to automatic review settings June 8, 2026 16:29

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comment thread src/windows/service/exe/WslCoreVm.cpp Outdated
Comment thread test/windows/Common.cpp Outdated
@benhillis benhillis force-pushed the benhillis/scratch-vhd-overlay branch from 064342b to 66e49cc Compare June 10, 2026 05:21
Copilot AI review requested due to automatic review settings June 10, 2026 05:21

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comment thread test/windows/Common.cpp
Comment thread src/linux/init/util.cpp Outdated
@benhillis benhillis force-pushed the benhillis/scratch-vhd-overlay branch from 66e49cc to 49ba5e1 Compare June 10, 2026 05:35
Ben Hillis and others added 4 commits June 10, 2026 09:10
The WSLg system distro runs from a read-only vhd with a writable overlay on
top. That overlay's read/write layer was backed by tmpfs, so everything
written into it (logs, temp files, copied-up files, build output) consumed
guest memory and could spill into swap. Heavy writes -- e.g. compiling the
Linux kernel in the system distro -- could exhaust RAM and swap and trigger
the OOM killer VM-wide.

Back the overlay read/write layer with a per-instance temporary ext4 "scratch"
vhd (dynamically expanding, 64 GB cap) instead, mirroring the swap vhd. Writes
now land on reclaimable disk page cache rather than pinned guest memory, and a
runaway write gets a clean ENOSPC instead of an OOM kill.

The host creates and attaches a scratch-<InstanceId>.vhdx per instance when
GUI apps are enabled, passes its LUN to the guest in LX_MINI_INIT_MESSAGE, and
ejects + deletes it when the instance terminates (with the per-VM temp dir
teardown as a backstop). The guest formats the device as ext4, mounts it, and
bind-mounts a unique subdirectory as the overlay rw layer. If the scratch vhd
cannot be created, attached, or mounted, the overlay transparently falls back
to the previous tmpfs behavior so the distro still launches.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The test hard-coded mkfs.ext4 /dev/sde for its bare-mounted 20MB disk. The
per-instance system distro overlay scratch vhd now occupies an earlier
/dev/sd* node, shifting the bare disk and causing the test to format the
wrong device. Detect the disk by size instead, and promote MountTests'
GetBlockDeviceInWsl helper to a shared Common.h/Common.cpp function.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The per-instance scratch vhd path is deterministic from the instance id, so
remove the m_instanceScratchVhds map and derive the path via GetInstanceScratchPath
wherever it is needed (create, failure cleanup, terminate). This makes cleanup
idempotent and removes the untracked-leak and const-std::move review findings.

Also fix the block-device scan in the test helper to include /dev/sdz.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The per-instance scratch vhd backing the system distro overlay's read/write
layer is reformatted on every launch and discarded on teardown, so the upper
dir is disposable. Mount the overlay with the 'volatile' option (overlayfs
>= 5.10) to skip syncs to the upper filesystem, avoiding writeback stalls on
a layer whose contents never need to survive a crash.

Volatile is only applied to the scratch-backed layer (a tmpfs rw layer gains
nothing). If the running kernel rejects the option the mount is retried
without it, so a disk-backed overlay is still used rather than falling back
to tmpfs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 10, 2026 17:04
@benhillis benhillis force-pushed the benhillis/scratch-vhd-overlay branch from 49ba5e1 to 86ca0ce Compare June 10, 2026 17:04

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Comment thread src/linux/init/util.cpp
Comment thread src/windows/service/exe/WslCoreVm.cpp
Comment thread src/windows/service/exe/WslCoreVm.cpp
…temp dir

Only retry the overlay mount without the volatile option when the kernel
rejects it with EINVAL; surface any other failure so the caller falls back
to a tmpfs read/write layer instead of masking a real error.

Throttle GetBlockDeviceInWsl's rescan loop with a short sleep so it does not
spin launching wsl.exe while the scratch disk attaches.

Set UtilCreateTempDirectory's path explicitly so the result does not depend
on prior buffer contents when ParentPath is null.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants