Skip to content

wslc: idle-terminate per-user session VMs when inactive#40781

Draft
benhillis wants to merge 20 commits into
masterfrom
user/benhill/wslc-idle-terminate-vm
Draft

wslc: idle-terminate per-user session VMs when inactive#40781
benhillis wants to merge 20 commits into
masterfrom
user/benhill/wslc-idle-terminate-vm

Conversation

@benhillis

Copy link
Copy Markdown
Member

Summary

Idle-terminates a per-user WSLC session's backing VM when it has been inactive, freeing memory while the session object (and its persistent storage) lives on. The VM is transparently recreated on the next operation.

Builds on #40770 (IWSLCVirtualMachineFactory).

Behavior

  • Only sessions with persistent storage (StoragePath set) idle-terminate.
  • An idle worker thread tears the VM down after a grace period (currently 30s) once there is no in-flight activity and no active container lock.
  • In-flight work holds an activity reference so the VM cannot be torn down mid-operation:
    • VmLease wraps CLI/container operations.
    • BeginContainerOperation hands clients an activity token (IFastRundown so a client crash reclaims it promptly).
    • Long-lived root-namespace processes (e.g. plugin hosts) created via CreateRootNamespaceProcess hold a keep-alive token for their lifetime.
  • Activity bookkeeping (count + wake event) lives in a shared IdleState held via shared_ptr, decoupled from the session's lifetime, so a held token suppresses idle teardown without extending the session object's lifetime (preserving the explicit-reset-invalidates-held-processes invariant from Add WSLC (WSL Containers) feature #40366).

Testing

  • New WSLCE2EVmIdleTests E2E suite (5 tests) including WSLCE2E_VmIdle_RootProcessKeepsVmAlive.
  • WSLCTests::CreateRootNamespaceProcess still passes.
  • Full x64 Debug build clean.

Notes / follow-ups (deferred)

  • Grace period is a hardcoded constexpr; making it injectable would enable deterministic race tests.
  • No crash-path (client dies holding token) automated coverage yet.

Note

Draft for early review.

Copilot AI review requested due to automatic review settings June 11, 2026 19:50

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds on-demand creation and idle-termination of per-user WSLC session VMs (for sessions with persistent storage), so memory can be reclaimed while keeping the session object and storage intact. It also introduces VM-liveness/activity bookkeeping to prevent teardown during in-flight operations and adds new E2E coverage around VM lifecycle behavior.

Changes:

  • Implement lazy VM bring-up and idle shutdown in wslcsession via an idle worker, activity counting/tokens, and a VmLease used by VM-requiring operations.
  • Add client-side “operation keep-alive” usage in wslc.exe container operations to prevent VM teardown between OpenContainer and subsequent calls/streaming.
  • Add a new E2E test suite validating lazy start, idle stop, persistence across restarts, keep-alive for root-namespace processes, and teardown/recreate races.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/windows/wslc/e2e/WSLCE2EVmIdleTests.cpp New E2E tests covering lazy VM start, idle stop, persistence, keep-alive, and race scenarios.
test/windows/wslc/e2e/WSLCE2EHelpers.h Exposes the underlying IWSLCSession* for diagnostics/test-only calls.
src/windows/wslcsession/WSLCSession.h Adds VM lifecycle state, idle worker/tokens/lease declarations, and new session methods.
src/windows/wslcsession/WSLCSession.cpp Implements lazy VM creation, idle teardown, activity tokens, and VM diagnostics reporting.
src/windows/wslcsession/WSLCProcessControl.cpp Preserves a real exit code when signaling container release, only synthesizing SIGKILL when needed.
src/windows/wslcsession/WSLCProcess.h Stores a keep-alive token on root-namespace processes to keep the VM alive for their lifetime.
src/windows/wslcsession/WSLCContainer.cpp Signals idle re-checks on terminal container transitions; holds a VM lease during delete.
src/windows/wslcsession/IORelay.h Adds IsRelayThread() to safely avoid destroying the relay on its own thread.
src/windows/wslcsession/IORelay.cpp Co-initializes the relay thread into the MTA; implements IsRelayThread().
src/windows/wslc/services/SessionModel.h Adds a helper to acquire/hold a keep-alive token for client-side container operations.
src/windows/wslc/services/ContainerService.cpp Uses the keep-alive token across container operations (attach/start/stop/kill/delete/exec/etc.).
src/windows/service/inc/wslc.idl Adds VM diagnostics type + new session methods for diagnostics and operation keep-alive.
src/windows/service/exe/WSLCSessionManager.cpp Updates comments to reflect on-demand VM creation and recreation after idle termination.

Comment thread src/windows/wslcsession/WSLCSession.h
Comment thread src/windows/service/inc/wslc.idl
Copilot AI review requested due to automatic review settings June 11, 2026 20:10

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Comment thread src/windows/service/inc/wslc.idl
Comment on lines +529 to +534
std::lock_guard containersLock(m_containersLock);
std::lock_guard networksLock(m_networksLock);

m_containers.clear();
m_volumes.reset();
m_networks.clear();
Comment on lines +343 to +346
// The VM is created lazily on the first operation that requires it (see EnsureVmRunning)
// and torn down when the session becomes idle. Start the worker that performs idle teardown.
m_idleThread = std::thread([this]() { IdleWorker(); });

kvega005 and others added 14 commits June 11, 2026 14:02
Summary
-------
Decouples a wslc session's VM lifetime from the wslcsession.exe process so the
VM can be torn down when nothing needs it and recreated on demand later, while
the per-user session and its bookkeeping survive across VM restarts.

Previously a VM was created 1:1 with a session: the SYSTEM service eagerly built
an HcsVirtualMachine and handed the COM pointer to wslcsession.exe, and any VM
exit permanently terminated the (single-shot) session.

Detailed description
--------------------
* New service-side IWSLCVirtualMachineFactory lets the per-user process mint a
  fresh VM at any time. IWSLCSessionFactory::CreateSession / IWSLCSession::Initialize
  now take the factory instead of an eager IWSLCVirtualMachine. WSLCVirtualMachineFactory
  deep-copies the settings and duplicates the dmesg handle per creation.
* WSLCSession Initialize is now lightweight (persists settings, starts the idle
  worker). VM bring-up/teardown is split into re-runnable StartVmLockHeld /
  StopVmLockHeld / TearDownVmLockHeld driven by a VmState machine
  (None/Starting/Running/Stopping).
* On-demand bring-up + idle teardown: every VM-requiring operation takes a VmLease
  RAII (EnsureVmRunning + activity count); when activity drops to zero and no
  container is in the Created or Running state, the idle worker tears the VM down
  immediately. Idle termination is enabled only for persistent-storage sessions.
* VM-exit disambiguation: intentional stops (m_vmStopRequested) keep the session
  alive; unexpected exits still Terminate() permanently.
* New IWSLCSession::GetVmDiagnostics (Running + StartCount) exposes VM lifecycle
  for tests/diagnostics without bringing the VM up or counting as activity.

Concurrency fixes folded in (compile-validated; flagged for runtime stress):
* IORelay self-join: TearDownVmLockHeld no longer destroys the IO relay from its
  own thread (added IORelay::IsRelayThread); the stopped relay is left for
  ~WSLCSession on a non-relay thread.
* Lease-vs-idle-stop race: VmLease retries instead of throwing ERROR_INVALID_STATE
  when the idle worker tears down in the bring-up window.
* Idle-worker-vs-crash deadlock: IdleWorker bails when the VM exit event is already
  signaled, letting the relay-thread Terminate path own teardown.

Validation steps
-----------------
* Full solution build (x64 Debug) green, including wsltests.dll.
* Copyright-header validation: no new violations.
* Added E2E tests (test/windows/wslc/e2e/WSLCE2EVmIdleTests.cpp): lazy start +
  idle stop, recreate-on-demand + state persistence, Created container keeps VM
  alive, and concurrent recreate stress (lease/idle race).
* NOT runtime-validated here (requires deploy + Administrator + container runtime);
  run bin\x64\Debug\test.bat /name:*VmIdle* and stress the two race fixes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…deadlock)

Runtime fixes for the idle-terminate feature so all four WSLCE2EVmIdleTests pass:

- Register the IWSLCVirtualMachineFactory proxy in the process Global Interface
  Table and re-fetch it per VM creation, and MTA-init the IdleWorker / IORelay
  threads, so on-demand VM creation no longer fails with RPC_E_WRONG_THREAD.
- Register the factory IID in the MSIX so cross-proc marshalling resolves.
- Preserve a container's real exit code in DockerContainerProcessControl::
  OnContainerReleased: only synthesize 128+SIGKILL when no exit code was ever
  recorded, so --rm 'container run' returns 0 instead of 137 when the VM
  idle-terminates immediately after the container exits.
- Fix an AB-BA deadlock between 'container rm' and idle teardown: hold a VmLease
  across WSLCContainer::Delete (keeps the VM up and blocks teardown) and drop the
  now-redundant shared-lock re-acquire in OnContainerDeleted (which would deadlock
  behind the idle worker's pending exclusive lock).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…Operation

The wslc CLI performs each container mutation as two COM round-trips
(OpenContainer to resolve a wrapper, then the operation), and may stream
output afterwards. With on-demand VM idle-termination enabled (any
persistent-storage session), the VM could idle-stop in the gap between the
calls when the target container is not Created/Running: TearDownVmLockHeld
clears m_containers, disconnecting the client-held wrapper, so the second
call failed with RPC_E_DISCONNECTED. This regressed the container CRUD E2E
suite (rm of stopped containers, and cleanup helpers).

Add IWSLCSession::BeginContainerOperation, returning an activity token that
holds m_activityCount > 0 for as long as the client holds it. The CLI now
holds the token across the whole operation (resolve + operate + streamed
relay), so the idle worker cannot tear the VM down mid-operation. Releasing
the token (or the client exiting, via fast rundown) lets the VM idle again.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tearing the VM down the instant it went idle could thrash it (repeated
teardown/recreate) when containers are created and destroyed, or operations
issued, in quick succession. Keep an otherwise-idle VM running for a short
grace period and only tear it down once it has stayed idle for the whole
window.

The idle worker now waits with a timeout derived from a grace deadline. The
deadline is armed when the VM is first observed idle and reset on any non-idle
observation or explicit idle-check signal (raised on every lease/token release
and terminal container state change), so teardown occurs a full grace period
after the last activity. A WAIT_TIMEOUT wake means the VM has been continuously
idle for the grace period and is torn down.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A root-namespace process created via CreateRootNamespaceProcess is not
tracked as a container, so it did not contribute to the session activity
count or HasActiveContainerLockHeld(). The VmLease taken during creation
was released when the call returned, leaving the process eligible for
idle teardown: once the grace period elapsed the idle worker could stop
the VM and kill a long-lived root process (e.g. a plugin host) out from
under the client.

Bind an activity token to the returned WSLCProcess so the VM stays alive
for as long as the client holds the process proxy. Factor the existing
BeginContainerOperation token logic into CreateActivityToken() and reuse
it here.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…VM lifecycle

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR #40767 (terminateEvent) removed ITerminationCallback / WSLCSessionSettings.TerminationCallback from the IDL but left the WSLCVirtualMachineFactory (introduced later by factory PR #40770, which its branch merged from master) still referencing them, so the branch did not compile. Drop the dead member and assignments; the VM now caches the reason for WSLCSession to pull.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Kevin's PR #40767 (event-model termination) moved m_vmExitEvent.SetEvent()
in OnExit() to after a new std::lock_guard(m_lock) that caches the
termination reason. ~HcsVirtualMachine already holds m_lock across the 5s
exit-event wait and HcsCloseComputeSystem (which drains in-flight HCS
callbacks). An in-flight OnExit() therefore blocks acquiring m_lock, so it
never signals the exit event nor drains, and the close never completes:
a hard deadlock that StuckVmTermination reliably reproduces.

Drop the broad lock from the dtor. By the time the compute system is
closed no further callbacks can run, so the remaining teardown is safe
unguarded. Flag to Kevin to fold into #40767.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
In the lazy-VM model, container/volume/network recovery runs at first VM
start rather than during CreateSession, so the create-time WarningCallback
was out of scope by the time recovery emitted warnings.

Park the session's WarningCallback in the GIT and have WSLCExecutionContext
fall back to it when an operation has no explicit callback, so lazy-recovery
warnings still reach the user. CLI-side, keep the create/enter callback alive
for the whole command by storing it in the Session model.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ConfigureStorage validates the storage path lazily (via AttachDisk at first
VM start), so with a lazily-created VM a WSLCSessionStorageFlagsNoCreate
session pointing at a missing path no longer failed at CreateSession.
Validate the storage VHD existence eagerly in Initialize so misconfiguration
is reported up front.

The idle worker acquired m_lock exclusively (blocking) on every wake. Because
SRW locks favor a waiting writer, that pending acquire stalled all new
shared-lock operations behind it, so a long-running operation holding its
shared VmLease (e.g. a blocking SaveImage/Export) serialized every concurrent
operation until it completed. Use try_lock_exclusive and treat contention as
activity, re-evaluating on the next idle-check signal.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@benhillis benhillis force-pushed the user/benhill/wslc-idle-terminate-vm branch from c12d7e1 to fa2eb47 Compare June 12, 2026 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants