wslc: idle-terminate per-user session VMs when inactive#40781
Draft
benhillis wants to merge 20 commits into
Draft
wslc: idle-terminate per-user session VMs when inactive#40781benhillis wants to merge 20 commits into
benhillis wants to merge 20 commits into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds on-demand creation and idle-termination of per-user WSLC session VMs (for sessions with persistent storage), so memory can be reclaimed while keeping the session object and storage intact. It also introduces VM-liveness/activity bookkeeping to prevent teardown during in-flight operations and adds new E2E coverage around VM lifecycle behavior.
Changes:
- Implement lazy VM bring-up and idle shutdown in
wslcsessionvia an idle worker, activity counting/tokens, and aVmLeaseused by VM-requiring operations. - Add client-side “operation keep-alive” usage in
wslc.execontainer operations to prevent VM teardown betweenOpenContainerand subsequent calls/streaming. - Add a new E2E test suite validating lazy start, idle stop, persistence across restarts, keep-alive for root-namespace processes, and teardown/recreate races.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test/windows/wslc/e2e/WSLCE2EVmIdleTests.cpp | New E2E tests covering lazy VM start, idle stop, persistence, keep-alive, and race scenarios. |
| test/windows/wslc/e2e/WSLCE2EHelpers.h | Exposes the underlying IWSLCSession* for diagnostics/test-only calls. |
| src/windows/wslcsession/WSLCSession.h | Adds VM lifecycle state, idle worker/tokens/lease declarations, and new session methods. |
| src/windows/wslcsession/WSLCSession.cpp | Implements lazy VM creation, idle teardown, activity tokens, and VM diagnostics reporting. |
| src/windows/wslcsession/WSLCProcessControl.cpp | Preserves a real exit code when signaling container release, only synthesizing SIGKILL when needed. |
| src/windows/wslcsession/WSLCProcess.h | Stores a keep-alive token on root-namespace processes to keep the VM alive for their lifetime. |
| src/windows/wslcsession/WSLCContainer.cpp | Signals idle re-checks on terminal container transitions; holds a VM lease during delete. |
| src/windows/wslcsession/IORelay.h | Adds IsRelayThread() to safely avoid destroying the relay on its own thread. |
| src/windows/wslcsession/IORelay.cpp | Co-initializes the relay thread into the MTA; implements IsRelayThread(). |
| src/windows/wslc/services/SessionModel.h | Adds a helper to acquire/hold a keep-alive token for client-side container operations. |
| src/windows/wslc/services/ContainerService.cpp | Uses the keep-alive token across container operations (attach/start/stop/kill/delete/exec/etc.). |
| src/windows/service/inc/wslc.idl | Adds VM diagnostics type + new session methods for diagnostics and operation keep-alive. |
| src/windows/service/exe/WSLCSessionManager.cpp | Updates comments to reflect on-demand VM creation and recreation after idle termination. |
Comment on lines
+529
to
+534
| std::lock_guard containersLock(m_containersLock); | ||
| std::lock_guard networksLock(m_networksLock); | ||
|
|
||
| m_containers.clear(); | ||
| m_volumes.reset(); | ||
| m_networks.clear(); |
Comment on lines
+343
to
+346
| // The VM is created lazily on the first operation that requires it (see EnsureVmRunning) | ||
| // and torn down when the session becomes idle. Start the worker that performs idle teardown. | ||
| m_idleThread = std::thread([this]() { IdleWorker(); }); | ||
|
|
Summary ------- Decouples a wslc session's VM lifetime from the wslcsession.exe process so the VM can be torn down when nothing needs it and recreated on demand later, while the per-user session and its bookkeeping survive across VM restarts. Previously a VM was created 1:1 with a session: the SYSTEM service eagerly built an HcsVirtualMachine and handed the COM pointer to wslcsession.exe, and any VM exit permanently terminated the (single-shot) session. Detailed description -------------------- * New service-side IWSLCVirtualMachineFactory lets the per-user process mint a fresh VM at any time. IWSLCSessionFactory::CreateSession / IWSLCSession::Initialize now take the factory instead of an eager IWSLCVirtualMachine. WSLCVirtualMachineFactory deep-copies the settings and duplicates the dmesg handle per creation. * WSLCSession Initialize is now lightweight (persists settings, starts the idle worker). VM bring-up/teardown is split into re-runnable StartVmLockHeld / StopVmLockHeld / TearDownVmLockHeld driven by a VmState machine (None/Starting/Running/Stopping). * On-demand bring-up + idle teardown: every VM-requiring operation takes a VmLease RAII (EnsureVmRunning + activity count); when activity drops to zero and no container is in the Created or Running state, the idle worker tears the VM down immediately. Idle termination is enabled only for persistent-storage sessions. * VM-exit disambiguation: intentional stops (m_vmStopRequested) keep the session alive; unexpected exits still Terminate() permanently. * New IWSLCSession::GetVmDiagnostics (Running + StartCount) exposes VM lifecycle for tests/diagnostics without bringing the VM up or counting as activity. Concurrency fixes folded in (compile-validated; flagged for runtime stress): * IORelay self-join: TearDownVmLockHeld no longer destroys the IO relay from its own thread (added IORelay::IsRelayThread); the stopped relay is left for ~WSLCSession on a non-relay thread. * Lease-vs-idle-stop race: VmLease retries instead of throwing ERROR_INVALID_STATE when the idle worker tears down in the bring-up window. * Idle-worker-vs-crash deadlock: IdleWorker bails when the VM exit event is already signaled, letting the relay-thread Terminate path own teardown. Validation steps ----------------- * Full solution build (x64 Debug) green, including wsltests.dll. * Copyright-header validation: no new violations. * Added E2E tests (test/windows/wslc/e2e/WSLCE2EVmIdleTests.cpp): lazy start + idle stop, recreate-on-demand + state persistence, Created container keeps VM alive, and concurrent recreate stress (lease/idle race). * NOT runtime-validated here (requires deploy + Administrator + container runtime); run bin\x64\Debug\test.bat /name:*VmIdle* and stress the two race fixes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…deadlock) Runtime fixes for the idle-terminate feature so all four WSLCE2EVmIdleTests pass: - Register the IWSLCVirtualMachineFactory proxy in the process Global Interface Table and re-fetch it per VM creation, and MTA-init the IdleWorker / IORelay threads, so on-demand VM creation no longer fails with RPC_E_WRONG_THREAD. - Register the factory IID in the MSIX so cross-proc marshalling resolves. - Preserve a container's real exit code in DockerContainerProcessControl:: OnContainerReleased: only synthesize 128+SIGKILL when no exit code was ever recorded, so --rm 'container run' returns 0 instead of 137 when the VM idle-terminates immediately after the container exits. - Fix an AB-BA deadlock between 'container rm' and idle teardown: hold a VmLease across WSLCContainer::Delete (keeps the VM up and blocks teardown) and drop the now-redundant shared-lock re-acquire in OnContainerDeleted (which would deadlock behind the idle worker's pending exclusive lock). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…Operation The wslc CLI performs each container mutation as two COM round-trips (OpenContainer to resolve a wrapper, then the operation), and may stream output afterwards. With on-demand VM idle-termination enabled (any persistent-storage session), the VM could idle-stop in the gap between the calls when the target container is not Created/Running: TearDownVmLockHeld clears m_containers, disconnecting the client-held wrapper, so the second call failed with RPC_E_DISCONNECTED. This regressed the container CRUD E2E suite (rm of stopped containers, and cleanup helpers). Add IWSLCSession::BeginContainerOperation, returning an activity token that holds m_activityCount > 0 for as long as the client holds it. The CLI now holds the token across the whole operation (resolve + operate + streamed relay), so the idle worker cannot tear the VM down mid-operation. Releasing the token (or the client exiting, via fast rundown) lets the VM idle again. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tearing the VM down the instant it went idle could thrash it (repeated teardown/recreate) when containers are created and destroyed, or operations issued, in quick succession. Keep an otherwise-idle VM running for a short grace period and only tear it down once it has stayed idle for the whole window. The idle worker now waits with a timeout derived from a grace deadline. The deadline is armed when the VM is first observed idle and reset on any non-idle observation or explicit idle-check signal (raised on every lease/token release and terminal container state change), so teardown occurs a full grace period after the last activity. A WAIT_TIMEOUT wake means the VM has been continuously idle for the grace period and is torn down. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A root-namespace process created via CreateRootNamespaceProcess is not tracked as a container, so it did not contribute to the session activity count or HasActiveContainerLockHeld(). The VmLease taken during creation was released when the call returned, leaving the process eligible for idle teardown: once the grace period elapsed the idle worker could stop the VM and kill a long-lived root process (e.g. a plugin host) out from under the client. Bind an activity token to the returned WSLCProcess so the VM stays alive for as long as the client holds the process proxy. Factor the existing BeginContainerOperation token logic into CreateActivityToken() and reuse it here. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…VM lifecycle Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR #40767 (terminateEvent) removed ITerminationCallback / WSLCSessionSettings.TerminationCallback from the IDL but left the WSLCVirtualMachineFactory (introduced later by factory PR #40770, which its branch merged from master) still referencing them, so the branch did not compile. Drop the dead member and assignments; the VM now caches the reason for WSLCSession to pull. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Kevin's PR #40767 (event-model termination) moved m_vmExitEvent.SetEvent() in OnExit() to after a new std::lock_guard(m_lock) that caches the termination reason. ~HcsVirtualMachine already holds m_lock across the 5s exit-event wait and HcsCloseComputeSystem (which drains in-flight HCS callbacks). An in-flight OnExit() therefore blocks acquiring m_lock, so it never signals the exit event nor drains, and the close never completes: a hard deadlock that StuckVmTermination reliably reproduces. Drop the broad lock from the dtor. By the time the compute system is closed no further callbacks can run, so the remaining teardown is safe unguarded. Flag to Kevin to fold into #40767. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
In the lazy-VM model, container/volume/network recovery runs at first VM start rather than during CreateSession, so the create-time WarningCallback was out of scope by the time recovery emitted warnings. Park the session's WarningCallback in the GIT and have WSLCExecutionContext fall back to it when an operation has no explicit callback, so lazy-recovery warnings still reach the user. CLI-side, keep the create/enter callback alive for the whole command by storing it in the Session model. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ConfigureStorage validates the storage path lazily (via AttachDisk at first VM start), so with a lazily-created VM a WSLCSessionStorageFlagsNoCreate session pointing at a missing path no longer failed at CreateSession. Validate the storage VHD existence eagerly in Initialize so misconfiguration is reported up front. The idle worker acquired m_lock exclusively (blocking) on every wake. Because SRW locks favor a waiting writer, that pending acquire stalled all new shared-lock operations behind it, so a long-running operation holding its shared VmLease (e.g. a blocking SaveImage/Export) serialized every concurrent operation until it completed. Use try_lock_exclusive and treat contention as activity, re-evaluating on the next idle-check signal. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
c12d7e1 to
fa2eb47
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Idle-terminates a per-user WSLC session's backing VM when it has been inactive, freeing memory while the session object (and its persistent storage) lives on. The VM is transparently recreated on the next operation.
Builds on #40770 (IWSLCVirtualMachineFactory).
Behavior
Testing
WSLCE2EVmIdleTestsE2E suite (5 tests) includingWSLCE2E_VmIdle_RootProcessKeepsVmAlive.WSLCTests::CreateRootNamespaceProcessstill passes.Notes / follow-ups (deferred)
Note
Draft for early review.