Serialize Linux test runs and capture crash dumps as a dedicated artifact#1728
Merged
Conversation
…fact The Linux CI leg keeps failing with Catastrophic failure: Test process crashed with exit code 137' (SIGKILL = kernel OOM-killer). The previous attempt to fix this (#1722) set XUNIT_MAX_PARALLEL_THREADS=1, -p:BuildInParallel=false, and RunConfiguration.MaxCpuCount=1, but those options only control parallelism *within* a single test assembly. They do not stop dotnet test <slnFile>' from launching vstest hosts for multiple test projects concurrently, which is what causes two multi-GB generator test hosts (CsWin32Generator.Tests and Microsoft.Windows.CsWin32.Tests) to run at the same time and OOM the agent. The build logs confirm both hosts were live concurrently after that change. This change: * Removes the ineffective parallelism env vars from #1722. * On non-Windows agents, enumerates est/*.Tests' projects and invokes dotnet test' once per project — mirroring the working pattern used by the GitHub Actions Linux job in .github/workflows/build.yml'. * Captures ree -h' and /proc/meminfo' before each project, and dumps dmesg' (where the OOM-killer logs) plus a memory snapshot after any failed run — so future OOMs are diagnosable without guessing. * Enables the .NET runtime mini-dump fallback (heap dumps) in case a managed abort precedes the kill. * Sweeps any captured *.dmp' / core.*' / coredump.*' files into a new crashDumps' artifact registered in build.yml, so they are easy to download without grabbing the 7+ GB testResults bundle. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ws filter) The per-csproj enumeration in 20ba5b5 bypassed the solution's NonWindows configuration, which filters out Windows-only test projects on Linux/macOS. It also passed each csproj to dotnet test directly, which on .NET 10 SDK caused vstest to reject the test exe path with 'argument is invalid' for the small projects (CsWin32Generator.BuildTasks.Tests, GenerationSandbox.*). Switch back to the solution-level invocation and instead serialize via MSBuild -m:1 (single worker node), which causes dotnet test to run the VSTest target one project at a time and prevents the OOM-killer from terminating concurrent heavy test hosts. Also tighten the dmesg diagnostic to surface OOM-related lines only. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sergio0694
approved these changes
Jun 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Resolves the recurring Linux CI failure where tests pass on the PR gate but crash with exit code 137 (SIGKILL by the Linux OOM-killer) on main.
Root cause
dotnet test <slnFile>on Linux uses MSBuild to schedule theVSTesttarget across multiple test projects in parallel. The heavy generator test projects (CsWin32Generator.Tests,Microsoft.Windows.CsWin32.Tests) each consume several GB of RAM. On the 8 GB ADO Linux agents, two of these running concurrently triggers the kernel OOM-killer.PR #1722 tried
XUNIT_MAX_PARALLEL_THREADS=1,-p:BuildInParallel=false, andRunConfiguration.MaxCpuCount=1. None of those serialize cross-project test execution — they only limit parallelism within a single test assembly — so the OOM continued.Fix
Add
-m:1(single MSBuild worker node) on Linux/macOS so the VSTest target runs one project at a time. We keep the solution-level invocation (dotnet test <repoRoot>) so that the sln'sNonWindowsconfiguration continues to filter out Windows-only test projects.Diagnostics
To avoid future guesswork:
Pre-test memory statesnapshot (free -h+/proc/meminfo) is printed before tests start.Post-failure memory statesnapshot is printed.(sudo -n) dmesg --ctime).DOTNET_DbgEnableMiniDump=1withDbgMiniDumpType=2(Heap) writes managed crash dumps into a dedicatedcrashDumps-<OS>pipeline artifact. (SIGKILL never reaches the runtime, so dumps won't be produced for pure OOM — but they will be for any other crash mode.)crashDumps-<OS>artifacts are added to the 1ESPT outputs inazure-pipelines/build.yml.Verification
Manually queued ADO build https://dev.azure.com/devdiv/DevDiv/_build/results?buildId=14363141 against
validate/fix-linux-test-oom.