Skip to content

Serialize Linux test runs and capture crash dumps as a dedicated artifact#1728

Merged
jevansaks merged 2 commits into
mainfrom
user/jevansa/fix-linux-test-oom
Jun 12, 2026
Merged

Serialize Linux test runs and capture crash dumps as a dedicated artifact#1728
jevansaks merged 2 commits into
mainfrom
user/jevansa/fix-linux-test-oom

Conversation

@jevansaks

@jevansaks jevansaks commented Jun 12, 2026

Copy link
Copy Markdown
Member

Summary

Resolves the recurring Linux CI failure where tests pass on the PR gate but crash with exit code 137 (SIGKILL by the Linux OOM-killer) on main.

Root cause

dotnet test <slnFile> on Linux uses MSBuild to schedule the VSTest target across multiple test projects in parallel. The heavy generator test projects (CsWin32Generator.Tests, Microsoft.Windows.CsWin32.Tests) each consume several GB of RAM. On the 8 GB ADO Linux agents, two of these running concurrently triggers the kernel OOM-killer.

PR #1722 tried XUNIT_MAX_PARALLEL_THREADS=1, -p:BuildInParallel=false, and RunConfiguration.MaxCpuCount=1. None of those serialize cross-project test execution — they only limit parallelism within a single test assembly — so the OOM continued.

Fix

Add -m:1 (single MSBuild worker node) on Linux/macOS so the VSTest target runs one project at a time. We keep the solution-level invocation (dotnet test <repoRoot>) so that the sln's NonWindows configuration continues to filter out Windows-only test projects.

Diagnostics

To avoid future guesswork:

  • A Pre-test memory state snapshot (free -h + /proc/meminfo) is printed before tests start.
  • On any test-host failure, a Post-failure memory state snapshot is printed.
  • On any test-host failure, the dmesg tail is scanned for OOM-killer entries ((sudo -n) dmesg --ctime).
  • DOTNET_DbgEnableMiniDump=1 with DbgMiniDumpType=2 (Heap) writes managed crash dumps into a dedicated crashDumps-<OS> pipeline artifact. (SIGKILL never reaches the runtime, so dumps won't be produced for pure OOM — but they will be for any other crash mode.)
  • Per-OS crashDumps-<OS> artifacts are added to the 1ESPT outputs in azure-pipelines/build.yml.

Verification

Manually queued ADO build https://dev.azure.com/devdiv/DevDiv/_build/results?buildId=14363141 against validate/fix-linux-test-oom.

jevansaks and others added 2 commits June 12, 2026 11:54
…fact

The Linux CI leg keeps failing with Catastrophic failure: Test process
crashed with exit code 137' (SIGKILL = kernel OOM-killer). The previous
attempt to fix this (#1722) set XUNIT_MAX_PARALLEL_THREADS=1,
-p:BuildInParallel=false, and RunConfiguration.MaxCpuCount=1, but those
options only control parallelism *within* a single test assembly. They
do not stop dotnet test <slnFile>' from launching vstest hosts for
multiple test projects concurrently, which is what causes two
multi-GB generator test hosts (CsWin32Generator.Tests and
Microsoft.Windows.CsWin32.Tests) to run at the same time and OOM the
agent. The build logs confirm both hosts were live concurrently after
that change.

This change:

* Removes the ineffective parallelism env vars from #1722.
* On non-Windows agents, enumerates 	est/*.Tests' projects and invokes
  dotnet test' once per project — mirroring the working pattern used by
  the GitHub Actions Linux job in .github/workflows/build.yml'.
* Captures ree -h' and /proc/meminfo' before each project, and dumps
  dmesg' (where the OOM-killer logs) plus a memory snapshot after any
  failed run — so future OOMs are diagnosable without guessing.
* Enables the .NET runtime mini-dump fallback (heap dumps) in case a
  managed abort precedes the kill.
* Sweeps any captured *.dmp' / core.*' / coredump.*' files into a
  new crashDumps' artifact registered in build.yml, so they are easy
  to download without grabbing the 7+ GB testResults bundle.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ws filter)

The per-csproj enumeration in 20ba5b5 bypassed the solution's NonWindows
configuration, which filters out Windows-only test projects on Linux/macOS.
It also passed each csproj to dotnet test directly, which on .NET 10 SDK
caused vstest to reject the test exe path with 'argument is invalid' for
the small projects (CsWin32Generator.BuildTasks.Tests, GenerationSandbox.*).

Switch back to the solution-level invocation and instead serialize via
MSBuild -m:1 (single worker node), which causes dotnet test to run the
VSTest target one project at a time and prevents the OOM-killer from
terminating concurrent heavy test hosts.

Also tighten the dmesg diagnostic to surface OOM-related lines only.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jevansaks jevansaks marked this pull request as ready for review June 12, 2026 20:12
@jevansaks jevansaks enabled auto-merge (squash) June 12, 2026 20:15
@jevansaks jevansaks merged commit b4d701a into main Jun 12, 2026
30 of 31 checks passed
@jevansaks jevansaks deleted the user/jevansa/fix-linux-test-oom branch June 12, 2026 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants