Skip to content

Mount ReFS Dev Drive on Windows to speed up small-file I/O#5065

Draft
pietern wants to merge 8 commits intomainfrom
test-time-windows-devdrive
Draft

Mount ReFS Dev Drive on Windows to speed up small-file I/O#5065
pietern wants to merge 8 commits intomainfrom
test-time-windows-devdrive

Conversation

@pietern
Copy link
Copy Markdown
Contributor

@pietern pietern commented Apr 22, 2026

Summary

Mount a ReFS Dev Drive on Windows in the shared setup-build-environment composite and redirect TEMP/TMP + Go cache dirs onto it. All Windows test jobs inherit the speedup.

Why

windows/terraform is ~32 min wall-clock, dominated by small-file I/O on C: — per-test .terraform/providers/ extraction, go build cache writes, test tmpdir churn. Public benchmarks put GH Windows C: at ~4.3k IOPS vs ~127k on a ReFS Dev Drive.

Design

  • Inlined PowerShell, no third-party action. Uses built-in New-VHD / Mount-VHD / Format-Volume.
  • In the composite, gated on runner.os == 'Windows', so every caller (test, test-exp-aitools, test-exp-ssh, test-pipelines) picks it up.
  • -DevDrive flag inside a try/catch; older hosts fall back to plain ReFS.
  • Checkout left on C:t.TempDir() is the hot path and lives under $TEMP, so redirecting TEMP/TMP captures the dominant I/O.
  • Env vars written to $GITHUB_ENV so actions/setup-go's cache save/restore lands on the drive.

Expected impact

windows/terraform: ~32 min → ~12–15 min. Getting under 10 min may need a follow-up (e.g. moving build/ via TEST_BUILD_DIR).

Risks

  • Mount requires admin on the runner; if databricks-protected-runner-group-large lacks it the first step fails loudly.
  • Z: collision (unlikely) — reassign.

This pull request and its description were written by Isaac.

Acceptance tests on Windows spend most of their wall-clock on small-file
writes: each terraform init copies providers into a per-test `.terraform/`
under `$TEMP`, and the go build/module caches see similar churn. The
default C: drive on GitHub-hosted and Databricks-protected Windows
runners is backed by remote block storage (~4.3k IOPS); a ReFS Dev Drive
is ~127k IOPS on comparable benchmarks.

This step creates a 20GB dynamic VHDX, mounts it as Z:, formats it ReFS
(with the `-DevDrive` flag where the host supports it, falling back to
plain ReFS otherwise), and redirects TEMP/TMP + GOCACHE/GOMODCACHE/
GOTMPDIR onto it. Checkout stays on C: -- moving it would be invasive
(acceptance test output normalization) for little further gain.

Placed at the top of the composite action so it applies to every caller
(test, test-exp-aitools, test-exp-ssh, test-pipelines). No-op on
non-Windows runners via `runner.os == 'Windows'`.

Co-authored-by: Isaac
@pietern pietern changed the title ci: mount ReFS Dev Drive on Windows to speed up small-file I/O Mount ReFS Dev Drive on Windows to speed up small-file I/O Apr 22, 2026
pietern added 7 commits April 22, 2026 16:32
Redirecting TEMP to Z: puts `t.TempDir()` (and therefore each test's
bundle cwd) on Z:, while the checkout and uv's Python package cache stay
on C:. Under `bundle/python/*` tests with older databricks-bundles
versions (e.g. PYDAB_VERSION=0.266.0), the Python mutator calls
`os.path.commonpath([os.getcwd(), path])` which raises
`ValueError: Paths don't have the same drive`. Six tests regressed:
experimental-compatibility{,-both-equal}, resource-loading,
unicode-support, restricted-execution, resolve-variable.

Keep only GOCACHE/GOMODCACHE/GOTMPDIR on the Dev Drive -- those benefit
Go compilation I/O without spanning drive boundaries at the Python
level. Per-test `.terraform/` speedup is lost; recover it in a follow-up
by plumbing a test-framework-specific tmpdir that callers can keep on
the same drive as the checkout.

Co-authored-by: Isaac
Rather than redirecting individual env vars (TEMP/TMP/GO*) and running
into cross-drive path issues (Python mutator's
`os.path.commonpath([cwd, path])` fails when cwd and the uv-cached
module live on different drives), we can just relocate the entire
workspace.

Create a directory junction from `$GITHUB_WORKSPACE` (still on C:) to
`Z:\workspace`. From every tool's point of view the path is unchanged --
it starts with `C:\` and `commonpath` works. Physically, all reads and
writes go to the ReFS volume.

Flow:
- Mount VHDX at Z:
- Wipe the workflow's prior checkout at $GITHUB_WORKSPACE
- Create junction to Z:\workspace
- The composite's own `actions/checkout` step re-populates the
  workspace via the junction (so setup-jfrog etc. find their files)

No Go / TEMP env gymnastics needed. bundle/python/* tests stay happy.

Co-authored-by: Isaac
Go-caches-only on Z: left the big Windows test jobs effectively flat
(windows/terraform 32m34s vs 32m33s baseline) because the dominant cost
is per-test `.terraform/` churn under TEMP, not Go compilation. Moving
TEMP onto the Dev Drive was the missing piece.

The first TEMP-on-Z: attempt broke `bundle/python/*` tests (older
databricks-bundles calls `os.path.commonpath([cwd, uv_cache_path])` and
chokes when the two live on different drives). Fix: create a directory
junction at `C:\a\_fast` (sibling to `C:\a\cli\cli`, not inside the
repo) pointing at `Z:\fast`. Path strings stay `C:\...`, so
`commonpath` is happy; I/O physically lands on Z:.

Junction is outside the checkout to avoid `git status` pollution,
`git clean` interactions after `actions/checkout`, and unintended
traversal by repo-walking tools.

Co-authored-by: Isaac
`bundle.TestRootLookup` (unit test) called `filepath.EvalSymlinks` on
its `t.TempDir()` path, which landed under the `C:\a\_fast` directory
junction. Go's stdlib EvalSymlinks on Windows returns
"cannot find the path specified" for `IO_REPARSE_TAG_MOUNT_POINT`
(junctions) rooted at a newly mounted ReFS VHDX, but handles
`IO_REPARSE_TAG_SYMLINK` (directory symlinks) correctly.

Switching `New-Item -ItemType Junction` to `cmd /c mklink /D`
(directory symbolic link) to dodge the quirk. Symlinks require
Developer Mode, which is the default on GitHub-hosted Windows
runners.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant