Mount ReFS Dev Drive on Windows to speed up small-file I/O#5065
Draft
Mount ReFS Dev Drive on Windows to speed up small-file I/O#5065
Conversation
Acceptance tests on Windows spend most of their wall-clock on small-file writes: each terraform init copies providers into a per-test `.terraform/` under `$TEMP`, and the go build/module caches see similar churn. The default C: drive on GitHub-hosted and Databricks-protected Windows runners is backed by remote block storage (~4.3k IOPS); a ReFS Dev Drive is ~127k IOPS on comparable benchmarks. This step creates a 20GB dynamic VHDX, mounts it as Z:, formats it ReFS (with the `-DevDrive` flag where the host supports it, falling back to plain ReFS otherwise), and redirects TEMP/TMP + GOCACHE/GOMODCACHE/ GOTMPDIR onto it. Checkout stays on C: -- moving it would be invasive (acceptance test output normalization) for little further gain. Placed at the top of the composite action so it applies to every caller (test, test-exp-aitools, test-exp-ssh, test-pipelines). No-op on non-Windows runners via `runner.os == 'Windows'`. Co-authored-by: Isaac
Redirecting TEMP to Z: puts `t.TempDir()` (and therefore each test's
bundle cwd) on Z:, while the checkout and uv's Python package cache stay
on C:. Under `bundle/python/*` tests with older databricks-bundles
versions (e.g. PYDAB_VERSION=0.266.0), the Python mutator calls
`os.path.commonpath([os.getcwd(), path])` which raises
`ValueError: Paths don't have the same drive`. Six tests regressed:
experimental-compatibility{,-both-equal}, resource-loading,
unicode-support, restricted-execution, resolve-variable.
Keep only GOCACHE/GOMODCACHE/GOTMPDIR on the Dev Drive -- those benefit
Go compilation I/O without spanning drive boundaries at the Python
level. Per-test `.terraform/` speedup is lost; recover it in a follow-up
by plumbing a test-framework-specific tmpdir that callers can keep on
the same drive as the checkout.
Co-authored-by: Isaac
Rather than redirecting individual env vars (TEMP/TMP/GO*) and running into cross-drive path issues (Python mutator's `os.path.commonpath([cwd, path])` fails when cwd and the uv-cached module live on different drives), we can just relocate the entire workspace. Create a directory junction from `$GITHUB_WORKSPACE` (still on C:) to `Z:\workspace`. From every tool's point of view the path is unchanged -- it starts with `C:\` and `commonpath` works. Physically, all reads and writes go to the ReFS volume. Flow: - Mount VHDX at Z: - Wipe the workflow's prior checkout at $GITHUB_WORKSPACE - Create junction to Z:\workspace - The composite's own `actions/checkout` step re-populates the workspace via the junction (so setup-jfrog etc. find their files) No Go / TEMP env gymnastics needed. bundle/python/* tests stay happy. Co-authored-by: Isaac
This reverts commit 931c8a9.
Go-caches-only on Z: left the big Windows test jobs effectively flat (windows/terraform 32m34s vs 32m33s baseline) because the dominant cost is per-test `.terraform/` churn under TEMP, not Go compilation. Moving TEMP onto the Dev Drive was the missing piece. The first TEMP-on-Z: attempt broke `bundle/python/*` tests (older databricks-bundles calls `os.path.commonpath([cwd, uv_cache_path])` and chokes when the two live on different drives). Fix: create a directory junction at `C:\a\_fast` (sibling to `C:\a\cli\cli`, not inside the repo) pointing at `Z:\fast`. Path strings stay `C:\...`, so `commonpath` is happy; I/O physically lands on Z:. Junction is outside the checkout to avoid `git status` pollution, `git clean` interactions after `actions/checkout`, and unintended traversal by repo-walking tools. Co-authored-by: Isaac
`bundle.TestRootLookup` (unit test) called `filepath.EvalSymlinks` on its `t.TempDir()` path, which landed under the `C:\a\_fast` directory junction. Go's stdlib EvalSymlinks on Windows returns "cannot find the path specified" for `IO_REPARSE_TAG_MOUNT_POINT` (junctions) rooted at a newly mounted ReFS VHDX, but handles `IO_REPARSE_TAG_SYMLINK` (directory symlinks) correctly. Switching `New-Item -ItemType Junction` to `cmd /c mklink /D` (directory symbolic link) to dodge the quirk. Symlinks require Developer Mode, which is the default on GitHub-hosted Windows runners. Co-authored-by: Isaac
This reverts commit fce8d2d.
This reverts commit 9908fe7.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Mount a ReFS Dev Drive on Windows in the shared
setup-build-environmentcomposite and redirectTEMP/TMP+ Go cache dirs onto it. All Windows test jobs inherit the speedup.Why
windows/terraform is ~32 min wall-clock, dominated by small-file I/O on
C:— per-test.terraform/providers/extraction, go build cache writes, test tmpdir churn. Public benchmarks put GH WindowsC:at ~4.3k IOPS vs ~127k on a ReFS Dev Drive.Design
New-VHD/Mount-VHD/Format-Volume.runner.os == 'Windows', so every caller (test,test-exp-aitools,test-exp-ssh,test-pipelines) picks it up.-DevDriveflag inside atry/catch; older hosts fall back to plain ReFS.C:—t.TempDir()is the hot path and lives under$TEMP, so redirectingTEMP/TMPcaptures the dominant I/O.$GITHUB_ENVsoactions/setup-go's cache save/restore lands on the drive.Expected impact
windows/terraform: ~32 min → ~12–15 min. Getting under 10 min may need a follow-up (e.g. moving
build/viaTEST_BUILD_DIR).Risks
databricks-protected-runner-group-largelacks it the first step fails loudly.Z:collision (unlikely) — reassign.This pull request and its description were written by Isaac.