[SPARK-57278][INFRA] Install zstd in CI container images to fix GitHub Actions cache#56324
Draft
zhengruifeng wants to merge 4 commits into
Draft
Conversation
The pyspark, sparkr, lint, and docs CI jobs run inside Docker containers where the runner's HOME is /github/home (bind-mounted from /home/runner/work/_temp/_github_home), so ~/.cache/coursier inside the container resolves to a different physical path than the /home/runner/.cache/coursier that the host-runner precompile job writes to. This mismatch caused every container job's Coursier cache step to report "Path Validation Error: ... no cache is being saved" and never find nor populate the shared cache. Fix by adding a volume mount that binds the host's Coursier cache directory into the container's $HOME: /home/runner/.cache/coursier → /github/home/.cache/coursier With this mount, the restore step extracts the precompile-written Linux-coursier-<hash> cache directly into the path SBT reads from inside the container, and the Path Validation Error is gone. All four jobs remain restore-only (actions/cache/restore). pyspark and sparkr depend on precompile so they always hit the precompile-written cache. lint and docs run concurrently with precompile so keeping them restore-only avoids a race where a partial closure could be saved before precompile finishes writing the full superset. Generated-by: Claude Code (claude-sonnet-4-6)
Install zstd in all CI container images so the @actions/cache toolkit
uses the same compression algorithm (zstd) as host-runner jobs.
Root cause: @actions/cache computes a cache "version" as
SHA256(path + compression_method). Host-runner jobs (including
precompile) have zstd available and save caches with zstd. Container
images (pyspark, sparkr, lint, docs) lacked zstd, so the toolkit fell
back to gzip, producing a different version hash. The cache lookup URL
therefore differed and every restore reported "Cache not found" even
though the key string matched an existing entry - confirmed by the fork's
cache API showing Linux-coursier-<hash> present but all container jobs
missing it despite looking up 2+ minutes after it was saved.
Add `zstd` to the apt-get install block of every CI Dockerfile:
- dev/infra/Dockerfile (branch-3.5 and base)
- dev/spark-test-image/python-{311,312,312-classic-only,312-pandas-3,313,314,314-nogil,minimum}/Dockerfile (pyspark variants)
- dev/spark-test-image/docs/Dockerfile
- dev/spark-test-image/lint/Dockerfile
- dev/spark-test-image/sparkr/Dockerfile
Remove the volume mounts added in the previous attempt
(/home/runner/.cache/coursier:/github/home/.cache/coursier) which were
the wrong fix — the path is correctly handled by the cache action
extracting to path inside the container; the real issue was the version
mismatch preventing lookup.
Generated-by: Claude Code (claude-sonnet-4-6)
…mages Generated-by: Claude Code (claude-sonnet-4-6)
The explanation is in the PR description instead. Generated-by: Claude Code (claude-sonnet-4-6)
Contributor
Author
|
Why
This is also why Ref: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Install
zstdin all CI container image Dockerfiles (dev/infra/Dockerfileand thepython-*,docs,lint,sparkrimages underdev/spark-test-image/).Why are the changes needed?
actions/cachehas never successfully restored a cache in any container-based CI job — confirmed byapache/spark's cache history, which has nopyspark-coursier-*/sparkr-coursier-*/docs-coursier-*entry. This is a long-standing issue, present since container jobs were introduced.actions/cachecomputes a cache version =SHA256(paths + compression_method)and includes it in the lookup URL. Host runners havezstdand use it; container images lackzstdand fall back togzip. The version then differs, so caches saved by host jobs (e.g. the Coursier cache written byprecompile) are invisible to container jobs even when the key matches. Installingzstdaligns the compression method.(The
build-cache happened to work because it is written by both host and container jobs, so a gzip-version entry also existed; host-only caches had no gzip entry.)Does this PR introduce any user-facing change?
No. CI-only.
How was this patch tested?
Before (run 26956300346):
precompilesavedLinux-coursier-<hash>, but all container jobs reportedCache not foundfor the same key minutes later.After (run 26996424034):
pyspark,sparkr,lint,docsallCache restored from key: Linux-coursier-<hash>.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (claude-sonnet-4-6)