Skip to content

[SPARK-57278][INFRA] Install zstd in CI container images to fix GitHub Actions cache#56324

Draft
zhengruifeng wants to merge 4 commits into
apache:masterfrom
zhengruifeng:fix-container-coursier-cache-ci-cache-opt-dev6
Draft

[SPARK-57278][INFRA] Install zstd in CI container images to fix GitHub Actions cache#56324
zhengruifeng wants to merge 4 commits into
apache:masterfrom
zhengruifeng:fix-container-coursier-cache-ci-cache-opt-dev6

Conversation

@zhengruifeng
Copy link
Copy Markdown
Contributor

@zhengruifeng zhengruifeng commented Jun 4, 2026

What changes were proposed in this pull request?

Install zstd in all CI container image Dockerfiles (dev/infra/Dockerfile and the python-*, docs, lint, sparkr images under dev/spark-test-image/).

Why are the changes needed?

actions/cache has never successfully restored a cache in any container-based CI job — confirmed by apache/spark's cache history, which has no pyspark-coursier-* / sparkr-coursier-* / docs-coursier-* entry. This is a long-standing issue, present since container jobs were introduced.

actions/cache computes a cache version = SHA256(paths + compression_method) and includes it in the lookup URL. Host runners have zstd and use it; container images lack zstd and fall back to gzip. The version then differs, so caches saved by host jobs (e.g. the Coursier cache written by precompile) are invisible to container jobs even when the key matches. Installing zstd aligns the compression method.

(The build- cache happened to work because it is written by both host and container jobs, so a gzip-version entry also existed; host-only caches had no gzip entry.)

Does this PR introduce any user-facing change?

No. CI-only.

How was this patch tested?

Before (run 26956300346): precompile saved Linux-coursier-<hash>, but all container jobs reported Cache not found for the same key minutes later.

After (run 26996424034): pyspark, sparkr, lint, docs all Cache restored from key: Linux-coursier-<hash>.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-sonnet-4-6)

The pyspark, sparkr, lint, and docs CI jobs run inside Docker containers
where the runner's HOME is /github/home (bind-mounted from
/home/runner/work/_temp/_github_home), so ~/.cache/coursier inside the
container resolves to a different physical path than the
/home/runner/.cache/coursier that the host-runner precompile job writes
to. This mismatch caused every container job's Coursier cache step to
report "Path Validation Error: ... no cache is being saved" and never
find nor populate the shared cache.

Fix by adding a volume mount that binds the host's Coursier cache
directory into the container's $HOME:

  /home/runner/.cache/coursier  →  /github/home/.cache/coursier

With this mount, the restore step extracts the precompile-written
Linux-coursier-<hash> cache directly into the path SBT reads from
inside the container, and the Path Validation Error is gone.

All four jobs remain restore-only (actions/cache/restore). pyspark and
sparkr depend on precompile so they always hit the precompile-written
cache. lint and docs run concurrently with precompile so keeping them
restore-only avoids a race where a partial closure could be saved before
precompile finishes writing the full superset.

Generated-by: Claude Code (claude-sonnet-4-6)
Install zstd in all CI container images so the @actions/cache toolkit
uses the same compression algorithm (zstd) as host-runner jobs.

Root cause: @actions/cache computes a cache "version" as
SHA256(path + compression_method). Host-runner jobs (including
precompile) have zstd available and save caches with zstd. Container
images (pyspark, sparkr, lint, docs) lacked zstd, so the toolkit fell
back to gzip, producing a different version hash. The cache lookup URL
therefore differed and every restore reported "Cache not found" even
though the key string matched an existing entry - confirmed by the fork's
cache API showing Linux-coursier-<hash> present but all container jobs
missing it despite looking up 2+ minutes after it was saved.

Add `zstd` to the apt-get install block of every CI Dockerfile:
  - dev/infra/Dockerfile (branch-3.5 and base)
  - dev/spark-test-image/python-{311,312,312-classic-only,312-pandas-3,313,314,314-nogil,minimum}/Dockerfile (pyspark variants)
  - dev/spark-test-image/docs/Dockerfile
  - dev/spark-test-image/lint/Dockerfile
  - dev/spark-test-image/sparkr/Dockerfile

Remove the volume mounts added in the previous attempt
(/home/runner/.cache/coursier:/github/home/.cache/coursier) which were
the wrong fix — the path is correctly handled by the cache action
extracting to path inside the container; the real issue was the version
mismatch preventing lookup.

Generated-by: Claude Code (claude-sonnet-4-6)
@zhengruifeng zhengruifeng changed the title [INFRA] Fix Coursier cache for container CI jobs via volume mount [INFRA] Install zstd in CI container images to fix Coursier cache Jun 5, 2026
…mages

Generated-by: Claude Code (claude-sonnet-4-6)
The explanation is in the PR description instead.

Generated-by: Claude Code (claude-sonnet-4-6)
@zhengruifeng
Copy link
Copy Markdown
Contributor Author

zhengruifeng commented Jun 5, 2026

Why zstd is required

actions/cache keys entries by (key, version), where version = SHA256(paths + compression_method). Host runners use zstd; container images without zstd fall back to gzip, so the version differs and a host-saved entry is a miss in the container even when the key matches.

This is also why build- worked but the Coursier cache didn't: build- is saved by both host and container jobs, so both a zstd and a gzip entry exist under the same key. The Coursier cache is saved only by the host precompile job, so no gzip entry exists for containers to find.

Ref: cacheUtils.ts

@zhengruifeng zhengruifeng changed the title [INFRA] Install zstd in CI container images to fix Coursier cache [INFRA] Install zstd in CI container images to fix GitHub Actions cache Jun 5, 2026
@zhengruifeng zhengruifeng changed the title [INFRA] Install zstd in CI container images to fix GitHub Actions cache [SPARK-57278][INFRA] Install zstd in CI container images to fix GitHub Actions cache Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant