CI: Retry Trivy scanner image pull to absorb transient Docker Hub timeouts#16660
Conversation
…eouts Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Gentle ping on this CI fix. The Trivy image-pull flake it addresses keeps hitting fresh PRs: it took down #16669 for the first time earlier today (job The change is intentionally minimal and self-contained: +19/-0 in @kevinjqliu you set up and own the CVE scan (#16291, #16287) - would you be able to take a quick look when you get a chance? @stevenzwu pulling you in as a backup in case Kevin is tied up. |
|
Another hit, this time on #16771 (an unrelated CI-infrastructure PR of mine): job Same signature as #16657 / #16652 / #16669. Note the log shows @kevinjqliu since you own the CVE scan (#16291, #16287), could you take a quick look when you get a chance? @stevenzwu as backup. It's +19/-0 in |
kevinjqliu
left a comment
There was a problem hiding this comment.
LGTM
thank you for addressing this. one small suggestion: i think we can utilize ghcr instead of docker hub, ghcr should work better in github environments
| # Trivy scanner image, pinned by digest (matches lhotari/sandboxed-trivy-action's | ||
| # default at the pinned ref). Pre-pulled with retry below to absorb transient Docker | ||
| # Hub (registry-1.docker.io) timeouts that otherwise fail the job with exit code 125. | ||
| TRIVY_IMAGE: aquasec/trivy:0.69.3@sha256:bcc376de8d77cfe086a917230e818dc9f8528e3c852f7b1aff648949b6258d1c |
There was a problem hiding this comment.
| # Trivy scanner image, pinned by digest (matches lhotari/sandboxed-trivy-action's | |
| # default at the pinned ref). Pre-pulled with retry below to absorb transient Docker | |
| # Hub (registry-1.docker.io) timeouts that otherwise fail the job with exit code 125. | |
| TRIVY_IMAGE: aquasec/trivy:0.69.3@sha256:bcc376de8d77cfe086a917230e818dc9f8528e3c852f7b1aff648949b6258d1c | |
| # Trivy scanner image. Use Aqua's official GHCR image instead of Docker Hub | |
| # to avoid transient registry-1.docker.io pull timeouts on GitHub-hosted runners. | |
| TRIVY_IMAGE: ghcr.io/aquasecurity/trivy:0.69.3@sha256:bcc376de8d77cfe086a917230e818dc9f8528e3c852f7b1aff648949b6258d1c |
it looks like trivy also publishes to ghcr, with the same digest. this can also help with the timeout issue
There was a problem hiding this comment.
Thanks for the review. I'd considered ghcr as well while working on this fix, but was hesitant to propose switching the registry myself. Glad you raised it, and I fully support the switch.
| # Pre-pull the scanner image so the action's docker run finds it locally and never hits | ||
| # the registry. Retrying with backoff absorbs transient Docker Hub timeouts (exit 125). | ||
| run: | | ||
| for attempt in 1 2 3 4 5; do |
There was a problem hiding this comment.
nit: the loop sleeps after the 5th failed pull even though there is no 6th retry, and the log says "retrying" on the final failed attempt.
Could we avoid the final sleep / misleading message? For example:
for attempt in 1 2 3 4 5; do
if docker pull "${TRIVY_IMAGE}"; then
exit 0
fi
if [ "${attempt}" = "5" ]; then
break
fi
echo "docker pull failed (attempt ${attempt}/5); retrying in $((attempt * 10))s..." >&2
sleep "$((attempt * 10))"
done
echo "Failed to pull ${TRIVY_IMAGE} after 5 attempts" >&2
exit 1Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thank you @wombatu-kun hope this will make the CI more stable! |
|
we run CVE scans more than necessary right now, which exacerbates the docker hub networking issue. #16513 should help fix it |
Problem
The CVE Scan workflow intermittently fails while pulling the Trivy scanner image. Recent examples are #16657 (job
flink-runtime-1.20) and #16652, #16669 (jobopen-api-test-fixtures-runtime), all failed the same way within hours of each other:lhotari/sandboxed-trivy-actionruns Trivy inside a Docker container. The scanner image is not cached on the runner, so Docker pulls it from Docker Hub, and that pull occasionally times out (context deadline exceeded, exit code 125), failing the job and blocking unrelated PRs. It hits different matrix entries on different PRs, which marks it as transient infrastructure flakiness rather than a code issue.This is a transient Docker Hub availability blip, not a rate limit: the error is a network timeout rather than an HTTP 429, and GitHub-hosted runners are exempt from Docker Hub's anonymous pull limits for public images.
Change
Pull the scanner image from GHCR (
ghcr.io/aquasecurity/trivy) instead of Docker Hub, and pre-pull it before the scan with a bounded retry and backoff.Aqua publishes Trivy to GHCR at the byte-identical manifest digest as Docker Hub (
sha256:bcc376...), so the scanned image content is unchanged and the digest pin is preserved, but ghcr.io is not subject to theregistry-1.docker.iotimeouts that triggered the failures above. The bounded retry is kept as defense-in-depth in case GHCR has its own transient blip.The action's
docker runuses Docker's default--pull=missing, so once the image is present locally it is reused and the registry is not contacted again. The image is defined once as a job-levelTRIVY_IMAGEenv var and passed to the action via itstrivy-imageinput, so the pre-pulled image and the scanned image are guaranteed identical. The retry is bounded to 5 attempts with linear backoff, and skips the sleep after the final attempt so it fails cleanly if the registry is genuinely down.