Skip to content

CI: Retry Trivy scanner image pull to absorb transient Docker Hub timeouts#16660

Merged
kevinjqliu merged 2 commits into
apache:mainfrom
wombatu-kun:ci-trivy-pull-retry
Jun 12, 2026
Merged

CI: Retry Trivy scanner image pull to absorb transient Docker Hub timeouts#16660
kevinjqliu merged 2 commits into
apache:mainfrom
wombatu-kun:ci-trivy-pull-retry

Conversation

@wombatu-kun

@wombatu-kun wombatu-kun commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Problem

The CVE Scan workflow intermittently fails while pulling the Trivy scanner image. Recent examples are #16657 (job flink-runtime-1.20) and #16652, #16669 (job open-api-test-fixtures-runtime), all failed the same way within hours of each other:

Running Trivy in sandboxed container (aquasec/trivy:0.69.3@sha256:bcc376...)...
Unable to find image 'aquasec/trivy:...' locally
docker: Error response from daemon: Get "https://registry-1.docker.io/v2/": context deadline exceeded
##[error]Process completed with exit code 125

lhotari/sandboxed-trivy-action runs Trivy inside a Docker container. The scanner image is not cached on the runner, so Docker pulls it from Docker Hub, and that pull occasionally times out (context deadline exceeded, exit code 125), failing the job and blocking unrelated PRs. It hits different matrix entries on different PRs, which marks it as transient infrastructure flakiness rather than a code issue.

This is a transient Docker Hub availability blip, not a rate limit: the error is a network timeout rather than an HTTP 429, and GitHub-hosted runners are exempt from Docker Hub's anonymous pull limits for public images.

Change

Pull the scanner image from GHCR (ghcr.io/aquasecurity/trivy) instead of Docker Hub, and pre-pull it before the scan with a bounded retry and backoff.

Aqua publishes Trivy to GHCR at the byte-identical manifest digest as Docker Hub (sha256:bcc376...), so the scanned image content is unchanged and the digest pin is preserved, but ghcr.io is not subject to the registry-1.docker.io timeouts that triggered the failures above. The bounded retry is kept as defense-in-depth in case GHCR has its own transient blip.

The action's docker run uses Docker's default --pull=missing, so once the image is present locally it is reused and the registry is not contacted again. The image is defined once as a job-level TRIVY_IMAGE env var and passed to the action via its trivy-image input, so the pre-pulled image and the scanned image are guaranteed identical. The retry is bounded to 5 attempts with linear backoff, and skips the sleep after the final attempt so it fails cleanly if the registry is genuinely down.

…eouts

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the INFRA label Jun 2, 2026
@wombatu-kun

Copy link
Copy Markdown
Contributor Author

Gentle ping on this CI fix. The Trivy image-pull flake it addresses keeps hitting fresh PRs: it took down #16669 for the first time earlier today (job open-api-test-fixtures-runtime, Docker Hub pull timing out), on top of the earlier #16657 / #16652 cases. Since the CVE scan is a blocking check on PRs, each hit red-marks an otherwise-green, unrelated PR and forces a committer to manually re-run the job.

The change is intentionally minimal and self-contained: +19/-0 in cve-scan.yml, a bounded pre-pull retry (5 attempts, linear backoff) that reuses the digest-pinned image, so it stays polite to the registry and touches nothing else.

@kevinjqliu you set up and own the CVE scan (#16291, #16287) - would you be able to take a quick look when you get a chance? @stevenzwu pulling you in as a backup in case Kevin is tied up.

@wombatu-kun

Copy link
Copy Markdown
Contributor Author

Another hit, this time on #16771 (an unrelated CI-infrastructure PR of mine): job flink-runtime-2.1 in the CVE Scan workflow failed pulling the Trivy image - https://github.com/apache/iceberg/actions/runs/27341442261/job/80779013238

Running Trivy in sandboxed container (aquasec/trivy:0.69.3@sha256:bcc376...)...
Unable to find image 'aquasec/trivy:...' locally
docker: Error response from daemon: Get "https://registry-1.docker.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
##[error]Process completed with exit code 125.

Same signature as #16657 / #16652 / #16669. Note the log shows Cache hit for: cache-trivy-2026-06-11 right before the failure - that restored cache is Trivy's vulnerability DB, not the scanner image, so docker run still pulls aquasec/trivy from Docker Hub and that pull times out. This PR's bounded pre-pull retry is exactly what absorbs it.

@kevinjqliu since you own the CVE scan (#16291, #16287), could you take a quick look when you get a chance? @stevenzwu as backup. It's +19/-0 in cve-scan.yml and keeps red-marking otherwise-green, unrelated PRs.

@kevinjqliu kevinjqliu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

thank you for addressing this. one small suggestion: i think we can utilize ghcr instead of docker hub, ghcr should work better in github environments

Comment thread .github/workflows/cve-scan.yml Outdated
Comment on lines +55 to +58
# Trivy scanner image, pinned by digest (matches lhotari/sandboxed-trivy-action's
# default at the pinned ref). Pre-pulled with retry below to absorb transient Docker
# Hub (registry-1.docker.io) timeouts that otherwise fail the job with exit code 125.
TRIVY_IMAGE: aquasec/trivy:0.69.3@sha256:bcc376de8d77cfe086a917230e818dc9f8528e3c852f7b1aff648949b6258d1c

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Trivy scanner image, pinned by digest (matches lhotari/sandboxed-trivy-action's
# default at the pinned ref). Pre-pulled with retry below to absorb transient Docker
# Hub (registry-1.docker.io) timeouts that otherwise fail the job with exit code 125.
TRIVY_IMAGE: aquasec/trivy:0.69.3@sha256:bcc376de8d77cfe086a917230e818dc9f8528e3c852f7b1aff648949b6258d1c
# Trivy scanner image. Use Aqua's official GHCR image instead of Docker Hub
# to avoid transient registry-1.docker.io pull timeouts on GitHub-hosted runners.
TRIVY_IMAGE: ghcr.io/aquasecurity/trivy:0.69.3@sha256:bcc376de8d77cfe086a917230e818dc9f8528e3c852f7b1aff648949b6258d1c

it looks like trivy also publishes to ghcr, with the same digest. this can also help with the timeout issue

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done f146fb2

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I'd considered ghcr as well while working on this fix, but was hesitant to propose switching the registry myself. Glad you raised it, and I fully support the switch.

# Pre-pull the scanner image so the action's docker run finds it locally and never hits
# the registry. Retrying with backoff absorbs transient Docker Hub timeouts (exit 125).
run: |
for attempt in 1 2 3 4 5; do

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the loop sleeps after the 5th failed pull even though there is no 6th retry, and the log says "retrying" on the final failed attempt.

Could we avoid the final sleep / misleading message? For example:

for attempt in 1 2 3 4 5; do
  if docker pull "${TRIVY_IMAGE}"; then
    exit 0
  fi

  if [ "${attempt}" = "5" ]; then
    break
  fi

  echo "docker pull failed (attempt ${attempt}/5); retrying in $((attempt * 10))s..." >&2
  sleep "$((attempt * 10))"
done

echo "Failed to pull ${TRIVY_IMAGE} after 5 attempts" >&2
exit 1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done f146fb2

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@kevinjqliu kevinjqliu merged commit 40d3e65 into apache:main Jun 12, 2026
18 checks passed
@kevinjqliu

Copy link
Copy Markdown
Contributor

Thank you @wombatu-kun hope this will make the CI more stable!

@kevinjqliu

Copy link
Copy Markdown
Contributor

we run CVE scans more than necessary right now, which exacerbates the docker hub networking issue. #16513 should help fix it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants