Skip to content

OCPBUGS-94089: Fix helm backend test setup reliability#16687

Open
jhadvig wants to merge 5 commits into
openshift:mainfrom
jhadvig:OCPBUGS-94089
Open

OCPBUGS-94089: Fix helm backend test setup reliability#16687
jhadvig wants to merge 5 commits into
openshift:mainfrom
jhadvig:OCPBUGS-94089

Conversation

@jhadvig

@jhadvig jhadvig commented Jun 30, 2026

Copy link
Copy Markdown
Member

Analysis / Root cause

The pkg/helm/actions backend test suite was failing consistently in CI with chartmuseum not starting on port 9443. The root cause was a curl flag conflict in downloadChartmuseum.sh: the script used both -o chartmuseum.tar.gz and -O, which in newer curl versions causes the archive to be saved with the remote filename instead of chartmuseum.tar.gz. The subsequent tar xf chartmuseum.tar.gz silently extracted nothing, leaving no binary for chartmuseum.sh to run.

This blocked ci/prow/backend for all PRs — including ones with zero backend changes (e.g., #16240, #16351).

Solution description

  1. Fix curl download — remove conflicting -O flag so the archive is saved as chartmuseum.tar.gz, add -fL for proper error handling and redirects, and chmod +x the extracted binary
  2. Upgrade chartmuseum from v0.14.0 to v0.16.5 (latest)
  3. Replace blind time.Sleep calls with TCP readiness checks (waitForTCP) that verify each server is actually listening before proceeding, with clear error messages on timeout
  4. Add diagnostics — binary existence checks, startup logging, pre-created storage directories, and exec for cleaner process management in all chartmuseum scripts

Test setup

CI backend job — no cluster needed.

Test cases

  • ci/prow/backend passes

Additional info

Reviewers and assignees:

/assign @jhadvig

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Improved Helm-related integration startup reliability by replacing fixed waits with TCP readiness checks for required services.
    • Added clearer timeout errors and more actionable diagnostics when test services don’t come up.
    • Strengthened local chartmuseum startup scripts by validating the binary is present/executable, creating chart storage directories, and ensuring correct executable permissions after downloads.

Chartmuseum v0.14.0 (from Jan 2022) crashes silently on startup in the
current CI builder image, causing cacertCreate.sh to fail with
connection refused on localhost:9443. This blocks all PRs' backend CI.

- Upgrade chartmuseum from v0.14.0 to v0.16.5
- Replace blind time.Sleep calls with TCP readiness checks that verify
  each server is actually listening before proceeding

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jun 30, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@jhadvig: This pull request references Jira Issue OCPBUGS-94089, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Analysis / Root cause

The pkg/helm/actions backend test suite is failing consistently in CI. setup_test.go spawns chartmuseum v0.14.0 (from January 2022) via ExecuteScript with waitForCompletion=false. If chartmuseum crashes on startup (due to binary incompatibility with the updated CI builder image), the test doesn't detect it. After a blind 5-second sleep, cacertCreate.sh tries openssl s_client -connect localhost:9443 and gets connection refused (errno=111), causing a panic.

This blocks ci/prow/backend for all PRs — including ones with zero backend changes (e.g., #16240, #16351).

Solution description

  1. Upgrade chartmuseum from v0.14.0 to v0.16.5 (latest) to fix binary compatibility with the current CI image
  2. Replace blind time.Sleep calls with TCP readiness checks (waitForTCP) that verify each server (chartmuseum on 9443/9181/8181, zot on 5443/5000/5001) is actually listening before proceeding. Timeout is 30 seconds with 1-second retries, producing a clear error message instead of a panic.

Test setup

CI backend job — no cluster needed.

Test cases

  • ci/prow/backend passes (was previously failing on every run)
  • Helm test suite (pkg/helm/actions) completes without panics

Additional info

Reviewers and assignees:

/assign @jhadvig

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 86743688-91c6-4476-95c7-4aa62aa6472a

📥 Commits

Reviewing files that changed from the base of the PR and between a1bbc05 and db2b1be.

📒 Files selected for processing (1)
  • pkg/helm/actions/setup_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/helm/actions/setup_test.go

Walkthrough

Helm test setup now waits for ChartMuseum and Zot TCP ports to become reachable instead of sleeping. The ChartMuseum fixture scripts validate binaries, create storage directories, and start servers with logging. The download script now fetches the tarball with fail-fast options and marks the binary executable.

Changes

Helm test readiness and fixture startup

Layer / File(s) Summary
TCP readiness checks and helper
pkg/helm/actions/setup_test.go
Adds waitForTCP and uses it in TLS, non-TLS, basic-auth, and OCI setup paths to block until ChartMuseum or Zot accepts TCP connections, with timeout errors and optional log output.
ChartMuseum fixture startup scripts
pkg/helm/actions/testdata/chartmuseum.sh, pkg/helm/actions/testdata/chartmuseumWithBasicAuth.sh, pkg/helm/actions/testdata/chartmuseumWithoutTls.sh
Expands the fixture scripts to verify the binary, prepare storage directories, and launch ChartMuseum with startup diagnostics and logging.
ChartMuseum download fixture
pkg/helm/actions/testdata/downloadChartmuseum.sh
Updates the tarball download and extraction step to use fail-fast curl flags and set execute permissions on the extracted binary.

Estimated code review effort: 2 (Simple) | ~8 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 11

❌ Failed checks (1 warning, 10 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Stable And Deterministic Test Names ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Test Structure And Quality ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Microshift Test Compatibility ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Single Node Openshift (Sno) Test Compatibility ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Topology-Aware Scheduling Compatibility ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Ote Binary Stdout Contract ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Ipv6 And Disconnected Network Test Compatibility ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
No-Weak-Crypto ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Container-Privileges ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
No-Sensitive-Data-In-Logs ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the PR’s main goal: improving Helm backend test setup reliability.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Tools execution failed with the following error:

Failed to run tools: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)


Comment @coderabbitai help to get the list of available commands.

@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@jhadvig: This pull request references Jira Issue OCPBUGS-94089, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Analysis / Root cause

The pkg/helm/actions backend test suite is failing consistently in CI. setup_test.go spawns chartmuseum v0.14.0 (from January 2022) via ExecuteScript with waitForCompletion=false. If chartmuseum crashes on startup (due to binary incompatibility with the updated CI builder image), the test doesn't detect it. After a blind 5-second sleep, cacertCreate.sh tries openssl s_client -connect localhost:9443 and gets connection refused (errno=111), causing a panic.

This blocks ci/prow/backend for all PRs — including ones with zero backend changes (e.g., #16240, #16351).

Solution description

  1. Upgrade chartmuseum from v0.14.0 to v0.16.5 (latest) to fix binary compatibility with the current CI image
  2. Replace blind time.Sleep calls with TCP readiness checks (waitForTCP) that verify each server (chartmuseum on 9443/9181/8181, zot on 5443/5000/5001) is actually listening before proceeding. Timeout is 30 seconds with 1-second retries, producing a clear error message instead of a panic.

Test setup

CI backend job — no cluster needed.

Test cases

  • ci/prow/backend passes (was previously failing on every run)
  • Helm test suite (pkg/helm/actions) completes without panics

Additional info

Reviewers and assignees:

/assign @stefanonardo

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from martinszuc and webbnh June 30, 2026 11:09
@openshift-ci openshift-ci Bot added component/backend Related to backend approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 30, 2026
- Fix curl flags conflict (-o/-O) in downloadChartmuseum.sh that may
  save the archive with wrong filename
- Add explicit chmod +x on extracted binary
- Create storage directories before starting chartmuseum
- Add binary existence check and startup logging to all chartmuseum
  scripts
- Use exec to replace shell with chartmuseum process

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ls -la "./$GOOS-$GOARCH/" >&2
exit 1
fi
echo "Starting chartmuseum TLS on port 9443..." >&2

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that ChartMuseum already prints this line.

@stefanonardo

Copy link
Copy Markdown
Contributor

/test backend

v0.16.5 starts but silently exits before binding the port. The root
cause was the curl -o/-O flag conflict in the download script, not
the chartmuseum version. Revert to v0.14.0 which works correctly
with the fixed download script.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@stefanonardo

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jul 1, 2026
@jhadvig

jhadvig commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

/test ci/prow/backend

@jhadvig

jhadvig commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

/test backend

@jhadvig

jhadvig commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

/retest

Chartmuseum starts (logs "Starting ChartMuseum") but silently exits
before binding to port 9443. Run it in background with PID file
(matching zot's pattern), redirect output to a log file, and verify
the process is alive after 1 second. On readiness timeout, print the
captured log to reveal the actual exit reason.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 1, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
pkg/helm/actions/testdata/chartmuseum.sh (1)

17-25: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Background start + liveness check pattern not applied to sibling scripts.

This script backgrounds chartmuseum, redirects logs, and does a quick post-start liveness check (kill -0) — a good diagnostic addition per the PR's goal of surfacing startup failures early. However, chartmuseumWithBasicAuth.sh and chartmuseumWithoutTls.sh still use blocking exec with no log capture or liveness check, so crash-on-start failures for those variants will only surface after the full 30s waitForTCP timeout instead of failing fast with diagnostics.

Consider applying the same background+log+liveness pattern to the other two scripts for consistency and faster CI failure feedback.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/helm/actions/testdata/chartmuseum.sh` around lines 17 - 25, Apply the
same background-start diagnostic pattern used in chartmuseum.sh to
chartmuseumWithBasicAuth.sh and chartmuseumWithoutTls.sh: replace the blocking
exec launch with a background process, redirect stdout/stderr to a log file,
store the PID, and add a short sleep plus kill -0 liveness check. If startup
fails, print the captured log and exit early so failures surface before
waitForTCP times out; use the existing chartmuseum startup script structure and
its PID/log handling as the reference.
pkg/helm/actions/testdata/chartmuseumWithBasicAuth.sh (1)

4-11: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

No post-start failure diagnostics, unlike the TLS variant.

chartmuseum.sh now backgrounds the process, captures logs, and performs a quick liveness check to fail fast on startup errors. This script instead execs the binary directly with no log file and no liveness check, so an immediate crash (e.g., port conflict) will only be detected via the 30s waitForTCP timeout in setup_test.go, with less diagnostic output available.

Consider aligning with the TLS script's pattern for consistency and quicker failure detection.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/helm/actions/testdata/chartmuseumWithBasicAuth.sh` around lines 4 - 11,
The basic-auth ChartMuseum startup script is missing the same post-start failure
diagnostics used by the TLS variant, so startup crashes are only caught later
with little context. Update chartmuseumWithBasicAuth.sh to follow the
chartmuseum.sh pattern by launching the binary in the background, capturing its
logs to a file, and performing an immediate liveness check before continuing.
Keep the behavior aligned around the chartmuseum startup flow so failures like
port conflicts surface quickly with useful diagnostics.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/helm/actions/testdata/downloadChartmuseum.sh`:
- Line 10: The curl invocation in downloadChartmuseum.sh uses the
CHARTMUSEUM_ARTIFACT_URL variable unquoted, which can trigger word-splitting or
globbing; update the curl command to quote that variable in the
downloadChartmuseum.sh script so the URL is passed as a single argument.

---

Nitpick comments:
In `@pkg/helm/actions/testdata/chartmuseum.sh`:
- Around line 17-25: Apply the same background-start diagnostic pattern used in
chartmuseum.sh to chartmuseumWithBasicAuth.sh and chartmuseumWithoutTls.sh:
replace the blocking exec launch with a background process, redirect
stdout/stderr to a log file, store the PID, and add a short sleep plus kill -0
liveness check. If startup fails, print the captured log and exit early so
failures surface before waitForTCP times out; use the existing chartmuseum
startup script structure and its PID/log handling as the reference.

In `@pkg/helm/actions/testdata/chartmuseumWithBasicAuth.sh`:
- Around line 4-11: The basic-auth ChartMuseum startup script is missing the
same post-start failure diagnostics used by the TLS variant, so startup crashes
are only caught later with little context. Update chartmuseumWithBasicAuth.sh to
follow the chartmuseum.sh pattern by launching the binary in the background,
capturing its logs to a file, and performing an immediate liveness check before
continuing. Keep the behavior aligned around the chartmuseum startup flow so
failures like port conflicts surface quickly with useful diagnostics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f197ee5c-c97c-4bce-b016-ed915226759f

📥 Commits

Reviewing files that changed from the base of the PR and between 59bb9f8 and a1bbc05.

📒 Files selected for processing (5)
  • pkg/helm/actions/setup_test.go
  • pkg/helm/actions/testdata/chartmuseum.sh
  • pkg/helm/actions/testdata/chartmuseumWithBasicAuth.sh
  • pkg/helm/actions/testdata/chartmuseumWithoutTls.sh
  • pkg/helm/actions/testdata/downloadChartmuseum.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/helm/actions/setup_test.go


if [ ! -f "$GOOS-$GOARCH/chartmuseum" ]; then
curl -o chartmuseum.tar.gz -O $CHARTMUSEUM_ARTIFACT_URL
curl -fL -o chartmuseum.tar.gz $CHARTMUSEUM_ARTIFACT_URL

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Quote the URL variable to avoid globbing/word-splitting.

Shellcheck (SC2086) flags the unquoted $CHARTMUSEUM_ARTIFACT_URL.

🛠️ Proposed fix
-curl -fL -o chartmuseum.tar.gz $CHARTMUSEUM_ARTIFACT_URL
+curl -fL -o chartmuseum.tar.gz "$CHARTMUSEUM_ARTIFACT_URL"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
curl -fL -o chartmuseum.tar.gz $CHARTMUSEUM_ARTIFACT_URL
curl -fL -o chartmuseum.tar.gz "$CHARTMUSEUM_ARTIFACT_URL"
🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 10-10: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/helm/actions/testdata/downloadChartmuseum.sh` at line 10, The curl
invocation in downloadChartmuseum.sh uses the CHARTMUSEUM_ARTIFACT_URL variable
unquoted, which can trigger word-splitting or globbing; update the curl command
to quote that variable in the downloadChartmuseum.sh script so the URL is passed
as a single argument.

Source: Linters/SAST tools

Chartmuseum is alive but never binds to port 9443. Add ss -tlnp output
and /proc PID check on timeout to determine if the process is alive,
what ports are listening, and whether chartmuseum is blocking or dead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jhadvig

jhadvig commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

/test backend

1 similar comment
@jhadvig

jhadvig commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

/test backend

@stefanonardo

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jul 1, 2026
@openshift-ci

openshift-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@jhadvig: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

conn.Close()
return nil
}
time.Sleep(time.Second)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker. Should we increase the time in each loop to 3-5 seconds

@sowmya-sl

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci

openshift-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jhadvig, sowmya-sl, stefanonardo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jhadvig jhadvig added the plugin-api-approved Indicates a PR with plugin API changes has been approved by an API reviewer label Jul 1, 2026
@jhadvig

jhadvig commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

/verified by CI

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jul 1, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@jhadvig: This PR has been marked as verified by CI.

Details

In response to this:

/verified by CI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. component/backend Related to backend jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. plugin-api-approved Indicates a PR with plugin API changes has been approved by an API reviewer verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants