Skip to content

SREP-4345: Fix ROSA CI stability and add daily status reporting#77602

Open
dustman9000 wants to merge 1 commit intoopenshift:mainfrom
dustman9000:fix-cluster-switch-ordering
Open

SREP-4345: Fix ROSA CI stability and add daily status reporting#77602
dustman9000 wants to merge 1 commit intoopenshift:mainfrom
dustman9000:fix-cluster-switch-ordering

Conversation

@dustman9000
Copy link
Copy Markdown
Member

@dustman9000 dustman9000 commented Apr 9, 2026

Summary

Four improvements for ROSA CI job stability and visibility:

1. Fix CLUSTER_SWITCH unbound variable (provisioning bug)

PR #77500 introduced a version fallback block that references ${CLUSTER_SWITCH} before it's defined. Crashes rosa-sts-account-roles-create for any version using the nightly channel (4.22+).

2. Skip flaky conformance tests

HCP conformance:

  • NetworkPolicy ingress from updated pod (4.20 flake)
  • In-tree Volumes local blockfs subPath (4.21 flake)

Classic STS conformance:

  • CPU Partitioning cluster infrastructure (not applicable to non-partitioned clusters)
  • Managed cluster should set requests but not limits
  • Managed cluster should ensure platform components have system-* priority class

3. Redirect Slack notifications to #wg-rosa-ci-enhancement

Update rosa-e2e repo-level slack_reporter_config from #ocm-fvt-prow to #wg-rosa-ci-enhancement with improved formatting and Sippy link. Covers rosa-e2e nightly and OCM FVT periodic.

4. Add daily CI status aggregation job

New rosa-ci-daily-status periodic runs at 14:00 UTC, queries the Sippy API for all rosa-stage job pass rates, and reports healthy/failing via Prow's Slack reporter. One message per day with a link to the full breakdown in logs and Sippy.

Jira: https://redhat.atlassian.net/browse/SREP-4345

@openshift-ci openshift-ci bot requested review from joshbranham and jtaleric April 9, 2026 15:55
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 9, 2026
@dustman9000 dustman9000 force-pushed the fix-cluster-switch-ordering branch 2 times, most recently from 47fd265 to 2621007 Compare April 9, 2026 15:57
@dustman9000 dustman9000 changed the title Fix CLUSTER_SWITCH unbound variable in account-roles version fallback SREP-4345: Fix account-roles bug and skip flaky conformance tests for ROSA CI Apr 9, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 9, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Apr 9, 2026

@dustman9000: This pull request references SREP-4345 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

Three fixes for ROSA CI job stability across HCP and Classic STS conformance:

1. Fix CLUSTER_SWITCH unbound variable (provisioning bug)

PR #77500 introduced a version fallback block that references ${CLUSTER_SWITCH} before it's defined. With set -o nounset, this crashes the rosa-sts-account-roles-create step for any version using the nightly channel. Affects 4.19+ HCP and Classic STS jobs.

Fix: move CLUSTER_SWITCH assignment before the fallback block.

2. Skip flaky tests in HCP conformance

  • NetworkPolicy between server and client should allow ingress access from updated pod (4.20 flake)
  • In-tree Volumes local blockfs subPath should support file as subpath (4.21 flake)

3. Skip flaky/non-applicable tests in Classic STS conformance

  • CPU Partitioning cluster infrastructure should be configured correctly (not applicable to non-partitioned clusters)
  • Managed cluster should set requests but not limits (arch validation)
  • Managed cluster should ensure platform components have system-* priority class associated (arch validation)

Impact

These fixes address failures across all released ROSA versions (4.18-4.21) for both HCP and Classic STS conformance jobs.

Jira: https://redhat.atlassian.net/browse/SREP-4345

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@dustman9000
Copy link
Copy Markdown
Member Author

/pj-rehearse periodic-ci-openshift-release-main-nightly-4.21-e2e-rosa-hcp-ovn periodic-ci-openshift-release-main-nightly-4.21-e2e-rosa-sts-ovn

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@dustman9000: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@dustman9000 dustman9000 force-pushed the fix-cluster-switch-ordering branch 3 times, most recently from 8331ab0 to 5de9f21 Compare April 9, 2026 17:26
@dustman9000 dustman9000 changed the title SREP-4345: Fix account-roles bug and skip flaky conformance tests for ROSA CI SREP-4345: Fix ROSA CI stability and add daily status reporting Apr 9, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 9, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dustman9000

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

1. Fix CLUSTER_SWITCH unbound variable in account-roles version
   fallback (breaks 4.22+ nightly channels).

2. Skip flaky conformance tests in HCP and Classic STS workflows:
   - NetworkPolicy ingress from updated pod (HCP 4.20 flake)
   - In-tree Volumes local blockfs subPath (HCP 4.21 flake)
   - CPU Partitioning cluster infrastructure (Classic 4.18 flake)
@dustman9000 dustman9000 force-pushed the fix-cluster-switch-ordering branch from 5de9f21 to 865ffc4 Compare April 9, 2026 17:44
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@dustman9000: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.20-nightly-krkn-tests-rosa redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.20-nightly-krkn-tests-rosa-node redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.15-nightly-krkn-tests-rosa redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.15-nightly-krkn-tests-rosa-hog redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.15-nightly-krkn-tests-rosa-infra redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.19-nightly-krkn-tests-rosa redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.19-nightly-krkn-tests-rosa-node redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.17-nightly-krkn-tests-rosa redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.17-nightly-krkn-tests-rosa-node redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.17-nightly-krkn-tests-rosa-infra redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.21-nightly-krkn-tests-rosa redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.21-nightly-krkn-tests-rosa-node redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.18-nightly-krkn-tests-rosa redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.18-nightly-krkn-tests-rosa-node redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.20-nightly-krkn-tests-rosa-hcp redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.20-nightly-krkn-rosa-hcp-node redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.15-nightly-krkn-tests-rosa-hcp redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.15-nightly-krkn-rosa-hcp-hog redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.19-nightly-krkn-tests-rosa-hcp redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.19-nightly-krkn-rosa-hcp-node redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.17-nightly-krkn-tests-rosa-hcp redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.17-nightly-krkn-rosa-hcp-node redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.21-nightly-krkn-tests-rosa-hcp redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.21-nightly-krkn-rosa-hcp-node redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.18-nightly-krkn-tests-rosa-hcp redhat-chaos/prow-scripts presubmit Registry content changed

A total of 291 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs.

A full list of affected jobs can be found here

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants