Skip to content

USHIFT-6967: Use deterministic workflow for CI Doctor analysis parallelization#192

Draft
ggiguash wants to merge 7 commits into
openshift-eng:mainfrom
ggiguash:ci-doctor-analyse-workflow
Draft

USHIFT-6967: Use deterministic workflow for CI Doctor analysis parallelization#192
ggiguash wants to merge 7 commits into
openshift-eng:mainfrom
ggiguash:ci-doctor-analyse-workflow

Conversation

@ggiguash

@ggiguash ggiguash commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Testing the changes in openshift/release#80503 for Prow.

Summary by CodeRabbit

  • Refactor

    • Doctor now performs per-job analysis using Workflow-based parallel execution, improving throughput and consistency versus launching many concurrent agents.
  • Documentation

    • Updated doctor skill instructions to reflect Workflow orchestration, including stricter “no fallback” behavior when workflow execution fails.
  • Chores

    • Consolidated the shared doctor-analyze implementation and updated CI scripts to use it.
    • The prepare step now generates a workflow job list JSON to drive the analysis phase.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 13, 2026
@openshift-ci

openshift-ci Bot commented Jun 13, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci

openshift-ci Bot commented Jun 13, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ggiguash

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 13, 2026
@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This PR introduces a shared doctor-analyze.js script and ${WORKDIR}/workflows/analyze-jobs.json generation to orchestrate parallel per-job analysis, replaces plugin-local scripts with symlinks to the shared implementation, and updates lvms-ci and microshift-ci skill definitions to call the shared analyzer via the Workflow tool instead of spawning concurrent Agent instances.

Changes

Workflow-based job analysis and plugin integration

Layer / File(s) Summary
Generate analyze-jobs.json with job metadata
plugins/shared/scripts/doctor.sh
Adds logic to prepare command to create ${WORKDIR}/workflows/ and write analyze-jobs.json containing all release job entries and filtered failure-status PR job entries, each with artifacts_dir, deterministic output_path, and label.
Shared doctor-analyze script for parallel job processing
plugins/shared/scripts/doctor-analyze.js
New script exports meta definition and implements Analyze phase: defensively normalizes args (object or JSON string), runs per-job parallel agent(...) invocations with configured prow_job_skill, aggregates results into {analyzed, failed, total}.
lvms-ci plugin script symlink and Workflow-based orchestration
plugins/lvms-ci/scripts/doctor-analyze.js, plugins/lvms-ci/skills/doctor/SKILL.md
Plugin script becomes symlink to shared implementation. Skill front-matter switches allowed-tools from Agent to Workflow. Step 2 reads analyze-jobs.json and invokes Workflow(doctor-analyze.js) with jobs and prow_job_skill path, waits for completion, and hard-fails on workflow errors. Notes updated to reflect parallel execution guarantee.
microshift-ci plugin script symlink and Workflow-based orchestration
plugins/microshift-ci/scripts/doctor-analyze.js, plugins/microshift-ci/skills/doctor/SKILL.md
Plugin script becomes symlink to shared implementation. Skill front-matter enables Workflow. Step 2 rewritten to read analyze-jobs.json, invoke Workflow(doctor-analyze.js), wait for completion, surface {analyzed, failed, total}, and introduce critical no-fallback stop-on-failure rule. Step 3 adjusted to launch single microshift-ci:create-bugs foreground agent. Related Skills and Notes updated.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • openshift-eng/edge-tooling#67: Both PRs modify plugins/microshift-ci/skills/doctor/SKILL.md control flow across Step 2–Step 4 phases to change how doctor workflow transitions.
  • openshift-eng/edge-tooling#112: Both PRs update plugins/microshift-ci/skills/doctor/SKILL.md Step 2/3 workflow and agent orchestration details.
  • openshift-eng/edge-tooling#147: Both PRs wire lvms-ci plugin to use the shared doctor workflow for per-job parallel analysis via Workflow tool.
🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Ai-Attribution ⚠️ Warning Raw summary indicates AI tool usage ("AI-generated summary of changes"), but commit b8a335b lacks required Red Hat attribution trailers (Assisted-by/Generated-by). No proper attribution found. Add appropriate attribution trailer (Assisted-by or Generated-by) to commit message when AI tools are used, per Red Hat standards.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: converting CI Doctor analysis to use a deterministic workflow for parallelization instead of sequential agent spawning.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
No-Weak-Crypto ✅ Passed No weak cryptographic algorithms (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-constant-time secret comparisons detected in any modified files.
Container-Privileges ✅ Passed The PR contains only JavaScript scripts, bash scripts, and markdown docs—no K8s manifests or container configs where privilege escalation settings could be present.
No-Sensitive-Data-In-Logs ✅ Passed All logging statements in the modified code log only non-sensitive data: job counts (integers), release versions, PR numbers, and public build IDs. No passwords, tokens, API keys, PII, or sensitive...
No-Hardcoded-Secrets ✅ Passed No hardcoded secrets detected in any modified files. Code contains only standard configuration parameters, file paths, bot account references, and legitimate string literals.
No-Injection-Vectors ✅ Passed No injection vectors found. Shell scripts use safe jq --arg passing, JSON.parse only parses workflow config (not executed as code), and all user input is safely concatenated into prompts/paths with...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@plugins/microshift-ci/skills/doctor/SKILL.md`:
- Around line 88-126: Add a new "## Edge Case" section in both orchestrator
SKILL docs to explicitly document workflow failure modes and the no-fallback
policy: in plugins/microshift-ci/skills/doctor/SKILL.md (lines 88-126) insert an
"## Edge Case" block that states the Workflow tool call must be treated as
blocking on error (if the Workflow invocation fails for any reason — script
error, timeout, API error — stop the skill and report the error to the user; do
not fall back to sequential agent spawning), documents that individual
job-analysis agents may fail without aborting the workflow (the workflow
continues), and instructs to use the returned {analyzed, failed, total} values
for summaries; in plugins/lvms-ci/skills/doctor/SKILL.md (lines 68-100) add the
same "## Edge Case" block with identical rules (state Workflow failure =
stop+report, per-job failures allowed, use returned summary values, and
explicitly prohibit sequential fallback). Ensure wording is concise and matches
the existing doc style.

In `@plugins/shared/scripts/doctor-analyze.js`:
- Around line 46-50: The top-level illegal `return { analyzed: analyzed, failed:
failed, total: results.length }` must be replaced with a proper module export or
function return; change it to `module.exports = { analyzed, failed, total:
results.length }` (or wrap the surrounding logic in a function like `analyze()`
and return from that function) so the file no longer uses a module-scope return;
reference the variables `analyzed`, `failed`, and `results.length` (or `total`)
when exporting.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 2d2b9be2-4b4f-475d-b858-b749df2fe1d1

📥 Commits

Reviewing files that changed from the base of the PR and between 17c108d and 1ac3d17.

📒 Files selected for processing (5)
  • plugins/lvms-ci/scripts/doctor-analyze.js
  • plugins/lvms-ci/skills/doctor/SKILL.md
  • plugins/microshift-ci/scripts/doctor-analyze.js
  • plugins/microshift-ci/skills/doctor/SKILL.md
  • plugins/shared/scripts/doctor-analyze.js

Comment thread plugins/microshift-ci/skills/doctor/SKILL.md
Comment thread plugins/shared/scripts/doctor-analyze.js

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
plugins/shared/scripts/doctor.sh (1)

238-244: ⚡ Quick win

Sort the job arrays before deriving index-based filenames.

Lines 242-254 derive output_path and label from .key + 1, so the filenames are only as deterministic as the incoming JSON order. That makes the PR’s deterministic-workflow guarantee depend on download-jobs.sh preserving order after parallel downloads. Sort on a stable key before to_entries[] so reruns keep the same filenames for the same jobs.

♻️ Proposed change
-        workflow_jobs=$(echo "${workflow_jobs}" | jq --arg r "${release}" --arg w "${WORKDIR}" \
+        workflow_jobs=$(echo "${workflow_jobs}" | jq --arg r "${release}" --arg w "${WORKDIR}" \
             --slurpfile jobs "${jobs_file}" \
-            '. + [$jobs[0] | to_entries[] | {
+            '. + [($jobs[0] | sort_by(.build_id)) | to_entries[] | {
                 artifacts_dir: .value.artifacts_dir,
                 output_path:   ($w + "/jobs/release-" + $r + "-job-" + ((.key+1)|tostring) + "-" + .value.build_id + ".txt"),
                 label:         ($r + "-job-" + ((.key+1)|tostring))
             }]')
...
-        workflow_jobs=$(echo "${workflow_jobs}" | jq --arg w "${WORKDIR}" \
+        workflow_jobs=$(echo "${workflow_jobs}" | jq --arg w "${WORKDIR}" \
             --slurpfile jobs "${WORKDIR}/jobs/prs-jobs.json" \
-            '. + [$jobs[0] | to_entries[] | {
+            '. + [($jobs[0] | sort_by(.pr_number, .build_id)) | to_entries[] | {
                 artifacts_dir: .value.artifacts_dir,
                 output_path:   ($w + "/jobs/prs-job-" + ((.key+1)|tostring) + "-pr" + (.value.pr_number|tostring) + "-" + .value.build_id + ".txt"),
                 label:         ("pr" + (.value.pr_number|tostring) + "-job-" + ((.key+1)|tostring)),
                 status:        .value.status
             }]')

Also applies to: 249-256

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/shared/scripts/doctor.sh` around lines 238 - 244, The jq pipeline
building workflow_jobs uses $jobs[0] | to_entries[] and then .key to form
index-based filenames, which is nondeterministic; change the pipeline to sort
the jobs before to_entries (e.g., $jobs[0] | sort_by(.value.build_id) |
to_entries[]) so indices are stable (use a stable key like value.build_id or
another unique stable field), and apply the same change to the other similar jq
expression mentioned in the comment so both places produce deterministic
filenames/labels.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@plugins/shared/scripts/doctor.sh`:
- Around line 238-244: The jq pipeline building workflow_jobs uses $jobs[0] |
to_entries[] and then .key to form index-based filenames, which is
nondeterministic; change the pipeline to sort the jobs before to_entries (e.g.,
$jobs[0] | sort_by(.value.build_id) | to_entries[]) so indices are stable (use a
stable key like value.build_id or another unique stable field), and apply the
same change to the other similar jq expression mentioned in the comment so both
places produce deterministic filenames/labels.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4dadb3fc-dd94-4b5c-8152-f599a4b79abc

📥 Commits

Reviewing files that changed from the base of the PR and between 1ac3d17 and 93c4df0.

📒 Files selected for processing (3)
  • plugins/lvms-ci/skills/doctor/SKILL.md
  • plugins/microshift-ci/skills/doctor/SKILL.md
  • plugins/shared/scripts/doctor.sh
🚧 Files skipped from review as they are similar to previous changes (2)
  • plugins/lvms-ci/skills/doctor/SKILL.md
  • plugins/microshift-ci/skills/doctor/SKILL.md

@openshift-ci openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 13, 2026
@coderabbitai coderabbitai Bot removed the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 13, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@plugins/shared/scripts/doctor-analyze.js`:
- Around line 14-18: After parsing the args into variable `a`, add validation to
ensure required properties exist before accessing them. Specifically, validate
that `a.jobs` exists and is an array (before accessing `a.jobs.length`), and
that `a.prow_job_skill` exists. If either validation fails, log a clear error
message and exit the script gracefully to prevent runtime crashes from malformed
or missing arguments.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: f21d0e26-2329-4871-8095-91290d4a87dd

📥 Commits

Reviewing files that changed from the base of the PR and between 3491f0a and b8a335b.

📒 Files selected for processing (2)
  • plugins/shared/scripts/doctor-analyze.js
  • plugins/shared/scripts/doctor.sh

Comment thread plugins/shared/scripts/doctor-analyze.js
@ggiguash ggiguash force-pushed the ci-doctor-analyse-workflow branch from b8a335b to 14c86b4 Compare June 14, 2026 09:26
@openshift-ci openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 14, 2026
@ggiguash ggiguash marked this pull request as ready for review June 15, 2026 05:17
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 15, 2026
@ggiguash ggiguash marked this pull request as draft June 15, 2026 06:43
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 15, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This essentially locks your skills to Claude Code which may be fine if its significantly improving your results.

This doesn't strike me as a substantial workflow, though -- what is this getting you that driving Claude via -p wouldn't? We do multi-stage workflows in other Prow jobs with just multiple Cluade invocations and it works fine to produce deterministic actions, and has the benefit of being portable to OpenCode or Pi later on.

openshift-eng/ai-helpers#545 should make the unstable version of Claude available in ai-helpers if you'd like to use it (I'd like to keep stable because unstable Claude Code has broken the payload agent several times....)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants