Skip to content

USHIFT-7198: CI Doctor Root Cause Analysis improvements#193

Open
pmtk wants to merge 13 commits into
openshift-eng:mainfrom
pmtk:ci-doctor-rca
Open

USHIFT-7198: CI Doctor Root Cause Analysis improvements#193
pmtk wants to merge 13 commits into
openshift-eng:mainfrom
pmtk:ci-doctor-rca

Conversation

@pmtk

@pmtk pmtk commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary by CodeRabbit

Release Notes

  • New Features

    • Expanded bug/issue investigation details with confidence, evidence-backed causal chains (with quotes), analysis gaps, and scenario metadata.
    • Added confidence badges and richer investigation sections to generated reports.
    • Introduced shared tooling to extract sosreport artifacts into an indexed JSON and to list git commits for a time window.
    • Enhanced CI “doctor” preparation with an optional repository input to create best-effort per-release worktrees.
  • Documentation

    • Updated MicroShift CI skill guides and added a new primer for interpreting Prow job artifacts.
  • Chores / Refactor

    • Consolidated duplicate helper scripts and improved infrastructure failure handling consistency across processing and aggregation.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 15, 2026
@openshift-ci

openshift-ci Bot commented Jun 15, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci

openshift-ci Bot commented Jun 15, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pmtk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 15, 2026
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds two shared Bash scripts (extract-sosreport.sh, repo-log.sh) and symlinks them across plugin directories. Introduces tri-state infrastructure_failure classification in the shared pipeline and propagates new investigation fields (confidence, causal_chain, analysis_gaps, scenarios) through parse → aggregate → search-bugs → report generation. Extends doctor.sh with optional --repo source-checkout support and sosreport extraction. Comprehensively rewrites prow-job skill with evidence-backed causal-chain analysis workflow, adds MicroShift CI artifact reference primer, enhances HTML report rendering with confidence badges, and updates supporting skills (doctor, fix-test-bugs, create-bugs).

Changes

Causal-chain analysis pipeline and investigation metadata propagation

Layer / File(s) Summary
Shared script utilities and plugin symlinks
plugins/shared/scripts/extract-sosreport.sh, plugins/shared/scripts/repo-log.sh, plugins/lvms-ci/scripts/extract-sosreport.sh, plugins/lvms-ci/scripts/repo-log.sh, plugins/microshift-ci/scripts/extract-sosreport.sh, plugins/microshift-ci/scripts/repo-log.sh
New extract-sosreport.sh scans Prow artifacts for sosreport-*.tar.xz, extracts idempotently with .extracted cache marker, indexes journals and highlights into JSON, and writes ${dest}/index.json. New repo-log.sh wraps git log with --since/--until/--paths/--limit flags for flexible commit listing. Both scripts symlinked from lvms-ci and microshift-ci plugin directories for code reuse.
Tri-state infrastructure classification
plugins/shared/scripts/classify.py
classify_breakdown() default for infrastructure_failure changes from binary False to tri-state None; INFRA_STEP_PATTERNS overrides apply only when None (explicit False skips override, explicit True forces "infrastructure" immediately). New combine_infrastructure_flags() aggregates tri-state values across entry groups with precedence: True > False > None.
Investigation metadata through parse-aggregate-search pipeline
plugins/shared/scripts/parse.py, plugins/shared/scripts/aggregate.py, plugins/microshift-ci/scripts/search-bugs.py
parse.py treats infrastructure_failure as tri-state (absent → None) and adds confidence, causal_chain, analysis_gaps, scenarios fields from report data with defensive defaults and type filtering. aggregate.py uses combine_infrastructure_flags() for grouped failure_type and exports investigation fields collected/de-duplicated from group members. search-bugs.py uses combined flags in candidate classification and extends both non-merge and merge-mode outputs with investigation metadata sourced from representative/merged groups.
Doctor repository and sosreport support
plugins/shared/scripts/doctor.sh
cmd_prepare adds optional --repo ORG/NAME; clones the repo into ${WORKDIR}/src/<repo_name> and creates per-release git worktree directories for non-main branches when available. Pre-processes artifacts to extract all sosreports via extract-sosreport.sh. Checkout failures are non-fatal and recorded in final JSON source field with error, repository path, or worktree mapping.
HTML confidence badge and investigation rendering
plugins/shared/scripts/create-report.py
CSS styling added for confidence badges (high/medium/low) and investigation elements (causal-chain entries, analysis gaps, scenario chips). New helpers _render_confidence_badge() and _render_investigation() render confidence field and investigation details (scenarios, causal-chain with optional evidence/quotes, analysis gaps). Release and PR issue tables append confidence badge to title cell and extend expanded rows with rendered investigation content.
Prow-job evidence-backed causal-chain analysis
plugins/microshift-ci/skills/prow-job/SKILL.md
Redefines Goal to require evidence-backed causal chains ending in actionable root causes or recorded gaps, explicit product-vs-test attribution via product-defect mechanisms. SOSreport guidance rewritten: use extract-sosreport.sh for indexed access, prefer on-failure captures matched by timestamp, mandate tarball extraction + previous.log for container restarts, correlate journals with characterization-phase failure timestamps. Workflow: localize failed step (anchor-error for deduplication), characterize scenarios/tests, pre-hypothesis sosreport consultation, iterative hypothesis→evidence with product-vs-test rules and restart timeline reconstruction from journal/pod logs, corroborate via commit listing and job history. Output includes finished, enriched causal_chain with evidence/quotes, confidence, analysis_gaps, scenarios; RAW_ERROR redefined as deterministic deduplication anchor.
MicroShift artifact reference guide
plugins/microshift-ci/skills/prow-job/references/microshift-ci-primer.md
New reference document explaining scenario-based e2e vs direct-test job types, Robot Framework setup, scenario naming (image/version/suite tokens), per-scenario artifact locations and contents, sosreport selection/extraction guidance, journal-based component failure reconstruction (pod lifecycle, restart interpretation, latency signals), common failure patterns (timeout cascades, greenboot masking, resource contention), and external search tool limitations (Sippy, Search.CI).
Skill documentation updates: doctor, fix-test-bugs, create-bugs
plugins/microshift-ci/skills/doctor/SKILL.md, plugins/microshift-ci/skills/fix-test-bugs/SKILL.md, plugins/microshift-ci/skills/create-bugs/SKILL.md
doctor/SKILL.md: invokes doctor.sh prepare --repo openshift/microshift; expanded agent prompts with job metadata (name/URL/PR number), performance graphs, optional source paths, causal-chain instructions; documents non-fatal source clone. fix-test-bugs/SKILL.md: adds Gate 4 (confidence != low); fix-application prioritizes causal_chain evidence. create-bugs/SKILL.md: extends Jira template with analysis confidence, affected scenarios, root-cause chain sections with quoted evidence; includes conditional omission rules for empty older outputs.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • openshift-eng/edge-tooling#185: Modifies the same shared pipeline (parse.py, aggregate.py, search-bugs.py) to introduce structured entities with enriched failure metadata and tri-state infrastructure classification.
  • openshift-eng/edge-tooling#148: Introduces the fix-test-bugs skill that this PR updates with new confidence/causal_chain eligibility and fix-instruction behavior.
  • openshift-eng/edge-tooling#109: Modifies plugins/microshift-ci/scripts/search-bugs.py to compute/propagate failure_type via classify_breakdown (with this PR further adding combine_infrastructure_flags-based tri-state infrastructure handling).

Suggested reviewers

  • ggiguash
🚥 Pre-merge checks | ✅ 9 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 70.59% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Ai-Attribution ⚠️ Warning Multiple commits show AI involvement but lack required Red Hat attribution trailers (Assisted-by/Generated-by). Some commits use incorrect Co-Authored-By for AI tools instead of proper trailers. Add Assisted-by or Generated-by trailers to commits with AI-involved work, and replace Co-Authored-By trailers with correct Red Hat attribution format for AI tools.
✅ Passed checks (9 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
No-Weak-Crypto ✅ Passed No weak cryptographic patterns detected. Files contain shell scripts and Python utilities for CI analysis with no MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB usage or custom crypto implementations.
Container-Privileges ✅ Passed No Kubernetes or container manifests with privileged settings found. PR contains only CI analysis scripts, documentation, and Python utilities with no container configurations.
No-Sensitive-Data-In-Logs ✅ Passed PR does not introduce active logging of sensitive data (passwords, tokens, API keys, PII, session IDs). New scripts extract sosreport highlights via safe regex, include git commit subjects and stru...
No-Hardcoded-Secrets ✅ Passed No hardcoded secrets found: comprehensive scan of Python/shell scripts and documentation detected no API keys, tokens, passwords, private keys, or embedded credentials.
No-Injection-Vectors ✅ Passed Code review of PR #193 found no injection vulnerabilities. HTML rendering properly uses html.escape(). Git commands use quoted variables. No SQL, eval, pickle, yaml.load, or os.system vulnerabiliti...
Title check ✅ Passed The title clearly and specifically summarizes the main objective of the changeset: improving root cause analysis capabilities in the CI Doctor system.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
plugins/shared/scripts/classify.py (1)

20-49: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Explicit False can still be reclassified as infrastructure.

combine_infrastructure_flags() defines explicit False as “infrastructure ruled out”, but classify_breakdown() can still return "infrastructure" via stack_layer matching. That breaks the tri-state precedence and can incorrectly suppress product-bug candidates downstream.

Suggested fix
-    lower = stack_layer.lower()
-    if lower in INFRA_LAYERS:
+    lower = stack_layer.lower()
+    if infrastructure_failure is None and lower in INFRA_LAYERS:
         return "infrastructure"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/shared/scripts/classify.py` around lines 20 - 49, The function
classify_breakdown has a tri-state logic issue where explicit False (indicating
infrastructure was ruled out) is not properly enforced throughout the
classification chain. Currently, when infrastructure_failure is explicitly
False, the function can still return "infrastructure" based on stack_layer
matching against INFRA_LAYERS. To fix this, add a guard condition before the
stack_layer checks (the section checking INFRA_LAYERS and BUILD_LAYERS) to skip
the INFRA_LAYERS classification when infrastructure_failure is explicitly False,
ensuring that explicit False truly prevents infrastructure classification as
intended by the tri-state design.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@plugins/microshift-ci/skills/create-bugs/SKILL.md`:
- Around line 175-177: In the Search D section ("Known-regression lookup"),
modify the instruction that adds linked Jira issues to the candidate's
`duplicates` array to include an explicit status filter. The current text adds
all linked issues without checking their status, which allows closed or verified
issues to reach the `duplicates` array and incorrectly suppress
create/regression handling in the auto-decision logic. Update the instruction to
add only non-closed, non-verified issues to `duplicates`, and either route
closed/verified linked issues to `regressions` instead or explicitly state they
should be ignored for update decision purposes.

In `@plugins/microshift-ci/skills/fix-test-bugs/SKILL.md`:
- Around line 94-99: The documentation adds Gate 4 (confidence check) to the
gates applied to candidate groups, but the Step 2 summary template does not
include a corresponding skip bucket for low-confidence candidates in its skip
breakdown schema. Update the Step 2 summary template section in the
documentation to explicitly include a low-confidence skip category in the skip
breakdown breakdown structure, ensuring the documented output schema accounts
for all skip reasons introduced by the gates (Gates 2, 3, and 4). This makes the
output structure consistent with the applied gates as required by coding
guidelines.

In `@plugins/shared/scripts/extract-sosreport.sh`:
- Line 64: The find command on line 64 that searches for sosreport-*.tar.xz
files will match directories and symlinks in addition to regular files, causing
tar -xf to fail when processing non-file objects. Add the -type f flag to the
find command to restrict the discovery to regular files only, preventing the
script from attempting to extract non-file filesystem objects and failing under
the set -e directive.

---

Outside diff comments:
In `@plugins/shared/scripts/classify.py`:
- Around line 20-49: The function classify_breakdown has a tri-state logic issue
where explicit False (indicating infrastructure was ruled out) is not properly
enforced throughout the classification chain. Currently, when
infrastructure_failure is explicitly False, the function can still return
"infrastructure" based on stack_layer matching against INFRA_LAYERS. To fix
this, add a guard condition before the stack_layer checks (the section checking
INFRA_LAYERS and BUILD_LAYERS) to skip the INFRA_LAYERS classification when
infrastructure_failure is explicitly False, ensuring that explicit False truly
prevents infrastructure classification as intended by the tri-state design.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: a2591fc2-e19a-41f4-8c0b-0aba8cd473e2

📥 Commits

Reviewing files that changed from the base of the PR and between 0a7ac01 and ed8163c.

📒 Files selected for processing (17)
  • plugins/lvms-ci/scripts/extract-sosreport.sh
  • plugins/lvms-ci/scripts/repo-log.sh
  • plugins/microshift-ci/scripts/extract-sosreport.sh
  • plugins/microshift-ci/scripts/repo-log.sh
  • plugins/microshift-ci/scripts/search-bugs.py
  • plugins/microshift-ci/skills/create-bugs/SKILL.md
  • plugins/microshift-ci/skills/doctor/SKILL.md
  • plugins/microshift-ci/skills/fix-test-bugs/SKILL.md
  • plugins/microshift-ci/skills/prow-job/SKILL.md
  • plugins/microshift-ci/skills/prow-job/references/microshift-ci-primer.md
  • plugins/shared/scripts/aggregate.py
  • plugins/shared/scripts/classify.py
  • plugins/shared/scripts/create-report.py
  • plugins/shared/scripts/doctor.sh
  • plugins/shared/scripts/extract-sosreport.sh
  • plugins/shared/scripts/parse.py
  • plugins/shared/scripts/repo-log.sh

Comment thread plugins/microshift-ci/skills/create-bugs/SKILL.md Outdated
Comment thread plugins/microshift-ci/skills/fix-test-bugs/SKILL.md
Comment thread plugins/shared/scripts/extract-sosreport.sh
@pmtk pmtk closed this Jun 15, 2026
@pmtk pmtk reopened this Jun 15, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
plugins/microshift-ci/skills/prow-job/SKILL.md (1)

6-6: ⚠️ Potential issue | 🟠 Major

Guidance references undeclared tools: allowed-tools mismatch.

Line 209 directs users to use mcp__openshift-ci__get_releases and mcp__openshift-ci__get_payload_status for release-controller queries (as alternatives to WebFetch), but neither tool is declared in allowed-tools on line 6. The declared tools are only mcp__openshift-ci__get_job_runs, mcp__openshift-ci__get_job_report, and mcp__openshift-ci__search_ci_logs.

Either add these tools to allowed-tools on line 6, or clarify/remove the reference in the guidance if they are not intended for this workflow.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/microshift-ci/skills/prow-job/SKILL.md` at line 6, The allowed-tools
declaration on line 6 is missing two MCP tools (mcp__openshift-ci__get_releases
and mcp__openshift-ci__get_payload_status) that are referenced in the guidance
documentation to users. Add these two missing tools to the allowed-tools list in
the SKILL.md file so that the guidance and the declared tools match, or
alternatively remove the references to these tools from the guidance if they are
not intended for this workflow.
🧹 Nitpick comments (1)
plugins/microshift-ci/skills/prow-job/SKILL.md (1)

200-205: ⚡ Quick win

Consider restructuring critical error-handling guidance per established learning.

Lines 200–205 contain critical rules about product-vs-test attribution and container restart detection. These are core to the investigation workflow and are properly co-located inline (not in a separate Edge Cases section), which aligns with your learning from PR 192. However, the learning also suggests using structured "CRITICAL" blocks and "Error Handling" bullet lists for hard stops, rather than inline narrative with bold emphasis.

For example, line 204's "Always check for container restarts." would be clearer as:

CRITICAL: Always inspect pod restart events in the journal.
- Grep for repeated `Created container`/`Started container` (crio) and `RemoveContainer`/PLEG events (kubelet).
- Two container instances for one pod means the first one DIED — a single startup story is the wrong narrative.
- Read the dead container's `previous.log` at `sos_commands/microshift/namespaces/<namespace>/pods/<pod>/<container>/<container>/logs/previous.log`.

This restructuring improves clarity for LLM and human readers by separating the policy from the narrative, reducing the chance of misinterpretation during execution.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/microshift-ci/skills/prow-job/SKILL.md` around lines 200 - 205, The
critical error-handling guidance in this section is currently presented as
inline narrative with bold emphasis, which can be ambiguous for both LLM and
human readers. Restructure the guidance by converting hard-stop rules into
distinct "CRITICAL:" blocks followed by organized bullet lists that separate the
policy statement from supporting details and examples. Specifically, restructure
the "A test-layer fix is never the bottom" paragraph and the "Always check for
container restarts" paragraph using this format: begin with a CRITICAL header
stating the rule clearly, then provide implementation steps as bullets with
supporting context. Apply this pattern consistently across all critical guidance
sections to reduce misinterpretation and improve clarity of the investigation
workflow.

Source: Learnings

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@plugins/microshift-ci/skills/prow-job/SKILL.md`:
- Line 6: The allowed-tools declaration on line 6 is missing two MCP tools
(mcp__openshift-ci__get_releases and mcp__openshift-ci__get_payload_status) that
are referenced in the guidance documentation to users. Add these two missing
tools to the allowed-tools list in the SKILL.md file so that the guidance and
the declared tools match, or alternatively remove the references to these tools
from the guidance if they are not intended for this workflow.

---

Nitpick comments:
In `@plugins/microshift-ci/skills/prow-job/SKILL.md`:
- Around line 200-205: The critical error-handling guidance in this section is
currently presented as inline narrative with bold emphasis, which can be
ambiguous for both LLM and human readers. Restructure the guidance by converting
hard-stop rules into distinct "CRITICAL:" blocks followed by organized bullet
lists that separate the policy statement from supporting details and examples.
Specifically, restructure the "A test-layer fix is never the bottom" paragraph
and the "Always check for container restarts" paragraph using this format: begin
with a CRITICAL header stating the rule clearly, then provide implementation
steps as bullets with supporting context. Apply this pattern consistently across
all critical guidance sections to reduce misinterpretation and improve clarity
of the investigation workflow.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: ad7c1c04-30cb-439c-94f0-b757ca72d0a4

📥 Commits

Reviewing files that changed from the base of the PR and between ed8163c and 1e46e39.

📒 Files selected for processing (1)
  • plugins/microshift-ci/skills/prow-job/SKILL.md

@coderabbitai coderabbitai Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 15, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
plugins/microshift-ci/skills/prow-job/SKILL.md (4)

271-273: 💤 Low value

Mark the product-vs-test attribution rule as CRITICAL.

The rule at line 271 ("A test-layer fix is never the bottom...") is a hard-stop constraint that fundamentally shapes the causal-chain analysis. Per retrieved learnings, mark it as **CRITICAL**: to signal non-negotiable importance.

Reformat from:

- **A test-layer fix is never the bottom when a product component misbehaved.** When the failure involves...

To:

- **CRITICAL — A test-layer fix is never the bottom when a product component misbehaved.** When the failure involves...
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/microshift-ci/skills/prow-job/SKILL.md` around lines 271 - 273, Add a
"CRITICAL — " prefix to the rule about test-layer fixes starting with "A
test-layer fix is never the bottom when a product component misbehaved." in the
SKILL.md file. Change the line from "- **A test-layer fix is never the
bottom..." to "- **CRITICAL — A test-layer fix is never the bottom..." to
clearly signal that this is a non-negotiable constraint that fundamentally
shapes causal-chain analysis and attribution decisions.

Source: Learnings


260-260: 💤 Low value

Restructure Phase 3b error handling as a distinct block.

Per retrieved learnings, error-handling rules should be co-located inline but formatted as distinct Error Handling: blocks (not embedded in prose). Line 260 contains the rule but as a clause within the narrative.

Reformat to:

After all sub-agents return, proceed to **Phase 4b (Cross-Scenario Synthesis)**.

**Error Handling**: if a sub-agent returns no `--- SCENARIO ANALYSIS ---` block or returns malformed JSON, record that scenario with `confidence: "low"`, `analysis_gaps: ["sub-agent analysis failed for this scenario"]`, and use whatever information you gathered during the Localize phase (step name, junit failure count).

This makes the error recovery explicit and easier for the agent to reference during execution.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/microshift-ci/skills/prow-job/SKILL.md` at line 260, The error
handling rule for Phase 3b (around line 260) that describes how to handle
sub-agent failures returning no SCENARIO ANALYSIS block or malformed JSON is
currently embedded within narrative prose. Extract this rule and reformat it as
a distinct, separate "**Error Handling**:" block that appears after the main
Phase 3b instructions but before transitioning to Phase 4b. This makes the error
recovery procedure explicit and easier for the agent to reference during
execution, following the established pattern of co-locating error-handling rules
inline as formatted distinct blocks rather than as embedded clauses.

Source: Learnings


296-296: 💤 Low value

Mark the WebFetch prohibition as a hard-stop CRITICAL block.

Per retrieved learnings, hard-stop failure policies should be marked with **CRITICAL**: and positioned immediately before the relevant tool invocation. The WebFetch prohibition (line 296) is a critical safety rule but is currently embedded in prose.

Reformat to:

- **CRITICAL**: Never use WebFetch to query the release controller API (e.g., `amd64.ocp.releases.ci.openshift.org`). Use the openshift-ci MCP tools instead: `mcp__openshift-ci__get_releases` to list release streams, `mcp__openshift-ci__get_payload_status` for payload acceptance status. WebFetch is not permitted in CI and will cause the job to fail.

This ensures the agent reads and respects the safety rule at the moment it considers tool choices.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/microshift-ci/skills/prow-job/SKILL.md` at line 296, The WebFetch
prohibition statement needs to be reformatted to clearly mark it as a critical
hard-stop rule. Find the bullet point containing "Never use WebFetch to query
the release controller API" and add the prefix `**CRITICAL**: ` immediately
before the word "Never" to ensure the safety rule is prominently displayed and
reads as a mandatory constraint when the agent considers tool choices.

Source: Learnings


1-7: ⚡ Quick win

Reduce skill file to under 500 lines by relocating output schema to separate reference.

Per SKILL-GUIDELINES.md, keep SKILL.md files under 500 lines; move reference material into references/ or split into sub-skills. The core skill content (lines 1–330) is necessary for the multi-phase orchestration, but the Output Template section (lines 331–437) adds ~110 lines of JSON schema documentation that can be cleanly separated.

  • plugins/microshift-ci/skills/prow-job/SKILL.md#L1-L330: Keep the core skill workflow and output template format (lines 331–342, the human-readable report template itself)
  • plugins/microshift-ci/skills/prow-job/SKILL.md#L344-L437: Move to plugins/microshift-ci/skills/prow-job/references/output-schema.md (severity rubric, field descriptions, confidence rules, RAW_ERROR rules, ROOT_CAUSE rules, examples)
  • plugins/microshift-ci/skills/prow-job/SKILL.md#L331-L342: Update output template to reference the schema file: "See references/output-schema.md for field definitions, confidence rules, and examples."

This achieves the 500-line limit while keeping reference material in the references/ directory as intended by the guideline.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/microshift-ci/skills/prow-job/SKILL.md` around lines 1 - 7, The
SKILL.md file needs to be reduced below 500 lines by extracting the detailed
output schema documentation. In plugins/microshift-ci/skills/prow-job/SKILL.md,
keep the core skill workflow (lines 1–330) and the output template format header
(lines 331–342), then delete the detailed schema content (lines 344–437
containing severity rubric, field descriptions, confidence rules, RAW_ERROR
rules, ROOT_CAUSE rules, and examples). Create a new file at
plugins/microshift-ci/skills/prow-job/references/output-schema.md and move the
deleted schema content there. Finally, update the output template section in
SKILL.md (around line 331–342) to add a note referencing the new schema file
with the message "See references/output-schema.md for field definitions,
confidence rules, and examples." This follows the same pattern as the existing
microshift-ci-primer.md reference file in the references directory.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@plugins/microshift-ci/skills/prow-job/SKILL.md`:
- Around line 271-273: Add a "CRITICAL — " prefix to the rule about test-layer
fixes starting with "A test-layer fix is never the bottom when a product
component misbehaved." in the SKILL.md file. Change the line from "- **A
test-layer fix is never the bottom..." to "- **CRITICAL — A test-layer fix is
never the bottom..." to clearly signal that this is a non-negotiable constraint
that fundamentally shapes causal-chain analysis and attribution decisions.
- Line 260: The error handling rule for Phase 3b (around line 260) that
describes how to handle sub-agent failures returning no SCENARIO ANALYSIS block
or malformed JSON is currently embedded within narrative prose. Extract this
rule and reformat it as a distinct, separate "**Error Handling**:" block that
appears after the main Phase 3b instructions but before transitioning to Phase
4b. This makes the error recovery procedure explicit and easier for the agent to
reference during execution, following the established pattern of co-locating
error-handling rules inline as formatted distinct blocks rather than as embedded
clauses.
- Line 296: The WebFetch prohibition statement needs to be reformatted to
clearly mark it as a critical hard-stop rule. Find the bullet point containing
"Never use WebFetch to query the release controller API" and add the prefix
`**CRITICAL**: ` immediately before the word "Never" to ensure the safety rule
is prominently displayed and reads as a mandatory constraint when the agent
considers tool choices.
- Around line 1-7: The SKILL.md file needs to be reduced below 500 lines by
extracting the detailed output schema documentation. In
plugins/microshift-ci/skills/prow-job/SKILL.md, keep the core skill workflow
(lines 1–330) and the output template format header (lines 331–342), then delete
the detailed schema content (lines 344–437 containing severity rubric, field
descriptions, confidence rules, RAW_ERROR rules, ROOT_CAUSE rules, and
examples). Create a new file at
plugins/microshift-ci/skills/prow-job/references/output-schema.md and move the
deleted schema content there. Finally, update the output template section in
SKILL.md (around line 331–342) to add a note referencing the new schema file
with the message "See references/output-schema.md for field definitions,
confidence rules, and examples." This follows the same pattern as the existing
microshift-ci-primer.md reference file in the references directory.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: dc75c525-4ab8-4b05-bb2d-195a75ecd029

📥 Commits

Reviewing files that changed from the base of the PR and between 1e46e39 and b94044d.

📒 Files selected for processing (2)
  • plugins/microshift-ci/skills/prow-job/SKILL.md
  • plugins/microshift-ci/skills/prow-job/references/microshift-ci-primer.md

pmtk added 8 commits June 16, 2026 10:30
The prow-job workflow prescribed surface-level analysis: scan the top
build log, read context around the first error, report. Agents stopped
at the symptom because nothing required them to go deeper, and the
richest artifacts (junit/scenario logs, sosreports, pre-generated PCP
graphs, job history) were never consulted.

Replace the linear workflow with a phased one (Localize, Characterize,
Drill down, Corroborate) that iterates hypothesis -> evidence until the
cause is actionable, and extend the STRUCTURED SUMMARY with additive
fields: causal_chain (evidence-cited links), confidence, analysis_gaps,
and history. RAW_ERROR/ROOT_CAUSE rules are unchanged, so deterministic
grouping and older report files keep working. Also add a severity
rubric, document the per-test-failure sosreports produced by the
sos-on-failure listener (they retain the test's namespaces and pod
logs, unlike the end-of-scenario report), and point agents at the PCP
graphs that the doctor flow already generates before analysis.

The doctor skill's per-job agent prompts now pass job metadata and the
graphs directory, instruct agents to pursue the underlying cause rather
than the first error, and require a one-line DONE reply to protect the
parent session's turn budget.

Commands not allowlisted in CI (tar, gsutil) degrade explicitly: the
skill records the gap in analysis_gaps instead of silently skipping,
until wrapper scripts land in a follow-up.
Analysis agents in CI cannot run tar (the permission allowlist only
permits bash/python3 on plugin scripts), which in practice meant
sosreports were never consulted during root cause analysis. Wrap
extraction in a shared script that finds all sosreport tarballs under a
directory, extracts them idempotently (.extracted marker), and prints a
JSON index: journal files, the per-namespace pod log tree, and
pre-grepped high-signal lines (panics, OOM kills, leader election lost,
dead-container previous.log content).

The prow-job skill now invokes the script instead of instructing raw
tar usage. Symlinked into both microshift-ci and lvms-ci plugins like
the other shared scripts.
Analyses could not distinguish test bugs from product bugs because the
source was only cloned by fix-test-bugs (main branch, after analysis),
and could not correlate a failure window with code changes because
agents cannot run git in CI.

doctor.sh prepare gains an optional --repo org/name flag: clone to
WORKDIR/src/<name> (deliberately separate from WORKDIR/<name>, which
fix-test-bugs owns and configures with fork remotes) plus a worktree
per analyzed release branch. Clone or worktree failures are non-fatal
and reported in the JSON summary's "source" field.

Add repo-log.sh, the permitted git-log mechanism for agents, to list
commits in a failure window with optional path filters.

The prow-job skill reads the failing test's source in the Characterize
phase (test/suites/, test/scenarios*/) and runs repo-log.sh in the
Corroborate phase when history yields a last-pass date; the doctor
skill passes --repo and the source path in agent prompts; fix-test-bugs
bases fixes on the causal chain when present.
The prow-job skill now emits confidence, causal_chain, analysis_gaps,
history, and scenarios in the STRUCTURED SUMMARY. Carry them through
parse.py (defensive defaults so older report files keep parsing),
aggregate.py issues (scenarios as the union across the grouped jobs),
and search-bugs.py candidates including the cross-source merge path.

Grouping and similarity logic are unchanged.
Per-issue details now show the evidence-cited causal chain (with file
references and verbatim quotes), an analysis-confidence badge next to
the issue title so humans know which analyses to double-check first,
evidence gaps, and scenario chips identifying which of the job's
scenarios failed. All additions tolerate summary files produced before
these fields existed.
create-bugs: Jira descriptions gain the analysis confidence, affected
scenarios, and the evidence-cited root cause chain (omitted for legacy
candidates without these fields). A new optional Search D consults the
openshift-ci MCP's known-regression data once per session and feeds
linked Jira issues into the candidates' duplicates so existing triaged
bugs are updated instead of duplicated; it degrades silently when the
MCP is unavailable.

fix-test-bugs: new eligibility Gate 4 skips automated fixes for
candidates whose root cause confidence is low — a fix generated from an
unverified analysis is more likely to mask the problem than solve it.
Missing confidence (legacy candidates) still passes.
classify.py force-classified anything failing in an infra-* step as
infrastructure, and create-bugs unconditionally skips infrastructure
failures — so a product bug surfacing during an infra step was silently
dropped from bug creation.

Make the analysis flag tri-state: True still forces infrastructure;
an explicit False now skips the INFRA_STEP_PATTERNS override (the
analysis ruled infrastructure out); None — reports produced before the
field existed — keeps the pattern-based behavior, so legacy data
classifies exactly as before. Group classification combines flags with
the same precedence (any True, else any False, else None) via the new
combine_infrastructure_flags() used by aggregate.py and search-bugs.py.
Consolidate the artifact-anatomy knowledge that was scattered across
skills (and tribal): scenario naming, which log answers which question,
plain-text journal exports next to sosreport tarballs, on-failure vs
end-of-scenario sosreports, previous.log of restarted containers,
journal patterns for reconstructing component timelines, the per-
scenario 30m timeout cascade signature, and the job-level-only coverage
of Sippy/Search.CI. Referenced from the prow-job skill.
@coderabbitai coderabbitai Bot removed the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 16, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
plugins/microshift-ci/skills/create-bugs/SKILL.md (1)

128-154: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Search A must filter for open status to match Search B.

Search A examples (lines 134–140) lack AND status not in (Closed, Verified), while Search B examples (lines 148–151) include it. This inconsistency allows closed bugs to enter the duplicates array, triggering an incorrect "update" decision at line 274.

The instruction at line 181 says "duplicates array must contain ALL unique open bugs returned by searches A and B," but Search A has no status filter in the JQL examples, creating a source-of-truth gap between the search definition and the expected output. Per CONTRIBUTING.md, checks should be placed at the right lifecycle stage — the filter belongs in the Search A JQL, not in downstream post-filtering.

Add AND status not in (Closed, Verified) to Search A examples to ensure consistency:

🔧 Proposed fix for Search A examples
 **Example: candidate.keywords = ["invalidclienttokenid", "cloudformation", "createstack", "aws-2"]**
-# Search A1: most distinctive keyword
-mcp__jira__jira_search(jql='... AND issuetype = Bug AND text ~ "invalidclienttokenid" ...', limit=5)
-# Search A2: second keyword
-mcp__jira__jira_search(jql='... AND issuetype = Bug AND text ~ "cloudformation" ...', limit=5)
+# Search A1: most distinctive keyword
+mcp__jira__jira_search(jql='... AND issuetype = Bug AND text ~ "invalidclienttokenid" AND status not in (Closed, Verified) ...', limit=5)
+# Search A2: second keyword
+mcp__jira__jira_search(jql='... AND issuetype = Bug AND text ~ "cloudformation" AND status not in (Closed, Verified) ...', limit=5)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/microshift-ci/skills/create-bugs/SKILL.md` around lines 128 - 154,
The Search A examples in the documentation are missing the status filter that
Search B examples include, which causes closed bugs to be included in the
duplicates array and triggers incorrect update decisions. Add the status filter
`AND status not in (Closed, Verified)` to both Search A JQL example queries (the
mcp__jira__jira_search calls showing searches for "invalidclienttokenid" and
"cloudformation") to match the status filtering already present in the Search B
examples. This ensures consistency between Search A and Search B and guarantees
that only open bugs are returned in both cases.
plugins/shared/scripts/parse.py (1)

26-29: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve tri-state semantics for infrastructure_failure instead of coercing unknown values to False.

Line 26 currently coerces non-string values with bool(value), so None (or malformed values) can be interpreted as explicit False, which changes downstream classification behavior from “unknown” to “ruled out infrastructure.”

Suggested fix
 def _parse_bool(value):
+    if value is None:
+        return None
+    if isinstance(value, bool):
+        return value
     if isinstance(value, str):
-        return value.lower() == "true"
-    return bool(value)
+        normalized = value.strip().lower()
+        if normalized == "true":
+            return True
+        if normalized == "false":
+            return False
+        return None
+    return None

Also applies to: 77-80

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/shared/scripts/parse.py` around lines 26 - 29, The _parse_bool
function at lines 26-29 uses bool(value) to coerce non-string values, which
converts None and other unknown/malformed values to False, breaking tri-state
semantics needed for infrastructure_failure classification. Instead of using
bool(value), preserve None and unknown states by returning the value as-is or
explicitly returning None for non-string inputs, rather than coercing to False.
This same fix needs to be applied at lines 77-80 where identical coercion logic
appears.
🧹 Nitpick comments (2)
plugins/microshift-ci/skills/create-bugs/SKILL.md (1)

1-606: ⚖️ Poor tradeoff

Refactor to keep SKILL.md under 500 lines by extracting Jira search reference material.

The file is 606 lines, exceeding the SKILL-GUIDELINES max of 500 lines (by 106). Step 2 (Jira searches) spans lines 128–196 and is reference/pattern material that could move to a separate references/jira-search-patterns.md file. This improves maintainability and allows Step 2 instructions in SKILL.md to remain concise (pointing to the reference) while preserving the search logic for readers.

Suggested extraction:

  • Create plugins/microshift-ci/skills/create-bugs/references/jira-search-patterns.md
  • Move lines 128–196 (Search A, B, C detailed examples) to the reference file
  • Replace Step 2 with a brief pointer: See [Jira Search Patterns](references/jira-search-patterns.md) for detailed Search A/B/C examples and implementation guidance
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/microshift-ci/skills/create-bugs/SKILL.md` around lines 1 - 606, The
SKILL.md file exceeds the 500-line guideline (currently 606 lines). Extract the
detailed Jira search reference material from Step 2 (currently spanning the
Search A, B, and C sections with their examples and implementation guidance)
into a new file at
`plugins/microshift-ci/skills/create-bugs/references/jira-search-patterns.md`.
Create this reference file with the full search logic, query examples, and
implementation notes. Then replace the extracted content in SKILL.md's Step 2
with a brief pointer like "See [Jira Search
Patterns](references/jira-search-patterns.md) for detailed Search A/B/C examples
and implementation guidance." This preserves the search logic while reducing
SKILL.md to under 500 lines and improving maintainability.
plugins/microshift-ci/skills/prow-job/SKILL.md (1)

152-154: 💤 Low value

Quote the ${TMP} variable in Bash examples.

The Bash code blocks at lines 153 and 169 contain unquoted ${TMP} in the gsutil command. While the work directory path is controlled and unlikely to contain spaces, quoting all variables is a shell scripting best practice and aligns with the repository's shell script guidelines. The LLM agent will execute these examples, so they should demonstrate correct quoting.

📝 Proposed fix

Line 153:

-gsutil -q -m cp -r "${GCS_PATH}/" ${TMP}/
+gsutil -q -m cp -r "${GCS_PATH}/" "${TMP}"/

Line 169:

-   gsutil -q -m cp -r "${GCS_PATH}/" ${TMP}/
+   gsutil -q -m cp -r "${GCS_PATH}/" "${TMP}"/

Also applies to: 168-169

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/microshift-ci/skills/prow-job/SKILL.md` around lines 152 - 154, The
Bash code examples in the SKILL.md file contain unquoted variables in the gsutil
commands. At line 153 in the first gsutil command and at line 169 in the second
gsutil command, wrap the ${TMP} variable with double quotes to change ${TMP}/ to
"${TMP}"/ in both locations. This follows shell scripting best practices for
variable quoting to prevent potential issues with whitespace or special
characters, and ensures the examples demonstrate correct Bash syntax for LLM
agents executing them.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@plugins/shared/scripts/parse.py`:
- Around line 74-99: Add comprehensive test cases for the parsing logic in
parse_structured_summary() covering all new validation branches. Create tests
for infrastructure_failure with explicit True/False values, string variants
("true"/"false"), and missing field (expecting None); for causal_chain with
valid dicts containing "cause" key, invalid entries (non-dict or missing "cause"
key), and enforce the 5-item limit; for analysis_gaps and scenarios with mixed
valid/invalid elements including non-strings and null values; and edge cases
with empty lists, null values, and malformed JSON inputs. These tests should
verify both the filtering logic and the default values applied when fields are
absent, ensuring the tri-state and list comprehension behavior is correctly
validated.

---

Outside diff comments:
In `@plugins/microshift-ci/skills/create-bugs/SKILL.md`:
- Around line 128-154: The Search A examples in the documentation are missing
the status filter that Search B examples include, which causes closed bugs to be
included in the duplicates array and triggers incorrect update decisions. Add
the status filter `AND status not in (Closed, Verified)` to both Search A JQL
example queries (the mcp__jira__jira_search calls showing searches for
"invalidclienttokenid" and "cloudformation") to match the status filtering
already present in the Search B examples. This ensures consistency between
Search A and Search B and guarantees that only open bugs are returned in both
cases.

In `@plugins/shared/scripts/parse.py`:
- Around line 26-29: The _parse_bool function at lines 26-29 uses bool(value) to
coerce non-string values, which converts None and other unknown/malformed values
to False, breaking tri-state semantics needed for infrastructure_failure
classification. Instead of using bool(value), preserve None and unknown states
by returning the value as-is or explicitly returning None for non-string inputs,
rather than coercing to False. This same fix needs to be applied at lines 77-80
where identical coercion logic appears.

---

Nitpick comments:
In `@plugins/microshift-ci/skills/create-bugs/SKILL.md`:
- Around line 1-606: The SKILL.md file exceeds the 500-line guideline (currently
606 lines). Extract the detailed Jira search reference material from Step 2
(currently spanning the Search A, B, and C sections with their examples and
implementation guidance) into a new file at
`plugins/microshift-ci/skills/create-bugs/references/jira-search-patterns.md`.
Create this reference file with the full search logic, query examples, and
implementation notes. Then replace the extracted content in SKILL.md's Step 2
with a brief pointer like "See [Jira Search
Patterns](references/jira-search-patterns.md) for detailed Search A/B/C examples
and implementation guidance." This preserves the search logic while reducing
SKILL.md to under 500 lines and improving maintainability.

In `@plugins/microshift-ci/skills/prow-job/SKILL.md`:
- Around line 152-154: The Bash code examples in the SKILL.md file contain
unquoted variables in the gsutil commands. At line 153 in the first gsutil
command and at line 169 in the second gsutil command, wrap the ${TMP} variable
with double quotes to change ${TMP}/ to "${TMP}"/ in both locations. This
follows shell scripting best practices for variable quoting to prevent potential
issues with whitespace or special characters, and ensures the examples
demonstrate correct Bash syntax for LLM agents executing them.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 544c4539-5f64-4347-8797-f0df7fd7dfdf

📥 Commits

Reviewing files that changed from the base of the PR and between b94044d and 535229a.

📒 Files selected for processing (17)
  • plugins/lvms-ci/scripts/extract-sosreport.sh
  • plugins/lvms-ci/scripts/repo-log.sh
  • plugins/microshift-ci/scripts/extract-sosreport.sh
  • plugins/microshift-ci/scripts/repo-log.sh
  • plugins/microshift-ci/scripts/search-bugs.py
  • plugins/microshift-ci/skills/create-bugs/SKILL.md
  • plugins/microshift-ci/skills/doctor/SKILL.md
  • plugins/microshift-ci/skills/fix-test-bugs/SKILL.md
  • plugins/microshift-ci/skills/prow-job/SKILL.md
  • plugins/microshift-ci/skills/prow-job/references/microshift-ci-primer.md
  • plugins/shared/scripts/aggregate.py
  • plugins/shared/scripts/classify.py
  • plugins/shared/scripts/create-report.py
  • plugins/shared/scripts/doctor.sh
  • plugins/shared/scripts/extract-sosreport.sh
  • plugins/shared/scripts/parse.py
  • plugins/shared/scripts/repo-log.sh
✅ Files skipped from review due to trivial changes (3)
  • plugins/lvms-ci/scripts/repo-log.sh
  • plugins/microshift-ci/scripts/extract-sosreport.sh
  • plugins/microshift-ci/skills/prow-job/references/microshift-ci-primer.md
🚧 Files skipped from review as they are similar to previous changes (5)
  • plugins/lvms-ci/scripts/extract-sosreport.sh
  • plugins/microshift-ci/scripts/search-bugs.py
  • plugins/microshift-ci/skills/fix-test-bugs/SKILL.md
  • plugins/shared/scripts/create-report.py
  • plugins/microshift-ci/scripts/repo-log.sh

Comment thread plugins/shared/scripts/parse.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
plugins/shared/scripts/doctor.sh (1)

211-229: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate reused source directories before publishing them.

This reuses ${WORKDIR}/src/${repo_name} and existing worktree paths based only on directory presence. Because repo_name drops the org, rerunning with another ORG/<same-name> or a stale partial worktree can publish incorrect source.repo_dir / source.worktrees to downstream analysis.

Proposed fix
         if [[ -d "${src_dir}/.git" ]]; then
-            echo "  Already cloned: ${src_dir}" >&2
+            local expected_origin="https://github.com/${repo}.git"
+            local actual_origin
+            actual_origin=$(git -C "${src_dir}" remote get-url origin 2>/dev/null || true)
+            if [[ "${actual_origin}" != "${expected_origin}" ]]; then
+                src_error="existing checkout at ${src_dir} points to ${actual_origin:-unknown origin}, not ${expected_origin}"
+                echo "  WARNING: ${src_error}" >&2
+                src_dir=""
+            else
+                echo "  Already cloned: ${src_dir}" >&2
+            fi
         else
             mkdir -p "${WORKDIR}/src"
             if ! git clone --quiet "https://github.com/${repo}.git" "${src_dir}" >&2; then
                 src_error="git clone failed for ${repo}"
@@
                 wt="${WORKDIR}/src/${repo_name}-release-${release}"
                 if [[ -d "${wt}" ]]; then
-                    src_worktrees["${release}"]="${wt}"
-                    continue
+                    if git -C "${wt}" rev-parse --is-inside-work-tree >/dev/null 2>&1; then
+                        src_worktrees["${release}"]="${wt}"
+                        continue
+                    fi
+                    echo "  WARNING: existing worktree path is not a git worktree: ${wt}" >&2
+                    continue
                 fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/shared/scripts/doctor.sh` around lines 211 - 229, The code reuses
source directories and worktree paths based solely on directory existence
checks, which can lead to incorrect source information being published when
running with different organizations or stale partial worktrees. To fix this,
validate the cloned repository at src_dir by verifying its git remote matches
the expected repo URL before reusing it, and similarly validate worktree
directories at wt by checking they are valid git worktrees before adding them to
src_worktrees. This ensures that directory presence alone is not sufficient for
reuse decisions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@plugins/shared/scripts/doctor.sh`:
- Around line 211-229: The code reuses source directories and worktree paths
based solely on directory existence checks, which can lead to incorrect source
information being published when running with different organizations or stale
partial worktrees. To fix this, validate the cloned repository at src_dir by
verifying its git remote matches the expected repo URL before reusing it, and
similarly validate worktree directories at wt by checking they are valid git
worktrees before adding them to src_worktrees. This ensures that directory
presence alone is not sufficient for reuse decisions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: dd85dcc9-4aa1-41bf-abdb-f70f1546956a

📥 Commits

Reviewing files that changed from the base of the PR and between 535229a and af83174.

📒 Files selected for processing (8)
  • plugins/microshift-ci/scripts/search-bugs.py
  • plugins/microshift-ci/skills/create-bugs/SKILL.md
  • plugins/microshift-ci/skills/prow-job/SKILL.md
  • plugins/microshift-ci/skills/prow-job/references/microshift-ci-primer.md
  • plugins/shared/scripts/aggregate.py
  • plugins/shared/scripts/doctor.sh
  • plugins/shared/scripts/extract-sosreport.sh
  • plugins/shared/scripts/parse.py
💤 Files with no reviewable changes (3)
  • plugins/shared/scripts/parse.py
  • plugins/shared/scripts/aggregate.py
  • plugins/microshift-ci/scripts/search-bugs.py
✅ Files skipped from review due to trivial changes (1)
  • plugins/microshift-ci/skills/prow-job/references/microshift-ci-primer.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • plugins/shared/scripts/extract-sosreport.sh
  • plugins/microshift-ci/skills/prow-job/SKILL.md

@openshift-ci openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 16, 2026
@pmtk

pmtk commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Addressed CodeRabbit review-body findings:

  • classify.py:44-46 — explicit False overridden by stack_layer: Applied — fixed in cf36b49.
  • parse.py:26-29 — _parse_bool coerces None to False: Applied — fixed in cf36b49.
  • create-bugs/SKILL.md:128-154 — Search A missing status filter: Won't fix — by design, Search A is intentionally broader; downstream logic separates open/closed results into duplicates vs regressions arrays.
  • prow-job/SKILL.md:6 — undeclared tools in allowed-tools: Won't fix — MCP tools are discovered at runtime, not gated by allowed-tools.
  • prow-job/SKILL.md:200-205, 260, 271, 296 — restructure as CRITICAL blocks: Won't fix — formatting nits.
  • prow-job/SKILL.md:1-7 — split file to under 500 lines: Won't fix — file is already 374 lines.
  • create-bugs/SKILL.md:1-606 — extract Jira search patterns: Won't fix — refactoring advice, not a bug.
  • prow-job/SKILL.md:152-154 — quote ${TMP}: Won't fix — TMP is set by mktemp -d, never contains spaces.

@pmtk pmtk marked this pull request as ready for review June 16, 2026 17:24
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 16, 2026
@pmtk pmtk changed the title CI Doctor Root Cause Analysis improvements USHIFT-7198: CI Doctor Root Cause Analysis improvements Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant