USHIFT-7198: CI Doctor Root Cause Analysis improvements#193
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: pmtk The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds two shared Bash scripts ( ChangesCausal-chain analysis pipeline and investigation metadata propagation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 9 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (9 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
plugins/shared/scripts/classify.py (1)
20-49:⚠️ Potential issue | 🟠 Major | ⚡ Quick winExplicit
Falsecan still be reclassified as infrastructure.
combine_infrastructure_flags()defines explicitFalseas “infrastructure ruled out”, butclassify_breakdown()can still return"infrastructure"viastack_layermatching. That breaks the tri-state precedence and can incorrectly suppress product-bug candidates downstream.Suggested fix
- lower = stack_layer.lower() - if lower in INFRA_LAYERS: + lower = stack_layer.lower() + if infrastructure_failure is None and lower in INFRA_LAYERS: return "infrastructure"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/shared/scripts/classify.py` around lines 20 - 49, The function classify_breakdown has a tri-state logic issue where explicit False (indicating infrastructure was ruled out) is not properly enforced throughout the classification chain. Currently, when infrastructure_failure is explicitly False, the function can still return "infrastructure" based on stack_layer matching against INFRA_LAYERS. To fix this, add a guard condition before the stack_layer checks (the section checking INFRA_LAYERS and BUILD_LAYERS) to skip the INFRA_LAYERS classification when infrastructure_failure is explicitly False, ensuring that explicit False truly prevents infrastructure classification as intended by the tri-state design.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@plugins/microshift-ci/skills/create-bugs/SKILL.md`:
- Around line 175-177: In the Search D section ("Known-regression lookup"),
modify the instruction that adds linked Jira issues to the candidate's
`duplicates` array to include an explicit status filter. The current text adds
all linked issues without checking their status, which allows closed or verified
issues to reach the `duplicates` array and incorrectly suppress
create/regression handling in the auto-decision logic. Update the instruction to
add only non-closed, non-verified issues to `duplicates`, and either route
closed/verified linked issues to `regressions` instead or explicitly state they
should be ignored for update decision purposes.
In `@plugins/microshift-ci/skills/fix-test-bugs/SKILL.md`:
- Around line 94-99: The documentation adds Gate 4 (confidence check) to the
gates applied to candidate groups, but the Step 2 summary template does not
include a corresponding skip bucket for low-confidence candidates in its skip
breakdown schema. Update the Step 2 summary template section in the
documentation to explicitly include a low-confidence skip category in the skip
breakdown breakdown structure, ensuring the documented output schema accounts
for all skip reasons introduced by the gates (Gates 2, 3, and 4). This makes the
output structure consistent with the applied gates as required by coding
guidelines.
In `@plugins/shared/scripts/extract-sosreport.sh`:
- Line 64: The find command on line 64 that searches for sosreport-*.tar.xz
files will match directories and symlinks in addition to regular files, causing
tar -xf to fail when processing non-file objects. Add the -type f flag to the
find command to restrict the discovery to regular files only, preventing the
script from attempting to extract non-file filesystem objects and failing under
the set -e directive.
---
Outside diff comments:
In `@plugins/shared/scripts/classify.py`:
- Around line 20-49: The function classify_breakdown has a tri-state logic issue
where explicit False (indicating infrastructure was ruled out) is not properly
enforced throughout the classification chain. Currently, when
infrastructure_failure is explicitly False, the function can still return
"infrastructure" based on stack_layer matching against INFRA_LAYERS. To fix
this, add a guard condition before the stack_layer checks (the section checking
INFRA_LAYERS and BUILD_LAYERS) to skip the INFRA_LAYERS classification when
infrastructure_failure is explicitly False, ensuring that explicit False truly
prevents infrastructure classification as intended by the tri-state design.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: a2591fc2-e19a-41f4-8c0b-0aba8cd473e2
📒 Files selected for processing (17)
plugins/lvms-ci/scripts/extract-sosreport.shplugins/lvms-ci/scripts/repo-log.shplugins/microshift-ci/scripts/extract-sosreport.shplugins/microshift-ci/scripts/repo-log.shplugins/microshift-ci/scripts/search-bugs.pyplugins/microshift-ci/skills/create-bugs/SKILL.mdplugins/microshift-ci/skills/doctor/SKILL.mdplugins/microshift-ci/skills/fix-test-bugs/SKILL.mdplugins/microshift-ci/skills/prow-job/SKILL.mdplugins/microshift-ci/skills/prow-job/references/microshift-ci-primer.mdplugins/shared/scripts/aggregate.pyplugins/shared/scripts/classify.pyplugins/shared/scripts/create-report.pyplugins/shared/scripts/doctor.shplugins/shared/scripts/extract-sosreport.shplugins/shared/scripts/parse.pyplugins/shared/scripts/repo-log.sh
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
plugins/microshift-ci/skills/prow-job/SKILL.md (1)
6-6:⚠️ Potential issue | 🟠 MajorGuidance references undeclared tools:
allowed-toolsmismatch.Line 209 directs users to use
mcp__openshift-ci__get_releasesandmcp__openshift-ci__get_payload_statusfor release-controller queries (as alternatives to WebFetch), but neither tool is declared inallowed-toolson line 6. The declared tools are onlymcp__openshift-ci__get_job_runs,mcp__openshift-ci__get_job_report, andmcp__openshift-ci__search_ci_logs.Either add these tools to
allowed-toolson line 6, or clarify/remove the reference in the guidance if they are not intended for this workflow.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/microshift-ci/skills/prow-job/SKILL.md` at line 6, The allowed-tools declaration on line 6 is missing two MCP tools (mcp__openshift-ci__get_releases and mcp__openshift-ci__get_payload_status) that are referenced in the guidance documentation to users. Add these two missing tools to the allowed-tools list in the SKILL.md file so that the guidance and the declared tools match, or alternatively remove the references to these tools from the guidance if they are not intended for this workflow.
🧹 Nitpick comments (1)
plugins/microshift-ci/skills/prow-job/SKILL.md (1)
200-205: ⚡ Quick winConsider restructuring critical error-handling guidance per established learning.
Lines 200–205 contain critical rules about product-vs-test attribution and container restart detection. These are core to the investigation workflow and are properly co-located inline (not in a separate Edge Cases section), which aligns with your learning from PR 192. However, the learning also suggests using structured "CRITICAL" blocks and "Error Handling" bullet lists for hard stops, rather than inline narrative with bold emphasis.
For example, line 204's "Always check for container restarts." would be clearer as:
CRITICAL: Always inspect pod restart events in the journal. - Grep for repeated `Created container`/`Started container` (crio) and `RemoveContainer`/PLEG events (kubelet). - Two container instances for one pod means the first one DIED — a single startup story is the wrong narrative. - Read the dead container's `previous.log` at `sos_commands/microshift/namespaces/<namespace>/pods/<pod>/<container>/<container>/logs/previous.log`.This restructuring improves clarity for LLM and human readers by separating the policy from the narrative, reducing the chance of misinterpretation during execution.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/microshift-ci/skills/prow-job/SKILL.md` around lines 200 - 205, The critical error-handling guidance in this section is currently presented as inline narrative with bold emphasis, which can be ambiguous for both LLM and human readers. Restructure the guidance by converting hard-stop rules into distinct "CRITICAL:" blocks followed by organized bullet lists that separate the policy statement from supporting details and examples. Specifically, restructure the "A test-layer fix is never the bottom" paragraph and the "Always check for container restarts" paragraph using this format: begin with a CRITICAL header stating the rule clearly, then provide implementation steps as bullets with supporting context. Apply this pattern consistently across all critical guidance sections to reduce misinterpretation and improve clarity of the investigation workflow.Source: Learnings
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@plugins/microshift-ci/skills/prow-job/SKILL.md`:
- Line 6: The allowed-tools declaration on line 6 is missing two MCP tools
(mcp__openshift-ci__get_releases and mcp__openshift-ci__get_payload_status) that
are referenced in the guidance documentation to users. Add these two missing
tools to the allowed-tools list in the SKILL.md file so that the guidance and
the declared tools match, or alternatively remove the references to these tools
from the guidance if they are not intended for this workflow.
---
Nitpick comments:
In `@plugins/microshift-ci/skills/prow-job/SKILL.md`:
- Around line 200-205: The critical error-handling guidance in this section is
currently presented as inline narrative with bold emphasis, which can be
ambiguous for both LLM and human readers. Restructure the guidance by converting
hard-stop rules into distinct "CRITICAL:" blocks followed by organized bullet
lists that separate the policy statement from supporting details and examples.
Specifically, restructure the "A test-layer fix is never the bottom" paragraph
and the "Always check for container restarts" paragraph using this format: begin
with a CRITICAL header stating the rule clearly, then provide implementation
steps as bullets with supporting context. Apply this pattern consistently across
all critical guidance sections to reduce misinterpretation and improve clarity
of the investigation workflow.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: ad7c1c04-30cb-439c-94f0-b757ca72d0a4
📒 Files selected for processing (1)
plugins/microshift-ci/skills/prow-job/SKILL.md
There was a problem hiding this comment.
🧹 Nitpick comments (4)
plugins/microshift-ci/skills/prow-job/SKILL.md (4)
271-273: 💤 Low valueMark the product-vs-test attribution rule as CRITICAL.
The rule at line 271 ("A test-layer fix is never the bottom...") is a hard-stop constraint that fundamentally shapes the causal-chain analysis. Per retrieved learnings, mark it as
**CRITICAL**:to signal non-negotiable importance.Reformat from:
- **A test-layer fix is never the bottom when a product component misbehaved.** When the failure involves...To:
- **CRITICAL — A test-layer fix is never the bottom when a product component misbehaved.** When the failure involves...🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/microshift-ci/skills/prow-job/SKILL.md` around lines 271 - 273, Add a "CRITICAL — " prefix to the rule about test-layer fixes starting with "A test-layer fix is never the bottom when a product component misbehaved." in the SKILL.md file. Change the line from "- **A test-layer fix is never the bottom..." to "- **CRITICAL — A test-layer fix is never the bottom..." to clearly signal that this is a non-negotiable constraint that fundamentally shapes causal-chain analysis and attribution decisions.Source: Learnings
260-260: 💤 Low valueRestructure Phase 3b error handling as a distinct block.
Per retrieved learnings, error-handling rules should be co-located inline but formatted as distinct
Error Handling:blocks (not embedded in prose). Line 260 contains the rule but as a clause within the narrative.Reformat to:
After all sub-agents return, proceed to **Phase 4b (Cross-Scenario Synthesis)**. **Error Handling**: if a sub-agent returns no `--- SCENARIO ANALYSIS ---` block or returns malformed JSON, record that scenario with `confidence: "low"`, `analysis_gaps: ["sub-agent analysis failed for this scenario"]`, and use whatever information you gathered during the Localize phase (step name, junit failure count).This makes the error recovery explicit and easier for the agent to reference during execution.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/microshift-ci/skills/prow-job/SKILL.md` at line 260, The error handling rule for Phase 3b (around line 260) that describes how to handle sub-agent failures returning no SCENARIO ANALYSIS block or malformed JSON is currently embedded within narrative prose. Extract this rule and reformat it as a distinct, separate "**Error Handling**:" block that appears after the main Phase 3b instructions but before transitioning to Phase 4b. This makes the error recovery procedure explicit and easier for the agent to reference during execution, following the established pattern of co-locating error-handling rules inline as formatted distinct blocks rather than as embedded clauses.Source: Learnings
296-296: 💤 Low valueMark the WebFetch prohibition as a hard-stop CRITICAL block.
Per retrieved learnings, hard-stop failure policies should be marked with
**CRITICAL**:and positioned immediately before the relevant tool invocation. The WebFetch prohibition (line 296) is a critical safety rule but is currently embedded in prose.Reformat to:
- **CRITICAL**: Never use WebFetch to query the release controller API (e.g., `amd64.ocp.releases.ci.openshift.org`). Use the openshift-ci MCP tools instead: `mcp__openshift-ci__get_releases` to list release streams, `mcp__openshift-ci__get_payload_status` for payload acceptance status. WebFetch is not permitted in CI and will cause the job to fail.This ensures the agent reads and respects the safety rule at the moment it considers tool choices.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/microshift-ci/skills/prow-job/SKILL.md` at line 296, The WebFetch prohibition statement needs to be reformatted to clearly mark it as a critical hard-stop rule. Find the bullet point containing "Never use WebFetch to query the release controller API" and add the prefix `**CRITICAL**: ` immediately before the word "Never" to ensure the safety rule is prominently displayed and reads as a mandatory constraint when the agent considers tool choices.Source: Learnings
1-7: ⚡ Quick winReduce skill file to under 500 lines by relocating output schema to separate reference.
Per SKILL-GUIDELINES.md, keep SKILL.md files under 500 lines; move reference material into
references/or split into sub-skills. The core skill content (lines 1–330) is necessary for the multi-phase orchestration, but the Output Template section (lines 331–437) adds ~110 lines of JSON schema documentation that can be cleanly separated.
plugins/microshift-ci/skills/prow-job/SKILL.md#L1-L330: Keep the core skill workflow and output template format (lines 331–342, the human-readable report template itself)plugins/microshift-ci/skills/prow-job/SKILL.md#L344-L437: Move toplugins/microshift-ci/skills/prow-job/references/output-schema.md(severity rubric, field descriptions, confidence rules, RAW_ERROR rules, ROOT_CAUSE rules, examples)plugins/microshift-ci/skills/prow-job/SKILL.md#L331-L342: Update output template to reference the schema file: "Seereferences/output-schema.mdfor field definitions, confidence rules, and examples."This achieves the 500-line limit while keeping reference material in the
references/directory as intended by the guideline.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/microshift-ci/skills/prow-job/SKILL.md` around lines 1 - 7, The SKILL.md file needs to be reduced below 500 lines by extracting the detailed output schema documentation. In plugins/microshift-ci/skills/prow-job/SKILL.md, keep the core skill workflow (lines 1–330) and the output template format header (lines 331–342), then delete the detailed schema content (lines 344–437 containing severity rubric, field descriptions, confidence rules, RAW_ERROR rules, ROOT_CAUSE rules, and examples). Create a new file at plugins/microshift-ci/skills/prow-job/references/output-schema.md and move the deleted schema content there. Finally, update the output template section in SKILL.md (around line 331–342) to add a note referencing the new schema file with the message "See references/output-schema.md for field definitions, confidence rules, and examples." This follows the same pattern as the existing microshift-ci-primer.md reference file in the references directory.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@plugins/microshift-ci/skills/prow-job/SKILL.md`:
- Around line 271-273: Add a "CRITICAL — " prefix to the rule about test-layer
fixes starting with "A test-layer fix is never the bottom when a product
component misbehaved." in the SKILL.md file. Change the line from "- **A
test-layer fix is never the bottom..." to "- **CRITICAL — A test-layer fix is
never the bottom..." to clearly signal that this is a non-negotiable constraint
that fundamentally shapes causal-chain analysis and attribution decisions.
- Line 260: The error handling rule for Phase 3b (around line 260) that
describes how to handle sub-agent failures returning no SCENARIO ANALYSIS block
or malformed JSON is currently embedded within narrative prose. Extract this
rule and reformat it as a distinct, separate "**Error Handling**:" block that
appears after the main Phase 3b instructions but before transitioning to Phase
4b. This makes the error recovery procedure explicit and easier for the agent to
reference during execution, following the established pattern of co-locating
error-handling rules inline as formatted distinct blocks rather than as embedded
clauses.
- Line 296: The WebFetch prohibition statement needs to be reformatted to
clearly mark it as a critical hard-stop rule. Find the bullet point containing
"Never use WebFetch to query the release controller API" and add the prefix
`**CRITICAL**: ` immediately before the word "Never" to ensure the safety rule
is prominently displayed and reads as a mandatory constraint when the agent
considers tool choices.
- Around line 1-7: The SKILL.md file needs to be reduced below 500 lines by
extracting the detailed output schema documentation. In
plugins/microshift-ci/skills/prow-job/SKILL.md, keep the core skill workflow
(lines 1–330) and the output template format header (lines 331–342), then delete
the detailed schema content (lines 344–437 containing severity rubric, field
descriptions, confidence rules, RAW_ERROR rules, ROOT_CAUSE rules, and
examples). Create a new file at
plugins/microshift-ci/skills/prow-job/references/output-schema.md and move the
deleted schema content there. Finally, update the output template section in
SKILL.md (around line 331–342) to add a note referencing the new schema file
with the message "See references/output-schema.md for field definitions,
confidence rules, and examples." This follows the same pattern as the existing
microshift-ci-primer.md reference file in the references directory.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: dc75c525-4ab8-4b05-bb2d-195a75ecd029
📒 Files selected for processing (2)
plugins/microshift-ci/skills/prow-job/SKILL.mdplugins/microshift-ci/skills/prow-job/references/microshift-ci-primer.md
The prow-job workflow prescribed surface-level analysis: scan the top build log, read context around the first error, report. Agents stopped at the symptom because nothing required them to go deeper, and the richest artifacts (junit/scenario logs, sosreports, pre-generated PCP graphs, job history) were never consulted. Replace the linear workflow with a phased one (Localize, Characterize, Drill down, Corroborate) that iterates hypothesis -> evidence until the cause is actionable, and extend the STRUCTURED SUMMARY with additive fields: causal_chain (evidence-cited links), confidence, analysis_gaps, and history. RAW_ERROR/ROOT_CAUSE rules are unchanged, so deterministic grouping and older report files keep working. Also add a severity rubric, document the per-test-failure sosreports produced by the sos-on-failure listener (they retain the test's namespaces and pod logs, unlike the end-of-scenario report), and point agents at the PCP graphs that the doctor flow already generates before analysis. The doctor skill's per-job agent prompts now pass job metadata and the graphs directory, instruct agents to pursue the underlying cause rather than the first error, and require a one-line DONE reply to protect the parent session's turn budget. Commands not allowlisted in CI (tar, gsutil) degrade explicitly: the skill records the gap in analysis_gaps instead of silently skipping, until wrapper scripts land in a follow-up.
Analysis agents in CI cannot run tar (the permission allowlist only permits bash/python3 on plugin scripts), which in practice meant sosreports were never consulted during root cause analysis. Wrap extraction in a shared script that finds all sosreport tarballs under a directory, extracts them idempotently (.extracted marker), and prints a JSON index: journal files, the per-namespace pod log tree, and pre-grepped high-signal lines (panics, OOM kills, leader election lost, dead-container previous.log content). The prow-job skill now invokes the script instead of instructing raw tar usage. Symlinked into both microshift-ci and lvms-ci plugins like the other shared scripts.
Analyses could not distinguish test bugs from product bugs because the source was only cloned by fix-test-bugs (main branch, after analysis), and could not correlate a failure window with code changes because agents cannot run git in CI. doctor.sh prepare gains an optional --repo org/name flag: clone to WORKDIR/src/<name> (deliberately separate from WORKDIR/<name>, which fix-test-bugs owns and configures with fork remotes) plus a worktree per analyzed release branch. Clone or worktree failures are non-fatal and reported in the JSON summary's "source" field. Add repo-log.sh, the permitted git-log mechanism for agents, to list commits in a failure window with optional path filters. The prow-job skill reads the failing test's source in the Characterize phase (test/suites/, test/scenarios*/) and runs repo-log.sh in the Corroborate phase when history yields a last-pass date; the doctor skill passes --repo and the source path in agent prompts; fix-test-bugs bases fixes on the causal chain when present.
The prow-job skill now emits confidence, causal_chain, analysis_gaps, history, and scenarios in the STRUCTURED SUMMARY. Carry them through parse.py (defensive defaults so older report files keep parsing), aggregate.py issues (scenarios as the union across the grouped jobs), and search-bugs.py candidates including the cross-source merge path. Grouping and similarity logic are unchanged.
Per-issue details now show the evidence-cited causal chain (with file references and verbatim quotes), an analysis-confidence badge next to the issue title so humans know which analyses to double-check first, evidence gaps, and scenario chips identifying which of the job's scenarios failed. All additions tolerate summary files produced before these fields existed.
create-bugs: Jira descriptions gain the analysis confidence, affected scenarios, and the evidence-cited root cause chain (omitted for legacy candidates without these fields). A new optional Search D consults the openshift-ci MCP's known-regression data once per session and feeds linked Jira issues into the candidates' duplicates so existing triaged bugs are updated instead of duplicated; it degrades silently when the MCP is unavailable. fix-test-bugs: new eligibility Gate 4 skips automated fixes for candidates whose root cause confidence is low — a fix generated from an unverified analysis is more likely to mask the problem than solve it. Missing confidence (legacy candidates) still passes.
classify.py force-classified anything failing in an infra-* step as infrastructure, and create-bugs unconditionally skips infrastructure failures — so a product bug surfacing during an infra step was silently dropped from bug creation. Make the analysis flag tri-state: True still forces infrastructure; an explicit False now skips the INFRA_STEP_PATTERNS override (the analysis ruled infrastructure out); None — reports produced before the field existed — keeps the pattern-based behavior, so legacy data classifies exactly as before. Group classification combines flags with the same precedence (any True, else any False, else None) via the new combine_infrastructure_flags() used by aggregate.py and search-bugs.py.
Consolidate the artifact-anatomy knowledge that was scattered across skills (and tribal): scenario naming, which log answers which question, plain-text journal exports next to sosreport tarballs, on-failure vs end-of-scenario sosreports, previous.log of restarted containers, journal patterns for reconstructing component timelines, the per- scenario 30m timeout cascade signature, and the job-level-only coverage of Sippy/Search.CI. Referenced from the prow-job skill.
Update agent instructions to read pre-built sosreport index
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
plugins/microshift-ci/skills/create-bugs/SKILL.md (1)
128-154:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSearch A must filter for open status to match Search B.
Search A examples (lines 134–140) lack
AND status not in (Closed, Verified), while Search B examples (lines 148–151) include it. This inconsistency allows closed bugs to enter theduplicatesarray, triggering an incorrect "update" decision at line 274.The instruction at line 181 says "
duplicatesarray must contain ALL unique open bugs returned by searches A and B," but Search A has no status filter in the JQL examples, creating a source-of-truth gap between the search definition and the expected output. Per CONTRIBUTING.md, checks should be placed at the right lifecycle stage — the filter belongs in the Search A JQL, not in downstream post-filtering.Add
AND status not in (Closed, Verified)to Search A examples to ensure consistency:🔧 Proposed fix for Search A examples
**Example: candidate.keywords = ["invalidclienttokenid", "cloudformation", "createstack", "aws-2"]** -# Search A1: most distinctive keyword -mcp__jira__jira_search(jql='... AND issuetype = Bug AND text ~ "invalidclienttokenid" ...', limit=5) -# Search A2: second keyword -mcp__jira__jira_search(jql='... AND issuetype = Bug AND text ~ "cloudformation" ...', limit=5) +# Search A1: most distinctive keyword +mcp__jira__jira_search(jql='... AND issuetype = Bug AND text ~ "invalidclienttokenid" AND status not in (Closed, Verified) ...', limit=5) +# Search A2: second keyword +mcp__jira__jira_search(jql='... AND issuetype = Bug AND text ~ "cloudformation" AND status not in (Closed, Verified) ...', limit=5)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/microshift-ci/skills/create-bugs/SKILL.md` around lines 128 - 154, The Search A examples in the documentation are missing the status filter that Search B examples include, which causes closed bugs to be included in the duplicates array and triggers incorrect update decisions. Add the status filter `AND status not in (Closed, Verified)` to both Search A JQL example queries (the mcp__jira__jira_search calls showing searches for "invalidclienttokenid" and "cloudformation") to match the status filtering already present in the Search B examples. This ensures consistency between Search A and Search B and guarantees that only open bugs are returned in both cases.plugins/shared/scripts/parse.py (1)
26-29:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPreserve tri-state semantics for
infrastructure_failureinstead of coercing unknown values toFalse.Line 26 currently coerces non-string values with
bool(value), soNone(or malformed values) can be interpreted as explicitFalse, which changes downstream classification behavior from “unknown” to “ruled out infrastructure.”Suggested fix
def _parse_bool(value): + if value is None: + return None + if isinstance(value, bool): + return value if isinstance(value, str): - return value.lower() == "true" - return bool(value) + normalized = value.strip().lower() + if normalized == "true": + return True + if normalized == "false": + return False + return None + return NoneAlso applies to: 77-80
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/shared/scripts/parse.py` around lines 26 - 29, The _parse_bool function at lines 26-29 uses bool(value) to coerce non-string values, which converts None and other unknown/malformed values to False, breaking tri-state semantics needed for infrastructure_failure classification. Instead of using bool(value), preserve None and unknown states by returning the value as-is or explicitly returning None for non-string inputs, rather than coercing to False. This same fix needs to be applied at lines 77-80 where identical coercion logic appears.
🧹 Nitpick comments (2)
plugins/microshift-ci/skills/create-bugs/SKILL.md (1)
1-606: ⚖️ Poor tradeoffRefactor to keep SKILL.md under 500 lines by extracting Jira search reference material.
The file is 606 lines, exceeding the SKILL-GUIDELINES max of 500 lines (by 106). Step 2 (Jira searches) spans lines 128–196 and is reference/pattern material that could move to a separate
references/jira-search-patterns.mdfile. This improves maintainability and allows Step 2 instructions in SKILL.md to remain concise (pointing to the reference) while preserving the search logic for readers.Suggested extraction:
- Create
plugins/microshift-ci/skills/create-bugs/references/jira-search-patterns.md- Move lines 128–196 (Search A, B, C detailed examples) to the reference file
- Replace Step 2 with a brief pointer:
See [Jira Search Patterns](references/jira-search-patterns.md) for detailed Search A/B/C examples and implementation guidance🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/microshift-ci/skills/create-bugs/SKILL.md` around lines 1 - 606, The SKILL.md file exceeds the 500-line guideline (currently 606 lines). Extract the detailed Jira search reference material from Step 2 (currently spanning the Search A, B, and C sections with their examples and implementation guidance) into a new file at `plugins/microshift-ci/skills/create-bugs/references/jira-search-patterns.md`. Create this reference file with the full search logic, query examples, and implementation notes. Then replace the extracted content in SKILL.md's Step 2 with a brief pointer like "See [Jira Search Patterns](references/jira-search-patterns.md) for detailed Search A/B/C examples and implementation guidance." This preserves the search logic while reducing SKILL.md to under 500 lines and improving maintainability.plugins/microshift-ci/skills/prow-job/SKILL.md (1)
152-154: 💤 Low valueQuote the
${TMP}variable in Bash examples.The Bash code blocks at lines 153 and 169 contain unquoted
${TMP}in thegsutilcommand. While the work directory path is controlled and unlikely to contain spaces, quoting all variables is a shell scripting best practice and aligns with the repository's shell script guidelines. The LLM agent will execute these examples, so they should demonstrate correct quoting.📝 Proposed fix
Line 153:
-gsutil -q -m cp -r "${GCS_PATH}/" ${TMP}/ +gsutil -q -m cp -r "${GCS_PATH}/" "${TMP}"/Line 169:
- gsutil -q -m cp -r "${GCS_PATH}/" ${TMP}/ + gsutil -q -m cp -r "${GCS_PATH}/" "${TMP}"/Also applies to: 168-169
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/microshift-ci/skills/prow-job/SKILL.md` around lines 152 - 154, The Bash code examples in the SKILL.md file contain unquoted variables in the gsutil commands. At line 153 in the first gsutil command and at line 169 in the second gsutil command, wrap the ${TMP} variable with double quotes to change ${TMP}/ to "${TMP}"/ in both locations. This follows shell scripting best practices for variable quoting to prevent potential issues with whitespace or special characters, and ensures the examples demonstrate correct Bash syntax for LLM agents executing them.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@plugins/shared/scripts/parse.py`:
- Around line 74-99: Add comprehensive test cases for the parsing logic in
parse_structured_summary() covering all new validation branches. Create tests
for infrastructure_failure with explicit True/False values, string variants
("true"/"false"), and missing field (expecting None); for causal_chain with
valid dicts containing "cause" key, invalid entries (non-dict or missing "cause"
key), and enforce the 5-item limit; for analysis_gaps and scenarios with mixed
valid/invalid elements including non-strings and null values; and edge cases
with empty lists, null values, and malformed JSON inputs. These tests should
verify both the filtering logic and the default values applied when fields are
absent, ensuring the tri-state and list comprehension behavior is correctly
validated.
---
Outside diff comments:
In `@plugins/microshift-ci/skills/create-bugs/SKILL.md`:
- Around line 128-154: The Search A examples in the documentation are missing
the status filter that Search B examples include, which causes closed bugs to be
included in the duplicates array and triggers incorrect update decisions. Add
the status filter `AND status not in (Closed, Verified)` to both Search A JQL
example queries (the mcp__jira__jira_search calls showing searches for
"invalidclienttokenid" and "cloudformation") to match the status filtering
already present in the Search B examples. This ensures consistency between
Search A and Search B and guarantees that only open bugs are returned in both
cases.
In `@plugins/shared/scripts/parse.py`:
- Around line 26-29: The _parse_bool function at lines 26-29 uses bool(value) to
coerce non-string values, which converts None and other unknown/malformed values
to False, breaking tri-state semantics needed for infrastructure_failure
classification. Instead of using bool(value), preserve None and unknown states
by returning the value as-is or explicitly returning None for non-string inputs,
rather than coercing to False. This same fix needs to be applied at lines 77-80
where identical coercion logic appears.
---
Nitpick comments:
In `@plugins/microshift-ci/skills/create-bugs/SKILL.md`:
- Around line 1-606: The SKILL.md file exceeds the 500-line guideline (currently
606 lines). Extract the detailed Jira search reference material from Step 2
(currently spanning the Search A, B, and C sections with their examples and
implementation guidance) into a new file at
`plugins/microshift-ci/skills/create-bugs/references/jira-search-patterns.md`.
Create this reference file with the full search logic, query examples, and
implementation notes. Then replace the extracted content in SKILL.md's Step 2
with a brief pointer like "See [Jira Search
Patterns](references/jira-search-patterns.md) for detailed Search A/B/C examples
and implementation guidance." This preserves the search logic while reducing
SKILL.md to under 500 lines and improving maintainability.
In `@plugins/microshift-ci/skills/prow-job/SKILL.md`:
- Around line 152-154: The Bash code examples in the SKILL.md file contain
unquoted variables in the gsutil commands. At line 153 in the first gsutil
command and at line 169 in the second gsutil command, wrap the ${TMP} variable
with double quotes to change ${TMP}/ to "${TMP}"/ in both locations. This
follows shell scripting best practices for variable quoting to prevent potential
issues with whitespace or special characters, and ensures the examples
demonstrate correct Bash syntax for LLM agents executing them.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 544c4539-5f64-4347-8797-f0df7fd7dfdf
📒 Files selected for processing (17)
plugins/lvms-ci/scripts/extract-sosreport.shplugins/lvms-ci/scripts/repo-log.shplugins/microshift-ci/scripts/extract-sosreport.shplugins/microshift-ci/scripts/repo-log.shplugins/microshift-ci/scripts/search-bugs.pyplugins/microshift-ci/skills/create-bugs/SKILL.mdplugins/microshift-ci/skills/doctor/SKILL.mdplugins/microshift-ci/skills/fix-test-bugs/SKILL.mdplugins/microshift-ci/skills/prow-job/SKILL.mdplugins/microshift-ci/skills/prow-job/references/microshift-ci-primer.mdplugins/shared/scripts/aggregate.pyplugins/shared/scripts/classify.pyplugins/shared/scripts/create-report.pyplugins/shared/scripts/doctor.shplugins/shared/scripts/extract-sosreport.shplugins/shared/scripts/parse.pyplugins/shared/scripts/repo-log.sh
✅ Files skipped from review due to trivial changes (3)
- plugins/lvms-ci/scripts/repo-log.sh
- plugins/microshift-ci/scripts/extract-sosreport.sh
- plugins/microshift-ci/skills/prow-job/references/microshift-ci-primer.md
🚧 Files skipped from review as they are similar to previous changes (5)
- plugins/lvms-ci/scripts/extract-sosreport.sh
- plugins/microshift-ci/scripts/search-bugs.py
- plugins/microshift-ci/skills/fix-test-bugs/SKILL.md
- plugins/shared/scripts/create-report.py
- plugins/microshift-ci/scripts/repo-log.sh
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
plugins/shared/scripts/doctor.sh (1)
211-229:⚠️ Potential issue | 🟠 Major | ⚡ Quick winValidate reused source directories before publishing them.
This reuses
${WORKDIR}/src/${repo_name}and existing worktree paths based only on directory presence. Becauserepo_namedrops the org, rerunning with anotherORG/<same-name>or a stale partial worktree can publish incorrectsource.repo_dir/source.worktreesto downstream analysis.Proposed fix
if [[ -d "${src_dir}/.git" ]]; then - echo " Already cloned: ${src_dir}" >&2 + local expected_origin="https://github.com/${repo}.git" + local actual_origin + actual_origin=$(git -C "${src_dir}" remote get-url origin 2>/dev/null || true) + if [[ "${actual_origin}" != "${expected_origin}" ]]; then + src_error="existing checkout at ${src_dir} points to ${actual_origin:-unknown origin}, not ${expected_origin}" + echo " WARNING: ${src_error}" >&2 + src_dir="" + else + echo " Already cloned: ${src_dir}" >&2 + fi else mkdir -p "${WORKDIR}/src" if ! git clone --quiet "https://github.com/${repo}.git" "${src_dir}" >&2; then src_error="git clone failed for ${repo}" @@ wt="${WORKDIR}/src/${repo_name}-release-${release}" if [[ -d "${wt}" ]]; then - src_worktrees["${release}"]="${wt}" - continue + if git -C "${wt}" rev-parse --is-inside-work-tree >/dev/null 2>&1; then + src_worktrees["${release}"]="${wt}" + continue + fi + echo " WARNING: existing worktree path is not a git worktree: ${wt}" >&2 + continue fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/shared/scripts/doctor.sh` around lines 211 - 229, The code reuses source directories and worktree paths based solely on directory existence checks, which can lead to incorrect source information being published when running with different organizations or stale partial worktrees. To fix this, validate the cloned repository at src_dir by verifying its git remote matches the expected repo URL before reusing it, and similarly validate worktree directories at wt by checking they are valid git worktrees before adding them to src_worktrees. This ensures that directory presence alone is not sufficient for reuse decisions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@plugins/shared/scripts/doctor.sh`:
- Around line 211-229: The code reuses source directories and worktree paths
based solely on directory existence checks, which can lead to incorrect source
information being published when running with different organizations or stale
partial worktrees. To fix this, validate the cloned repository at src_dir by
verifying its git remote matches the expected repo URL before reusing it, and
similarly validate worktree directories at wt by checking they are valid git
worktrees before adding them to src_worktrees. This ensures that directory
presence alone is not sufficient for reuse decisions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: dd85dcc9-4aa1-41bf-abdb-f70f1546956a
📒 Files selected for processing (8)
plugins/microshift-ci/scripts/search-bugs.pyplugins/microshift-ci/skills/create-bugs/SKILL.mdplugins/microshift-ci/skills/prow-job/SKILL.mdplugins/microshift-ci/skills/prow-job/references/microshift-ci-primer.mdplugins/shared/scripts/aggregate.pyplugins/shared/scripts/doctor.shplugins/shared/scripts/extract-sosreport.shplugins/shared/scripts/parse.py
💤 Files with no reviewable changes (3)
- plugins/shared/scripts/parse.py
- plugins/shared/scripts/aggregate.py
- plugins/microshift-ci/scripts/search-bugs.py
✅ Files skipped from review due to trivial changes (1)
- plugins/microshift-ci/skills/prow-job/references/microshift-ci-primer.md
🚧 Files skipped from review as they are similar to previous changes (2)
- plugins/shared/scripts/extract-sosreport.sh
- plugins/microshift-ci/skills/prow-job/SKILL.md
|
Addressed CodeRabbit review-body findings:
|
Summary by CodeRabbit
Release Notes
New Features
Documentation
Chores / Refactor