Add LVMS analyze skill for storage troubleshooting#194
Conversation
Move the LVMS analyze skill from openshift-eng/ai-helpers into the existing LVMS plugin. The skill diagnoses LVMS storage issues on live clusters (via oc commands) and offline must-gather data (via Python analysis script). Compared to the original, this version: - Consolidates the command + skill into a single SKILL.md (~130 lines vs ~1300 lines combined) - Improves the Python script to handle individual PV/SC files from must-gather alongside list-format YAML - Adds early PyYAML import check with a clear error message - Simplifies deduplication and resource extraction Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: suleymanakbas91 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WalkthroughThe LVMS plugin is updated to version 1.1.0 with the addition of a new ChangesLVMS v1.1.0: Analyze Skill Addition
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (1 error, 2 warnings)
✅ Passed checks (8 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
plugins/lvms/skills/analyze/SKILL.md (1)
163-168: ⚡ Quick winInline common failure patterns with relevant workflow steps.
Per retrieved learning (PR 192: microshift-ci), failure policies and edge-case rules should be "co-located inline with the specific step that needs them" rather than in a separate reference section. The LLM reads linearly and may not reliably re-consult separate sections mid-execution.
Suggestion: Move each root-cause pattern (device filesystem conflict, insufficient capacity, duplicate VG names, node-specific failure) inline with the analysis step where it would be detected. For example, the "device filesystem conflict" pattern should live with Step 4's VG analysis block (lines 102–109), and the "insufficient capacity" pattern should follow the thin pool checks.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/lvms/skills/analyze/SKILL.md` around lines 163 - 168, Move the common root cause patterns from the separate "Common Root Cause Patterns" section (currently at lines 163-168) inline with the relevant workflow steps earlier in the document. Specifically: embed the "Device filesystem conflict" pattern directly within Step 4's VG analysis block (lines 102-109), place the "Insufficient capacity" pattern inline with the thin pool capacity checks, and co-locate each remaining pattern (duplicate VG names, node-specific failure) with the specific analysis step where it would be detected. This ensures the LLM encounters failure patterns and edge-case rules in the same context where it performs the corresponding diagnostic check, rather than relying on separate reference sections that may not be consulted during linear execution.Source: Learnings
plugins/lvms/skills/analyze/scripts/analyze_lvms.py (1)
56-64: ⚡ Quick winMissing explicit file encoding in file operations. Both
load_yaml()and_parse_log()open files without specifying encoding, which defaults to the system locale and can cause issues with UTF-8 content on some platforms.
plugins/lvms/skills/analyze/scripts/analyze_lvms.py#L56-L64: Addencoding='utf-8'toopen(path)inload_yaml().plugins/lvms/skills/analyze/scripts/analyze_lvms.py#L213-L240: Addencoding='utf-8'toopen(path)in_parse_log().🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/lvms/skills/analyze/scripts/analyze_lvms.py` around lines 56 - 64, File operations are missing explicit UTF-8 encoding specifications which can cause platform-specific issues. In plugins/lvms/skills/analyze/scripts/analyze_lvms.py, locate the `load_yaml()` function (lines 56-64) and add `encoding='utf-8'` parameter to the `open(path)` call. Then locate the `_parse_log()` function (lines 213-240) and add the same `encoding='utf-8'` parameter to its `open(path)` call. This ensures consistent UTF-8 file handling across all platforms.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@plugins/lvms/skills/analyze/scripts/analyze_lvms.py`:
- Around line 152-174: The code for loading PersistentVolumes and StorageClasses
processes both individual files and list files, which may contain the same
resources causing duplicates in self.pvs and self.storage_classes. To fix this,
deduplicate by resource UID: before appending a PV in the PV loading block
(where you check for topolvm.io driver) or a StorageClass in the StorageClass
loading block (where you check for topolvm.io provisioner), first check if a
resource with the same UID has already been added; use a set to track seen UIDs
across both the individual file and list file loops to prevent duplicates from
being appended.
- Around line 332-339: The if reason block (lines 332-339) that appends to
self.issues['critical'] is currently executing for all nodes regardless of
status, causing Ready nodes with informational reasons to be incorrectly flagged
as critical issues. Move the entire if reason block (including the print
statements and the critical append operation) inside the conditional that
handles non-Ready node statuses, so that only nodes without a Ready status will
have their reasons logged and flagged as critical issues.
---
Nitpick comments:
In `@plugins/lvms/skills/analyze/scripts/analyze_lvms.py`:
- Around line 56-64: File operations are missing explicit UTF-8 encoding
specifications which can cause platform-specific issues. In
plugins/lvms/skills/analyze/scripts/analyze_lvms.py, locate the `load_yaml()`
function (lines 56-64) and add `encoding='utf-8'` parameter to the `open(path)`
call. Then locate the `_parse_log()` function (lines 213-240) and add the same
`encoding='utf-8'` parameter to its `open(path)` call. This ensures consistent
UTF-8 file handling across all platforms.
In `@plugins/lvms/skills/analyze/SKILL.md`:
- Around line 163-168: Move the common root cause patterns from the separate
"Common Root Cause Patterns" section (currently at lines 163-168) inline with
the relevant workflow steps earlier in the document. Specifically: embed the
"Device filesystem conflict" pattern directly within Step 4's VG analysis block
(lines 102-109), place the "Insufficient capacity" pattern inline with the thin
pool capacity checks, and co-locate each remaining pattern (duplicate VG names,
node-specific failure) with the specific analysis step where it would be
detected. This ensures the LLM encounters failure patterns and edge-case rules
in the same context where it performs the corresponding diagnostic check, rather
than relying on separate reference sections that may not be consulted during
linear execution.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 5a67ca2a-3f8a-446e-8d1f-9c48aad30782
📒 Files selected for processing (5)
.claude-plugin/marketplace.jsonplugins/lvms/.claude-plugin/plugin.jsonplugins/lvms/README.mdplugins/lvms/skills/analyze/SKILL.mdplugins/lvms/skills/analyze/scripts/analyze_lvms.py
- Fix PV/SC duplicate loading by deduplicating by UID/name - Fix VG reason wrongly flagged as critical for Ready nodes - Add explicit encoding='utf-8' to file open calls - Move root cause patterns inline with relevant analysis steps Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
General OCP must-gathers don't include LVMS CRDs (LVMCluster), only pods/deployments/daemonsets. Downgrade missing LVMCluster from CRITICAL to WARNING when other LVMS components are present, and suggest using the LVMS-specific must-gather image for full CRD data. Also add a fallback search for individual lvmclusters/*.yaml files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
plugins/lvms/skills/analyze/scripts/analyze_lvms.py (2)
345-359:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winProgressing nodes with reasons are added to both warning and critical lists.
When a node has status
'Progressing'with a reason, line 347 adds it to warnings, but line 359 also adds it to critical issues. This double-counts the issue.Proposed fix - skip critical append for Progressing status
if reason and st != 'Ready': print(f"\n {BOLD}Reason:{END}") lines = reason.split('\n') for line in lines[:5]: print(f" {line}") if len(lines) > 5: print(f" ... (truncated, {len(lines) - 5} more lines)") - self.issues['critical'].append(f"VG {vg_name} on {node}: {reason[:200]}") + if st != 'Progressing': + self.issues['critical'].append(f"VG {vg_name} on {node}: {reason[:200]}")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/lvms/skills/analyze/scripts/analyze_lvms.py` around lines 345 - 359, When a VG node has status 'Progressing' with a reason, it is being added to both the 'warning' list (at line 347) and the 'critical' list (at line 359), causing double-counting of the same issue. Modify the condition that controls whether to append to self.issues['critical'] on line 359 to exclude 'Progressing' status. The condition currently checks `if reason and st != 'Ready':` but should also exclude 'Progressing' by adding an additional check `and st != 'Progressing'` so that 'Progressing' status items are only added to warnings, not critical issues.
499-518:⚠️ Potential issue | 🟠 Major | ⚡ Quick winError-level log entries without an
errorfield are not added to issue tracking.The issue tracking logic (lines 514-517) is nested inside
if e['error']:, so log entries at error level that lack anerrorfield are printed but not accumulated inself.issues. This may cause the summary to undercount errors.Proposed fix - track issues regardless of error field
for e in sorted(entries, key=lambda x: x['ts']): (err if e['level'] == 'error' else warn)(f"{e['ts']}: {e['msg']}") if e['controller']: print(f" Controller: {e['controller']}") if e['error']: lines = e['error'].split('\n') if len(lines) > 1: print(f" {BOLD}Error Details:{END}") for line in lines[:10]: if line.strip(): print(f" {line}") if len(lines) > 10: print(f" ... ({len(lines) - 10} more lines)") else: print(f" Error: {e['error']}") - if e['level'] == 'error': - self.issues['critical'].append(f"Pod {pod}: {e['msg']}") - else: - self.issues['warning'].append(f"Pod {pod}: {e['msg']}") + if e['level'] == 'error': + self.issues['critical'].append(f"Pod {pod}: {e['msg']}") + else: + self.issues['warning'].append(f"Pod {pod}: {e['msg']}") print()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/lvms/skills/analyze/scripts/analyze_lvms.py` around lines 499 - 518, The issue tracking logic that appends to self.issues['critical'] and self.issues['warning'] is currently nested inside the if e['error']: condition, which means error-level entries without an error field are not tracked in the issues summary. Move the if/else block that checks e['level'] == 'error' and appends to self.issues outside of the if e['error']: block so that all error and warning level entries are tracked regardless of whether the error field is present. The error field display logic can remain nested, but the issue tracking must execute for every entry that has an error or warning level.
🧹 Nitpick comments (1)
plugins/lvms/skills/analyze/scripts/analyze_lvms.py (1)
217-223: 💤 Low valueGlobal deduplication may hide multi-pod failures.
The deduplication key is only the message text (
e['msg']), which means if the same error occurs in multiple pods, only one entry is retained. This could mask systemic issues affecting multiple pods.Consider including the pod name in the deduplication key if you want to preserve per-pod visibility:
seen = {} for e in raw_entries: - key = e['msg'] + key = (e['pod'], e['msg']) if key not in seen or e['ts'] < seen[key]['ts']: seen[key] = e🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/lvms/skills/analyze/scripts/analyze_lvms.py` around lines 217 - 223, The deduplication logic in the analyze_lvms function uses only the message text as the deduplication key, which causes identical errors occurring in multiple pods to be collapsed into a single entry, masking systemic failures. Modify the key construction from key = e['msg'] to include both the pod name and message (such as creating a composite key with e['pod'] and e['msg'] or using a tuple of both values) to ensure that the same error message is tracked separately per pod while still deduplicating duplicate entries within the same pod.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@plugins/lvms/skills/analyze/scripts/analyze_lvms.py`:
- Around line 345-359: When a VG node has status 'Progressing' with a reason, it
is being added to both the 'warning' list (at line 347) and the 'critical' list
(at line 359), causing double-counting of the same issue. Modify the condition
that controls whether to append to self.issues['critical'] on line 359 to
exclude 'Progressing' status. The condition currently checks `if reason and st
!= 'Ready':` but should also exclude 'Progressing' by adding an additional check
`and st != 'Progressing'` so that 'Progressing' status items are only added to
warnings, not critical issues.
- Around line 499-518: The issue tracking logic that appends to
self.issues['critical'] and self.issues['warning'] is currently nested inside
the if e['error']: condition, which means error-level entries without an error
field are not tracked in the issues summary. Move the if/else block that checks
e['level'] == 'error' and appends to self.issues outside of the if e['error']:
block so that all error and warning level entries are tracked regardless of
whether the error field is present. The error field display logic can remain
nested, but the issue tracking must execute for every entry that has an error or
warning level.
---
Nitpick comments:
In `@plugins/lvms/skills/analyze/scripts/analyze_lvms.py`:
- Around line 217-223: The deduplication logic in the analyze_lvms function uses
only the message text as the deduplication key, which causes identical errors
occurring in multiple pods to be collapsed into a single entry, masking systemic
failures. Modify the key construction from key = e['msg'] to include both the
pod name and message (such as creating a composite key with e['pod'] and
e['msg'] or using a tuple of both values) to ensure that the same error message
is tracked separately per pod while still deduplicating duplicate entries within
the same pod.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: b7939640-caa4-44fb-9fc0-6def8b0b7470
📒 Files selected for processing (1)
plugins/lvms/skills/analyze/scripts/analyze_lvms.py
- Fix Progressing VG nodes being double-counted in both warning and critical issue lists — only append reason to critical for non-Ready, non-Progressing statuses - Un-nest issue tracking from error field check so error-level log entries without an explicit error field are still tracked in summary - Include pod name in log dedup key so identical errors across different pods are preserved separately Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/lgtm |
Summary
openshift-eng/ai-helpersinto the existing LVMS pluginoccommands) and offline must-gather data (via Python analysis script)Companion PR
The removal of the LVMS plugin from
openshift-eng/ai-helperswill follow in a separate PR to that repo.Test plan
/lvms:analyze --helpshows correct usagemake lint-all-skills)🤖 Generated with Claude Code
Summary by CodeRabbit
/lvms:analyzeskill for troubleshooting LVMS on OpenShift./lvms:analyzeusage, examples, and requirements.1.1.0and expanded public plugin metadata to include release/QE workflows and troubleshooting.