OCPEDGE-2727: Add agent-eval-harness integration config for cluster-diagnostic skill#178
OCPEDGE-2727: Add agent-eval-harness integration config for cluster-diagnostic skill#178dhensel-rh wants to merge 5 commits into
Conversation
|
Skipping CI for Draft Pull Request. |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Excluded labels (none allowed) (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository YAML (base), Organization UI (inherited) Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dhensel-rh The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…eat-model skills Add evaluation configs, test cases, and README for two skills: - cluster-diagnostic: 5 cases covering validate and recovery-guide modes - threat-model-tnf: 5 cases covering PR security analysis Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The eval harness resolves dataset.path from the repo root, not relative to the config file. Both configs were using short relative paths that broke when running from different working directories. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
||
| inputs: | ||
| tools: | ||
| - match: Questions asked to the user via AskUserQuestion (game mode) |
There was a problem hiding this comment.
The inputs.tools section defines AskUserQuestion interception for game mode, but the analysis doc explicitly says game mode is excluded from eval scope. None of the 5 cases test game mode.
This is unused config that could confuse future contributors.
| models: | ||
| skill: claude-opus-4-6 | ||
| judge: claude-opus-4-6 | ||
| hook: claude-sonnet-4-6 |
There was a problem hiding this comment.
can we drop both skill and hook here ?
There was a problem hiding this comment.
skill can be used here or on the command line
/eval-run --model claude-opus-4-6 --config evals/cluster-diagnostic.yaml
The hook is useful if I chose to exercise the game mode.
What is the motivation to drop ?
| hook: claude-sonnet-4-6 | ||
|
|
||
| permissions: | ||
| allow: [] |
There was a problem hiding this comment.
Do we want to list any speific tools or allow all ?
|
|
||
| if expected_blockers and not found_blockers: | ||
| return (False, f"Expected blockers {expected_blockers} not found in output") | ||
|
|
There was a problem hiding this comment.
Warnings are never validated. The expected_warnings field is
collected but never used by any judge — either add a check or remove the field from annotations to avoid false confidence.
| sec_lower = section.lower() | ||
| if "pcs node standby" in sec_lower and "never" not in sec_lower and "do not" not in sec_lower: | ||
| forbidden.append("pcs node standby recommended") | ||
| if "shutdown -h 1" in sec_lower and "never" not in sec_lower and "do not" not in sec_lower: |
There was a problem hiding this comment.
The judge checks for "shutdown -h 1" (with the 1), but case-001's expected blocker is shutdown -h (without 1). If the skill outputs "shutdown -h" without the trailing argument, the
forbidden check wouldn't catch it. Should be "shutdown -h" in the judge.
|
|
||
| Return a JSON object: {"score": <1-5>, "rationale": "<explanation>"} | ||
|
|
||
| thresholds: |
There was a problem hiding this comment.
How are these defined ? they are from the eval-analyze command ?
There was a problem hiding this comment.
Yes, the eval-analyze generated this.
A threshold sets the bar for a judge across all cases in a run.
- 1.0 = all 5 must pass (5/5)
- 0.8 = at least 4 must pass (4/5)
- 0.6 = at least 3 must pass (3/5)
| cluster issues across 4 modes: diagnose (live SSH), validate (check proposed | ||
| procedures), recovery-guide (return correct procedures), and game (interactive | ||
| training). The skill encodes 7 validated bare metal test scenarios (HPE ProLiant | ||
| e920t, OCP 4.22.0-rc.3) into a knowledge base. |
There was a problem hiding this comment.
should we mention the baremetal system and ocp details ? this should work against any baremetal and ocp versions right ?
There was a problem hiding this comment.
This file (cluster-diagnostic.md) is generated from /eval-analyze. The cluster-diagnostic tool should work with any OCP version, baremetal, or VM. The reference ./cluster-knowledge-base.md skill would need updating, not this markdown file. This can probably be addressed for accuracy, but does not necessarily affect the test output.
- Add case-006-game-quiz with quiz mode test case and answers - Add warning_classification judge for expected WARNING findings - Add game_mode_scoring judge for rating/score validation - Fix forbidden_recommendations to check 'shutdown -h' (not 'shutdown -h 1') - Update severity_classification description for clarity - Drop models.skill default (let CLI --model flag control it) - Simplify schema note to only exclude diagnose mode Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Evals score skill quality on a spectrum (1-5), not pass/fail. Update terminology to reflect this: testing→scoring, test cases→scenarios, test input→scenario input. Add game mode to cluster-diagnostic case count. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
cluster-diagnosticskill to enable automated qualitytesting via
/test eval-cluster-diagnosticin CIvalidatemode (severity classification of dangerous procedures)and
recovery-guidemode (correct shutdown/recovery procedures)recommendations, and cost budget; one LLM judge evaluates knowledge base accuracy