Skip to content

OCPEDGE-2727: Add agent-eval-harness integration config for cluster-diagnostic skill#178

Draft
dhensel-rh wants to merge 5 commits into
openshift-eng:mainfrom
dhensel-rh:OCPEDGE-2727
Draft

OCPEDGE-2727: Add agent-eval-harness integration config for cluster-diagnostic skill#178
dhensel-rh wants to merge 5 commits into
openshift-eng:mainfrom
dhensel-rh:OCPEDGE-2727

Conversation

@dhensel-rh

Copy link
Copy Markdown
Contributor

Summary

  • Integrate the agent-eval-harness with the cluster-diagnostic skill to enable automated quality
    testing via /test eval-cluster-diagnostic in CI
  • Five test cases exercise both validate mode (severity classification of dangerous procedures)
    and recovery-guide mode (correct shutdown/recovery procedures)
  • Four deterministic judges check severity accuracy, procedure completeness, forbidden
    recommendations, and cost budget; one LLM judge evaluates knowledge base accuracy
  • Companion CI job defined in openshift/release#80177

@openshift-ci

openshift-ci Bot commented Jun 6, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 6, 2026
@coderabbitai

coderabbitai Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Excluded labels (none allowed) (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 41852670-003c-4d9b-a99c-f307ff651771

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci

openshift-ci Bot commented Jun 6, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhensel-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 6, 2026
…eat-model skills

Add evaluation configs, test cases, and README for two skills:
- cluster-diagnostic: 5 cases covering validate and recovery-guide modes
- threat-model-tnf: 5 cases covering PR security analysis

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dhensel-rh and others added 2 commits June 7, 2026 16:56
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The eval harness resolves dataset.path from the repo root, not relative
to the config file. Both configs were using short relative paths that
broke when running from different working directories.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

inputs:
tools:
- match: Questions asked to the user via AskUserQuestion (game mode)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inputs.tools section defines AskUserQuestion interception for game mode, but the analysis doc explicitly says game mode is excluded from eval scope. None of the 5 cases test game mode.
This is unused config that could confuse future contributors.

models:
skill: claude-opus-4-6
judge: claude-opus-4-6
hook: claude-sonnet-4-6

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we drop both skill and hook here ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skill can be used here or on the command line

/eval-run --model claude-opus-4-6 --config evals/cluster-diagnostic.yaml

The hook is useful if I chose to exercise the game mode.

What is the motivation to drop ?

hook: claude-sonnet-4-6

permissions:
allow: []

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to list any speific tools or allow all ?


if expected_blockers and not found_blockers:
return (False, f"Expected blockers {expected_blockers} not found in output")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warnings are never validated. The expected_warnings field is
collected but never used by any judge — either add a check or remove the field from annotations to avoid false confidence.

sec_lower = section.lower()
if "pcs node standby" in sec_lower and "never" not in sec_lower and "do not" not in sec_lower:
forbidden.append("pcs node standby recommended")
if "shutdown -h 1" in sec_lower and "never" not in sec_lower and "do not" not in sec_lower:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The judge checks for "shutdown -h 1" (with the 1), but case-001's expected blocker is shutdown -h (without 1). If the skill outputs "shutdown -h" without the trailing argument, the
forbidden check wouldn't catch it. Should be "shutdown -h" in the judge.


Return a JSON object: {"score": <1-5>, "rationale": "<explanation>"}

thresholds:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are these defined ? they are from the eval-analyze command ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the eval-analyze generated this.

A threshold sets the bar for a judge across all cases in a run.

  - 1.0 = all 5 must pass (5/5)
  - 0.8 = at least 4 must pass (4/5)
  - 0.6 = at least 3 must pass (3/5)

cluster issues across 4 modes: diagnose (live SSH), validate (check proposed
procedures), recovery-guide (return correct procedures), and game (interactive
training). The skill encodes 7 validated bare metal test scenarios (HPE ProLiant
e920t, OCP 4.22.0-rc.3) into a knowledge base.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we mention the baremetal system and ocp details ? this should work against any baremetal and ocp versions right ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file (cluster-diagnostic.md) is generated from /eval-analyze. The cluster-diagnostic tool should work with any OCP version, baremetal, or VM. The reference ./cluster-knowledge-base.md skill would need updating, not this markdown file. This can probably be addressed for accuracy, but does not necessarily affect the test output.

- Add case-006-game-quiz with quiz mode test case and answers
- Add warning_classification judge for expected WARNING findings
- Add game_mode_scoring judge for rating/score validation
- Fix forbidden_recommendations to check 'shutdown -h' (not 'shutdown -h 1')
- Update severity_classification description for clarity
- Drop models.skill default (let CLI --model flag control it)
- Simplify schema note to only exclude diagnose mode

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dhensel-rh

Copy link
Copy Markdown
Contributor Author

Superseded by #187 (skill plugin), #188 (REPORT_DIR override), and #189 (eval configs).

@dhensel-rh dhensel-rh closed this Jun 11, 2026
@dhensel-rh dhensel-rh reopened this Jun 11, 2026
Evals score skill quality on a spectrum (1-5), not pass/fail. Update
terminology to reflect this: testing→scoring, test cases→scenarios,
test input→scenario input. Add game mode to cluster-diagnostic case count.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants