OCPEDGE-2727: Add agent-eval-harness integration config for cluster-diagnostic skill by dhensel-rh · Pull Request #178 · openshift-eng/edge-tooling

dhensel-rh · 2026-06-06T20:10:09Z

Summary

Integrate the agent-eval-harness with the cluster-diagnostic skill to enable automated quality
testing via /test eval-cluster-diagnostic in CI
Five test cases exercise both validate mode (severity classification of dangerous procedures)
and recovery-guide mode (correct shutdown/recovery procedures)
Four deterministic judges check severity accuracy, procedure completeness, forbidden
recommendations, and cost budget; one LLM judge evaluates knowledge base accuracy
Companion CI job defined in openshift/release#80177

openshift-ci · 2026-06-06T20:10:12Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-06-06T20:10:15Z

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Excluded labels (none allowed) (1)

do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 41852670-003c-4d9b-a99c-f307ff651771

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-06-06T20:10:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhensel-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [dhensel-rh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…eat-model skills Add evaluation configs, test cases, and README for two skills: - cluster-diagnostic: 5 cases covering validate and recovery-guide modes - threat-model-tnf: 5 cases covering PR security analysis Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The eval harness resolves dataset.path from the repo root, not relative to the config file. Both configs were using short relative paths that broke when running from different working directories. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kasturinarra · 2026-06-09T09:22:06Z

+
+inputs:
+  tools:
+    - match: Questions asked to the user via AskUserQuestion (game mode)


The inputs.tools section defines AskUserQuestion interception for game mode, but the analysis doc explicitly says game mode is excluded from eval scope. None of the 5 cases test game mode.
This is unused config that could confuse future contributors.

kasturinarra · 2026-06-09T10:46:02Z

+models:
+  skill: claude-opus-4-6
+  judge: claude-opus-4-6
+  hook: claude-sonnet-4-6


can we drop both skill and hook here ?

skill can be used here or on the command line

/eval-run --model claude-opus-4-6 --config evals/cluster-diagnostic.yaml

The hook is useful if I chose to exercise the game mode.

What is the motivation to drop ?

kasturinarra · 2026-06-09T10:46:20Z

+  hook: claude-sonnet-4-6
+
+permissions:
+  allow: []


Do we want to list any speific tools or allow all ?

kasturinarra · 2026-06-09T11:03:19Z

+
+      if expected_blockers and not found_blockers:
+          return (False, f"Expected blockers {expected_blockers} not found in output")
+


Warnings are never validated. The expected_warnings field is
collected but never used by any judge — either add a check or remove the field from annotations to avoid false confidence.

kasturinarra · 2026-06-09T11:10:30Z

+          sec_lower = section.lower()
+          if "pcs node standby" in sec_lower and "never" not in sec_lower and "do not" not in sec_lower:
+              forbidden.append("pcs node standby recommended")
+          if "shutdown -h 1" in sec_lower and "never" not in sec_lower and "do not" not in sec_lower:


The judge checks for "shutdown -h 1" (with the 1), but case-001's expected blocker is shutdown -h (without 1). If the skill outputs "shutdown -h" without the trailing argument, the
forbidden check wouldn't catch it. Should be "shutdown -h" in the judge.

kasturinarra · 2026-06-09T11:12:50Z

+
+      Return a JSON object: {"score": <1-5>, "rationale": "<explanation>"}
+
+thresholds:


How are these defined ? they are from the eval-analyze command ?

Yes, the eval-analyze generated this.

A threshold sets the bar for a judge across all cases in a run.

- 1.0 = all 5 must pass (5/5) - 0.8 = at least 4 must pass (4/5) - 0.6 = at least 3 must pass (3/5)

kasturinarra · 2026-06-09T11:18:36Z

+cluster issues across 4 modes: diagnose (live SSH), validate (check proposed
+procedures), recovery-guide (return correct procedures), and game (interactive
+training). The skill encodes 7 validated bare metal test scenarios (HPE ProLiant
+e920t, OCP 4.22.0-rc.3) into a knowledge base.


should we mention the baremetal system and ocp details ? this should work against any baremetal and ocp versions right ?

This file (cluster-diagnostic.md) is generated from /eval-analyze. The cluster-diagnostic tool should work with any OCP version, baremetal, or VM. The reference ./cluster-knowledge-base.md skill would need updating, not this markdown file. This can probably be addressed for accuracy, but does not necessarily affect the test output.

- Add case-006-game-quiz with quiz mode test case and answers - Add warning_classification judge for expected WARNING findings - Add game_mode_scoring judge for rating/score validation - Fix forbidden_recommendations to check 'shutdown -h' (not 'shutdown -h 1') - Update severity_classification description for clarity - Drop models.skill default (let CLI --model flag control it) - Simplify schema note to only exclude diagnose mode Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dhensel-rh · 2026-06-11T18:41:52Z

Superseded by #187 (skill plugin), #188 (REPORT_DIR override), and #189 (eval configs).

Evals score skill quality on a spectrum (1-5), not pass/fail. Update terminology to reflect this: testing→scoring, test cases→scenarios, test input→scenario input. Add game mode to cluster-diagnostic case count. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 6, 2026

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 6, 2026

dhensel-rh force-pushed the OCPEDGE-2727 branch from 7a057d1 to f089a0d Compare June 6, 2026 21:21

dhensel-rh and others added 2 commits June 7, 2026 16:56

Update evals README with detailed pipeline steps

72871f4

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kasturinarra reviewed Jun 9, 2026

View reviewed changes

dhensel-rh mentioned this pull request Jun 11, 2026

OCPEDGE-2727: Add eval harness configs for cluster-diagnostic and threat-model skills #189

Closed

4 tasks

dhensel-rh closed this Jun 11, 2026

dhensel-rh reopened this Jun 11, 2026


		if expected_blockers and not found_blockers:
		return (False, f"Expected blockers {expected_blockers} not found in output")


		Return a JSON object: {"score": <1-5>, "rationale": "<explanation>"}

		thresholds:

Conversation

dhensel-rh commented Jun 6, 2026

Summary

Uh oh!

openshift-ci Bot commented Jun 6, 2026

Uh oh!

coderabbitai Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

openshift-ci Bot commented Jun 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhensel-rh commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading