Skip to content

test-anti-patterns / test-smell-detection: narrow-specialist re-pitch (Variant B)#671

Merged
Evangelink merged 3 commits into
mainfrom
dev/amauryleve/test-audit-narrow-specialist
May 25, 2026
Merged

test-anti-patterns / test-smell-detection: narrow-specialist re-pitch (Variant B)#671
Evangelink merged 3 commits into
mainfrom
dev/amauryleve/test-audit-narrow-specialist

Conversation

@Evangelink
Copy link
Copy Markdown
Member

What

Variant B of a 3-PR experiment to fix plugin-mode activation failures for test-anti-patterns and test-smell-detection. Narrow-specialist re-pitch.

Diagnosis

(Same as Variants A and C — see PR #670 for full diagnosis.) Plugin-mode activation fails because the two skills overlap heavily, the baseline is already 4.0/5, and the previous descriptions told the agent to use the other one.

Change (Variant B)

  • test-anti-patterns becomes the umbrella audit skill for all pragmatic test-quality reviews (anti-patterns and smell-style prompts alike). Its description explicitly absorbs smell-audit triggers (test smell audit, formal smell audit, rank by severity, etc.).
  • test-smell-detection is repositioned as a narrow niche — invoked only when the user explicitly asks for the testsmells.org / academic 19-smell catalog with citable smell names from the research literature.
  • Eval prompts for test-smell-detection are rewritten so they actually exercise the narrow niche (explicit testsmells.org / 19-smell catalog requests).

Companion variants

Compare /evaluate results across all three to pick the winner.

Validation

  • skill-validator check --plugin plugins/dotnet-test passes for both skills.

… (Variant B)

Both skills suffer plugin-mode activation failures: in plugin runs the model

bypasses the skill tool entirely for ~5/6 of test-anti-patterns audit prompts

and ~2/3 of test-smell-detection prompts, answering directly from training.

Variant B repositions the two skills as umbrella + narrow specialist:

* test-anti-patterns becomes THE umbrella audit skill for all pragmatic

  test-quality reviews (anti-patterns and smell-style prompts alike). Its

  description explicitly absorbs smell-audit triggers.

* test-smell-detection is repositioned as a narrow niche that only fires

  when the user explicitly asks for the testsmells.org / academic 19-smell

  catalog with citable smell names from the research literature.

Eval prompts for test-smell-detection are rewritten to match the narrow

niche (explicit testsmells.org / 19-smell catalog requests) so we test

specialist activation rather than the umbrella scenarios.

Variant B of a 3-PR experiment to fix activation. Companion PRs:

* Variant A = activation-only rewrite, both skills keep overlap;

* Variant C = full merge into test-anti-patterns.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 19, 2026 09:55
@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR is Variant B of a 3-PR experiment to fix plugin-mode activation issues between two overlapping skills. It repositions test-anti-patterns as the umbrella audit skill for all general/pragmatic test-quality reviews, and narrows test-smell-detection to only fire on explicit requests for the testsmells.org/academic 19-smell catalog. Eval prompts for test-smell-detection are rewritten so they actually exercise the narrow niche.

Changes:

  • Rewrite test-anti-patterns description to claim the umbrella audit role (absorbing smell-audit phrasing), with INVOKE THIS SKILL mandate language.
  • Narrow test-smell-detection description to invoke ONLY on explicit testsmells.org / 19-smell academic catalog requests.
  • Update both eval.yaml and eval.vally.yaml scenarios for test-smell-detection to use the narrow-niche trigger phrasing.
Show a summary per file
File Description
plugins/dotnet-test/skills/test-anti-patterns/SKILL.md Reframes skill as the umbrella audit, absorbing smell-audit triggers
plugins/dotnet-test/skills/test-smell-detection/SKILL.md Restricts skill to explicit academic-catalog requests; defers general audits to test-anti-patterns
tests/dotnet-test/test-smell-detection/eval.yaml Updates 3 prompts to explicitly request the 19-smell academic taxonomy
tests/dotnet-test/test-smell-detection/eval.vally.yaml Mirrors eval.yaml prompt changes for the vally adapter

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 4/4 changed files
  • Comments generated: 0

github-actions Bot added a commit that referenced this pull request May 19, 2026
github-actions Bot added a commit that referenced this pull request May 19, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
test-smell-detection Detect multiple test smells in order processing test suite 5.0/5 → 5.0/5 ✅ test-smell-detection; tools: skill 🟡 0.44 [1]
test-smell-detection Recognize well-written tests with no significant smells 4.0/5 → 5.0/5 🟢 ✅ test-smell-detection; tools: skill 🟡 0.44 [2]
test-smell-detection Recognize integration tests and avoid false positives for external resources 5.0/5 → 5.0/5 ✅ test-smell-detection; tools: skill / ✅ test-smell-detection; tools: skill, glob 🟡 0.44 [3]
test-smell-detection Decline request to write new tests from scratch 4.7/5 → 4.7/5 ℹ️ not activated (expected) 🟡 0.44 [4]
test-anti-patterns Detect mixed severity anti-patterns in repository service tests 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.25 [5]
test-anti-patterns Detect flakiness indicators and test coupling 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.25 [6]
test-anti-patterns Detect duplicated tests and magic values 4.0/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.25 [7]
test-anti-patterns Recognize well-written tests without inventing false positives 4.0/5 → 4.7/5 🟢 ✅ test-anti-patterns; tools: skill 🟡 0.25 [8]
test-anti-patterns Detect coverage-touching pattern across a service facade 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.25 [9]
test-anti-patterns Detect self-referential assertions in round-trip and identity tests 4.3/5 → 4.3/5 ✅ test-anti-patterns; tools: skill / ✅ assertion-quality; test-anti-patterns; tools: skill 🟡 0.25 [10]

[1] ⚠️ High run-to-run variance (CV=0.62) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=2.09) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -5.3% due to: tokens (47205 → 75194), time (44.3s → 69.6s), quality
[3] ⚠️ High run-to-run variance (CV=0.54) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -7.7% due to: tokens (41545 → 102308), tool calls (4 → 8), time (48.6s → 90.7s)
[4] ⚠️ High run-to-run variance (CV=3.53) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -3.9% due to: quality, tool calls (3 → 4), time (36.0s → 43.8s)
[5] (Plugin) Quality unchanged but weighted score is -2.8% due to: tokens (28197 → 37502), time (34.0s → 50.1s)
[6] ⚠️ High run-to-run variance (CV=0.60) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -2.5% due to: tokens (40586 → 55013), tool calls (3 → 4)
[7] ⚠️ High run-to-run variance (CV=1.08) — consider re-running with --runs 5
[8] ⚠️ High run-to-run variance (CV=1.52) — consider re-running with --runs 5
[9] ⚠️ High run-to-run variance (CV=1.03) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -1.7% due to: tokens (28935 → 38440)
[10] ⚠️ High run-to-run variance (CV=5.79) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -1.7% due to: tokens (42479 → 85645), time (43.2s → 87.8s), tool calls (4 → 5)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

@Evangelink
Copy link
Copy Markdown
Member Author

@copilot resolve the merge conflicts in this pull request

Copy link
Copy Markdown
Contributor

Copilot AI commented May 25, 2026

@copilot resolve the merge conflicts in this pull request

Resolved the PR merge conflicts by merging origin/main into this branch and finalizing conflict resolution in the 4 conflicting files. Addressed in commit 71e33f4.

@Evangelink
Copy link
Copy Markdown
Member Author

@copilot resolve the merge conflicts in this pull request

Copilot AI review requested due to automatic review settings May 25, 2026 12:57
@Evangelink Evangelink review requested due to automatic review settings May 25, 2026 12:57
Copy link
Copy Markdown
Contributor

Copilot AI commented May 25, 2026

@copilot resolve the merge conflicts in this pull request

Done — I merged the latest origin/main into this branch and resolved the remaining conflict. Addressed in commit e57ff5c.

@Evangelink Evangelink enabled auto-merge (squash) May 25, 2026 13:07
@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

github-actions Bot added a commit that referenced this pull request May 25, 2026
github-actions Bot added a commit that referenced this pull request May 25, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
test-smell-detection Detect multiple test smells in order processing test suite 4.3/5 → 5.0/5 🟢 ✅ test-smell-detection; tools: skill / ✅ test-anti-patterns; test-smell-detection; tools: skill 🟡 0.41 [1]
test-smell-detection Recognize well-written tests with no significant smells 4.7/5 → 5.0/5 🟢 ✅ test-smell-detection; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.41
test-smell-detection Recognize integration tests and avoid false positives for external resources 5.0/5 → 5.0/5 ✅ test-smell-detection; tools: skill / ✅ test-anti-patterns; test-smell-detection; tools: skill 🟡 0.41 [2]
test-smell-detection Decline request to write new tests from scratch 5.0/5 → 4.7/5 🔴 ℹ️ not activated (expected) 🟡 0.41 [3]
test-anti-patterns Detect mixed severity anti-patterns in repository service tests 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: skill 🟡 0.24 [4]
test-anti-patterns Detect flakiness indicators and test coupling 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.24 [5]
test-anti-patterns Detect duplicated tests and magic values 4.0/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: skill 🟡 0.24 [6]
test-anti-patterns Recognize well-written tests without inventing false positives 4.3/5 → 4.3/5 ✅ test-anti-patterns; tools: skill / ✅ test-anti-patterns; tools: skill, glob 🟡 0.24 [7]
test-anti-patterns Detect coverage-touching pattern across a service facade 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: skill 🟡 0.24 [8]
test-anti-patterns Detect self-referential assertions in round-trip and identity tests 4.3/5 → 4.3/5 ✅ test-anti-patterns; tools: skill 🟡 0.24 [9]

[1] ⚠️ High run-to-run variance (CV=2317%) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -1.6% due to: tokens (41420 → 78343), time (33.0s → 63.0s), tool calls (4 → 6)
[2] (Plugin) Quality unchanged but weighted score is -10.0% due to: tokens (45903 → 138834), tool calls (4 → 9), time (45.0s → 89.4s)
[3] ⚠️ High run-to-run variance (CV=558%) — consider re-running with --runs 5. (Isolated) Quality dropped but weighted score is +0.3% due to: tokens (87145 → 43008), tool calls (6 → 4), time (45.9s → 35.4s)
[4] (Isolated) Quality unchanged but weighted score is -4.9% due to: tokens (28078 → 48020), tool calls (3 → 4)
[5] ⚠️ High run-to-run variance (CV=80%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -3.3% due to: tokens (27409 → 43158)
[6] ⚠️ High run-to-run variance (CV=690%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -9.8% due to: judgment, tokens (40555 → 47510)
[7] ⚠️ High run-to-run variance (CV=82%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -5.7% due to: tokens (26590 → 46450), tool calls (2 → 3), time (18.3s → 23.4s)
[8] ⚠️ High run-to-run variance (CV=51%) — consider re-running with --runs 5
[9] (Plugin) Quality unchanged but weighted score is -4.8% due to: tokens (42391 → 84590), time (33.6s → 44.6s), tool calls (4 → 5)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

@Evangelink Evangelink merged commit 5810d49 into main May 25, 2026
36 checks passed
@Evangelink Evangelink deleted the dev/amauryleve/test-audit-narrow-specialist branch May 25, 2026 13:18
@github-actions
Copy link
Copy Markdown
Contributor

Skill Coverage Report

Plugin Skill Covered Coverage
dotnet-test test-anti-patterns 16/18 88.9%
⚠️ dotnet-test test-smell-detection 15/19 78.9%
Uncovered: dotnet-test/test-anti-patterns
  • [Validation] Every finding includes a specific location (not just a general warning) (line 141)
  • [Validation] Recommendations are prioritized by severity (line 145)
Uncovered: dotnet-test/test-smell-detection
  • [Validation] Every finding includes a concrete fix example (not just "fix this") (line 197)
  • [Validation] Contextually obvious numbers are not flagged as magic numbers (line 200)
  • [Validation] Severity levels are justified, not arbitrary (line 202)
  • [Pitfall] Treating skip annotations with reasons same as bare skips (line 213)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants