test-anti-patterns / test-smell-detection: narrow-specialist re-pitch (Variant B) by Evangelink · Pull Request #671 · dotnet/skills

Evangelink · 2026-05-19T09:55:55Z

What

Variant B of a 3-PR experiment to fix plugin-mode activation failures for test-anti-patterns and test-smell-detection. Narrow-specialist re-pitch.

Diagnosis

(Same as Variants A and C — see PR #670 for full diagnosis.) Plugin-mode activation fails because the two skills overlap heavily, the baseline is already 4.0/5, and the previous descriptions told the agent to use the other one.

Change (Variant B)

test-anti-patterns becomes the umbrella audit skill for all pragmatic test-quality reviews (anti-patterns and smell-style prompts alike). Its description explicitly absorbs smell-audit triggers (test smell audit, formal smell audit, rank by severity, etc.).
test-smell-detection is repositioned as a narrow niche — invoked only when the user explicitly asks for the testsmells.org / academic 19-smell catalog with citable smell names from the research literature.
Eval prompts for test-smell-detection are rewritten so they actually exercise the narrow niche (explicit testsmells.org / 19-smell catalog requests).

Companion variants

Variant A (test-anti-patterns / test-smell-detection: activation-only rewrite (Variant A) #670): activation-only rewrite, both skills keep overlap and INVOKE-THIS wording.
Variant C (#TBD-merge): full merge — deletes test-smell-detection entirely.

Compare /evaluate results across all three to pick the winner.

Validation

skill-validator check --plugin plugins/dotnet-test passes for both skills.

… (Variant B) Both skills suffer plugin-mode activation failures: in plugin runs the model bypasses the skill tool entirely for ~5/6 of test-anti-patterns audit prompts and ~2/3 of test-smell-detection prompts, answering directly from training. Variant B repositions the two skills as umbrella + narrow specialist: * test-anti-patterns becomes THE umbrella audit skill for all pragmatic test-quality reviews (anti-patterns and smell-style prompts alike). Its description explicitly absorbs smell-audit triggers. * test-smell-detection is repositioned as a narrow niche that only fires when the user explicitly asks for the testsmells.org / academic 19-smell catalog with citable smell names from the research literature. Eval prompts for test-smell-detection are rewritten to match the narrow niche (explicit testsmells.org / 19-smell catalog requests) so we test specialist activation rather than the umbrella scenarios. Variant B of a 3-PR experiment to fix activation. Companion PRs: * Variant A = activation-only rewrite, both skills keep overlap; * Variant C = full merge into test-anti-patterns. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Evangelink · 2026-05-19T09:56:22Z

/evaluate

Copilot

Pull request overview

This PR is Variant B of a 3-PR experiment to fix plugin-mode activation issues between two overlapping skills. It repositions test-anti-patterns as the umbrella audit skill for all general/pragmatic test-quality reviews, and narrows test-smell-detection to only fire on explicit requests for the testsmells.org/academic 19-smell catalog. Eval prompts for test-smell-detection are rewritten so they actually exercise the narrow niche.

Changes:

Rewrite test-anti-patterns description to claim the umbrella audit role (absorbing smell-audit phrasing), with INVOKE THIS SKILL mandate language.
Narrow test-smell-detection description to invoke ONLY on explicit testsmells.org / 19-smell academic catalog requests.
Update both eval.yaml and eval.vally.yaml scenarios for test-smell-detection to use the narrow-niche trigger phrasing.

Show a summary per file

File	Description
plugins/dotnet-test/skills/test-anti-patterns/SKILL.md	Reframes skill as the umbrella audit, absorbing smell-audit triggers
plugins/dotnet-test/skills/test-smell-detection/SKILL.md	Restricts skill to explicit academic-catalog requests; defers general audits to test-anti-patterns
tests/dotnet-test/test-smell-detection/eval.yaml	Updates 3 prompts to explicitly request the 19-smell academic taxonomy
tests/dotnet-test/test-smell-detection/eval.vally.yaml	Mirrors eval.yaml prompt changes for the vally adapter

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 4/4 changed files
Comments generated: 0

github-actions · 2026-05-19T10:06:30Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
test-smell-detection	Detect multiple test smells in order processing test suite	5.0/5 → 5.0/5	✅ test-smell-detection; tools: skill	🟡 0.44	❌ [1]
test-smell-detection	Recognize well-written tests with no significant smells	4.0/5 → 5.0/5 🟢	✅ test-smell-detection; tools: skill	🟡 0.44	❌ [2]
test-smell-detection	Recognize integration tests and avoid false positives for external resources	5.0/5 → 5.0/5	✅ test-smell-detection; tools: skill / ✅ test-smell-detection; tools: skill, glob	🟡 0.44	❌ [3]
test-smell-detection	Decline request to write new tests from scratch	4.7/5 → 4.7/5	ℹ️ not activated (expected)	🟡 0.44	❌ [4]
test-anti-patterns	Detect mixed severity anti-patterns in repository service tests	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.25	❌ [5]
test-anti-patterns	Detect flakiness indicators and test coupling	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.25	❌ [6]
test-anti-patterns	Detect duplicated tests and magic values	4.0/5 → 5.0/5 🟢	✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.25	✅ [7]
test-anti-patterns	Recognize well-written tests without inventing false positives	4.0/5 → 4.7/5 🟢	✅ test-anti-patterns; tools: skill	🟡 0.25	✅ [8]
test-anti-patterns	Detect coverage-touching pattern across a service facade	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.25	❌ [9]
test-anti-patterns	Detect self-referential assertions in round-trip and identity tests	4.3/5 → 4.3/5	✅ test-anti-patterns; tools: skill / ✅ assertion-quality; test-anti-patterns; tools: skill	🟡 0.25	❌ [10]

[1] ⚠️ High run-to-run variance (CV=0.62) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=2.09) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -5.3% due to: tokens (47205 → 75194), time (44.3s → 69.6s), quality
[3] ⚠️ High run-to-run variance (CV=0.54) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -7.7% due to: tokens (41545 → 102308), tool calls (4 → 8), time (48.6s → 90.7s)
[4] ⚠️ High run-to-run variance (CV=3.53) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -3.9% due to: quality, tool calls (3 → 4), time (36.0s → 43.8s)
[5] (Plugin) Quality unchanged but weighted score is -2.8% due to: tokens (28197 → 37502), time (34.0s → 50.1s)
[6] ⚠️ High run-to-run variance (CV=0.60) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -2.5% due to: tokens (40586 → 55013), tool calls (3 → 4)
[7] ⚠️ High run-to-run variance (CV=1.08) — consider re-running with --runs 5
[8] ⚠️ High run-to-run variance (CV=1.52) — consider re-running with --runs 5
[9] ⚠️ High run-to-run variance (CV=1.03) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -1.7% due to: tokens (28935 → 38440)
[10] ⚠️ High run-to-run variance (CV=5.79) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -1.7% due to: tokens (42479 → 85645), time (43.2s → 87.8s), tool calls (4 → 5)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

Evangelink · 2026-05-25T11:43:04Z

@copilot resolve the merge conflicts in this pull request

Copilot · 2026-05-25T11:46:11Z

@copilot resolve the merge conflicts in this pull request

Resolved the PR merge conflicts by merging origin/main into this branch and finalizing conflict resolution in the 4 conflicting files. Addressed in commit 71e33f4.

Evangelink · 2026-05-25T12:55:02Z

@copilot resolve the merge conflicts in this pull request

Copilot · 2026-05-25T12:57:39Z

@copilot resolve the merge conflicts in this pull request

Done — I merged the latest origin/main into this branch and resolved the remaining conflict. Addressed in commit e57ff5c.

Evangelink · 2026-05-25T13:07:04Z

/evaluate

github-actions · 2026-05-25T13:17:15Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
test-smell-detection	Detect multiple test smells in order processing test suite	4.3/5 → 5.0/5 🟢	✅ test-smell-detection; tools: skill / ✅ test-anti-patterns; test-smell-detection; tools: skill	🟡 0.41	❌ [1]
test-smell-detection	Recognize well-written tests with no significant smells	4.7/5 → 5.0/5 🟢	✅ test-smell-detection; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.41	❌
test-smell-detection	Recognize integration tests and avoid false positives for external resources	5.0/5 → 5.0/5	✅ test-smell-detection; tools: skill / ✅ test-anti-patterns; test-smell-detection; tools: skill	🟡 0.41	❌ [2]
test-smell-detection	Decline request to write new tests from scratch	5.0/5 → 4.7/5 🔴	ℹ️ not activated (expected)	🟡 0.41	✅ [3]
test-anti-patterns	Detect mixed severity anti-patterns in repository service tests	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: skill	🟡 0.24	❌ [4]
test-anti-patterns	Detect flakiness indicators and test coupling	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.24	❌ [5]
test-anti-patterns	Detect duplicated tests and magic values	4.0/5 → 5.0/5 🟢	✅ test-anti-patterns; tools: skill	🟡 0.24	❌ [6]
test-anti-patterns	Recognize well-written tests without inventing false positives	4.3/5 → 4.3/5	✅ test-anti-patterns; tools: skill / ✅ test-anti-patterns; tools: skill, glob	🟡 0.24	❌ [7]
test-anti-patterns	Detect coverage-touching pattern across a service facade	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: skill	🟡 0.24	❌ [8]
test-anti-patterns	Detect self-referential assertions in round-trip and identity tests	4.3/5 → 4.3/5	✅ test-anti-patterns; tools: skill	🟡 0.24	❌ [9]

[1] ⚠️ High run-to-run variance (CV=2317%) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -1.6% due to: tokens (41420 → 78343), time (33.0s → 63.0s), tool calls (4 → 6)
[2] (Plugin) Quality unchanged but weighted score is -10.0% due to: tokens (45903 → 138834), tool calls (4 → 9), time (45.0s → 89.4s)
[3] ⚠️ High run-to-run variance (CV=558%) — consider re-running with --runs 5. (Isolated) Quality dropped but weighted score is +0.3% due to: tokens (87145 → 43008), tool calls (6 → 4), time (45.9s → 35.4s)
[4] (Isolated) Quality unchanged but weighted score is -4.9% due to: tokens (28078 → 48020), tool calls (3 → 4)
[5] ⚠️ High run-to-run variance (CV=80%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -3.3% due to: tokens (27409 → 43158)
[6] ⚠️ High run-to-run variance (CV=690%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -9.8% due to: judgment, tokens (40555 → 47510)
[7] ⚠️ High run-to-run variance (CV=82%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -5.7% due to: tokens (26590 → 46450), tool calls (2 → 3), time (18.3s → 23.4s)
[8] ⚠️ High run-to-run variance (CV=51%) — consider re-running with --runs 5
[9] (Plugin) Quality unchanged but weighted score is -4.8% due to: tokens (42391 → 84590), time (33.6s → 44.6s), tool calls (4 → 5)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

github-actions · 2026-05-25T13:21:38Z

Skill Coverage Report

	Plugin	Skill	Covered	Coverage
✅	`dotnet-test`	`test-anti-patterns`	16/18	88.9%
⚠️	`dotnet-test`	`test-smell-detection`	15/19	78.9%

Uncovered: dotnet-test/test-anti-patterns

[Validation] Every finding includes a specific location (not just a general warning) (line 141)
[Validation] Recommendations are prioritized by severity (line 145)

Uncovered: dotnet-test/test-smell-detection

[Validation] Every finding includes a concrete fix example (not just "fix this") (line 197)
[Validation] Contextually obvious numbers are not flagged as magic numbers (line 200)
[Validation] Severity levels are justified, not arbitrary (line 202)
[Pitfall] Treating skip annotations with reasons same as bare skips (line 213)

Copilot AI review requested due to automatic review settings May 19, 2026 09:55

Evangelink mentioned this pull request May 19, 2026

test-anti-patterns: merge test-smell-detection in (Variant C) #672

Closed

Copilot started reviewing on behalf of Evangelink May 19, 2026 09:56 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

github-actions Bot added a commit that referenced this pull request May 19, 2026

Update PR token usage data (PR #671)

2388c49

github-actions Bot added a commit that referenced this pull request May 19, 2026

Update session data (PR #671)

63f3f19

AbhitejJohn added the area-test label May 22, 2026

Copilot started work on behalf of Evangelink May 25, 2026 11:43 View session

Merge origin/main and resolve conflicts

71e33f4

Copilot finished work on behalf of Evangelink May 25, 2026 11:46

Copilot started work on behalf of Evangelink May 25, 2026 12:55 View session

Merge latest origin/main and resolve conflicts

e57ff5c

Copilot AI review requested due to automatic review settings May 25, 2026 12:57

Evangelink review requested due to automatic review settings May 25, 2026 12:57

Copilot finished work on behalf of Evangelink May 25, 2026 12:58

Evangelink enabled auto-merge (squash) May 25, 2026 13:07

JanKrivanek approved these changes May 25, 2026

View reviewed changes

github-actions Bot added a commit that referenced this pull request May 25, 2026

Update PR token usage data (PR #671)

ee7d2b2

github-actions Bot added a commit that referenced this pull request May 25, 2026

Update session data (PR #671)

5269ffa

Evangelink merged commit 5810d49 into main May 25, 2026
36 checks passed

Evangelink deleted the dev/amauryleve/test-audit-narrow-specialist branch May 25, 2026 13:18

Evangelink mentioned this pull request May 25, 2026

test-anti-patterns / test-smell-detection: activation-only rewrite (Variant A) #670

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test-anti-patterns / test-smell-detection: narrow-specialist re-pitch (Variant B)#671

test-anti-patterns / test-smell-detection: narrow-specialist re-pitch (Variant B)#671
Evangelink merged 3 commits into
mainfrom
dev/amauryleve/test-audit-narrow-specialist

Evangelink commented May 19, 2026

Uh oh!

Evangelink commented May 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

Evangelink commented May 25, 2026

Uh oh!

Copilot AI commented May 25, 2026

Uh oh!

Evangelink commented May 25, 2026

Uh oh!

Copilot AI commented May 25, 2026

Uh oh!

Evangelink commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Evangelink commented May 19, 2026

What

Diagnosis

Change (Variant B)

Companion variants

Validation

Uh oh!

Evangelink commented May 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

github-actions Bot commented May 19, 2026

Skill Validation Results

Uh oh!

Evangelink commented May 25, 2026

Uh oh!

Copilot AI commented May 25, 2026

Uh oh!

Evangelink commented May 25, 2026

Uh oh!

Copilot AI commented May 25, 2026

Uh oh!

Evangelink commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Skill Validation Results

Uh oh!

Uh oh!

github-actions Bot commented May 25, 2026

Skill Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants