Test engineer plugin current coverage by nthompson-bitwarden · Pull Request #152 · bitwarden/ai-plugins

nthompson-bitwarden · 2026-06-25T20:55:27Z

🎟️ Tracking

https://bitwarden.atlassian.net/browse/QA-1983

📔 Objective

Slimmed down version of #150 containing only files necessary to get a "current test coverage" report

…rcing trophy

# Conflicts: # .cspell.json

github-actions · 2026-06-25T20:57:50Z

Plugin Validation Summary — `bitwarden-test-toolkit`

Result: ✅ PASS — all three validation stages passed. No critical or major issues. A few optional minor polish items are noted below; none block merge.

Validated against PR #152 changed files: plugin.json, CHANGELOG.md, README.md, the assessing-test-coverage skill (SKILL.md), its four references/*.md files, and the evals/ harness (README.md, baseline.json, run_real_eval.py, trigger-eval.json).

1. Plugin Structure (plugin-validator) — ✅ PASS

Check	Result
`plugin.json` valid JSON, `name` matches directory, `version` valid semver (`1.0.0`)	✅
Required manifest fields (name, version, description, author, homepage, repository, keywords)	✅
Version consistency: `plugin.json` `1.0.0` == `CHANGELOG.md` `[1.0.0]` == `marketplace.json` entry	✅
`CHANGELOG.md` present, Keep a Changelog format	✅
`README.md` present and comprehensive (overview, features, usage, installation)	✅
Skill frontmatter: `name` + `description` present; `name` matches directory	✅
Auto-discovery layout (`skills/<name>/SKILL.md`)	✅
No hardcoded credentials in any plugin file	✅
`evals/` placement	✅ Appropriate — matches existing repo convention (`bitwarden-delivery-tools/.../evals`); runtime scratch output (`evals/runs/`) is git-ignored

2. Skill Review (skill-reviewer) — ✅ PASS

Check	Result
YAML frontmatter (`name`, `description`)	✅
Description quality — third-person, specific quoted trigger phrases, explicit out-of-scope carve-outs	✅ Exemplary
SKILL.md word count (~864 words)	✅ Deliberately lean; detail correctly pushed to `references/`
Writing style — imperative/infinitive throughout	✅
Progressive disclosure — lean core + 4 focused reference files (~2,650 words)	✅ Textbook
All referenced files and section anchors resolve	✅ Every path/anchor verified, no broken links
Trigger regression harness with committed baseline	✅ 10/10 should-trigger, 10/10 should-not-trigger

3. Security Validation (reviewing-claude-config) — ✅ PASS

Check	Result
No committed `settings.local.json`	✅ None present
No hardcoded secrets/API keys/tokens/passwords	✅ Grep sweep clean — only benign `TARGET_SKILL_TOKEN` variable name and "token spend" (LLM tokens)
`allowed-tools` scoping (SKILL.md:5)	✅ Least-privilege — narrow Bash sub-patterns (e.g. `Bash(gh pr view:)`, `Bash(git -C rev-parse:*)`) instead of blanket shell access
Dangerous command auto-approvals	✅ None
Overly broad file access	✅ None
`run_real_eval.py` credential handling	✅ No secrets; shells out to authenticated `claude` CLI on PATH; env passthrough only
Eval JSON files (`baseline.json`, `trigger-eval.json`)	✅ Query strings + results only, no sensitive data
Prompt-injection posture	✅ Positive — SKILL.md:48 and `references/input-sources.md:6-10` treat all ingested Jira/Confluence/PR/CSV content as untrusted data and cite CWE-1427

Minor Observations (optional — non-blocking)

evals/run_real_eval.py:173 — argparse --model default is claude-opus-4-7, but the eval README/baseline were recorded with --model claude-sonnet-4-6 (evals/README.md:13,25). Cosmetic drift; documented invocations always pass --model explicitly. Consider aligning the default or making --model required. (warning)
SKILL.md:4 — argument-hint names the specific vendor "Testmo" while the skill body/references describe the input generically as a "test-case CSV export." Optionally generalize to Test-case CSV for vendor neutrality. (warning)
SKILL.md:35 — the ## Output section heading conceptually overlaps with the report's ## Overview subsection. Optionally rename to ## Artifacts to reduce ambiguity. (warning)
CHANGELOG.md — no [Unreleased] section. Fully compliant for an initial release; adding one eases future edits. (optional)

Positive Highlights

Skill description is best-in-class: concrete quoted triggers plus explicit negative scoping, validated by a reproducible eval harness with a committed regression baseline.
Progressive disclosure done well — lean orchestration spine, all detail in references/, every anchor resolves.
Least-privilege allowed-tools and built-in prompt-injection defenses (CWE-1427).

Recommendation: APPROVE. No errors requiring changes; the four minor items are cosmetic.

theMickster

A few things to change, all feel minor other that how we approach Tech Breakdowns. Great start @nthompson-bitwarden!

theMickster · 2026-06-26T15:44:44Z

+
+| Plugin                      | How It's Used                                                                                                                                                                                                                                                                |
+| --------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `bitwarden-atlassian-tools` | Optional but recommended. Provides the `mcp__plugin_bitwarden-atlassian-tools_bitwarden-atlassian__*` server used to read Jira tickets and linked Confluence requirements. If absent, the plugin degrades gracefully — paste requirements or rely on the PR/CSV/description. |


❓ Do we have any reason to make our Atlassian tooling required? I would really love to see our Claude Code + testing efforts driven from directions in our Jira + Testmo tickets (if possible) 🤞🏼.

Good question - I've kept the Atlassian tooling optional, but reframed it as the recommended driver in the README, and is already steered towards that direction in the reference files.

Ideally the skill is used early on in the SDLC - which would support making atlassian tools required - but it was purposefully designed to be used at any point in the process, even far after feature work is completed.

It could be also be used for assessing test coverage for Community PRs (which supports adding verbiage to resolve the other comment re: untrusted input boundaries).

That said, I fully agree with the direction. Perhaps the additive input sources are too broad/open-ended, which leads to variance in use, which might produce suboptimal determinism in its output.

For example, if QA has already written test cases for a feature - adding those test cases to the input along with the Jira Epic, will likely influence the output compared to if the skill was run before the test cases were written. 🤔

theMickster · 2026-06-26T16:26:51Z

+## Technical breakdown document
+
+A Bitwarden **Tech Breakdown** — the Confluence artifact a team produces before implementation,
+authored with the `bitwarden-delivery-tools:writing-tech-breakdowns` skill. It is the richest


❌ This skill has been removed so we need to refactor how we tackle referencing our tech breakdowns.

@trmartin4 do you think you could give us a hand here with some wording/phrasing? We have a challenge because we have the goal of tech breakdowns in a GitHub repo, but also still have them in Confluence that our new bitwarden-test-engineer plugin is going to need to work with.

I think we also want to be careful on how much we rely upon the tech breakdowns as a source of truth. While they are a source of truth for a period of time, once Jira tickets are crafted and code is created then we need to recognize and steer Claude more towards those and less at the tech breakdown. Correct? Or am I missing something?

📓 I think the reference concern is moot in this PR. regardless of what the new name or entry point is for the process that generates a tech breakdown, it can be removed entirely since it's not a true invocation. It's not really necessary to elaborate on where the tech breakdown came from.

Correct on both points.

We'd want to point this skill at the template in https://github.com/bitwarden/tech-breakdowns/tree/main/templates for contents, and the corresponding breakdown documents for each team, versus the Confluence space.

To the point about timing, I agree and I think it depends on when in the SDLC this skill is intended to be used. The intent is for completed breakdowns (i.e. the code has been implemented and is now the source of truth) will be moved to /complete subfolders for each team, so you could pretty confidently add open PRs and merged breakdowns not complete into your input sources, but guide Claude to preference the implementation over the breakdown if they conflict.

SaintPatrck

Great start! In addition to the inline comments, I think it would be beneficial for everyone if we go ahead and introduce trigger evals, at a minimum, so we can objectively measure effectiveness of these additions and future changes. #126 is an example of basic evals for a design based skill. The creating-pull-requests skill in delivery-tools plugin also has some basic evals you can use as a reference. Happy to sync up outside of this PR and talk through setting them up using /skill-creator if you'd like.

SaintPatrck · 2026-06-29T14:30:32Z

+      "name": "bitwarden-test-engineer",
+      "source": "./plugins/bitwarden-test-engineer",
+      "version": "1.0.0",
+      "description": "Test engineering toolkit for Bitwarden. Inventories the test coverage a change already has — finding existing tests across the server, client, and mobile repos, bucketing each by layer (unit, integration, E2E), citing it as a stable GitHub permalink, and flagging untested behaviors as gaps in a markdown report. Designed to grow additional testing capabilities over time."


💭 Since this will eventually turn into a collection of SDET tools, I'm wondering if the description and plugin name should be more generic so we don't have to constantly update it when new tools or capabilities are added. bitwarden-sdet-tools, maybe?

I like the idea of making the description/plugin name more generic. But I feel like SDET is too scoped. Software engineer should feel free to make use of the plugin as well as management, etc... 🤔

SaintPatrck · 2026-06-29T14:40:50Z

 | [bitwarden-product-analyst](plugins/bitwarden-product-analyst/)     | 0.1.5   | Product analyst agent for creating comprehensive Bitwarden requirements documents from multiple sources                                                     |
 | [bitwarden-security-engineer](plugins/bitwarden-security-engineer/) | 1.2.0   | Application security engineering: vulnerability triage, threat modeling, and secure code analysis                                                           |
 | [bitwarden-software-engineer](plugins/bitwarden-software-engineer/) | 1.0.0   | Software engineer agent for a Bitwarden product team. Implements stories, tasks, and bugs with code quality, performance, security, and team comms in mind. |
+| [bitwarden-test-engineer](plugins/bitwarden-test-engineer/)         | 1.0.0   | Test engineering toolkit: role-specific testing agents spanning the test lifecycle, starting with risk-weighted test strategy and coverage planning.        |


⛏️ This could also be more generic to avoid future maintenance.

SaintPatrck · 2026-06-29T14:53:54Z

+      "name": "bitwarden-test-engineer",
+      "source": "./plugins/bitwarden-test-engineer",
+      "version": "1.0.0",
+      "description": "Test engineering toolkit for Bitwarden. Inventories the test coverage a change already has — finding existing tests across the server, client, and mobile repos, bucketing each by layer (unit, integration, E2E), citing it as a stable GitHub permalink, and flagging untested behaviors as gaps in a markdown report. Designed to grow additional testing capabilities over time."


⛏️ I don't think this trailer statement is necessary.

Designed to grow additional testing capabilities over time.

Suggested change

"description": "Test engineering toolkit for Bitwarden. Inventories the test coverage a change already has — finding existing tests across the server, client, and mobile repos, bucketing each by layer (unit, integration, E2E), citing it as a stable GitHub permalink, and flagging untested behaviors as gaps in a markdown report. Designed to grow additional testing capabilities over time."

"description": "Test engineering toolkit for Bitwarden. Inventories the test coverage a change already has — finding existing tests across the server, client, and mobile repos, bucketing each by layer (unit, integration, E2E), citing it as a stable GitHub permalink, and flagging untested behaviors as gaps in a markdown report."

SaintPatrck · 2026-06-29T14:54:32Z

+{
+  "name": "bitwarden-test-engineer",
+  "version": "1.0.0",
+  "description": "Test engineering toolkit for Bitwarden. Inventories the test coverage a change already has — finding existing tests across the server, client, and mobile repos, bucketing each by layer (unit, integration, E2E), citing it as a stable GitHub permalink, and flagging untested behaviors as gaps in a markdown report. Designed to grow additional testing capabilities over time.",


⛏️ Same unnecessary trailer statement repeated here.

Designed to grow additional testing capabilities over time.

Suggested change

"description": "Test engineering toolkit for Bitwarden. Inventories the test coverage a change already has — finding existing tests across the server, client, and mobile repos, bucketing each by layer (unit, integration, E2E), citing it as a stable GitHub permalink, and flagging untested behaviors as gaps in a markdown report. Designed to grow additional testing capabilities over time.",

"description": "Test engineering toolkit for Bitwarden. Inventories the test coverage a change already has — finding existing tests across the server, client, and mobile repos, bucketing each by layer (unit, integration, E2E), citing it as a stable GitHub permalink, and flagging untested behaviors as gaps in a markdown report.",

SaintPatrck · 2026-06-29T14:59:45Z

@@ -0,0 +1,46 @@
+---
+name: assessing-test-coverage
+description: Use when determining what test coverage ALREADY exists for a change — citing each covering test as a GitHub permalink bucketed by test layer, and flagging behaviors with no observed test as gaps. Triggers on "what's already tested", "does this PR have tests", "what coverage exists for", or "is this component covered". This is a backward-looking inventory of existing coverage — it does NOT recommend new tests or assign cheapest-sufficient layers.


🎨

"citing each covering test as a GitHub permalink bucketed by test layer, and flagging behaviors with no observed test as gaps"

This is describing output, not when the skill should be used. Unless it was added to cover missed triggers you encountered, consider stripping it.

Trigger evals would help determine if it is adding value or not. Consider having /skill-creator build trigger evals so we can measure effectiveness now and going forward.

SaintPatrck · 2026-06-29T17:56:36Z

+name: assessing-test-coverage
+description: Use when determining what test coverage ALREADY exists for a change — citing each covering test as a GitHub permalink bucketed by test layer, and flagging behaviors with no observed test as gaps. Triggers on "what's already tested", "does this PR have tests", "what coverage exists for", or "is this component covered". This is a backward-looking inventory of existing coverage — it does NOT recommend new tests or assign cheapest-sufficient layers.
+allowed-tools: "Read, Write, Grep, Glob, AskUserQuestion, Bash(gh pr view:*), Bash(gh pr diff:*), Bash(git rev-parse:*), Bash(git remote get-url:*), Bash(git -C * rev-parse:*), Bash(git -C * remote get-url:*), Skill(bitwarden-atlassian-tools:researching-jira-issues), mcp__plugin_bitwarden-atlassian-tools_bitwarden-atlassian__get_issue, mcp__plugin_bitwarden-atlassian-tools_bitwarden-atlassian__get_issue_comments, mcp__plugin_bitwarden-atlassian-tools_bitwarden-atlassian__get_issue_remote_links, mcp__plugin_bitwarden-atlassian-tools_bitwarden-atlassian__get_confluence_page, mcp__plugin_bitwarden-atlassian-tools_bitwarden-atlassian__search_issues, mcp__plugin_bitwarden-atlassian-tools_bitwarden-atlassian__search_confluence, mcp__plugin_bitwarden-atlassian-tools_bitwarden-atlassian__search_confluence_cql"
+---


🎨 Can use arguments and argument-hint to provide cues for users about what they can pass in when invoking the skill.

It also allows reference within the skill using $ARGUMENTS[i] or $argName.

See https://code.claude.com/docs/en/skills#frontmatter-reference for details on arguments and argument-hint usage.

SaintPatrck · 2026-06-29T18:06:40Z

+
+## Output file
+
+Write to `test-engineer-report-<slug>-<date>/coverage.md`:


💭 This can result in files saved to different locations, depending on where the Claude session is started and its CWD. What do you think about having a single, consistent output location that doesn't run the risk of being picked up by Git? $CLAUDE_PLUGIN_DATA points to the plugin installation directory (~/.claude/plugin-data/<plugin-name>/), so maybe something like ${CLAUDE_PLUGIN_DATA}/<slug>-<date>-coverage.md? Thoughts?

SaintPatrck · 2026-06-29T18:14:03Z

+Write to `test-engineer-report-<slug>-<date>/coverage.md`:
+
+- `<slug>` — a kebab-case slug for the change (from the ticket key, PR number, or feature name).
+- `<date>` — today's date as `YYYY-MM-DD`, **supplied by the caller** (skills can't read the clock).


❓ Where does the assertion that "skills can't read the clock" come from? I use date format in slugs for generated files and regularly see Claude add date and time stamps without issue. My gut says this is not accurate and should be removed, but worth double checking.

good catch! fixed

SaintPatrck · 2026-06-29T21:02:46Z

+   are scope-only and any coverage is prospective. Surface this in the report's Evidence.
+5. **Preferred path.** The `researching-jira-issues` skill (preferred at the top of this file) does
+   this hierarchical discovery and depth-controlled traversal in one synthesized read — run it on the
+   epic key; the direct MCP calls above are the fallback.


🎨 This is re-iterating the steps defined in researching-jira-issues, and we already noted that the Skill and MCP server are bundled together. What do you think about rephrasing it to something along the lines of...

### Epic intake When `issuetype` is `Epic` or `Feature`, the testable behaviors live on the children and their PRs, not on the epic body. Run `Skill(bitwarden-atlassian-tools:researching-jira-issues)` on the epic key; it does child discovery, depth-controlled fan-out, and PR-link traversal in one read. The direct MCP calls (`get_issue` per child, `get_issue_remote_links` for PRs) serve targeted lookups when you do not need the full synthesized read. Carry three things forward that are specific to this analysis: - **Source key per behavior.** A behavior from a child links to that child, not the epic. - **PRs are the coverage backbone.** Each child's linked PRs carry the tests that shipped, permalink-ready via the head SHA. If `gh` cannot reach one, record it as evidence-not-inspected rather than dropping it. - **Epic status bounds expectations.** `Done` children likely have tests-in-PR to audit; `To Do` children are scope-only. Note it in Evidence.

theMickster · 2026-06-30T15:07:04Z

♻️ A local multi-agent code review raised the following finding.

16 of 21 added cspell words are dead entries — present in no shipped file

.cspell.json
Caught by: Code quality agent / Architecture agent
Surfaced by: both (opus + sonnet)
cmon-bro verdict: CONFIRMED — 5 words present in new files, 16 with 0 hits in new files and 0 on main

Details

This PR adds 21 words to the repo-wide cspell dictionary; only 5 (mockall, Robolectric, stylesheet, unfound, unlinkable) appear in any shipped file. The other 16 appear in no shipped file and — verified by whole-word grep across main — in no pre-existing repo file either:

actioned, automatable, Consolas, detekt, getline, inlines, ktlint, Menlo, nextest, SDET, Segoe, subdirs, tablist, tabpanel, tnum, XCUI.

These are residue from the larger PR #150 (Android/iOS/Rust tooling and UI/CSS terms from dropped content); the dictionary was not slimmed alongside the files. Repo CLAUDE.md guidance is to add domain terms only when they trip cspell on shipped content. Dead entries pollute the shared allowlist and silently accept future misspellings that collide with these tokens.

Suggested fix: drop the 16 unused words; keep only the 5 the shipped files use.

Co-authored-by: Patrick Honkonen <1883101+SaintPatrck@users.noreply.github.com>

…erage' into test-engineer-plugin-current-coverage

nthompson-bitwarden added 14 commits June 16, 2026 10:39

add test-engineer-plugin with test stack analyst skills

25eea80

Merge branch 'main' into test-engineer-plugin

bebf863

split stack analysis from coverage report

471fd43

html report changes, defer to current test stack shape rather than fo…

be2db8d

…rcing trophy

clean up consistency issues

736b8a5

deduplication pass

0ca9c42

refine scopes, multiple debloat passes

f605de3

Merge branch 'main' into test-engineer-plugin

29ef658

update atlassian tool mcp refs

94d5698

make examples more generic

3e4a51a

update README.md

1fbffb8

resolve should-fix review comments

6336832

Merge branch 'main' into test-engineer-plugin

4c946b8

# Conflicts: # .cspell.json

focus and slim down test engineer plugin

8c66434

theMickster requested changes Jun 26, 2026

View reviewed changes

resolve PR comments

8edd3dc

nthompson-bitwarden mentioned this pull request Jun 27, 2026

Test engineer plugin #150

Closed

remove confluence dep for tech breakdowns, other cleanup

556e31e

SaintPatrck requested changes Jun 30, 2026

View reviewed changes

theMickster reviewed Jun 30, 2026

View reviewed changes

nthompson-bitwarden and others added 9 commits June 30, 2026 11:54

update .cspell.json, add hints to assesment SKILL.md

ed6676c

add evals, genericize plugin/skill descriptions

4c433b1

eval updates

c3b569d

resolving review comments

56d8d2d

Merge branch 'main' into test-engineer-plugin-current-coverage

ceccc1d

rename plugin to Bitwarden-Test-Toolkit

5452777

minor cleanup

df0d8f3

Apply suggestion from @SaintPatrck

20b39cf

Co-authored-by: Patrick Honkonen <1883101+SaintPatrck@users.noreply.github.com>

Apply suggestion from @SaintPatrck

45f00dd

Co-authored-by: Patrick Honkonen <1883101+SaintPatrck@users.noreply.github.com>

nthompson-bitwarden added 4 commits July 2, 2026 09:54

update description in README.md

0b50900

Merge remote-tracking branch 'origin/test-engineer-plugin-current-cov…

d5064ff

…erage' into test-engineer-plugin-current-coverage

depend only on researching-jira-issues skill

c28b2f0

clean up

ee38b4d

nthompson-bitwarden requested review from SaintPatrck and theMickster July 2, 2026 18:46

	"description": "Test engineering toolkit for Bitwarden. Inventories the test coverage a change already has — finding existing tests across the server, client, and mobile repos, bucketing each by layer (unit, integration, E2E), citing it as a stable GitHub permalink, and flagging untested behaviors as gaps in a markdown report. Designed to grow additional testing capabilities over time."
	"description": "Test engineering toolkit for Bitwarden. Inventories the test coverage a change already has — finding existing tests across the server, client, and mobile repos, bucketing each by layer (unit, integration, E2E), citing it as a stable GitHub permalink, and flagging untested behaviors as gaps in a markdown report."


		## Output file

		Write to `test-engineer-report-<slug>-<date>/coverage.md`:

Uh oh!

Conversation

nthompson-bitwarden commented Jun 25, 2026

🎟️ Tracking

📔 Objective

Uh oh!

github-actions Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Plugin Validation Summary — bitwarden-test-toolkit

1. Plugin Structure (plugin-validator) — ✅ PASS

2. Skill Review (skill-reviewer) — ✅ PASS

3. Security Validation (reviewing-claude-config) — ✅ PASS

Minor Observations (optional — non-blocking)

Positive Highlights

Uh oh!

theMickster left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SaintPatrck left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

16 of 21 added cspell words are dead entries — present in no shipped file

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions Bot commented Jun 25, 2026 •

edited

Loading

Plugin Validation Summary — `bitwarden-test-toolkit`