Skip to content

Initial version of playwright-testing plugin#140

Open
kdenney wants to merge 1 commit into
mainfrom
add/bitwarden-playwright-testing
Open

Initial version of playwright-testing plugin#140
kdenney wants to merge 1 commit into
mainfrom
add/bitwarden-playwright-testing

Conversation

@kdenney

@kdenney kdenney commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

🎟️ Tracking

N/A — new plugin contribution.

📔 Objective

Adds the bitwarden-playwright-testing plugin (v1.0.0): an automated end-to-end UI testing framework for Bitwarden web changes, driven by Playwright.

The plugin exposes a single user-facing skill, test-web-changes, that takes a Jira ticket ID, an implementation-plan file, or a free-form feature description and turns it into a complete Playwright test run — gathering context, exploring the affected codebases, building grounded test cases, verifying the local dev environment, executing the tests through the browser UI, and compiling an HTML report with full-page screenshots.

How it works

test-web-changes runs as a team lead over an eight-task pipeline, dispatching a seven-agent team and persisting each artifact to .playwright-testing-artifacts/<slug>/:

Agent Responsibility
context-gatherer Acquires feature source content and extracts structured context
code-explorer Explores affected repos → Application Context (changed files, routes, selectors, verification points)
service-mapper Maps changed paths to the local services that must be running
test-planner Builds grounded test cases via build-test-cases
service-manager Verifies Docker containers, /alive endpoints, and Angular bootstrap; halts on failure
test-runner Executes test cases through the playwright-cli agent with guardrails and screenshots
report-compiler Compiles the HTML report

Notable design points

  • Web-first policy — all setup and test actions go through the browser UI; no direct DB queries, out-of-band REST calls, or CLI shortcuts.
  • Verify-only environment handling — the plugin confirms the dev environment is ready but never starts, builds, or stops services; it halts with a hint if something's missing.
  • Self-host not supported — the plugin currently only supports testing cloud-hosted setups. Self-host support could be added as a fast follow.
  • Billing supportbuild-test-cases bakes Stripe test values into UI steps; a billing-related 400 during execution halts the run immediately.
  • Out of scope — browser extension, desktop, and CLI surfaces (no Playwright UI surface).

Prerequisites

  • The playwright-cli Claude Code plugin (render verification and browser execution depend on it).
  • A running Bitwarden local dev environment (containers + the relevant .NET/web services).

Notes for reviewers

  • Registered in marketplace.json and README.md; passes validate-marketplace.sh and validate-plugin-structure.sh (0 errors, 0 warnings).
  • New plugin only — no changes to existing plugins.

Screenshots

image image

@kdenney kdenney requested a review from a team as a code owner June 9, 2026 16:51
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Plugin Validation Report — bitwarden-playwright-testing

PR: #140 — Initial version of playwright-testing plugin
Plugin path: plugins/bitwarden-playwright-testing/
Version: 1.0.0 (initial release)

Overall Verdict

PASS with should-fix follow-ups. No critical security issues, no committed secrets, no broken structural requirements. One major security finding (overly broad curl allowlist on test-runner), one major structural finding (a skill that appears to be a slash command), and a cluster of major path-variable consistency issues (${CLAUDE_SKILL_DIR} vs ${CLAUDE_PLUGIN_ROOT}). All referenced files exist; manifest, marketplace entry, agent/skill frontmatter, and changelog are well-formed.

Component Inventory

  • 1 plugin manifest (.claude-plugin/plugin.json) — valid
  • 7 agents — all frontmatter valid, all marked user-invocable: false
  • 8 skills — all frontmatter valid
  • 3 shell scripts — safe variable handling, good hygiene
  • 1 playwright config, 1 HTML template, 5 reference docs, 1 tool-policy doc
  • 0 hooks, 0 commands, 0 MCP servers
  • README.md + CHANGELOG.md present; marketplace entry consistent at 1.0.0

Critical Issues (0)

None. The plugin can ship as-is from a structural and security standpoint.


Major Issues (must / should fix)

1. Bash(curl:*) allowlist is overly broad on test-runner agent

File: plugins/bitwarden-playwright-testing/agents/test-runner/AGENT.md:11

The pattern Bash(curl:*) permits arbitrary HTTP/S requests to any host. The tool-policy doc only legitimizes curl for Category 3 external-trigger simulation against localhost endpoints (e.g., POST http://localhost:33656/accounts/trial/send-verification-email). Because the upstream context-gatherer ingests Jira tickets verbatim, a malicious ticket body could in theory steer the downstream test-runner to exfiltrate via curl. Enforcement currently relies on prompt instructions only.

Remediation: Narrow to localhost-only patterns, e.g. Bash(curl * http://localhost:*) and Bash(curl * https://localhost:*). Or scope to the single documented endpoint.


2. test-web-changes skill looks like a misclassified slash command

File: plugins/bitwarden-playwright-testing/skills/test-web-changes/SKILL.md (entire file)

Evidence it's a slash command, not a skill:

  • Frontmatter declares argument-hint and allowed-tools (command-only fields).
  • Description: "all in a single command" and "Add --confirm to pause…".
  • File begins with "Step 0 — Parse input" parsing argv-style arguments.
  • Body is imperative orchestration logic dispatching teammates (not declarative know-how).

Remediation: Move to commands/test-web-changes.md (create the commands/ directory). Skills describe declarative behavior; commands implement workflows invoked by /test-web-changes.


3. Inconsistent path variable: ${CLAUDE_SKILL_DIR} used throughout

Files (with line numbers):

  • skills/build-test-cases/SKILL.md:104
  • skills/compiling-test-report/SKILL.md:11, 74
  • skills/determining-required-services/SKILL.md:5
  • skills/exploring-application-context/SKILL.md:14
  • skills/reading-mailcatcher-api/SKILL.md:13, 145
  • skills/verifying-environment-health/SKILL.md:21, 31

${CLAUDE_PLUGIN_ROOT} is the documented and marketplace-standard plugin path variable. ${CLAUDE_SKILL_DIR} is not part of the published plugin spec and is not used by any other plugin in this marketplace. Several files mix the two (e.g. build-test-cases/SKILL.md uses both in adjacent paragraphs).

Remediation: Replace every occurrence with ${CLAUDE_PLUGIN_ROOT}/skills/<skill-name>/.... Single-pattern find/replace across 7 files.


4. reading-mailcatcher-api skill description issues

File: plugins/bitwarden-playwright-testing/skills/reading-mailcatcher-api/SKILL.md:3 (and frontmatter)

  • Description is 723 chars (exceeds 500-char soft cap) and contains two near-duplicate sentences ("Use this skill whenever you need to read…" and "Invoke whenever a workflow needs to read…").
  • Uses second-person form ("Use this skill whenever you need to…") instead of recommended third-person ("This skill should be used when…").
  • Frontmatter declares argument-hint and allowed-tools (command-only fields). Either remove or expose this functionality via a slash command wrapper.

Remediation: Rewrite to ~350 chars, third-person, no duplicate sentences. Remove command-only frontmatter or convert to a slash command.


5. Other skill descriptions use second-person form

Files:

  • skills/build-test-cases/SKILL.md:3 — "Use when you have plan context…"
  • skills/test-web-changes/SKILL.md:3 — "Use when you want to plan and run UI tests…"

Remediation: Rewrite to third-person: "This skill should be used when…". Add explicit user trigger phrases.


Minor Issues / Warnings

Path/permission scoping

  • agents/test-runner/AGENT.md:11Bash(ls *) is wider than needed; only used to list a screenshot artifacts dir. Consider scoping to the artifacts path.
  • agents/service-manager/AGENT.md:10 and agents/test-runner/AGENT.md:11 — Use leading */bitwarden-playwright-testing/... wildcards rather than ${CLAUDE_PLUGIN_ROOT}/.... Functional but less explicit. The leading-wildcard pattern is acceptable since the plugin install path varies per user.
  • skills/test-web-changes/SKILL.md:5allowed-tools: [Read, Write, Bash] includes Bash, but the body delegates all bash to sub-agents. Consider removing Bash.

Frontmatter consistency

  • All 7 agents lack <example>...</example> blocks in their description fields. Acceptable because every agent declares user-invocable: false; example blocks are only required for user-invocable agents. Worth noting if any agent is ever made user-invocable.
  • agents/code-explorer/AGENT.md:8 uses color: orange and agents/service-manager/AGENT.md:10 uses color: purple. Not on Claude Code's documented canonical color list (blue/cyan/green/yellow/magenta/red), but consistent with other plugins in this marketplace. Not a blocker.

Skill content/structure

  • skills/determining-required-services/SKILL.md — 368 words; bulk of guidance is in references/services.md (acceptable progressive disclosure, but a brief overview of what services.md contains would help).
  • skills/compiling-test-report/SKILL.md — 586 words; could include a brief overview of the report sections produced.
  • skills/executing-web-tests/SKILL.md (2,017 words) and skills/exploring-application-context/SKILL.md (2,129 words) — Several dense rule sets (adaptive assertion evaluation, validity gates, output schemas, screenshot policy) could be extracted to references/*.md for leaner core skill files.
  • skills/reading-mailcatcher-api/SKILL.md — The ### Step 1### Step 4 manual sections duplicate logic already in scripts/read-mailcatcher.sh. Consider extracting to references/manual-curl-procedure.md and elevating the "ALWAYS ask the user before running" warning (currently buried at line 153).
  • skills/test-web-changes/SKILL.md — Inconsistent step numbering: "Step 0/1" then "Task 1–8".
  • Many skill descriptions lack explicit user-trigger phrases users might naturally type ("which services do I need", "verify environment", "compile test report").

Shell scripts

  • skills/reading-mailcatcher-api/scripts/read-mailcatcher.sh:47, 72, 742>/dev/null masks curl stderr, reducing diagnostics. Consider preserving stderr or distinguishing unreachable-endpoint from no-match.
  • All three shell scripts use set -u but not set -e -o pipefail. Adding pipefail would surface upstream pipeline failures. Low priority.
  • All scripts have good shell hygiene: variables properly double-quoted, read-mailcatcher.sh passes user input to Python via env vars (not string interpolation), health-check.sh validates service-name argv against a fixed case allowlist. No command injection or unquoted-variable vulnerabilities.

Documentation / linkage

  • README.md:11 — Links playwright-cli to https://github.com/microsoft/playwright-cli. That repo is archived/legacy; current upstream is microsoft/playwright. If this is meant to be a Claude Code plugin dependency, the link should point to the marketplace listing instead. Not a security issue; user-confusion risk.
  • skills/exploring-application-context/SKILL.md:69 — Instructs the agent to substitute admin email from server/dev/secrets.json into flow artifacts. The resolved email will appear in .playwright-testing-artifacts/<slug>/ files. Recommend documenting that artifacts are dev-only and shouldn't be committed/shared, and consider adding .playwright-testing-artifacts/ to a recommended .gitignore snippet.

Test fixtures (informational — intentional, no action needed)

All flagged per review instructions:

  • Stripe public test card numbers (4242 4242 4242 4242, 5555 5555 5555 4444, 4000 0000 0000 0002) in skills/build-test-cases/references/billing-test-data.md:17-22, skills/exploring-application-context/references/known-flows.md (lines 185–209, 291–294), and skills/build-test-cases/SKILL.md:75. All are Stripe-documented public test fixtures, not real PANs.
  • Test expiry 12/29, CVC 123, postal 12345 — same files. Safe test fixtures.
  • Test-user email format testuser-s<N>-<YYYYMMDDHHMMSS>@example.com (skills/build-test-cases/SKILL.md:30). Uses RFC 2606 reserved domain. Safe.
  • playwright.config.json:4 sets ignoreHTTPSErrors: true — appropriate for local dev with self-signed certs; documented in README.

Positive Findings

  • No committed secrets, credentials, JWTs, or API keys anywhere in the plugin (prefix-scanned for AKIA, sk_live_, sk_test_, pk_live_, ghp_, xoxb-, AIza, eyJ…).
  • No settings.local.json, .env*, or other sensitive committed files.
  • No dangerous shell patterns (eval of untrusted input, curl | sh, unbounded rm -rf, sudo, chmod 777, --privileged).
  • Excellent least-privilege tool design — most agents restricted to Read, Skill; bash patterns are mostly path-scoped.
  • Safe shell scripting — proper quoting, env-var passing of user input to Python, allowlisted service names via case statement.
  • Clean version disciplineplugin.json, marketplace.json, agent version: frontmatter, and CHANGELOG.md all aligned at 1.0.0.
  • Comprehensive README with overview, prerequisites, installation, usage, agent/skill tables, web-first policy, billing notes, out-of-scope section, and plugin structure tree.
  • Strong tool policy document (references/tool-policy.md) with four explicit categories and clear web-first assertion mandate.
  • Stripe writes blocked (except clock advancement); no direct DB access; billing-blocker halt policy.
  • Plugin manifest correct and complete with all required fields, semver-valid version, and an explicit agents array enumerating all 7 agents (all on-disk).
  • All referenced files exist — no broken ${CLAUDE_PLUGIN_ROOT} / ${CLAUDE_SKILL_DIR} / templates / scripts / reference paths.
  • Marketplace entry at .claude-plugin/marketplace.json is present and consistent with plugin manifest (name, source path, version, description).

Priority Recommendations (in order)

  1. (Major, security) Narrow Bash(curl:*) on test-runner agent to localhost-only host patterns.
  2. (Major, structural) Move skills/test-web-changes/SKILL.md to commands/test-web-changes.md. Create a commands/ directory.
  3. (Major, consistency) Replace all ${CLAUDE_SKILL_DIR} references with ${CLAUDE_PLUGIN_ROOT}/skills/<skill-name>/... across 7 SKILL.md files.
  4. (Major, quality) Rewrite reading-mailcatcher-api description: trim to ~350 chars, third-person, no duplicate sentence. Remove command-only argument-hint/allowed-tools frontmatter.
  5. (Major, quality) Convert second-person descriptions in build-test-cases and test-web-changes to third-person.
  6. (Minor, polish) Verify and update the playwright-cli link in README.
  7. (Minor, polish) Document that .playwright-testing-artifacts/ may contain dev admin emails and shouldn't be committed.
  8. (Minor, polish) Consider extracting dense rule sets in executing-web-tests and exploring-application-context to references/.
  9. (Minor, polish) Add set -o pipefail to the three shell scripts.
  10. (Minor, polish) Remove Bash from test-web-changes/SKILL.md allowed-tools if orchestration only writes artifacts; tighten Bash(ls *) on test-runner.

Severity Summary

Severity Count
Critical 0
Major 5 (1 security, 1 structural, 1 consistency cluster, 2 description quality)
Minor ~14 (scoping, content, polish, fixtures-informational)

Conclusion: No blockers. Solid initial release with strong security guardrails and comprehensive documentation. Major items should be addressed before or shortly after merge; minor items are polish.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

🤖 Bitwarden Claude Code Review

Overall Assessment: APPROVE

Reviewed the initial v1.0.0 contribution of the bitwarden-playwright-testing plugin — a new, additive plugin with 30 files including 7 agents, 8 skills, 3 shell scripts, an HTML report template, README, CHANGELOG, and the plugin manifest. The plugin is registered correctly in .claude-plugin/marketplace.json and the root README.md table. The orchestration skill (test-web-changes), agent allowlists, and four-category tool policy are internally consistent, and the included shell scripts (preflight-check.sh, health-check.sh, read-mailcatcher.sh) handle untrusted input safely — service-name whitelist validation in the health check, env-var-passed args into the embedded Python in the mailcatcher script, and hardcoded grep patterns in preflight. No new dependencies are introduced (no manifest files in the diff), so no dependency approval flow is required.

Code Review Details

No findings to report.

@theMickster

Copy link
Copy Markdown
Contributor

Howdy 👋🏼 @kdenney,

I ran the performing-multi-agent-code-review against the PR using Opus and Sonnet. The results are below. Kindly review them along with the plugin validation on the PR (frankly they are walls of text that Claude should be able to decipher).

Overall, I think we have a strong start to a plugin. I like how focused the agents are written (zero fluff ✅). The skills themselves have clear purposes. I do think that these two skills: build-test-cases/SKILL.md and executing-web-tests/SKILL.md are too verbose and overly wordy. Please consider leveraging the /skill-creator:skill-creator with a strict prompt to reduce any unnecessary verbiage.

Please also consider breaking down exploring-application-context/references/known-flows.md into multiple documents. They will be must easier to maintain and expand by all teams moving forward (I see big things possible with these!).


Multi agent code reviews

Opus

Expand # Code Review: Initial version of playwright-testing plugin (#140)

Date: 2026-06-26 | Reviewed by: Claude Code | Model: opus

Summary

Severity Count
🛑 Blocker 0
⚠️ Important 6
♻️ Refactor 1

The new bitwarden-playwright-testing plugin is well-designed and architecturally consistent with repo patterns — the 7-agent pipeline, skill structure, and plugin manifest all follow established conventions. Three CI lint failures must be resolved before merge: the cspell dictionary is missing four domain terms the new files introduce, the README table row needs Prettier reformatting, and the HTML report template's angle-bracket placeholders cause a hard Prettier parse error. Two logic bugs in read-mailcatcher.sh will produce misleading diagnostics when Mailcatcher is unreachable or when a URL filter mismatches. The test-runner agent's unrestricted Bash(curl:*) grant creates an SSRF/exfiltration surface reachable via crafted Jira tickets or feature descriptions, and a related policy/permission conflict leaves the Mailcatcher delete-all guard unenforced at the tool level.

Findings

⚠️ Important

New plugin's domain terms missing from .cspell.json; cspell lint fails

.cspell.json:4
Caught by: Architecture agent

Details

The repo CLAUDE.md requires "Add domain-specific terms to .cspell.json" when adding a new plugin, and CI enforces pnpm run lint:spelling (cspell lint ... '**') via lint.yml. This PR adds a new plugin but does not update .cspell.json. Running cspell against the branch flags these unknown words across the new plugin files: mailcatcher / Mailcatcher / MAILCATCHER, bitwardenserver, testuser, and toastr. (playwright, azurite, and preflight already pass.)

Fix: Add the missing terms to the words array in .cspell.json. cspell matching is case-insensitive, so a single mailcatcher entry covers all casings. For deliberate test-fixture strings like testuser, consider inline // cspell:ignore testuser comments per CLAUDE.md guidance instead of polluting the dictionary. This is a formatting/config change — no version bump required.


New marketplace table row breaks Prettier formatting; lint:prettier fails

README.md:18
Caught by: Architecture agent

Details

The bitwarden-playwright-testing row added to the root README.md plugin table uses different column padding than its sibling rows — the Plugin cell is wider and the Version/Description cells are under-padded. The repo enforces pnpm run lint:prettier (prettier --check .) in CI via lint.yml. Prettier flags README.md with a [warn] and rewrites the entire table to realign every column to the new longest cell width, causing the CI check to fail.

Fix: Run pnpm run format (or prettier --write README.md) to realign the table, then commit the result. Formatting-only — no version bump required per CLAUDE.md.


Angle-bracket placeholders make report-template.html unparseable by Prettier (CI error)

plugins/bitwarden-playwright-testing/skills/compiling-test-report/templates/report-template.html:45
Caught by: Architecture agent

Details

report-template.html uses angle-bracket placeholder tokens inside HTML element content — e.g., <h1>Web Test Report: <Plan Name></h1> (line 45), <YYYY-MM-DD HH:mm> (line 47). Prettier's HTML parser treats <Plan Name> as a malformed tag and throws a hard SyntaxError: Unexpected closing tag "h1" at 45:33. This is a Prettier [error] (not a [warn]): prettier --check . fails outright on this file, breaking CI lint:prettier independently of the README issue.

Fix: Replace angle-bracket placeholders with non-tag-like syntax (e.g. {{PLAN_NAME}}, {{DATE}}) that Prettier's HTML parser can handle, or add the file to .prettierignore if angle-bracket tokens are intentional. The compiling-test-report SKILL.md's fill-in instructions (lines 11 and 74) must be updated to match whichever placeholder syntax is chosen.


Mailcatcher connectivity failure is silently reported as NO_MATCH

plugins/bitwarden-playwright-testing/skills/reading-mailcatcher-api/scripts/read-mailcatcher.sh:39
Caught by: Code quality agent

Details

find_message_id pipes curl -fsS "$MAILCATCHER_URL/messages" directly into python3. When Mailcatcher is unreachable (daemon down, wrong port, network error), curl -fsS fails and emits nothing on stdout. The Python script reads empty stdin, json.load raises, and the inline script does sys.exit(2) — but that exit code is discarded by the command substitution. Back in attempt(), the empty $id is indistinguishable from "the inbox is empty," so the script retries once and then exits 1 with NO_MATCH: no email for recipient '<x>'.

The script's documented contract says exit 1 = "the message wasn't found." A transport failure being reported as NO_MATCH misleads the test-runner and the operator: they will hunt for a missing verification email when the real problem is that Mailcatcher isn't running or MAILCATCHER_URL is misconfigured.

Fix: Capture the messages payload into a variable and check curl's exit status before piping to Python, emitting a distinct diagnostic and distinct exit code for an unreachable endpoint:

find_message_id() {
  local json
  if ! json="$(curl -fsS "$MAILCATCHER_URL/messages" 2>/dev/null)"; then
    echo "ERROR: Mailcatcher unreachable at $MAILCATCHER_URL" >&2
    return 3
  fi
  printf '%s' "$json" | RECIPIENT="$RECIPIENT" PATTERN="$PATTERN" python3 -c '...'
}

The caller (attempt() and the top-level driver) should distinguish exit code 3 from 1 so the diagnostic names the right problem.


Final NO_MATCH diagnostic wrongly claims "no email" when email was found but URL filter missed

plugins/bitwarden-playwright-testing/skills/reading-mailcatcher-api/scripts/read-mailcatcher.sh:103
Caught by: Bug analysis agent

Details

attempt() returns exit code 2 specifically for the case "message matched the recipient/subject but its body contained no URL matching $LINK_FILTER", and prints an accurate diagnostic: NO_MATCH: message $id matched but contained no URL filtered by '$LINK_FILTER'.

The outer driver treats exit 2 identically to exit 1:

if attempt; then exit 0; fi
sleep 3
if attempt; then exit 0; fi

if [ -n "$PATTERN" ]; then
  echo "NO_MATCH: no email for recipient '$RECIPIENT' with subject containing '$PATTERN'" >&2
else
  echo "NO_MATCH: no email for recipient '$RECIPIENT'" >&2
fi
exit 1

Two problems on the exit-2 path:

  1. The retry is useless. When a message is found but has no filter-matching URL, the retry re-fetches the same message body (it doesn't change between attempts), so extract_url fails identically.
  2. The final diagnostic is wrong. After both attempts, the script unconditionally prints NO_MATCH: no email for recipient..., which contradicts the true cause — an email was found; only the URL filter failed to match. The accurate stderr line from attempt() is emitted earlier but is then overridden in the operator's view by the misleading final line. The test-runner keys off this diagnostic to decide why email verification failed.

Fix: Capture attempt's return code and, on exit code 2, exit immediately with the specific "filter miss" message rather than retrying and printing the generic "no email" line:

attempt; rc=$?
[ "$rc" -eq 0 ] && exit 0
[ "$rc" -eq 2 ] && exit 1  # message found, no matching URL — already reported, don't retry
sleep 3
attempt; rc=$?
[ "$rc" -eq 0 ] && exit 0
[ "$rc" -eq 2 ] && exit 1
# only the generic "no email" message when rc was 1 (truly no message found)

Unrestricted Bash(curl:*) grant lets untrusted test-plan content drive arbitrary requests

plugins/bitwarden-playwright-testing/agents/test-runner/AGENT.md:11
Caught by: Security & logic agent

Details

The test-runner agent grants Bash(curl:*) — any URL, method, header, and flag, with no host or scheme restriction. The plugin's references/tool-policy.md restricts curl to Category 3 local external-trigger simulation against localhost Bitwarden services, but that restriction is prose only; the enforced permission is a wildcard.

Data flow (source → sink): test-web-changes accepts a Jira ticket ID or free-form feature description (untrusted input). context-gatherer returns the complete raw source content verbatim, the team lead persists it verbatim, test-planner turns it into a test plan containing EXTERNAL TRIGGER steps (literal curl ... <endpoint> commands templated from plan parameters such as email). test-runner reads and executes those steps. No sanitization or host allowlist exists at any pipeline boundary. The default pipeline runs end-to-end with no plan-review gate — --confirm is opt-in.

Why it matters (CWE-918, CWE-77): A crafted Jira ticket or feature description can introduce an EXTERNAL TRIGGER step pointing curl at an arbitrary destination — a cloud metadata endpoint (169.254.169.254), an internal-network service, or an attacker-controlled host that receives exfiltrated values (magic-link tokens, email addresses, Stripe CLI credentials) captured earlier in the run. The AGENT.md compounds the risk by instructing the agent to run allowlisted tools "inline as an ordinary test step — never as an obstacle and never as a pause point" and "Do not request permission to use tools outside it", which actively suppresses the consent prompt that would otherwise surface an anomalous curl target.

Fix: Replace the wildcard with host/scheme-scoped grants matching the documented Category 3 policy. Enumerate the permitted local endpoints or introduce a wrapper script with an internal allowlist — the same pattern used to constrain the mailcatcher reader to a single script path (Bash(*/bitwarden-playwright-testing/skills/reading-mailcatcher-api/scripts/read-mailcatcher.sh *)). Prose policy in tool-policy.md cannot substitute for permission-level enforcement.


♻️ Refactor

Documentation-only guard on irreversible Mailcatcher delete-all is not enforced by curl grant

plugins/bitwarden-playwright-testing/agents/test-runner/AGENT.md:11
Caught by: Security & logic agent

Details

skills/reading-mailcatcher-api/SKILL.md warns "ALWAYS ask the user before running this command" for curl -X DELETE http://localhost:1080/messages, calling it irreversible and destructive of earlier test evidence. That guard is prose only. The test-runner agent holds Bash(curl:*) (same wildcard as sec-1), which permits the DELETE with no tool-level confirmation, and AGENT.md instructs the agent to execute allowlisted curl "never as a pause point". The advisory gate and the operational instruction directly conflict; the enforced permission sides with the destructive path.

This is a concrete, in-tree policy/configuration inconsistency: the documented safety rule and the agent's operating instruction are incompatible, and any operator or future reviewer who trusts the SKILL.md warning will be misled about how the agent actually behaves.

Fix: Scope the curl grant so destructive Mailcatcher operations are not silently permitted — for example, allow only read endpoints via path-restricted grants, or route all Mailcatcher access through the existing read-only read-mailcatcher.sh script (which performs no writes). This is already the right pattern for mailcatcher reads; applying it to the full Mailcatcher surface removes the conflict. The SKILL.md advisory would then accurately reflect the enforced behavior.


Sonnet

Expand # Code Review: Initial version of playwright-testing plugin (#140)

Date: 2026-06-26 | Reviewed by: Claude Code | Model: sonnet

Summary

Severity Count
🛑 Blocker 0
⚠️ Important 8
♻️ Refactor 8

The plugin is architecturally solid — the agent pipeline design is well-reasoned and the tool policy reflects careful thought about test boundaries. Several real bugs in the orchestration layer would cause silent failures in practice: the test plan composition can race ahead of the services artifact (Task 5 missing a blockedBy), checkpoint appending uses a write-only tool that silently drops prior segments on multi-pause runs, and the test-runner agent has a self-contradicting terminal-shape claim. The mailcatcher script also has a diagnostic-masking issue and a retry-on-wrong-condition bug that would confuse developers troubleshooting failures. A concrete prompt injection path from Jira ticket comments to the test-runner's unrestricted curl allowlist warrants addressing before the plugin is used with sensitive Jira content.

Findings

⚠️ Important

Task 5 blockedBy annotation missing Task 3 — orchestrator may start before services artifact exists

plugins/bitwarden-playwright-testing/skills/test-web-changes/SKILL.md:104
Caught by: Bug analysis agent

Details

Task 5 (Compose test plan) is annotated *(blockedBy: Task 4)*, but it reads both services-<timestamp>.md (produced by Task 3) and test-cases-<timestamp>.md (produced by Task 4). Tasks 3 and 4 are both annotated *(blockedBy: Task 2)* with no declared dependency between them — the standard parallelizability signal. An AI orchestrator following the blockedBy annotations could dispatch Task 5 as soon as Task 4 completes, without waiting for Task 3.

The Read of services-<timestamp>.md would return empty or fail, producing a test plan with no ## Required Services section. Task 6 (service-manager) passes those service names to health-check.sh; with an empty section, the script receives no arguments and exits 1 with a usage error, halting the pipeline.

Fix: Change the annotation to *(blockedBy: Tasks 3 and 4)*.

Checkpoint append on subsequent pauses uses Write tool, which overwrites — prior segments silently lost

plugins/bitwarden-playwright-testing/skills/test-web-changes/SKILL.md:198
Caught by: Bug analysis agent

Details

Task 7 instructs: "Subsequent pauses: open the file in append mode and add the chunk with a blank-line separator." The skill's allowed-tools is [Read, Write, Bash]. The Write tool only overwrites files — it has no append mode.

An LLM following these instructions for a second or later pause will call Write on checkpoint-<timestamp>.md, silently discarding all prior-pause segments. The subsequent merge assembles test-results-<timestamp>.md from all segments — with only the last segment present, all earlier completed test cases are missing, totals are wrong, and the HTML report is incomplete.

Bash is in the allowlist and supports echo ... >> file, but the instruction does not direct the LLM to use it.

Fix: Explicitly instruct the orchestrator to use Bash(echo "..." >> <path>) for subsequent-pause appends, or restructure to accumulate segments in memory and write the merged file once at the end.

test-runner Step 3 states response always ends with TEST RUN COMPLETE — immediately contradicted by the pause case

plugins/bitwarden-playwright-testing/agents/test-runner/AGENT.md:73
Caught by: Bug analysis agent

Details

Step 3 opens with a definitive terminal-shape claim:

"Your response begins with === TEST RUN RESULTS === and ends with === TEST RUN COMPLETE: N total, N passed, N passed (adaptive), N failed ===. This is the same shape for fresh and resumed runs."

The very next paragraph contradicts this:

"If executing-web-tests instead returned a partial response ending with Need user input:, return it verbatim..."

A pause response ends with Need user input:, not the complete marker. A model reading top-down encounters the definitive claim first, which may cause it to suppress the pause signal or try to wrap it in the complete-marker format. The Loop Invariant section (above Step 0) correctly covers both cases — Step 3's opening claim should be conditionalized to match.

Fix: Replace the unconditional claim with: "On a complete run, your response ends with === TEST RUN COMPLETE: ... ===. On a pause, return the partial block verbatim ending with Need user input:."

Retry fires when attempt() returns 2 (URL filter mismatch) — retry is useless and final diagnostic is wrong

plugins/bitwarden-playwright-testing/skills/reading-mailcatcher-api/scripts/read-mailcatcher.sh:105
Caught by: Bug analysis agent

Details

attempt() returns 2 when a matching email was found but no URL in its body passed the --link-filter regex. The main retry logic treats if attempt; then exit 0; fi as false (any non-zero exit), so it sleeps 3 seconds and retries. The retry reads the same message and returns 2 again — the email is still there, the filter still fails, nothing has changed.

After both attempts, the script emits:

NO_MATCH: no email for recipient '...' with subject containing '...'

This is factually wrong — the email was found. The real diagnostic (URL filter mismatch) was already written to stderr by attempt() but is overwritten by the final message, misleading agents and developers reading test results.

Fix: Treat return 2 as a hard failure with no retry. After the first if attempt, check $?; if it is 2, exit 1 immediately without the sleep-and-retry.

attempt() swallows exit code 2 from find_message_id — Mailcatcher-down errors reported as "no email found"

plugins/bitwarden-playwright-testing/skills/reading-mailcatcher-api/scripts/read-mailcatcher.sh:79
Caught by: Code quality agent

Details

find_message_id() exits with code 2 when json.load raises an exception — this happens when Mailcatcher is unreachable or returns non-JSON. attempt() captures its output via command substitution:

id="$(find_message_id)"

When find_message_id exits 2, it produces no stdout, so $id is empty. attempt() checks only [ -z "$id" ] and returns 1 (no-message-found), not 2 — losing the parse-error signal. The retry fires, then the final message says NO_MATCH: no email for recipient '...' with no hint that Mailcatcher is down.

A developer seeing NO_MATCH after preflight passed will spend time checking email patterns rather than checking whether Mailcatcher crashed after startup.

Fix: Explicitly capture the exit code from find_message_id and emit a distinct diagnostic when it is 2:

id="$(find_message_id 2>&1)"; mc_exit=$?
if [ "$mc_exit" -eq 2 ]; then
  echo "NO_MATCH: Mailcatcher unreachable or returned invalid JSON" >&2
  return 2
fi

Checkpoint merge instruction "discard intermediate SUMMARY: lines" is ambiguous — may discard all SUMMARY data

plugins/bitwarden-playwright-testing/skills/test-web-changes/SKILL.md:224
Caught by: Code quality agent

Details

Task 7's multi-pause assembly instructs:

"Discard all segment headers, all intermediate SUMMARY: lines, and all === PARTIAL RUN — PAUSED === markers."

Step 3 then says: "Sum the SUMMARY: counts across all segments to produce final totals."

The word "intermediate" is ambiguous — a language model may interpret it as "all SUMMARY: lines are intermediate," discarding every one. With no SUMMARY lines remaining, the model falls back to counting test case blocks, which may differ if any block lacked a clear status line. The step 3 instruction provides partial recovery, but the contradiction creates a real risk of wrong totals on multi-pause runs.

Fix: Clarify explicitly: "Discard the SUMMARY: lines from segments ending with === PARTIAL RUN — PAUSED ===. Keep the SUMMARY: from the final complete segment for the total count."

Preflight Mailcatcher container pattern -mail- matches any container with "mail" in its Compose name

plugins/bitwarden-playwright-testing/skills/verifying-environment-health/scripts/preflight-check.sh:35
Caught by: Code quality agent

Details

The preflight script checks for the Mailcatcher container using the Compose pattern -mail-:

"Mailcatcher email|-mail-|^mailcatcher-"

This matches any running Docker container whose Compose-assigned name contains -mail- — including mailhog, smtp-relay, or any other mail-related project running on the same Docker daemon. In a shared-daemon environment, a false positive here causes preflight to pass when the actual Bitwarden Mailcatcher is not started.

The test run then fails later with a cryptic NO_MATCH from the mailcatcher script, with no connection back to the wrong-container false positive in preflight.

Fix: Tighten the Compose pattern to -mailcatcher- or anchor to the known Bitwarden Compose project name: match on bitwardenserver-mail- to target bitwardenserver-mail-1 specifically.

Jira ticket content passed verbatim through artifact chain to test-runner — prompt injection path (CWE-1427)

plugins/bitwarden-playwright-testing/agents/context-gatherer/AGENT.md:61
Caught by: Security & logic agent

Details

context-gatherer fetches the full Jira synthesis — including all comments, linked issues, and Confluence pages — and writes it verbatim under ## Source Summary in context-<timestamp>.md. This artifact is read by code-explorer, test-planner, and ultimately flows to test-runner.

The test-runner agent carries the broadest tool allowlist in the pipeline:

tools: ..., Bash(curl:*), Bash(stripe get:*), Bash(stripe post .../advance:*), ...

A Jira ticket comment containing injected instructions would traverse the artifact chain and reach the test-runner with those tool permissions active. Unlike the untrusted-input boundary applied within a single conversation turn, this attack crosses agent turns via a written artifact file — the per-turn boundary does not protect across turns. Anyone with Jira comment access (internal staff, vendors, integrations) can inject. CWE-1427 (Improper Neutralization of Input Used in LLM Prompting).

Fix: (1) Limit ## Source Summary propagation — execution-phase agents (service-manager, test-runner) need only structured sections (Feature Description, Acceptance Criteria, Required Services), not the raw source. (2) Add an explicit untrusted-content preamble in every agent that reads context artifacts.


♻️ Refactor

All seven agents use namespace-qualified skill names in frontmatter — breaks convention used by every other plugin agent

plugins/bitwarden-playwright-testing/agents/context-gatherer/AGENT.md:6
Caught by: Architecture agent

Details

Every other agent in the repository uses bare (unqualified) skill names in the YAML skills: frontmatter field (e.g., avoiding-false-positives, triaging-security-findings). All seven new agents in this PR use namespace-qualified forms — bitwarden-playwright-testing:exploring-application-context, bitwarden-atlassian-tools:researching-jira-issues — diverging from the established convention.

The test-runner agent compounds this by mixing conventions: it lists the external playwright-cli skill without a namespace while listing internal skills with the bitwarden-playwright-testing: prefix.

Fix: Remove namespace prefixes from all skills: frontmatter entries across all seven agents, aligning with the bare-name convention used by every other plugin.

executing-web-tests and build-test-cases hardcode a cross-skill path into the mailcatcher skill's scripts dir

plugins/bitwarden-playwright-testing/skills/executing-web-tests/SKILL.md:160
Caught by: Architecture agent

Details

Both executing-web-tests/SKILL.md and build-test-cases/SKILL.md reference the mailcatcher script directly:

${CLAUDE_PLUGIN_ROOT}/skills/reading-mailcatcher-api/scripts/read-mailcatcher.sh

This couples two skills to the internal directory layout of a third (reading-mailcatcher-api). The path is duplicated across two non-contiguous locations. If reading-mailcatcher-api is relocated, both skills break silently — the path is embedded in prose, not in a config or shared reference.

Fix: Define the canonical script path once in references/tool-policy.md (which already governs tool usage) and reference it by that source in both skill documents.

health-check.sh passes curl -k globally — silently skips TLS verification for all services

plugins/bitwarden-playwright-testing/skills/verifying-environment-health/scripts/health-check.sh:72
Caught by: Code quality agent

Details

The health-check script applies curl -k to every service poll, disabling SSL certificate verification for all services including https://localhost:8080. Because ignoreHTTPSErrors: true is also set in playwright.config.json, no cert problem would be caught at either layer. The -k flag is undocumented, and its use alongside ignoreHTTPSErrors creates a consistent pattern of silent cert-bypass assumptions.

Fix: Add a comment documenting the intentional bypass: # -k required: Bitwarden dev certs are self-signed. If dev cert paths are available, consider --cacert for a narrower bypass.

extract_url HTML fallback silently returns truncated URL when the body has HTML-encoded characters

plugins/bitwarden-playwright-testing/skills/reading-mailcatcher-api/scripts/read-mailcatcher.sh:69
Caught by: Code quality agent

Details

When the plain-text email body is empty, extract_url falls back to the .html body. The grep -oE 'https?://[^ >\"\\)]+ pattern may truncate URLs containing characters rendered as HTML entities in text nodes (e.g., &amp; query string separators). The result is returned with exit 0 — the caller navigates to a truncated URL, which fails with a missing-token error indistinguishable from a server rejection.

Fix: Emit a stderr warning when using the HTML fallback (echo "WARNING: using HTML body for URL extraction" >&2) so truncation failures can be traced. Consider using Python's html.parser (already available) for more robust extraction.

ignoreHTTPSErrors:true applies globally — no origin-level TLS enforcement for any navigation

plugins/bitwarden-playwright-testing/scripts/playwright.config.json:4
Caught by: Security & logic agent

Details

The ignoreHTTPSErrors: true context option suppresses certificate errors for every HTTPS URL Playwright navigates to during a test run, not only https://localhost:8080. For a dev-local-only plugin this is an acceptable ergonomics tradeoff, but the scope of the bypass is undocumented.

Fix: Add a comment to playwright.config.json noting that ignoreHTTPSErrors is intentional for Bitwarden's self-signed dev certs and scoped to localhost use only.

extract_url performs no host validation on extracted URLs — any matching URL drives browser navigation

plugins/bitwarden-playwright-testing/skills/reading-mailcatcher-api/scripts/read-mailcatcher.sh:69
Caught by: Security & logic agent

Details

extract_url returns the first URL in any email body matching the LINK_FILTER regex with no assertion that the host is localhost. In normal dev operation, the Bitwarden server is the only sender and all URLs will be localhost. The missing guard means a non-localhost URL matching the filter — even if present accidentally — would silently drive browser navigation with cert errors suppressed (ignoreHTTPSErrors: true).

Fix: Add a host guard before returning the URL:

if ! echo "$url" | grep -qE '^https?://localhost(:[0-9]+)?'; then
  echo "NO_MATCH: extracted URL '$url' is not a localhost URL" >&2
  return 2
fi

Admin email from server/dev/secrets.json flows into artifact files; no .gitignore covers the artifacts directory

plugins/bitwarden-playwright-testing/skills/exploring-application-context/SKILL.md:69
Caught by: Security & logic agent

Details

exploring-application-context reads server/dev/secrets.json to resolve <bitwarden-portal-admin-email> and substitutes it verbatim into the Application Context markdown, which flows into app-context-<timestamp>.md and test-plan-<timestamp>.md under .playwright-testing-artifacts/. No .gitignore covering .playwright-testing-artifacts/ is shipped with the plugin.

server/dev/secrets.json is gitignored because it holds dev credentials; this plugin creates unguarded derivatives of those credentials. A git add . stages the artifacts. CWE-312 (Cleartext Storage of Sensitive Information).

Fix: (1) Add a .gitignore to the plugin root or to .playwright-testing-artifacts/ excluding *.md and *.html artifacts. (2) Avoid writing the resolved admin email into persisted markdown — resolve it at execution time within the agent's ephemeral context only.

Bash(curl:*) tool allowlist is unrestricted — test-runner curl policy is natural-language only

plugins/bitwarden-playwright-testing/agents/test-runner/AGENT.md:11
Caught by: Security & logic agent

Details

The test-runner's tool allowlist declares Bash(curl:*) with no pattern restriction on destination, HTTP method, or headers. The tool policy in references/tool-policy.md restricts curl to Category 3 external-trigger simulation only, but this is a natural-language constraint — not enforced at the harness permission level. An LLM influenced by prompt-injected content (see Jira injection finding above) could issue any curl call, including to local Bitwarden APIs with session tokens.

Fix: Narrow the allowlist to the specific endpoint and method the tool policy authorizes, e.g.:

Bash(curl -s -X POST http://localhost:33656/accounts/trial/send-verification-email *)

Any broader curl call should require explicit user approval.


Reviewed and Dismissed

🔍 1 initial finding dismissed after validation

arch-2 file citation is wrong — cross-skill mailcatcher path is in executing-web-tests and build-test-cases, not verifying-environment-health

plugins/bitwarden-playwright-testing/skills/executing-web-tests/SKILL.md:160
Caught by: Validation agent (collateral)
Original severity: ♻️ Refactor
Original confidence: 100/100
Dismissed at: Step 5 severity audit
Dismissed because: Meta-correction of arch-2's file citation with no independent code finding — arch-2 already documents the coupling issue with the correct files named in its detail; this collateral adds no additional code issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants