Skip to content

feat: add qa-changes plugin for automated PR QA validation#135

Draft
xingyaoww wants to merge 28 commits intomainfrom
feat/qa-changes-plugin
Draft

feat: add qa-changes plugin for automated PR QA validation#135
xingyaoww wants to merge 28 commits intomainfrom
feat/qa-changes-plugin

Conversation

@xingyaoww
Copy link
Copy Markdown
Contributor

@xingyaoww xingyaoww commented Apr 2, 2026

Summary

Add a new qa-changes plugin that goes beyond code review by actually running the code to verify PR changes work as described. While the existing pr-review plugin reads diffs and posts inline code comments, this plugin sets up the environment, runs the test suite, exercises changed behavior, and posts a structured QA report.

Plugin Structure

skills/qa-changes/SKILL.md              # Generic QA methodology skill
plugins/qa-changes/
├── README.md                           # Plugin documentation
├── action.yml                          # Composite GitHub Action
├── scripts/
│   ├── agent_script.py                 # Main QA agent script
│   └── prompt.py                       # Prompt template
├── skills/
│   └── qa-changes -> ../../../skills/qa-changes
└── workflows/
    └── qa-changes-by-openhands.yml     # Example workflow

Four-Phase QA Methodology

The skill defines a generic, language-agnostic methodology:

  1. Understand — Read the diff, classify changes (new feature, bug fix, refactor, config/docs)
  2. Setup — Bootstrap the repo: install deps, build, establish test baseline
  3. Exercise — The core phase. Actually use the software: run CLI commands, make HTTP requests, open browsers. Go beyond "tests pass."
  4. Report — Post structured PR comment with evidence and verdict (PASS / PASS WITH ISSUES / FAIL)

How It Differs from PR Review

Aspect PR Review QA Changes
Method Reads the diff Runs the code
Speed 2-3 minutes 5-15 minutes
Catches Style, security, logic issues Regressions, broken features, build failures
Output Inline code comments Structured QA report with evidence

Usage

- name: Run QA Changes
  uses: OpenHands/extensions/plugins/qa-changes@main
  with:
    llm-model: anthropic/claude-sonnet-4-5-20250929
    llm-api-key: ${{ secrets.LLM_API_KEY }}
    github-token: ${{ secrets.GITHUB_TOKEN }}

Triggers: qa-this label or openhands-agent reviewer request.

Design Decisions

  • Generic skill: The SKILL.md is intentionally language/framework-agnostic. It teaches the agent how to think about QA, not specific commands. Project-specific details come from AGENTS.md or custom skills.
  • Structured from pr-review: The agent_script.py and action.yml follow the same patterns as pr-review for consistency, but the prompt and skill are completely different.
  • Security: Excludes FIRST_TIME_CONTRIBUTOR and NONE from automatic triggers since QA executes code.

Related

This PR was created by an AI assistant (OpenHands).

Add a new plugin that goes beyond code review by actually running
the code to verify PR changes work as described.

Plugin structure:
- skills/qa-changes/SKILL.md: Generic QA methodology skill
- plugins/qa-changes/action.yml: Composite GitHub Action
- plugins/qa-changes/scripts/agent_script.py: Main QA agent
- plugins/qa-changes/scripts/prompt.py: Prompt template
- plugins/qa-changes/workflows/: Example workflow file
- plugins/qa-changes/README.md: Documentation

The QA agent follows a five-phase methodology:
1. Understand the change (classify diff)
2. Set up the environment (install deps, build)
3. Run the test suite (establish baseline, detect regressions)
4. Exercise changed behavior (manually test features/fixes)
5. Report results (structured PR comment with verdict)

Co-authored-by: openhands <openhands@all-hands.dev>
…ful failure

Key changes to the QA skill:
- Merge Setup + Test into one phase; check CI status first, only run
  tests CI doesn't cover
- Raise the bar for Exercise phase: frontend changes must use a real
  browser (Playwright/browser automation), CLI changes must run the
  actual CLI, API changes must make real HTTP requests
- Add specific guidance per change type (frontend, CLI, API, bug fix,
  library, refactor, config)
- Add 'Knowing When to Give Up' section: three attempts per approach,
  two approaches max, then report honestly and suggest AGENTS.md guidance
- Add PARTIAL verdict for when some behavior could not be verified
- Update prompt, README to match new four-phase methodology

Co-authored-by: openhands <openhands@all-hands.dev>
OpenHands SDK performs best with tmux available for terminal management.

Co-authored-by: openhands <openhands@all-hands.dev>
Add .plugin/plugin.json manifest and update agent_script.py to load the
qa-changes plugin via the SDK's Plugin system. This properly loads skills,
hooks, and MCP config bundled in the plugin directory.

Previously the script only loaded project skills via load_project_skills()
and missed the plugin's own skills entirely.

See #136 for the same issue in the pr-review plugin.

Co-authored-by: openhands <openhands@all-hands.dev>
…arketplace entry

- Enable browser tools (enable_browser=True) so the QA agent can
  actually verify UI changes in a real browser, matching the SKILL.md
  methodology
- Switch workflow from pull_request_target to pull_request to avoid
  executing untrusted fork code with the base repo's secrets
- Isolate untrusted PR body in the prompt with an explicit warning
  to mitigate prompt injection
- Add qa-changes skill to marketplaces/default.json (required by CI)
- Add comments explaining tmux (OpenHands runtime) and gh dependencies
- Update README security section to reflect the pull_request change

Co-authored-by: openhands <openhands@all-hands.dev>
- Add max-budget ($10 default), timeout-minutes (30 default), and
  max-iterations (200 default) as action inputs
- Enforce budget via a Conversation callback that raises BudgetExceeded
  when accumulated LLM cost exceeds the limit
- Enforce timeout via GHA step-level timeout-minutes
- Enforce iteration cap via SDK's max_iteration_per_run parameter
- Pass all three values through as env vars (MAX_BUDGET, MAX_ITERATIONS)
  and document them
- Add tests for format_prompt, truncate_diff, and validate_environment
  (20 tests covering fields, edge cases, defaults, and custom overrides)
- Add missing skills/qa-changes/README.md (required by CI)
- Update README action inputs table with new parameters

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
…ll timeout

GitHub Actions does not support `timeout-minutes` on steps inside
composite actions (only at the job level). The `Set up job` step
fails with: 'Unexpected value timeout-minutes'.

Replace with the coreutils `timeout` command, passing the
`timeout-minutes` input via an environment variable and converting
to seconds in the shell.

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
- Add lmnr-api-key input to action.yml with env var and --with lmnr
- Add Laminar trace artifact upload step
- Add save_trace_context() to agent_script.py for trace persistence
- Create evaluate_qa_changes.py for post-close evaluation
- Create qa-changes-evaluation.yml workflow template
- Update workflow template to pass lmnr-api-key

Co-authored-by: openhands <openhands@all-hands.dev>
Test extract_qa_report, extract_human_responses, truncate_text,
calculate_engagement_score, and load_trace_info.

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
Switch from posting QA results as a plain PR comment (gh pr comment)
to posting them as a GitHub code review thread using the
/github-pr-review skill. The agent now:
- Triggers both /qa-changes and /github-pr-review skills
- Posts a structured review body with the full QA report
- Adds inline review comments on specific lines for issues found
- Uses priority labels (🔴🟠🟡🟢) from the github-pr-review skill
- Bundles everything into a single review API call

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
Add symlink skills/github-pr-review -> ../../../skills/github-pr-review
so the skill is explicitly available to the agent, matching the pattern
used by the pr-review plugin.

Co-authored-by: openhands <openhands@all-hands.dev>
Update the QA skill and prompt to produce more scannable reports:

- Verdict + one-sentence summary at the top for instant readability
- Status table gives at-a-glance phase results
- All evidence (code snippets, logs, command output) goes inside
  HTML <details> collapsible blocks
- Explicit formatting rules: no repetition across sections, omit
  empty sections, issues always visible
- Prompt reinforces compact format and collapsible evidence

Motivated by verbose QA reports on PRs like #2798 in software-agent-sdk
where long inline evidence made the report hard to scan.

Co-authored-by: openhands <openhands@all-hands.dev>
Update the prompt and skill to prioritize answering the core question:
'does this PR actually fix the original issue?' rather than just
reporting that tests pass.

Changes:
- Phase 1 (Understand): emphasize identifying the original problem
  from PR description, linked issues, and title
- Phase 3 (Exercise): start by verifying the PR addresses the stated
  issue with concrete examples of what that means for each PR type
- Report template: add 'Does this PR fix the original issue?' section
  with direct answer and evidence
- Key Principles: add principle that answering the core question is
  the primary deliverable
- Prompt: add explicit instruction that the #1 job is to answer
  whether the PR fixes the original issue

Co-authored-by: openhands <openhands@all-hands.dev>
The previous wording assumed PRs always fix issues, but PRs can also
add new features, refactor code, improve performance, update docs, etc.
Reframe to the broader question: "does this PR achieve what it set out
to do?" This covers all PR types without being narrowly issue-focused.

Co-authored-by: openhands <openhands@all-hands.dev>
Update the QA agent prompt and skill to require structured before/after
evidence inside the collapsible Functional Verification section:

1. Reproduce the problem / establish baseline (without the fix)
2. Interpret what the baseline output means
3. Apply the PR's changes
4. Re-run the same verification with the fix in place
5. Interpret what the new output means

This ensures QA reports show concrete proof that the agent actually ran
code to verify fixes, rather than just describing what the diff does.

Changes:
- SKILL.md Phase 3: Expanded bug fix section into 6 explicit steps
- SKILL.md Phase 3: Added general before/after narrative requirement
- SKILL.md Phase 4: Replaced vague collapsible template with structured
  3-step before/after format (reproduce → apply → verify)
- prompt.py: Added 5-step before/after instructions to the Important
  section

Co-authored-by: openhands <openhands@all-hands.dev>
- Rebase onto main to resolve conflicts from marketplace rename
- Add qa-changes to openhands-extensions.json marketplace
- Fix plugin.json author field to object format for Claude Code compat
- Generate missing symlinks and command files via sync_extensions.py

Co-authored-by: openhands <openhands@all-hands.dev>
@xingyaoww xingyaoww force-pushed the feat/qa-changes-plugin branch from ef0c987 to 61824c0 Compare April 22, 2026 15:56
Copy link
Copy Markdown
Contributor Author

Rebased onto main to resolve merge conflicts (marketplace file was renamed from default.jsonopenhands-extensions.json). Also fixed the plugin.json author field to use the object format required by Claude Code compatibility tests, added the qa-changes entry to the marketplace, and ran sync_extensions.py to generate missing symlinks/commands.

All core tests pass (64/64). The test_qa_changes.py errors are expected — they require lmnr which is a CI-only dependency.

We've been using qa-changes for a while now and it's working well. Merging this to main so it can be included in the new openhands-verification-stack marketplace bundle alongside onboarding, pr-review, and iterate.

This comment was posted by an AI assistant (OpenHands) on behalf of the user.

Copy link
Copy Markdown
Contributor

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Solid implementation with minor improvements needed

This PR adds valuable QA validation capabilities that go beyond static code review by actually executing PR changes. The implementation follows existing patterns from pr-review, includes comprehensive testing, and has well-documented security tradeoffs.

Three Questions:

  1. Real problem? Yes - many PRs pass review but break when run
  2. Simpler way? No - the complexity matches the problem
  3. What breaks? Nothing - new plugin, well-isolated

Key Strengths:

  • Reuses proven patterns from pr-review plugin
  • Comprehensive test coverage for utility functions
  • Security model is sound and well-documented
  • Clear separation between skill (methodology) and plugin (automation)

Issues to address: See inline comments for documentation inconsistency, magic numbers, and readability improvements.

Comment thread README.md Outdated
Comment thread plugins/qa-changes/scripts/evaluate_qa_changes.py
Comment thread plugins/qa-changes/workflows/qa-changes-by-openhands.yml Outdated
Comment thread plugins/qa-changes/scripts/agent_script.py
Comment thread plugins/qa-changes/scripts/agent_script.py
Comment thread plugins/qa-changes/README.md
Comment thread skills/qa-changes/SKILL.md
Comment thread tests/test_qa_changes.py
xingyaoww pushed a commit that referenced this pull request Apr 22, 2026
Expand the openhands-verification-stack marketplace to include all four
verification layers:

1. onboarding  — repo agent-readiness assessment and setup
2. pr-review   — automated code review via inline GitHub comments
3. qa-changes  — run-the-code QA validation with structured reports
4. iterate     — continuous PR loop through CI, review, and QA

Note: qa-changes depends on PR #135 being merged first.

Co-authored-by: openhands <openhands@all-hands.dev>
@enyst
Copy link
Copy Markdown
Collaborator

enyst commented Apr 23, 2026

@OpenHands read the comments in this PR. First things first: disable uv cache or some other github cache if any. We should not share this workflow data with other workflows. Commit and push only that.

Apart from that, investigate github docs and tell me if it’s possible to partition caches, e.g. somehow cache for workflows a, b and c, but NOT d, e, f.

@openhands-ai

This comment was marked as resolved.

@openhands-ai

This comment was marked as resolved.

@enyst
Copy link
Copy Markdown
Collaborator

enyst commented Apr 23, 2026

@OpenHands read all comments here. Let’s focus on the blast radius, but only on this part:

this PR loads project skills from the checked-out repo,
Recommendations
Do not load repo-controlled agent guidance from the PR head
Load AGENTS.md / project skills from the base branch, or from a maintainer-controlled path outside the PR checkout.
This is one of the highest-leverage fixes in this design.

I agree, but I’m a little confused here: look up in agent-sdk repo (clone it from the same org) the system prompt construction; and I think you will find that it tells the agent that these files are untrusted. Verify what I just said. Are all these skills and AGENTS.md covered by the existing prompt?

Does the current behavior of the SDK address this concern already or do you have targeted recommendations? Only answer, directly as a comment on this PR.

@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented Apr 23, 2026

I'm on it! enyst can track my progress at all-hands.dev

Copy link
Copy Markdown
Collaborator

enyst commented Apr 23, 2026

I checked the SDK prompt construction in OpenHands/software-agent-sdk, and the answer is: partially yes, but not completely.

What is already covered today:

  • load_project_skills() does load AGENTS.md / CLAUDE.md / GEMINI.md and project skill directories from the checked-out workspace.
  • In the prompt builder, repo skills with trigger=None (this includes AGENTS.md and other always-on repo guidance) are injected into <REPO_CONTEXT> and explicitly wrapped in <UNTRUSTED_CONTENT>.
  • The SDK security policy / risk policy also explicitly calls out AGENTS.md, .cursorrules, and .agents/skills as repository-provided context.

So your memory is correct for AGENTS.md and other always-on repo context: the agent is told that content is untrusted.

But I don’t think that fully closes the concern here, because not all project skills are covered equally:

  • AgentSkills / SKILL.md project skills are listed in <available_skills> outside the <UNTRUSTED_CONTENT> wrapper.
  • Triggered project skills are injected via <EXTRA_INFO> without the same untrusted banner.
  • invoke_skill returns the full skill content directly, and the SDK’s skill renderer can execute inline !\cmd`` expansions from skill content. That path is explicitly documented as safe only for trusted skill sources.

So my read is:

  • Does the SDK already help? Yes, partially. It is a real mitigation for AGENTS.md / repo-context prompt injection.
  • Does it eliminate this PR’s blast-radius concern? No. The workflow still loads PR-head-controlled guidance from the checked-out repo, and some of the skill surfaces are not protected by the same untrusted framing.

My targeted recommendation stays the same:

  1. Do not load AGENTS.md / project skills from the PR head checkout for this workflow.
  2. If we want repo guidance, load it from the base branch (or another maintainer-controlled checkout/path).
  3. Separately at the SDK level, I would harden the remaining gaps by treating all repo-loaded skill surfaces as untrusted, including <available_skills>, triggered skill injections, and invoke_skill output; and I would not allow inline command execution for repo/project skills unless the source is explicitly trusted.

So: the current SDK behavior reduces the problem, but it does not fully address the specific concern raised in this review.

This comment was created by an AI assistant (OpenHands) on behalf of the user.

@openhands-ai

This comment was marked as duplicate.

default: main
max-budget:
description: Maximum LLM cost in dollars. The agent stops when this budget is exceeded.
required: false
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
required: false
required: true

What if we make this required? It doesn't seem much of a burden, and someone might become happy if their agent goes wild - and this stops it. WDYT?

- /github-pr-review skill for posting results as a code review thread

The template includes:
- {diff} - The complete git diff for the PR (may be truncated)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- {diff} - The complete git diff for the PR (may be truncated)
- {body} - The PR description
- {diff} - The complete git diff for the PR (may be truncated)

4. Re-run the same verification with the fix in place — show the command
and its output.
5. Interpret the new result: explain what it means (e.g., "The error is
gone, confirming the fix works").
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
gone, confirming the fix works").
gone, confirming the fix works", or "The error is still there" etc).

@enyst
Copy link
Copy Markdown
Collaborator

enyst commented Apr 24, 2026

@OpenHands Understand this PR and add a sentence to address this to the prompt for the QA agent:

Treat all PR-derived content as untrusted
Not just the PR body.
Diff, repo guidance, terminal output, browser content, HTTP responses — all of it.
The model should be told explicitly never to follow instructions from those channels.

Note that you should use similar language with the language in the agent-sdk (clone yourself that repo from the same org), where it inserts warning about untrusted content on AGENTS.md on PRs and the like. Make sure also, that if there is other language in this PR that does the same, also follows the same pattern(s) / wording / strength.

Do not make other modifications, just this. In your response on github, link the commit hash.

@enyst
Copy link
Copy Markdown
Collaborator

enyst commented Apr 24, 2026

@OpenHands why does CI fail? fix it pls

@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented Apr 24, 2026

I'm on it! enyst can track my progress at all-hands.dev

enyst and others added 2 commits April 24, 2026 20:11
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Copy Markdown
Collaborator

enyst commented Apr 24, 2026

CI was failing because plugins/qa-changes/scripts/agent_script.py imported Plugin from openhands.sdk.plugin, but the SDK version used in CI does not export that symbol there. I fixed it by switching the QA plugin loader to the supported PluginSource + Conversation(plugins=[...]) pattern used elsewhere in this repo: 2bfd98a

Per the previous comment, I also made the prompt hardening as a separate follow-up commit so it is easy to review in isolation. That commit adds an SDK-style <UNTRUSTED_CONTENT> warning telling the QA agent to treat all PR-derived content as untrusted input, not just the PR body (diff, repo guidance, terminal output, browser content, HTTP responses, etc.): 81a9527

Validation run locally:

  • uv run pytest -q tests/test_qa_changes.py
  • uv run pytest -q

This comment was created by an AI agent (OpenHands) on behalf of the user.

@openhands-ai

This comment was marked as duplicate.

@enyst
Copy link
Copy Markdown
Collaborator

enyst commented Apr 24, 2026

@xingyaoww I think this #135 (comment) is very interesting, and I’d pick it up for more investigation, but it doesn’t really apply as long as we don’t yet run on forks.

So it shouldn’t block this PR, WDYT?

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Copy Markdown
Collaborator

enyst commented Apr 24, 2026

You understood it correctly: before this change, the example workflow could still get triggered for a fork PR (especially via label / reviewer-request), but because it runs in pull_request context the fork run does not get the repo's LLM_API_KEY. In practice that meant it could start and then fail at config validation instead of declining cleanly.

I fixed that in 9774ad8 . The composite action now does a preflight check: if the PR comes from a fork and the LLM key is unavailable, it exits successfully with a clear skip notice in the workflow summary instead of failing. I also updated the README wording to match.

Validation:

  • uv run pytest -q

This comment was created by an AI agent (OpenHands) on behalf of the user.

@enyst
Copy link
Copy Markdown
Collaborator

enyst commented Apr 24, 2026

A little curious output: OpenHands/software-agent-sdk#2909 (review)

Should it be the case that it didn’t actually try to really run the workflow? It ran dry-run or —help. Now, it was a bit tricky case, because it needs OH API key, so it wouldn’t work, but it never says that in the output. 🤔
@xingyaoww

@xingyaoww
Copy link
Copy Markdown
Contributor Author

@OpenHands please try to fix this issue by changing the prompt. I'd prefer not to change it so drastically.

#135 (comment)

@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented Apr 24, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

The QA agent was falling back to --help and --dry-run output and presenting
it as functional verification. This adds explicit guidance in both the prompt
and SKILL.md: always attempt real execution first, and if it fails due to
missing credentials or external services, report honestly what could not be
verified instead of substituting --help output.

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Copy Markdown
Contributor Author

Fixed in 7630b8b — targeted prompt change only, as requested.

Problem: The QA agent was running scripts with --help and --dry-run and presenting that as functional verification. In the linked QA report, the agent ran issue_duplicate_check_openhands.py --help and auto_close_duplicate_issues.py --help but never actually attempted to run them with real inputs. When it couldn't run because of missing credentials (OH API key), it should have reported that honestly instead of settling for --help output.

Fix: Added a single bullet point in both the prompt (prompt.py) and the skill (SKILL.md) that explicitly tells the agent:

  • --help, --dry-run, and --version are NOT functional verification
  • Always attempt real execution first
  • If real execution fails due to missing credentials/services, report the failure honestly (what was tried, what was missing, what couldn't be verified)
  • Do not fall back to --help output and present it as evidence the software works

All 198 tests pass.

This comment was created by an AI agent (OpenHands) on behalf of the user.

@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented Apr 24, 2026

Summary of Work

Request

@xingyaoww asked to fix the issue where the QA agent was running --help/--dry-run instead of actually executing software, and requested the fix be done by changing the prompt without drastic changes.

Checklist

  • Fix applied via prompt change only — no structural, script, or workflow changes
  • Minimal/non-drastic change — 8 lines added across 2 files (prompt.py and SKILL.md)
  • Addresses the root cause — explicitly tells the agent that --help, --dry-run, --version are NOT functional verification; to attempt real execution first; and to report honestly when credentials/services are missing instead of falling back to --help output
  • Consistent across prompt and skill — both plugins/qa-changes/scripts/prompt.py and skills/qa-changes/SKILL.md updated with matching guidance
  • All tests pass — 198/198 tests pass (including 21 qa-changes-specific tests)
  • Committed and pushed — commit 7630b8b on feat/qa-changes-plugin
  • PR comment posted — linked to the commit with explanation of the problem and fix
  • No extraneous changes — only the two files directly related to the prompt were modified

What Changed (since last summary)

Single commit 7630b8b: Added one bullet point each to:

  1. prompt.py (the "What you MUST do" section) — 7 lines telling the agent to attempt real execution, not --help
  2. SKILL.md (the "DO" list in Phase 3) — 1 line with the same guidance

No other files were touched. The change is deliberately narrow to match the "I'd prefer not to change it so drastically" request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants