feat: add qa-changes plugin for automated PR QA validation#135
feat: add qa-changes plugin for automated PR QA validation#135
Conversation
Add a new plugin that goes beyond code review by actually running the code to verify PR changes work as described. Plugin structure: - skills/qa-changes/SKILL.md: Generic QA methodology skill - plugins/qa-changes/action.yml: Composite GitHub Action - plugins/qa-changes/scripts/agent_script.py: Main QA agent - plugins/qa-changes/scripts/prompt.py: Prompt template - plugins/qa-changes/workflows/: Example workflow file - plugins/qa-changes/README.md: Documentation The QA agent follows a five-phase methodology: 1. Understand the change (classify diff) 2. Set up the environment (install deps, build) 3. Run the test suite (establish baseline, detect regressions) 4. Exercise changed behavior (manually test features/fixes) 5. Report results (structured PR comment with verdict) Co-authored-by: openhands <openhands@all-hands.dev>
…ful failure Key changes to the QA skill: - Merge Setup + Test into one phase; check CI status first, only run tests CI doesn't cover - Raise the bar for Exercise phase: frontend changes must use a real browser (Playwright/browser automation), CLI changes must run the actual CLI, API changes must make real HTTP requests - Add specific guidance per change type (frontend, CLI, API, bug fix, library, refactor, config) - Add 'Knowing When to Give Up' section: three attempts per approach, two approaches max, then report honestly and suggest AGENTS.md guidance - Add PARTIAL verdict for when some behavior could not be verified - Update prompt, README to match new four-phase methodology Co-authored-by: openhands <openhands@all-hands.dev>
OpenHands SDK performs best with tmux available for terminal management. Co-authored-by: openhands <openhands@all-hands.dev>
Add .plugin/plugin.json manifest and update agent_script.py to load the qa-changes plugin via the SDK's Plugin system. This properly loads skills, hooks, and MCP config bundled in the plugin directory. Previously the script only loaded project skills via load_project_skills() and missed the plugin's own skills entirely. See #136 for the same issue in the pr-review plugin. Co-authored-by: openhands <openhands@all-hands.dev>
…arketplace entry - Enable browser tools (enable_browser=True) so the QA agent can actually verify UI changes in a real browser, matching the SKILL.md methodology - Switch workflow from pull_request_target to pull_request to avoid executing untrusted fork code with the base repo's secrets - Isolate untrusted PR body in the prompt with an explicit warning to mitigate prompt injection - Add qa-changes skill to marketplaces/default.json (required by CI) - Add comments explaining tmux (OpenHands runtime) and gh dependencies - Update README security section to reflect the pull_request change Co-authored-by: openhands <openhands@all-hands.dev>
- Add max-budget ($10 default), timeout-minutes (30 default), and max-iterations (200 default) as action inputs - Enforce budget via a Conversation callback that raises BudgetExceeded when accumulated LLM cost exceeds the limit - Enforce timeout via GHA step-level timeout-minutes - Enforce iteration cap via SDK's max_iteration_per_run parameter - Pass all three values through as env vars (MAX_BUDGET, MAX_ITERATIONS) and document them - Add tests for format_prompt, truncate_diff, and validate_environment (20 tests covering fields, edge cases, defaults, and custom overrides) - Add missing skills/qa-changes/README.md (required by CI) - Update README action inputs table with new parameters Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
…ll timeout GitHub Actions does not support `timeout-minutes` on steps inside composite actions (only at the job level). The `Set up job` step fails with: 'Unexpected value timeout-minutes'. Replace with the coreutils `timeout` command, passing the `timeout-minutes` input via an environment variable and converting to seconds in the shell. Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
- Add lmnr-api-key input to action.yml with env var and --with lmnr - Add Laminar trace artifact upload step - Add save_trace_context() to agent_script.py for trace persistence - Create evaluate_qa_changes.py for post-close evaluation - Create qa-changes-evaluation.yml workflow template - Update workflow template to pass lmnr-api-key Co-authored-by: openhands <openhands@all-hands.dev>
Test extract_qa_report, extract_human_responses, truncate_text, calculate_engagement_score, and load_trace_info. Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
Switch from posting QA results as a plain PR comment (gh pr comment) to posting them as a GitHub code review thread using the /github-pr-review skill. The agent now: - Triggers both /qa-changes and /github-pr-review skills - Posts a structured review body with the full QA report - Adds inline review comments on specific lines for issues found - Uses priority labels (🔴🟠🟡🟢) from the github-pr-review skill - Bundles everything into a single review API call Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
Add symlink skills/github-pr-review -> ../../../skills/github-pr-review so the skill is explicitly available to the agent, matching the pattern used by the pr-review plugin. Co-authored-by: openhands <openhands@all-hands.dev>
Update the QA skill and prompt to produce more scannable reports: - Verdict + one-sentence summary at the top for instant readability - Status table gives at-a-glance phase results - All evidence (code snippets, logs, command output) goes inside HTML <details> collapsible blocks - Explicit formatting rules: no repetition across sections, omit empty sections, issues always visible - Prompt reinforces compact format and collapsible evidence Motivated by verbose QA reports on PRs like #2798 in software-agent-sdk where long inline evidence made the report hard to scan. Co-authored-by: openhands <openhands@all-hands.dev>
Update the prompt and skill to prioritize answering the core question: 'does this PR actually fix the original issue?' rather than just reporting that tests pass. Changes: - Phase 1 (Understand): emphasize identifying the original problem from PR description, linked issues, and title - Phase 3 (Exercise): start by verifying the PR addresses the stated issue with concrete examples of what that means for each PR type - Report template: add 'Does this PR fix the original issue?' section with direct answer and evidence - Key Principles: add principle that answering the core question is the primary deliverable - Prompt: add explicit instruction that the #1 job is to answer whether the PR fixes the original issue Co-authored-by: openhands <openhands@all-hands.dev>
The previous wording assumed PRs always fix issues, but PRs can also add new features, refactor code, improve performance, update docs, etc. Reframe to the broader question: "does this PR achieve what it set out to do?" This covers all PR types without being narrowly issue-focused. Co-authored-by: openhands <openhands@all-hands.dev>
Update the QA agent prompt and skill to require structured before/after evidence inside the collapsible Functional Verification section: 1. Reproduce the problem / establish baseline (without the fix) 2. Interpret what the baseline output means 3. Apply the PR's changes 4. Re-run the same verification with the fix in place 5. Interpret what the new output means This ensures QA reports show concrete proof that the agent actually ran code to verify fixes, rather than just describing what the diff does. Changes: - SKILL.md Phase 3: Expanded bug fix section into 6 explicit steps - SKILL.md Phase 3: Added general before/after narrative requirement - SKILL.md Phase 4: Replaced vague collapsible template with structured 3-step before/after format (reproduce → apply → verify) - prompt.py: Added 5-step before/after instructions to the Important section Co-authored-by: openhands <openhands@all-hands.dev>
- Rebase onto main to resolve conflicts from marketplace rename - Add qa-changes to openhands-extensions.json marketplace - Fix plugin.json author field to object format for Claude Code compat - Generate missing symlinks and command files via sync_extensions.py Co-authored-by: openhands <openhands@all-hands.dev>
ef0c987 to
61824c0
Compare
|
Rebased onto main to resolve merge conflicts (marketplace file was renamed from All core tests pass (64/64). The We've been using qa-changes for a while now and it's working well. Merging this to main so it can be included in the new This comment was posted by an AI assistant (OpenHands) on behalf of the user. |
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Solid implementation with minor improvements needed
This PR adds valuable QA validation capabilities that go beyond static code review by actually executing PR changes. The implementation follows existing patterns from pr-review, includes comprehensive testing, and has well-documented security tradeoffs.
Three Questions:
- ✅ Real problem? Yes - many PRs pass review but break when run
- ✅ Simpler way? No - the complexity matches the problem
- ✅ What breaks? Nothing - new plugin, well-isolated
Key Strengths:
- Reuses proven patterns from
pr-reviewplugin - Comprehensive test coverage for utility functions
- Security model is sound and well-documented
- Clear separation between skill (methodology) and plugin (automation)
Issues to address: See inline comments for documentation inconsistency, magic numbers, and readability improvements.
Expand the openhands-verification-stack marketplace to include all four verification layers: 1. onboarding — repo agent-readiness assessment and setup 2. pr-review — automated code review via inline GitHub comments 3. qa-changes — run-the-code QA validation with structured reports 4. iterate — continuous PR loop through CI, review, and QA Note: qa-changes depends on PR #135 being merged first. Co-authored-by: openhands <openhands@all-hands.dev>
|
@OpenHands read the comments in this PR. First things first: disable uv cache or some other github cache if any. We should not share this workflow data with other workflows. Commit and push only that. Apart from that, investigate github docs and tell me if it’s possible to partition caches, e.g. somehow cache for workflows a, b and c, but NOT d, e, f. |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
|
@OpenHands read all comments here. Let’s focus on the blast radius, but only on this part:
I agree, but I’m a little confused here: look up in agent-sdk repo (clone it from the same org) the system prompt construction; and I think you will find that it tells the agent that these files are untrusted. Verify what I just said. Are all these skills and AGENTS.md covered by the existing prompt? Does the current behavior of the SDK address this concern already or do you have targeted recommendations? Only answer, directly as a comment on this PR. |
|
I'm on it! enyst can track my progress at all-hands.dev |
|
I checked the SDK prompt construction in What is already covered today:
So your memory is correct for AGENTS.md and other always-on repo context: the agent is told that content is untrusted. But I don’t think that fully closes the concern here, because not all project skills are covered equally:
So my read is:
My targeted recommendation stays the same:
So: the current SDK behavior reduces the problem, but it does not fully address the specific concern raised in this review. This comment was created by an AI assistant (OpenHands) on behalf of the user. |
This comment was marked as duplicate.
This comment was marked as duplicate.
| default: main | ||
| max-budget: | ||
| description: Maximum LLM cost in dollars. The agent stops when this budget is exceeded. | ||
| required: false |
There was a problem hiding this comment.
| required: false | |
| required: true |
What if we make this required? It doesn't seem much of a burden, and someone might become happy if their agent goes wild - and this stops it. WDYT?
| - /github-pr-review skill for posting results as a code review thread | ||
|
|
||
| The template includes: | ||
| - {diff} - The complete git diff for the PR (may be truncated) |
There was a problem hiding this comment.
| - {diff} - The complete git diff for the PR (may be truncated) | |
| - {body} - The PR description | |
| - {diff} - The complete git diff for the PR (may be truncated) |
| 4. Re-run the same verification with the fix in place — show the command | ||
| and its output. | ||
| 5. Interpret the new result: explain what it means (e.g., "The error is | ||
| gone, confirming the fix works"). |
There was a problem hiding this comment.
| gone, confirming the fix works"). | |
| gone, confirming the fix works", or "The error is still there" etc). |
|
@OpenHands Understand this PR and add a sentence to address this to the prompt for the QA agent:
Note that you should use similar language with the language in the agent-sdk (clone yourself that repo from the same org), where it inserts warning about untrusted content on AGENTS.md on PRs and the like. Make sure also, that if there is other language in this PR that does the same, also follows the same pattern(s) / wording / strength. Do not make other modifications, just this. In your response on github, link the commit hash. |
|
@OpenHands why does CI fail? fix it pls |
|
I'm on it! enyst can track my progress at all-hands.dev |
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
|
CI was failing because Per the previous comment, I also made the prompt hardening as a separate follow-up commit so it is easy to review in isolation. That commit adds an SDK-style Validation run locally:
This comment was created by an AI agent (OpenHands) on behalf of the user. |
This comment was marked as duplicate.
This comment was marked as duplicate.
|
@xingyaoww I think this #135 (comment) is very interesting, and I’d pick it up for more investigation, but it doesn’t really apply as long as we don’t yet run on forks. So it shouldn’t block this PR, WDYT? |
Co-authored-by: openhands <openhands@all-hands.dev>
|
You understood it correctly: before this change, the example workflow could still get triggered for a fork PR (especially via label / reviewer-request), but because it runs in I fixed that in 9774ad8 . The composite action now does a preflight check: if the PR comes from a fork and the LLM key is unavailable, it exits successfully with a clear skip notice in the workflow summary instead of failing. I also updated the README wording to match. Validation:
This comment was created by an AI agent (OpenHands) on behalf of the user. |
|
A little curious output: OpenHands/software-agent-sdk#2909 (review) Should it be the case that it didn’t actually try to really run the workflow? It ran dry-run or —help. Now, it was a bit tricky case, because it needs OH API key, so it wouldn’t work, but it never says that in the output. 🤔 |
|
@OpenHands please try to fix this issue by changing the prompt. I'd prefer not to change it so drastically. |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
The QA agent was falling back to --help and --dry-run output and presenting it as functional verification. This adds explicit guidance in both the prompt and SKILL.md: always attempt real execution first, and if it fails due to missing credentials or external services, report honestly what could not be verified instead of substituting --help output. Co-authored-by: openhands <openhands@all-hands.dev>
|
Fixed in 7630b8b — targeted prompt change only, as requested. Problem: The QA agent was running scripts with Fix: Added a single bullet point in both the prompt (
All 198 tests pass. This comment was created by an AI agent (OpenHands) on behalf of the user. |
Summary of WorkRequest@xingyaoww asked to fix the issue where the QA agent was running Checklist
What Changed (since last summary)Single commit
No other files were touched. The change is deliberately narrow to match the "I'd prefer not to change it so drastically" request. |
Summary
Add a new
qa-changesplugin that goes beyond code review by actually running the code to verify PR changes work as described. While the existingpr-reviewplugin reads diffs and posts inline code comments, this plugin sets up the environment, runs the test suite, exercises changed behavior, and posts a structured QA report.Plugin Structure
Four-Phase QA Methodology
The skill defines a generic, language-agnostic methodology:
How It Differs from PR Review
Usage
Triggers:
qa-thislabel oropenhands-agentreviewer request.Design Decisions
AGENTS.mdor custom skills.agent_script.pyandaction.ymlfollow the same patterns aspr-reviewfor consistency, but the prompt and skill are completely different.FIRST_TIME_CONTRIBUTORandNONEfrom automatic triggers since QA executes code.Related
This PR was created by an AI assistant (OpenHands).