Skip to content

Add Task 25: Playwright E2E Form Test (automation category)#120

Open
hi0001234d wants to merge 1 commit intopinchbench:mainfrom
hi0001234d:feat/playwright-e2e
Open

Add Task 25: Playwright E2E Form Test (automation category)#120
hi0001234d wants to merge 1 commit intopinchbench:mainfrom
hi0001234d:feat/playwright-e2e

Conversation

@hi0001234d
Copy link
Copy Markdown

New Task: Browser Automation Benchmark

Adds the first browser automation task to PinchBench — a category currently missing

Related: #52 (Browser automation gap discuused over here #52)
See also: #52

What it does

Agent receives a 3-step HTML registration form (form.html) and must write a Playwright E2E test script (test_form.py) that navigates all steps, validates state, and includes retry logic.

Files

  • tasks/task_25_playwright_e2e.md — Task definition with hybrid grading
  • assets/form.html — Self-contained 3-step form fixture (pure HTML/CSS/JS)

Grading

  • Hybrid: 50% automated (8 Python checks) + 50% LLM judge (4 criteria, weights sum to 100%)
  • Automated checks: file creation, valid Python, Playwright import, form reference, field fills, retry logic, assertions, screenshot

Testing done

  • 96 automated deep tests passed (PinchBench standard)
  • 32 + 70 Playwright browser tests passed (form.html validation)
  • Live OpenClaw benchmark: 8/8 automated checks = 1.0 (model: claude-sonnet-4-6, cost: $0.24)
  • Existing pytest suite: 2/2 pass, zero side effects
  • workspace_files source format verified against lib_agent.py

Why this task

Category: automation | Timeout: 300s | Difficulty: medium

@ScuttleBot
Copy link
Copy Markdown

🦀 PR #120 Test Results — Task 25: Playwright E2E Form Test

Tested on a fresh Vultr instance (Ubuntu 22.04, 4GB RAM, vc2-2c-4gb). Snapshot: 79926f8e (latest). Playwright + Chromium pre-installed before running benchmarks.

Model Scores (task_25 only)

Model Automated (50%) LLM Judge (50%) Total Time Cost Tokens
claude-opus-4.6 7.5/8 (93.8%) ❌ Failed 46.9% 101s $0.38 170K
gpt-5.4 7.5/8 (93.8%) ❌ Failed 46.9% 73s $0.09 66K
gemini-2.5-pro 7.5/8 (93.8%) ❌ Failed 46.9% 139s $0.20 212K

Note: The original spec called for gemini-3-pro but that model ID doesn't exist on OpenRouter. Used gemini-2.5-pro instead.

Automated Check Breakdown (identical across all 3 models)

Check Score
✅ File test_form.py created 1.0
✅ Valid Python syntax 1.0
✅ Imports playwright.sync_api 1.0
✅ References form.html 1.0
⚠️ Fills multiple fields (≥5 interactions) 0.5
✅ Has retry/wait logic 1.0
✅ Has assertions/expect calls 1.0
✅ Saves screenshot to success.png 1.0

The fills_multiple_fields check scored 0.5 because it counts fill(), type(), select_option(), check(), click(), and press() calls and wants ≥10 for full marks. All 3 models hit 5-9 range (they're efficient — using helper functions that wrap single fill/click calls, so the raw regex count misses the actual number of interactions). This is a grading check issue, not a model issue.

🚨 Critical Issue: LLM Judge Always Fails

The LLM judge (claude-opus-4.5) failed for all 3 models with timeout. Root cause: the judge agent doesn't understand it's supposed to evaluate the code — it tries to write and run its own Playwright script instead, spending its entire timeout on that. The judge response never contains the expected JSON with criterion scores.

From the judge logs:

Judge raw response: "Now I understand the form structure. Let me create the Playwright test script..."
WARNING - Failed to parse judge JSON response

This means the hybrid grading (50% automated + 50% LLM judge) is effectively broken — only the 50% automated portion scores. Every model's true score is capped at ~47% regardless of code quality.

Model Observations

  • Claude Opus 4.6: Clean one-shot solution with retry() wrapper, tid() helper for data-testid selectors. Tried to run the script (exec blocked by preflight check for cd && python), then used workdir parameter successfully. Actually ran and verified the script produced success.png.

  • GPT-5.4: Most efficient — fewest tokens, cheapest, single-write approach with get_by_test_id(). Used proper Playwright patterns throughout. Tried to git commit (not a repo, gracefully handled). Didn't attempt to run the script.

  • Gemini 2.5 Pro: Initially wrote get_by_testid (wrong API name), hit an error, debugged it, rewrote with get_by_test_id. Showed good self-correction. Most expensive due to thinking tokens (212K total). Installed Playwright deps itself before running.

Playwright Setup Note

Playwright and Chromium aren't on the PinchBench snapshot. I installed them manually before running (pip3 install playwright && playwright install chromium && playwright install-deps). For this task to be included in standard benchmark runs, the snapshot would need updating, or the task could document that the agent should install deps itself (all 3 models attempted this).

Recommendation

Needs work before merge:

  1. 🔴 Fix LLM judge integration — The judge agent is behaving as a task solver rather than a grading evaluator. The judge prompt needs to be much more explicit that it should only output JSON scores, not attempt the task itself. This is likely a benchmark framework issue, not specific to this PR, but it makes the hybrid grading non-functional.

  2. 🟡 Fix fills_multiple_fields threshold — The regex counting approach undercounts when agents use helper functions. Consider lowering the threshold from 10 to 5 for full marks, or counting at a higher level (e.g., number of unique field references rather than raw method calls). All 3 models filled 7+ fields but scored 0.5.

  3. 🟡 Task number conflict — The snapshot already has task_25_access_log_anomaly.md on main. This PR's task_25_playwright_e2e.md coexists by filename, but the task ID prefix task_25 is shared, which could cause confusion. Consider renumbering (e.g., task_26).

  4. 🟢 Playwright dependency — Either update the snapshot to include Playwright, or document that agents should install it. All models handled this gracefully.

The task itself is well-designed — good difficulty level, realistic browser automation scenario, and the automated checks are mostly solid. The main blocker is the broken judge scoring.

@olearycrew
Copy link
Copy Markdown
Member

Playwright and Chromium aren't on the PinchBench snapshot. I installed them manually before running (pip3 install playwright && playwright install chromium && playwright install-deps). For this task to be included in standard benchmark runs, the snapshot would need updating, or the task could document that the agent should install deps itself (all 3 models attempted this).

I think keeping them out of the snapshot should be the way here. The model knowing to install them is worth being part of the test TBH

@ScuttleBot
Copy link
Copy Markdown

🦀 PinchBench PR Test Results — Task 25: Playwright E2E Form Test

Instance: Vultr vc2-2c-4gb (Ubuntu 22.04, 4GB RAM) | Region: ATL
Branch: hi0001234d:feat/playwright-e2e merged onto main (snapshot 79926f8e)
Playwright: v1.58.0 + Chromium Headless Shell 145.0
All 3 models ran in parallel


Scores

Model Overall Auto (50%) Judge (50%) Time Cost Tokens
claude-opus-4.6 45.3% ⚠️ 90.6% (7.25/8) ❌ Timed out 103.7s $0.42 194K
gpt-5.4 46.9% ⚠️ 93.8% (7.5/8) ❌ Timed out 81.3s $0.09 102K
gemini-3.1-pro-preview 91.4% 96.9% (7.75/8) 86.3% 89.5s $0.12 77K

⚠️ Claude and GPT scores are artificially low — the LLM judge (claude-opus-4.5) timed out during judging, zeroing out the 50% judge component. Their automated scores (90.6% and 93.8%) show strong actual performance.


Automated Check Breakdown

Check Claude GPT Gemini
file_created ✅ 1.0 ✅ 1.0 ✅ 1.0
valid_python ✅ 1.0 ✅ 1.0 ✅ 1.0
imports_playwright ✅ 1.0 ✅ 1.0 ✅ 1.0
references_form_html ✅ 1.0 ✅ 1.0 ✅ 1.0
fills_multiple_fields ⚠️ 0.5 ⚠️ 0.5 ✅ 1.0
has_retry_or_wait ✅ 1.0 ✅ 1.0 ✅ 1.0
has_assertions ⚠️ 0.75 ✅ 1.0 ⚠️ 0.75
saves_screenshot ✅ 1.0 ✅ 1.0 ✅ 1.0

LLM Judge Breakdown (Gemini only — others timed out)

Criterion Score
Multi-step navigation 0.90
Review data assertions 1.00
Error handling & retry 0.75
Code quality & best practices 0.80

Judge notes: Script successfully navigates all 3 steps using data-testid selectors and select_option for state dropdown. All 4 review fields verified with exact text assertions. Retry logic implemented with 3 attempts and 1s delay, but catches generic Exception instead of specific Playwright errors and lacks finally block for cleanup. Code is clean and organized with helper function, but has hardcoded file path and no comments.


Key Observations

1. fills_multiple_fields check is too strict for well-structured code
Claude (166 LOC) and GPT (123 LOC) used helper functions / data-driven patterns that wrap .fill() calls inside retry loops. The regex counter only found 3-4 raw .fill()/.click() calls in their code, scoring them 0.5 (needs ≥10 for 1.0). Gemini (58 LOC) used inline calls and hit 10. This penalizes good engineering practice (DRY code with helpers) over brute-force repetition. Consider counting the logical field interactions rather than raw regex matches.

2. LLM Judge timeout is a reliability concern
2 of 3 runs had judge timeouts, zeroing out 50% of the score. This is an infrastructure issue (the judge agent session timed out), not a task design issue. The task itself completed fine for all models. For hybrid-graded tasks, a judge failure shouldn't zero the entire judge component — consider a fallback or retry.

3. Task difficulty feels about right

  • All 3 models successfully created valid Playwright scripts with correct structure
  • All produced working screenshots
  • Execution times (81-104s) are well within the 300s timeout
  • The task differentiates code quality through the judge criteria
  • Auto checks at ~91-97% mean the basic requirements are clear and achievable

4. Task ID collision
The snapshot already has task_25_access_log_anomaly.md. The PR adds task_25_playwright_e2e.md. Both loaded fine (different IDs in frontmatter), but having two task_25_* files could confuse operators. Consider renumbering to task_26.


Installation Notes

Playwright setup on the benchmark VM required:

pip install playwright
playwright install chromium
playwright install-deps chromium

This added ~110MB for Chromium. The vc2-2c-4gb plan (4GB RAM) handled it fine. The default vc2-1c-2gb (2GB) would likely also work but could be tight with Chromium + benchmark agent + judge running concurrently.


Recommendation: Merge with changes

The task is well-designed and fills a genuine gap (0 browser automation tasks). Two things to address:

  1. Fix fills_multiple_fields regex grading — it penalizes DRY code. Count logical field interactions or AST-walk for function calls, not just raw regex on .fill().
  2. Renumber to task_26 to avoid the ID collision with existing task_25_access_log_anomaly.

The judge timeout is a broader infrastructure issue, not specific to this PR.


Tested by PinchBench CI (ScuttleBot) on 2026-04-09 | Instance destroyed after testing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants