Add Task 25: Playwright E2E Form Test (automation category) by hi0001234d · Pull Request #120 · pinchbench/skill

hi0001234d · 2026-04-07T15:57:14Z

New Task: Browser Automation Benchmark

Adds the first browser automation task to PinchBench — a category currently missing

Related: #52 (Browser automation gap discuused over here #52)
See also: #52

What it does

Agent receives a 3-step HTML registration form (form.html) and must write a Playwright E2E test script (test_form.py) that navigates all steps, validates state, and includes retry logic.

Files

tasks/task_25_playwright_e2e.md — Task definition with hybrid grading
assets/form.html — Self-contained 3-step form fixture (pure HTML/CSS/JS)

Grading

Hybrid: 50% automated (8 Python checks) + 50% LLM judge (4 criteria, weights sum to 100%)
Automated checks: file creation, valid Python, Playwright import, form reference, field fills, retry logic, assertions, screenshot

Testing done

96 automated deep tests passed (PinchBench standard)
32 + 70 Playwright browser tests passed (form.html validation)
Live OpenClaw benchmark: 8/8 automated checks = 1.0 (model: claude-sonnet-4-6, cost: $0.24)
Existing pytest suite: 2/2 pass, zero side effects
workspace_files source format verified against lib_agent.py

Why this task

PinchBench has 0 browser automation tasks (Discussion: What tasks should we add next? #52 confirms this gap)
OpenClaw has built-in Playwright support (playwright-cli skill)
ScuttleBot gap analysis in Discussion: What tasks should we add next? #52 listed browser automation as missing category

Category: automation | Timeout: 300s | Difficulty: medium

ScuttleBot · 2026-04-09T16:49:18Z

🦀 PR #120 Test Results — Task 25: Playwright E2E Form Test

Tested on a fresh Vultr instance (Ubuntu 22.04, 4GB RAM, vc2-2c-4gb). Snapshot: 79926f8e (latest). Playwright + Chromium pre-installed before running benchmarks.

Model Scores (task_25 only)

Model	Automated (50%)	LLM Judge (50%)	Total	Time	Cost	Tokens
`claude-opus-4.6`	7.5/8 (93.8%)	❌ Failed	46.9%	101s	$0.38	170K
`gpt-5.4`	7.5/8 (93.8%)	❌ Failed	46.9%	73s	$0.09	66K
`gemini-2.5-pro`	7.5/8 (93.8%)	❌ Failed	46.9%	139s	$0.20	212K

Note: The original spec called for gemini-3-pro but that model ID doesn't exist on OpenRouter. Used gemini-2.5-pro instead.

Automated Check Breakdown (identical across all 3 models)

Check	Score
✅ File `test_form.py` created	1.0
✅ Valid Python syntax	1.0
✅ Imports `playwright.sync_api`	1.0
✅ References `form.html`	1.0
⚠️ Fills multiple fields (≥5 interactions)	0.5
✅ Has retry/wait logic	1.0
✅ Has assertions/expect calls	1.0
✅ Saves screenshot to `success.png`	1.0

The fills_multiple_fields check scored 0.5 because it counts fill(), type(), select_option(), check(), click(), and press() calls and wants ≥10 for full marks. All 3 models hit 5-9 range (they're efficient — using helper functions that wrap single fill/click calls, so the raw regex count misses the actual number of interactions). This is a grading check issue, not a model issue.

🚨 Critical Issue: LLM Judge Always Fails

The LLM judge (claude-opus-4.5) failed for all 3 models with timeout. Root cause: the judge agent doesn't understand it's supposed to evaluate the code — it tries to write and run its own Playwright script instead, spending its entire timeout on that. The judge response never contains the expected JSON with criterion scores.

From the judge logs:

Judge raw response: "Now I understand the form structure. Let me create the Playwright test script..."
WARNING - Failed to parse judge JSON response

This means the hybrid grading (50% automated + 50% LLM judge) is effectively broken — only the 50% automated portion scores. Every model's true score is capped at ~47% regardless of code quality.

Model Observations

Claude Opus 4.6: Clean one-shot solution with retry() wrapper, tid() helper for data-testid selectors. Tried to run the script (exec blocked by preflight check for cd && python), then used workdir parameter successfully. Actually ran and verified the script produced success.png.
GPT-5.4: Most efficient — fewest tokens, cheapest, single-write approach with get_by_test_id(). Used proper Playwright patterns throughout. Tried to git commit (not a repo, gracefully handled). Didn't attempt to run the script.
Gemini 2.5 Pro: Initially wrote get_by_testid (wrong API name), hit an error, debugged it, rewrote with get_by_test_id. Showed good self-correction. Most expensive due to thinking tokens (212K total). Installed Playwright deps itself before running.

Playwright Setup Note

Playwright and Chromium aren't on the PinchBench snapshot. I installed them manually before running (pip3 install playwright && playwright install chromium && playwright install-deps). For this task to be included in standard benchmark runs, the snapshot would need updating, or the task could document that the agent should install deps itself (all 3 models attempted this).

Recommendation

Needs work before merge:

🔴 Fix LLM judge integration — The judge agent is behaving as a task solver rather than a grading evaluator. The judge prompt needs to be much more explicit that it should only output JSON scores, not attempt the task itself. This is likely a benchmark framework issue, not specific to this PR, but it makes the hybrid grading non-functional.
🟡 Fix fills_multiple_fields threshold — The regex counting approach undercounts when agents use helper functions. Consider lowering the threshold from 10 to 5 for full marks, or counting at a higher level (e.g., number of unique field references rather than raw method calls). All 3 models filled 7+ fields but scored 0.5.
🟡 Task number conflict — The snapshot already has task_25_access_log_anomaly.md on main. This PR's task_25_playwright_e2e.md coexists by filename, but the task ID prefix task_25 is shared, which could cause confusion. Consider renumbering (e.g., task_26).
🟢 Playwright dependency — Either update the snapshot to include Playwright, or document that agents should install it. All models handled this gracefully.

The task itself is well-designed — good difficulty level, realistic browser automation scenario, and the automated checks are mostly solid. The main blocker is the broken judge scoring.

olearycrew · 2026-04-09T16:55:45Z

Playwright and Chromium aren't on the PinchBench snapshot. I installed them manually before running (pip3 install playwright && playwright install chromium && playwright install-deps). For this task to be included in standard benchmark runs, the snapshot would need updating, or the task could document that the agent should install deps itself (all 3 models attempted this).

I think keeping them out of the snapshot should be the way here. The model knowing to install them is worth being part of the test TBH

ScuttleBot · 2026-04-09T17:30:15Z

🦀 PinchBench PR Test Results — Task 25: Playwright E2E Form Test

Instance: Vultr vc2-2c-4gb (Ubuntu 22.04, 4GB RAM) | Region: ATL
Branch: hi0001234d:feat/playwright-e2e merged onto main (snapshot 79926f8e)
Playwright: v1.58.0 + Chromium Headless Shell 145.0
All 3 models ran in parallel

Scores

Model	Overall	Auto (50%)	Judge (50%)	Time	Cost	Tokens
claude-opus-4.6	45.3% ⚠️	90.6% (7.25/8)	❌ Timed out	103.7s	$0.42	194K
gpt-5.4	46.9% ⚠️	93.8% (7.5/8)	❌ Timed out	81.3s	$0.09	102K
gemini-3.1-pro-preview	91.4% ✅	96.9% (7.75/8)	86.3%	89.5s	$0.12	77K

⚠️ Claude and GPT scores are artificially low — the LLM judge (claude-opus-4.5) timed out during judging, zeroing out the 50% judge component. Their automated scores (90.6% and 93.8%) show strong actual performance.

Automated Check Breakdown

Check	Claude	GPT	Gemini
file_created	✅ 1.0	✅ 1.0	✅ 1.0
valid_python	✅ 1.0	✅ 1.0	✅ 1.0
imports_playwright	✅ 1.0	✅ 1.0	✅ 1.0
references_form_html	✅ 1.0	✅ 1.0	✅ 1.0
fills_multiple_fields	⚠️ 0.5	⚠️ 0.5	✅ 1.0
has_retry_or_wait	✅ 1.0	✅ 1.0	✅ 1.0
has_assertions	⚠️ 0.75	✅ 1.0	⚠️ 0.75
saves_screenshot	✅ 1.0	✅ 1.0	✅ 1.0

LLM Judge Breakdown (Gemini only — others timed out)

Criterion	Score
Multi-step navigation	0.90
Review data assertions	1.00
Error handling & retry	0.75
Code quality & best practices	0.80

Judge notes: Script successfully navigates all 3 steps using data-testid selectors and select_option for state dropdown. All 4 review fields verified with exact text assertions. Retry logic implemented with 3 attempts and 1s delay, but catches generic Exception instead of specific Playwright errors and lacks finally block for cleanup. Code is clean and organized with helper function, but has hardcoded file path and no comments.

Key Observations

1. fills_multiple_fields check is too strict for well-structured code
Claude (166 LOC) and GPT (123 LOC) used helper functions / data-driven patterns that wrap .fill() calls inside retry loops. The regex counter only found 3-4 raw .fill()/.click() calls in their code, scoring them 0.5 (needs ≥10 for 1.0). Gemini (58 LOC) used inline calls and hit 10. This penalizes good engineering practice (DRY code with helpers) over brute-force repetition. Consider counting the logical field interactions rather than raw regex matches.

2. LLM Judge timeout is a reliability concern
2 of 3 runs had judge timeouts, zeroing out 50% of the score. This is an infrastructure issue (the judge agent session timed out), not a task design issue. The task itself completed fine for all models. For hybrid-graded tasks, a judge failure shouldn't zero the entire judge component — consider a fallback or retry.

3. Task difficulty feels about right

All 3 models successfully created valid Playwright scripts with correct structure
All produced working screenshots
Execution times (81-104s) are well within the 300s timeout
The task differentiates code quality through the judge criteria
Auto checks at ~91-97% mean the basic requirements are clear and achievable

4. Task ID collision
The snapshot already has task_25_access_log_anomaly.md. The PR adds task_25_playwright_e2e.md. Both loaded fine (different IDs in frontmatter), but having two task_25_* files could confuse operators. Consider renumbering to task_26.

Installation Notes

Playwright setup on the benchmark VM required:

pip install playwright
playwright install chromium
playwright install-deps chromium

This added ~110MB for Chromium. The vc2-2c-4gb plan (4GB RAM) handled it fine. The default vc2-1c-2gb (2GB) would likely also work but could be tight with Chromium + benchmark agent + judge running concurrently.

Recommendation: Merge with changes

The task is well-designed and fills a genuine gap (0 browser automation tasks). Two things to address:

Fix fills_multiple_fields regex grading — it penalizes DRY code. Count logical field interactions or AST-walk for function calls, not just raw regex on .fill().
Renumber to task_26 to avoid the ID collision with existing task_25_access_log_anomaly.

The judge timeout is a broader infrastructure issue, not specific to this PR.

Tested by PinchBench CI (ScuttleBot) on 2026-04-09 | Instance destroyed after testing

Add Task 25: Playwright E2E Form Test (automation category)

bbe4167

harshil480 mentioned this pull request Apr 8, 2026

[task-proposal] Playwright E2E Test Writing #154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Task 25: Playwright E2E Form Test (automation category)#120

Add Task 25: Playwright E2E Form Test (automation category)#120
hi0001234d wants to merge 1 commit intopinchbench:mainfrom
hi0001234d:feat/playwright-e2e

hi0001234d commented Apr 7, 2026

Uh oh!

ScuttleBot commented Apr 9, 2026

Uh oh!

olearycrew commented Apr 9, 2026

Uh oh!

ScuttleBot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hi0001234d commented Apr 7, 2026

New Task: Browser Automation Benchmark

What it does

Files

Grading

Testing done

Why this task

Uh oh!

ScuttleBot commented Apr 9, 2026

🦀 PR #120 Test Results — Task 25: Playwright E2E Form Test

Model Scores (task_25 only)

Automated Check Breakdown (identical across all 3 models)

🚨 Critical Issue: LLM Judge Always Fails

Model Observations

Playwright Setup Note

Recommendation

Uh oh!

olearycrew commented Apr 9, 2026

Uh oh!

ScuttleBot commented Apr 9, 2026

🦀 PinchBench PR Test Results — Task 25: Playwright E2E Form Test

Scores

Automated Check Breakdown

LLM Judge Breakdown (Gemini only — others timed out)

Key Observations

Installation Notes

Recommendation: Merge with changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants