Add Task 25: Playwright E2E Form Test (automation category)#120
Add Task 25: Playwright E2E Form Test (automation category)#120hi0001234d wants to merge 1 commit intopinchbench:mainfrom
Conversation
🦀 PR #120 Test Results — Task 25: Playwright E2E Form TestTested on a fresh Vultr instance (Ubuntu 22.04, 4GB RAM, Model Scores (task_25 only)
Automated Check Breakdown (identical across all 3 models)
The 🚨 Critical Issue: LLM Judge Always FailsThe LLM judge (claude-opus-4.5) failed for all 3 models with timeout. Root cause: the judge agent doesn't understand it's supposed to evaluate the code — it tries to write and run its own Playwright script instead, spending its entire timeout on that. The judge response never contains the expected JSON with criterion scores. From the judge logs: This means the hybrid grading (50% automated + 50% LLM judge) is effectively broken — only the 50% automated portion scores. Every model's true score is capped at ~47% regardless of code quality. Model Observations
Playwright Setup NotePlaywright and Chromium aren't on the PinchBench snapshot. I installed them manually before running ( RecommendationNeeds work before merge:
The task itself is well-designed — good difficulty level, realistic browser automation scenario, and the automated checks are mostly solid. The main blocker is the broken judge scoring. |
I think keeping them out of the snapshot should be the way here. The model knowing to install them is worth being part of the test TBH |
🦀 PinchBench PR Test Results — Task 25: Playwright E2E Form TestInstance: Vultr vc2-2c-4gb (Ubuntu 22.04, 4GB RAM) | Region: ATL Scores
Automated Check Breakdown
LLM Judge Breakdown (Gemini only — others timed out)
Judge notes: Script successfully navigates all 3 steps using data-testid selectors and select_option for state dropdown. All 4 review fields verified with exact text assertions. Retry logic implemented with 3 attempts and 1s delay, but catches generic Exception instead of specific Playwright errors and lacks finally block for cleanup. Code is clean and organized with helper function, but has hardcoded file path and no comments. Key Observations1. 2. LLM Judge timeout is a reliability concern 3. Task difficulty feels about right
4. Task ID collision Installation NotesPlaywright setup on the benchmark VM required: pip install playwright
playwright install chromium
playwright install-deps chromiumThis added ~110MB for Chromium. The Recommendation: Merge with changesThe task is well-designed and fills a genuine gap (0 browser automation tasks). Two things to address:
The judge timeout is a broader infrastructure issue, not specific to this PR. Tested by PinchBench CI (ScuttleBot) on 2026-04-09 | Instance destroyed after testing |
New Task: Browser Automation Benchmark
Adds the first browser automation task to PinchBench — a category currently missing
Related: #52 (Browser automation gap discuused over here #52)
See also: #52
What it does
Agent receives a 3-step HTML registration form (
form.html) and must write a Playwright E2E test script (test_form.py) that navigates all steps, validates state, and includes retry logic.Files
tasks/task_25_playwright_e2e.md— Task definition with hybrid gradingassets/form.html— Self-contained 3-step form fixture (pure HTML/CSS/JS)Grading
Testing done
Why this task
Category: automation | Timeout: 300s | Difficulty: medium