Question about reproducing WebArena-Lite-v2 reported results

Hi team,

Congratulations on the CILR acceptance — awesome work! I tried setting up the WebArena-Lite-v2 evaluation from this repo. After fixing a few small issues (e.g., parsing actions like answer(xxxx) and handling write after cleaning up), I ran the 7B checkpoint with a 15-step budget and got the following results:

Domain | Pass@1 | Pass@4 | Count
-- | -- | -- | --
Gitlab | 0.2833 | 0.4000 | 30
Map | 0.1635 | 0.2308 | 26
Reddit | 0.2632 | 0.3684 | 19
Shopping | 0.2330 | 0.3182 | 44
ShoppingAdmin | 0.3143 | 0.3714 | 35
--- | --- | --- | ---
Average | 0.2532 | 0.3377 | 154

This is lower than the reported number (~37). Do you have any insight into what settings or evaluation details might explain the gap (e.g., step limit, prompts, environment versions, or action parsing)?

Thanks a lot!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about reproducing WebArena-Lite-v2 reported results #17

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Domain	Pass@1	Pass@4	Count
Gitlab	0.2833	0.4000	30
Map	0.1635	0.2308	26
Reddit	0.2632	0.3684	19
Shopping	0.2330	0.3182	44
ShoppingAdmin	0.3143	0.3714	35
---	---	---	---
Average	0.2532	0.3377	154

Question about reproducing WebArena-Lite-v2 reported results #17

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions