Hi team,
Congratulations on the CILR acceptance — awesome work! I tried setting up the WebArena-Lite-v2 evaluation from this repo. After fixing a few small issues (e.g., parsing actions like answer(xxxx) and handling write after cleaning up), I ran the 7B checkpoint with a 15-step budget and got the following results:
| Domain |
Pass@1 |
Pass@4 |
Count |
| Gitlab |
0.2833 |
0.4000 |
30 |
| Map |
0.1635 |
0.2308 |
26 |
| Reddit |
0.2632 |
0.3684 |
19 |
| Shopping |
0.2330 |
0.3182 |
44 |
| ShoppingAdmin |
0.3143 |
0.3714 |
35 |
| --- |
--- |
--- |
--- |
| Average |
0.2532 |
0.3377 |
154 |
This is lower than the reported number (~37). Do you have any insight into what settings or evaluation details might explain the gap (e.g., step limit, prompts, environment versions, or action parsing)?
Thanks a lot!