关于UI-R1-E-3B的训练过程

想要复现您对于UI-R1-E-3B的消融实验。我使用了UI-R1-3B在136条数据上的训练方式，对于2k数据进行了1个epoch的GRPO训练，但是在Screenspot上的实验效果仅仅达到79.6%，请问有其他的实验细节需要注意吗，比如epoch数需要设置更多？但是感觉Reward曲线已经接近于稳定。

<img width="408" height="270" alt="Image" src="https://github.com/user-attachments/assets/b2a54c22-9818-44a4-ba7e-17291e6e85e0" />


<img width="398" height="262" alt="Image" src="https://github.com/user-attachments/assets/3b4c282c-6b6a-494a-a527-4887abd84ec5" />

<img width="400" height="269" alt="Image" src="https://github.com/user-attachments/assets/79639978-d04d-46bd-b4bb-7fcca69926ab" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于UI-R1-E-3B的训练过程 #21

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

关于UI-R1-E-3B的训练过程 #21

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions