-
Notifications
You must be signed in to change notification settings - Fork 90
Description
I used rum_mlm.py along with the training data you provided for pre-training. After training for 8 epochs with a batch size of 32 across 32 GPUs each with 64GB of VRAM, I noticed that the model's performance remains quite low even as it approaches convergence. Could you kindly share the detailed pre-training script?
{'eval_loss': 5.036761283874512, 'eval_accuracy': 0.2256266404065416, 'eval_f1': 0.2653012523859082, 'eval_mcc': 0.22392924231341627, 'eval_runtime': 16.2194, 'eval_samples_per_second': 1190.672, 'eval_steps_per_second': 1.171, 'epoch': 6.88}
{'loss': 5.0858, 'grad_norm': 0.2914751172065735, 'learning_rate': 6.51907022425929e-06, 'epoch': 6.96}
{'loss': 5.0805, 'grad_norm': 0.2778473198413849, 'learning_rate': 6.007529873956458e-06, 'epoch': 7.04}
{'loss': 5.0823, 'grad_norm': 0.28457939624786377, 'learning_rate': 5.495989523653626e-06, 'epoch': 7.12}
{'loss': 5.082, 'grad_norm': 0.282247930765152, 'learning_rate': 4.984449173350794e-06, 'epoch': 7.2}
{'eval_loss': 5.031910419464111, 'eval_accuracy': 0.22645610628951326, 'eval_f1': 0.2671266363782628, 'eval_mcc': 0.22476370202238263, 'eval_runtime': 16.2135, 'eval_samples_per_second': 1191.109, 'eval_steps_per_second': 1.172, 'epoch': 7.2}
{'loss': 5.0789, 'grad_norm': 0.279897004365921, 'learning_rate': 4.472908823047963e-06, 'epoch': 7.28}
{'loss': 5.0803, 'grad_norm': 0.2841487526893616, 'learning_rate': 3.9613684727451305e-06, 'epoch': 7.37}
{'loss': 5.0788, 'grad_norm': 0.32013562321662903, 'learning_rate': 3.4498281224422987e-06, 'epoch': 7.45}
{'loss': 5.076, 'grad_norm': 0.28553149104118347, 'learning_rate': 2.9382877721394665e-06, 'epoch': 7.53}
{'eval_loss': 5.032684803009033, 'eval_accuracy': 0.22627068077954807, 'eval_f1': 0.266603782932162, 'eval_mcc': 0.22457673959967936, 'eval_runtime': 16.3434, 'eval_samples_per_second': 1181.637, 'eval_steps_per_second': 1.163, 'epoch': 7.53}
{'loss': 5.0784, 'grad_norm': 0.29609060287475586, 'learning_rate': 2.4267474218366343e-06, 'epoch': 7.61}
{'loss': 5.079, 'grad_norm': 0.29126057028770447, 'learning_rate': 1.9152070715338025e-06, 'epoch': 7.69}
{'loss': 5.0747, 'grad_norm': 0.27743813395500183, 'learning_rate': 1.4036667212309707e-06, 'epoch': 7.78}
{'loss': 5.0772, 'grad_norm': 0.2757411301136017, 'learning_rate': 8.921263709281388e-07, 'epoch': 7.86}
{'eval_loss': 5.037484169006348, 'eval_accuracy': 0.22560653778113435, 'eval_f1': 0.2660859160040803, 'eval_mcc': 0.22390840831026426, 'eval_runtime': 16.2032, 'eval_samples_per_second': 1191.866, 'eval_steps_per_second': 1.173, 'epoch': 7.86}
{'loss': 5.0776, 'grad_norm': 0.2676142752170563, 'learning_rate': 3.8058602062530694e-07, 'epoch': 7.94}
{'train_runtime': 81132.2624, 'train_samples_per_second': 616.824, 'train_steps_per_second': 0.602, 'train_loss': 5.26640695736439, 'epoch': 8.0}
07/29/2025 23:01:43 - INFO - main - *** Evaluate ***
07/29/2025 23:01:43 - INFO - main - *** Evaluate ***
07/29/2025 23:01:43 - INFO - main - *** Evaluate ***
***** train metrics *****
epoch = 8.0
total_flos = 32624038352GF
train_loss = 5.2664
train_runtime = 22:32:12.26
train_samples = 6255545
train_samples_per_second = 616.824
train_steps_per_second = 0.602
07/29/2025 23:01:44 - INFO - main - *** Evaluate ***
***** eval metrics *****
epoch = 8.0
eval_accuracy = 0.2252
eval_f1 = 0.2643
eval_loss = 5.0414
eval_mcc = 0.2235
eval_runtime = 0:00:15.66
eval_samples = 19312
eval_samples_per_second = 1233.169
eval_steps_per_second = 1.213
perplexity = 154.6872