Skip to content

Improve eval metrics logging#3840

Open
aireenmei wants to merge 1 commit intomainfrom
aireen/eval_metrics
Open

Improve eval metrics logging#3840
aireenmei wants to merge 1 commit intomainfrom
aireen/eval_metrics

Conversation

@aireenmei
Copy link
Copy Markdown
Collaborator

@aireenmei aireenmei commented May 7, 2026

Description

b/509626795

Summary of issue:

When running eval after every train step (eval_interval=1), it's observed that train step time in the logging increases significantly, from ~9.5s -> 11.5s

Root cause analysis:

  • Major: in the current code, the train step time boundary mistakenly includes the eval loop. So the train step time right before eval includes eval time, causing step time inflation. In most cases people use a bigger eval_interval and only see the inflated step once in many steps and ignore the issue.
  • Minor: Unlike train step time that always delay reporting the metrics one step to avoid blocking the data loading of next step. In eval, the metrics are calculated right after the last eval step (the only eval step in this case), blocking the data loading of the following training step.

Tests

Test on v5e-32, default 1b model, per_device_batch_size=1

  • With the old code, step time ~ 1.6s, screenshot
  • With the new code, step time separated, train ~1s + eval ~0.5s, screenshot. There is slightly in the total time likely due to fixing the minor blocking issue stated above.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

🤖 Hi @RissyRan, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary

This Pull Request successfully introduces a buffering mechanism for metrics to overlap I/O and fixes a significant bug where evaluation time was mistakenly included in training step time measurements. The core logic in metric_logger.py is sound, but there are a few critical inconsistencies and potential data quality issues that should be addressed.

🔍 General Feedback

  • Inconsistent Inflation Fix: While the training step time inflation fix was correctly implemented in the main train.py loop, it was missed in the RL and deprecated SFT trainers.
  • TensorBoard Data Quality: The new "running" eval metrics are logged using the evaluation loop index as the step number, which will lead to overlapping and confusing data in TensorBoard.
  • Robustness: A small adjustment to the buffering order in metric_logger.py can prevent the loss of the final training step's metrics when training is stopped due to reaching the target loss.

Comment thread src/maxtext/common/metric_logger.py
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary (Addendum)

Adding the missing inline comments regarding the training step time inflation fix in RL/SFT trainers and the buffering order in metric_logger.py.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary (Final Addendum)

Adding the missing inline comments regarding the training step time inflation fix in RL/SFT trainers and the buffering order in metric_logger.py.

Comment thread src/maxtext/common/metric_logger.py Outdated
Comment thread src/maxtext/experimental/rl/grpo_trainer.py
Comment thread src/maxtext/trainers/post_train/sft/train_sft_deprecated.py
@aireenmei aireenmei force-pushed the aireen/eval_metrics branch from e4edcdc to 0b39494 Compare May 7, 2026 22:00
@aireenmei aireenmei force-pushed the aireen/eval_metrics branch 3 times, most recently from 26d1570 to c81f165 Compare May 8, 2026 06:09
@aireenmei aireenmei force-pushed the aireen/eval_metrics branch from c81f165 to 8f9b3a9 Compare May 8, 2026 06:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants