Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
ed98cc6 to
e4edcdc
Compare
|
🤖 Hi @RissyRan, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
This Pull Request successfully introduces a buffering mechanism for metrics to overlap I/O and fixes a significant bug where evaluation time was mistakenly included in training step time measurements. The core logic in metric_logger.py is sound, but there are a few critical inconsistencies and potential data quality issues that should be addressed.
🔍 General Feedback
- Inconsistent Inflation Fix: While the training step time inflation fix was correctly implemented in the main
train.pyloop, it was missed in the RL and deprecated SFT trainers. - TensorBoard Data Quality: The new "running" eval metrics are logged using the evaluation loop index as the step number, which will lead to overlapping and confusing data in TensorBoard.
- Robustness: A small adjustment to the buffering order in
metric_logger.pycan prevent the loss of the final training step's metrics when training is stopped due to reaching the target loss.
e4edcdc to
0b39494
Compare
26d1570 to
c81f165
Compare
c81f165 to
8f9b3a9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
b/509626795
Summary of issue:
When running eval after every train step (eval_interval=1), it's observed that train step time in the logging increases significantly, from ~9.5s -> 11.5s
Root cause analysis:
Tests
Test on v5e-32, default 1b model, per_device_batch_size=1
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.