Open
Conversation
With technique: Slightly faster
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Asynchronous Prefetching — submission notes
1191 (with technique) vs 1137 (default) steps in 600s on local compute
Key changes
Same model, optimizer, data layout, and training math as baseline. This is a general purpose rework that could apply to most other approaches for slight speed boosts. Overlap CPU data prep and host→device copies with GPU work so the GPU spends less time idle.
train_gpt_og_linux.py)PrefetchingDistributedTokenLoader) builds the next pinned CPU batch while the GPU runs the current step. Primary win: CPU work overlaps GPU compute (not GPU-side double-buffering of H2D vs forward).TRAIN_COPY_STREAM, off when timing diagnostics are on). Transfers use pinned memory; the training path still waits for that step’s H2D before forward (wait_stream).VAL_BYTECOUNT_DEVICE=cpumoves BPB byte counting off the GPU vs the original (setcudato mirror baseline GPU LUT math).Diagnostics
To measure how much time this actually saves, I added
TRAINING_TIMING_BREAKDOWN(batch CPU vs H2D vs FWD/BWD/opt vs val; adds syncs). When enabled, lines log everyTRAINING_TIMING_EVERYsteps (default 200) and for early steps (first 10). Extra logs: train/val I/O mode,val_stage_time_ms, train vs val wall time split.VAL_BYTECOUNT_DEVICEdefaults tocpuin the improved script (not an extra flag you must set). Usecudaif you want validation byte math on the GPU like the original.Optional
VAL_PROGRESS_LOG_EVERY(default 0): set to a positive value to log per-batch validation progress (val_progress:...).Defaults & toggles
Overlap features are on by default (
TRAIN_PREFETCH,TRAIN_COPY_STREAM,VAL_PREFETCH,VAL_COPY_STREAM, etc.) and can be turned off via env vars if needed.TRAINING_TIMING_BREAKDOWNdefaults to 0 and is not displayed. Prefetch/overlap are automatically disabled whenTRAINING_TIMING_BREAKDOWN=1so timings stay interpretable.Idea
Prefetch training and validation batches asynchronously and parallelize CPU ↔ GPU transfers with compute to minimize pipeline bubbles under a fixed wall-clock budget.
This is an intuitive idea that I came up with that could help models with real research and architectural advancements place slightly higher.
Why this may be unimpactful in some cases
With
TRAINING_TIMING_BREAKDOWN=1, early-step lines look like this (same hardware / config as above;grad_accum_steps=8, per-micro averages for batch/forward/backward):How to read this:
batch_cpu_msandbatch_h2d_msare ~0.3 ms per micro-step;forward_msandbackward_msare ~30 ms and ~65 ms per micro-step. Scaled by 8 micro-steps, batch prep + H2D is on the order of ~5 ms per optimizer step, while forward + backward + optimizer is on the order of ~800+ ms. So data movement is a tiny slice of the step; overlapping it cannot move wall-clock much when the GPU is already busy with compute for almost the whole step.Caveat: On a much faster GPU (or smaller model / larger batch so steps are shorter), the same CPU+H2D work could become a larger fraction of the step, and prefetch or val overlap might show up more in profiles. The breakdown above is not universal; it only shows why the optimization can be a no-op when compute is the bottleneck.