(Nonrecord) Applied Async Prefetching Potentially Boosts Performance#785
Open
SirSaltySalmon wants to merge 7 commits intoopenai:mainfrom
Open
(Nonrecord) Applied Async Prefetching Potentially Boosts Performance#785SirSaltySalmon wants to merge 7 commits intoopenai:mainfrom
SirSaltySalmon wants to merge 7 commits intoopenai:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
LeakyReLU^2 + Legal TTT + Parallel Muon + systems: prefetch & fusion-friendly MLP
Reference baseline:
2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/README.mdOutcome
This variant improves throughput slightly, but does not improve quality versus the original 3-seed 8xH100 runs.
step_avg: 83.53ms -> 83.44ms (faster)val_bpb(final_int6_sliding_window_exact): 1.12184 -> 1.12334 (worse by +0.00151)val_bpb(legal_ttt_exact): 1.11938 -> 1.12096 (worse by +0.00158)3-seed comparison (8xH100, 600s train budget)
1xH100 ablation (Modal sanity check, 600s train budget)
Interpretation
The data is consistent across all three seeds: the systems changes increase training throughput, but that throughput gain does not translate into better final validation quality in this setup.
So the result here is best described as a speed optimization with neutral-to-slightly-negative quality impact relative to the original record recipe. Likely just means noise impacted the training result, as training math and process is exactly the same.
On 1xH100, the same systems changes looked clearly positive (more steps and better post-TTT bpb), while on 8xH100 they remain speed-positive but quality-negative. The practical interpretation is that prefetch/fusion behavior does not transfer linearly from single-GPU to multi-GPU quality outcomes and should be treated as a throughput optimization first. Likely, I/O is no longer bottleneck at large scale, and more so communication between GPUs tend to be the target.
I will continue iterating on this as increased training speed shows promises. This attempt tries to prove that async prefetching and memory pinning can improve the throughput of most approaches, but requires more experimentation to investigate compatibility with other methods. Aiming to increase optimization's compatibility with parallel GPUs next.
What changed vs. base record
All differences are in data loading and MLP forward; model architecture, banking, Parallel Muon, FlashAttention-3,
torch.compileusage, TTT protocol, and env-driven hyperparameters are otherwise aligned with base PR1. Pinned async prefetch (
PrefetchingDistributedTokenLoader)queue,threading.TRAIN_PREFETCH(default1)TRAIN_PREFETCH_QUEUE(default2)TRAIN_COPY_STREAM(default1) — when enabled with prefetch, H2D uses a dedicatedtorch.cuda.Streamand the default stream waits on it._cpu_batch_from_stream,_h2d_int64_batches.(x, y)on CPU,contiguous().pin_memory(), boundedqueue.Queue;next_batchdequees and copies to device.make_train_loader()factory; after optimizer state rewind (e.g. SWA branch), existing prefetch thread isshutdown()before a fresh loader is created so the token stream does not advance in the background.2. Fusion-friendly LeakyReLU² MLP
Base:
This submission:
Mathematically identical to LeakyReLU(0.5)² feeding the down projection; the change is layout / fusion hints for the compiled training graph, the Inductor fuses or simplifies more than before.
ENV
Same as the base run command, with optional prefetch toggles (defaults match optimized script):
Credits