Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745)#688
Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745)#688RoyiRa wants to merge 4 commits intoopenai:mainfrom
Conversation
Key differences from our failed SGD TTT: - AdamW(lr=1e-4) instead of SGD(lr=0.002) — 20x lower LR - Only last 2 blocks unfrozen (9 frozen) — protects VRL gates - Polyak weight averaging (decay=0.998) for scoring stability - Cosine LR decay across chunks - Score with Polyak weights, train with live weights PR openai#688 gets -0.05 bpb with this recipe. Our SGD TTT was diverging because we used 20x higher LR and unfroze all blocks including VRL gates. TTT_ENABLED=1 to activate. Default: AdamW lr=1e-4, 3 epochs, 9 frozen blocks, Polyak decay=0.998. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for your submission! Unfortunately, it's disallowed due to the use of hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens. Please refer to the long discussion about this under the issues tab for more details, and please submit more runs in the future! |
|
@valerio-oai |
|
Thank you @Eppie for the correction, I should have been more careful! |
Summary
3-seed mean val_bpb: 1.0745 (std 0.021) | <15.5 MB | 8xH100 SXM, 600s
Results
Key Technique: 5-expert Logistic Context Mixer
GPU-vectorized online context mixing using the Hedge algorithm. Five experts blend predictions in log-probability space during TTT eval:
N-gram tables built incrementally from already-scored tokens only (legal). Expert weights updated online via Hedge:
log_w -= eta * loss.Each expert produces an NLL for every token. The mixer maintains learned weights (one per expert) updated via the Hedge algorithm. At each position, the mixed prediction is:
mixed_NLL = -log(sum_k w_k * exp(-NLL_k))Training Budget
GPTQ calibration runs within the 600s training budget (18s reserved).
Reproduction
Credits