Non-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed)#609
Non-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed)#609saml212 wants to merge 4 commits intoopenai:mainfrom
Conversation
|
Such a nice and low std! |
|
Same as #569 -- Are you counting the calibration time as part of the training 600s, or the eval 600s? If it's part of training (and you can prove it), then I am more incline to believe this is legal (would have to look more into it) but does not meet the minimum nat-difference, so it is a non-sota valid submission, if it's part of eval time then this is not valid as it is accessing training data at eval time. |
Track: 10min_16mb Based on: PR openai#549 (LeakyReLU+ParallelMuon), PR openai#606 (Soft-Round+AdamW TTT), PR openai#609 (XSA-all+Full GPTQ) Changes from SOTA (openai#549): - XSA on all 11 layers (was 4) - Soft-Round QAT with tanh-based differentiable rounding (alpha 1->16) - Full GPTQ with Hessian-aware column-reordered Cholesky error compensation - MHA 8/8 (was GQA 8/4) - MLP 3.5x expansion (1792 hidden, was 3.0x/1536) - BigramHash vocabulary 8192 (was 2048) - AdamW TTT with grouped LR and cosine schedule (was SGD) - Early QAT threshold 0.5 (was late 0.15) - Selective ±1 magnitude pruning to hit size target
|
@valerio-oai (Sorry for the long response but just trying to lay out my thoughts) The GPTQ calibration accesses training data after the 600s training wallclock but before the artifact is saved or any val data is touched. It's part of artifact creation — 256 forward-only passes to compute per-layer Hessians for better quantization. No gradients, no weight updates. The rules say "you aren't allowed to access any training data during evaluation." The calibration doesn't happen during evaluation — it happens during the post-training quantization that produces the artifact. The eval phase begins when the artifact is loaded and val data is scored, and at that point no training data is accessed. If the calibration needs to fit inside the 600s training budget, I can rerun with MAX_WALLCLOCK_SECONDS=560 to accommodate. The ~40s calibration overhead would cost ~450 steps and roughly +0.002 BPB — still a competitive result. I do want to flag something respectfully. The new merged leader (#549 at 1.1194) runs SGD — actual gradient descent with weight updates — on validation data during the eval phase. That's training on the test set during evaluation. My submission accesses training data (read-only, no weight updates) before eval even begins. I'm not sure why the former is accepted and the latter is questioned. If the TTT ruling stands, then my 1.1154 is 0.004 nats short of the 0.005 threshold against 1.1194 and I'm happy to reclassify as a non-record submission since there are some useful ideas that have already been adopted by the group. But I wanted to raise the inconsistency. A side note: I also have four earlier PRs (#114, #236, #332 — all with clean diffs now, 3-seed validated, no TTT, no post-training calibration) that were submitted when they were SOTA but never made it to the leaderboard. I would appreciate if you considered putting those up as well because I do believe they were relative achievements at the time and contributed to the architectures people are still using today. |
|
Hi @saml212 , no worries about the long answer. I'll talk to the other organizers, but I would not regard "artifact creation" as untimed (otherwise you'd be able to sneak in arbitrary compute there), so everything needs to fit into either training or eval, and this has to fall under training to be legal. As for #549's SGD: it's doing test-time training, which is allowed (within reason, and as we've seen across the past day, it can be very easy to unintentionally leak eval tokens there): my understanding from the code is that for every doc D, it first runs the model on D, takes whatever loss D gives and counts that as the loss over that document, then updates the weights of the model, and uses these updated weights over the rest of the docs. Things of this nature are allowed: all tokens are scored first, so the model is still not scoring any tokens it has trained on, and we are not training on the eval set. Methods like these are why participants are allowed 10min of eval time, we want to see ingenuity at inference-time, too, so this is not inconsistent. With regards to your other PRs: we'll do a final pass of the leaderboard by the time the challenge closes, so if what you claim is true, we'll pick it up then -- personally, I haven't reviewed those PRs. |
|
@valerio-oai totally makes sense, appreciate the clarification and your time! I'll update this to a non-record submission and fit the calibration into the training budget for future submissions. |
Fork of PR openai#609 with int5 MLP quantization to fund BigramHash(8192).
Maps every top entry through BPB = L + Q + T + M: - openai#700 solved M (mixer) but has worst L (training) - openai#609 solved Q (quant) but has zero T and M (no eval pipeline) - openai#549 solved L (training) but has zero M (no mixer) - Nobody has optimized all four terms simultaneously - Theoretical optimal = 1.052 (combine best of each) - Our Track B path to 1.025 via recurrence + FiLM-only TTT + Mixer Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: Kevin Francis Tan <kf.tan@lightarchitects.io>
…eframe Corrections: - T+M are combined (-0.020), not separate. PR openai#700 gets -0.073 (3.6x better) - Our Q gap (0.066) is larger than the openai#549-openai#700 total gap — Q is THE bottleneck - Added "Best Known" column comparing against best per-term, not just merged SOTA New insights added: - Kaplan width scaling, hidden ≥ 512 threshold, Goldilocks depth - MoE viability at small scale (inactive experts compress well) - Vocab expansion opportunity (mechanical BPB reduction) - Compression reframe: BPB competition = compression competition, 20 years of literature - Strategic evolution: feature bloat → simplify → Q bottleneck → compression-first approach - Theoretical optimal 1.052 = combine best of openai#549 + openai#609 + openai#700 (nobody has done this) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: Kevin Francis Tan <kf.tan@lightarchitects.io>
- Based on PR openai#414 SOTA (1.1228 BPB) - LeakyReLU(0.5)^2 activation (proven frontier technique) - XSA on all 11 layers (PR openai#609 approach) - Full Hessian-aware GPTQ quantization with column reordering - Binary-search selective pruning to fit under 16MB - Auto-detects GPU: uses FlashAttention 3 + torch.compile on H100, falls back to PyTorch SDP + no compile on T4/P100/L4 - 1500 lines exactly Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Non-record: 11L XSA-all + Full GPTQ + Selective Pruning + Parallel Muon
val_bpb: 1.1154 (3-seed mean, sd 0.0005) | 15.94 MB | 8xH100 SXM, 600s
Two techniques on top of PR #593's Parallel Muon stack.
Key additions over PR #593
Everything else from PR #593 carries forward: 11L, 512d, 8H/4KV, LeakyReLU(0.5)² MLP 3x, BigramHash(2048), Partial RoPE 16/64, LN Scale, VE128, SmearGate, U-Net skips, EMA(0.997), Tight SWA, Full Hessian GPTQ int6 + lzma, Parameter Banking + Parallel Muon.
Results (3 seeds, 8xH100 SXM)
Mean: 1.1154 | Sd: 0.0005
Requirements
Flash Attention 3 (Hopper kernel) is required. The script imports
flash_attn_interfacedirectly and will fail without it. FA2 is not sufficient — it produces ~100ms/step vs ~87ms, losing ~1,000 training steps and ~0.004 BPB.pip install --break-system-packages flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291 python3 -c "from flash_attn_interface import flash_attn_func; print('FA3 OK')"Also requires:
zstandard,sentencepieceRun command
Negative results
Techniques tested on this stack that did not help:
Credits