Skip to content

Non-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed)#609

Open
saml212 wants to merge 4 commits intoopenai:mainfrom
saml212:sam/xsa-all-gptq-pruning
Open

Non-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed)#609
saml212 wants to merge 4 commits intoopenai:mainfrom
saml212:sam/xsa-all-gptq-pruning

Conversation

@saml212
Copy link
Copy Markdown
Contributor

@saml212 saml212 commented Mar 24, 2026

Non-record: 11L XSA-all + Full GPTQ + Selective Pruning + Parallel Muon

val_bpb: 1.1154 (3-seed mean, sd 0.0005) | 15.94 MB | 8xH100 SXM, 600s

Two techniques on top of PR #593's Parallel Muon stack.

Key additions over PR #593

Change Impact
XSA on all 11 layers Standard practice is XSA on last 4. Applying to all layers forces cross-position information mixing from layer 0. -0.0016 BPB vs XSA-last-4 in ablation. Zero new parameters.
Selective ±1 magnitude pruning Post-GPTQ, sort ±1 quantized values by reconstruction error (scale²), zero least-impactful first until artifact fits. Targets only values whose removal causes minimal reconstruction damage.

Everything else from PR #593 carries forward: 11L, 512d, 8H/4KV, LeakyReLU(0.5)² MLP 3x, BigramHash(2048), Partial RoPE 16/64, LN Scale, VE128, SmearGate, U-Net skips, EMA(0.997), Tight SWA, Full Hessian GPTQ int6 + lzma, Parameter Banking + Parallel Muon.

Results (3 seeds, 8xH100 SXM)

Seed Steps ms/step Sliding BPB (s64) Artifact
7 6,938 86.7 1.1153 15,937,739 bytes
314 ~6,930 86.7 1.1150 15,933,191 bytes
2024 ~6,930 86.7 1.1159 15,928,475 bytes

Mean: 1.1154 | Sd: 0.0005

Requirements

Flash Attention 3 (Hopper kernel) is required. The script imports flash_attn_interface directly and will fail without it. FA2 is not sufficient — it produces ~100ms/step vs ~87ms, losing ~1,000 training steps and ~0.004 BPB.

pip install --break-system-packages flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291
python3 -c "from flash_attn_interface import flash_attn_func; print('FA3 OK')"

Also requires: zstandard, sentencepiece

Run command

python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
SEED=7 TARGET_MB=15.9 torchrun --standalone --nproc_per_node=8 train_gpt.py

Negative results

Techniques tested on this stack that did not help:

Technique BPB Delta Why
Value Residual Learning (linear) 1.1298 +0.0012 Conflicts with VE128 — both inject identity info into deep layers
VRL sigmoid gates + TrigramHash 1.1174 +0.0020 Combined overhead costs ~100 steps, net negative
Catalytic Residuals 1.1285 -0.0001 Redundant with existing attn_scale/mlp_scale/resid_mix
Backout Connection 1.1291 +0.0005 Redundant with U-Net skip connections
Gated Attention + XSA-all 1.1279 +0.0011 vs XSA-all 3% step overhead outweighs quality gain
Hadamard rotation + GPTQ 1.1266 -0.0002 +0.5MB artifact size, hurts zstd compressibility
TrigramHash (zero params) 1.1237 +0.0049 Changes weight distribution, hurts compression
BigramHash(8192) 1.1200 -0.0068 Artifact 0.52MB over 16MB budget
BigramHash(4096) 1.1285 +0.0097 Artifact 0.52MB over budget, cold cache
Stride=32 eval +0.0001 Negligible at seq2048. Stride=64 already gives 1984 context
Temperature scaling (T≠1.0) +0.0002 to +0.003 Model already well-calibrated; T=1.0 optimal
Extended context eval (seq4096) 1.5695 catastrophic RoPE breaks completely beyond training length
Checkpoint logit ensemble infeasible EMA-raw delta is 16.2MB compressed (int8+zstd)
Entropy coding (ANS/Huffman) +0.05MB max lzma already at 99.7% of Shannon entropy limit
Magnitude pruning (all ±1) 1.1341 +0.015 Too aggressive — no smooth continuum between threshold=0 and threshold=1

Credits

@abaybektursun
Copy link
Copy Markdown
Contributor

Such a nice and low std!

@valerio-oai
Copy link
Copy Markdown
Contributor

Same as #569 -- Are you counting the calibration time as part of the training 600s, or the eval 600s? If it's part of training (and you can prove it), then I am more incline to believe this is legal (would have to look more into it) but does not meet the minimum nat-difference, so it is a non-sota valid submission, if it's part of eval time then this is not valid as it is accessing training data at eval time.

senstar-hsoleimani added a commit to senstar-hsoleimani/parameter-golf that referenced this pull request Mar 24, 2026
Track: 10min_16mb
Based on: PR openai#549 (LeakyReLU+ParallelMuon), PR openai#606 (Soft-Round+AdamW TTT), PR openai#609 (XSA-all+Full GPTQ)

Changes from SOTA (openai#549):
- XSA on all 11 layers (was 4)
- Soft-Round QAT with tanh-based differentiable rounding (alpha 1->16)
- Full GPTQ with Hessian-aware column-reordered Cholesky error compensation
- MHA 8/8 (was GQA 8/4)
- MLP 3.5x expansion (1792 hidden, was 3.0x/1536)
- BigramHash vocabulary 8192 (was 2048)
- AdamW TTT with grouped LR and cosine schedule (was SGD)
- Early QAT threshold 0.5 (was late 0.15)
- Selective ±1 magnitude pruning to hit size target
@saml212
Copy link
Copy Markdown
Contributor Author

saml212 commented Mar 24, 2026

@valerio-oai (Sorry for the long response but just trying to lay out my thoughts)

The GPTQ calibration accesses training data after the 600s training wallclock but before the artifact is saved or any val data is touched. It's part of artifact creation — 256 forward-only passes to compute per-layer Hessians for better quantization. No gradients, no weight updates.

The rules say "you aren't allowed to access any training data during evaluation." The calibration doesn't happen during evaluation — it happens during the post-training quantization that produces the artifact. The eval phase begins when the artifact is loaded and val data is scored, and at that point no training data is accessed.

If the calibration needs to fit inside the 600s training budget, I can rerun with MAX_WALLCLOCK_SECONDS=560 to accommodate. The ~40s calibration overhead would cost ~450 steps and roughly +0.002 BPB — still a competitive result.

I do want to flag something respectfully. The new merged leader (#549 at 1.1194) runs SGD — actual gradient descent with weight updates — on validation data during the eval phase. That's training on the test set during evaluation. My submission accesses training data (read-only, no weight updates) before eval even begins. I'm not sure why the former is accepted and the latter is questioned.

If the TTT ruling stands, then my 1.1154 is 0.004 nats short of the 0.005 threshold against 1.1194 and I'm happy to reclassify as a non-record submission since there are some useful ideas that have already been adopted by the group. But I wanted to raise the inconsistency.

A side note: I also have four earlier PRs (#114, #236, #332 — all with clean diffs now, 3-seed validated, no TTT, no post-training calibration) that were submitted when they were SOTA but never made it to the leaderboard. I would appreciate if you considered putting those up as well because I do believe they were relative achievements at the time and contributed to the architectures people are still using today.

@valerio-oai
Copy link
Copy Markdown
Contributor

Hi @saml212 , no worries about the long answer. I'll talk to the other organizers, but I would not regard "artifact creation" as untimed (otherwise you'd be able to sneak in arbitrary compute there), so everything needs to fit into either training or eval, and this has to fall under training to be legal.

As for #549's SGD: it's doing test-time training, which is allowed (within reason, and as we've seen across the past day, it can be very easy to unintentionally leak eval tokens there): my understanding from the code is that for every doc D, it first runs the model on D, takes whatever loss D gives and counts that as the loss over that document, then updates the weights of the model, and uses these updated weights over the rest of the docs. Things of this nature are allowed: all tokens are scored first, so the model is still not scoring any tokens it has trained on, and we are not training on the eval set. Methods like these are why participants are allowed 10min of eval time, we want to see ingenuity at inference-time, too, so this is not inconsistent.

With regards to your other PRs: we'll do a final pass of the leaderboard by the time the challenge closes, so if what you claim is true, we'll pick it up then -- personally, I haven't reviewed those PRs.

@saml212
Copy link
Copy Markdown
Contributor Author

saml212 commented Mar 24, 2026

@valerio-oai totally makes sense, appreciate the clarification and your time! I'll update this to a non-record submission and fit the calibration into the training budget for future submissions.

@saml212 saml212 changed the title Record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed) Non-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed) Mar 24, 2026
kasimte pushed a commit to kasimte/parameter-golf that referenced this pull request Mar 24, 2026
Fork of PR openai#609 with int5 MLP quantization to fund BigramHash(8192).
theLightArchitect added a commit to theLightArchitect/parameter-golf that referenced this pull request Mar 27, 2026
Maps every top entry through BPB = L + Q + T + M:
- openai#700 solved M (mixer) but has worst L (training)
- openai#609 solved Q (quant) but has zero T and M (no eval pipeline)
- openai#549 solved L (training) but has zero M (no mixer)
- Nobody has optimized all four terms simultaneously
- Theoretical optimal = 1.052 (combine best of each)
- Our Track B path to 1.025 via recurrence + FiLM-only TTT + Mixer

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Kevin Francis Tan <kf.tan@lightarchitects.io>
theLightArchitect added a commit to theLightArchitect/parameter-golf that referenced this pull request Mar 27, 2026
…eframe

Corrections:
- T+M are combined (-0.020), not separate. PR openai#700 gets -0.073 (3.6x better)
- Our Q gap (0.066) is larger than the openai#549-openai#700 total gap — Q is THE bottleneck
- Added "Best Known" column comparing against best per-term, not just merged SOTA

New insights added:
- Kaplan width scaling, hidden ≥ 512 threshold, Goldilocks depth
- MoE viability at small scale (inactive experts compress well)
- Vocab expansion opportunity (mechanical BPB reduction)
- Compression reframe: BPB competition = compression competition, 20 years of literature
- Strategic evolution: feature bloat → simplify → Q bottleneck → compression-first approach
- Theoretical optimal 1.052 = combine best of openai#549 + openai#609 + openai#700 (nobody has done this)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Kevin Francis Tan <kf.tan@lightarchitects.io>
EconLearn pushed a commit to EconLearn/parameter-golf that referenced this pull request Mar 27, 2026
- Based on PR openai#414 SOTA (1.1228 BPB)
- LeakyReLU(0.5)^2 activation (proven frontier technique)
- XSA on all 11 layers (PR openai#609 approach)
- Full Hessian-aware GPTQ quantization with column reordering
- Binary-search selective pruning to fit under 16MB
- Auto-detects GPU: uses FlashAttention 3 + torch.compile on H100,
  falls back to PyTorch SDP + no compile on T4/P100/L4
- 1500 lines exactly

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants