Non-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed) by saml212 · Pull Request #609 · openai/parameter-golf

saml212 · 2026-03-24T10:01:14Z

Non-record: 11L XSA-all + Full GPTQ + Selective Pruning + Parallel Muon

val_bpb: 1.1154 (3-seed mean, sd 0.0005) | 15.94 MB | 8xH100 SXM, 600s

Two techniques on top of PR #593's Parallel Muon stack.

Key additions over PR #593

Change	Impact
XSA on all 11 layers	Standard practice is XSA on last 4. Applying to all layers forces cross-position information mixing from layer 0. -0.0016 BPB vs XSA-last-4 in ablation. Zero new parameters.
Selective ±1 magnitude pruning	Post-GPTQ, sort ±1 quantized values by reconstruction error (scale²), zero least-impactful first until artifact fits. Targets only values whose removal causes minimal reconstruction damage.

Everything else from PR #593 carries forward: 11L, 512d, 8H/4KV, LeakyReLU(0.5)² MLP 3x, BigramHash(2048), Partial RoPE 16/64, LN Scale, VE128, SmearGate, U-Net skips, EMA(0.997), Tight SWA, Full Hessian GPTQ int6 + lzma, Parameter Banking + Parallel Muon.

Results (3 seeds, 8xH100 SXM)

Seed	Steps	ms/step	Sliding BPB (s64)	Artifact
7	6,938	86.7	1.1153	15,937,739 bytes
314	~6,930	86.7	1.1150	15,933,191 bytes
2024	~6,930	86.7	1.1159	15,928,475 bytes

Mean: 1.1154 | Sd: 0.0005

Requirements

Flash Attention 3 (Hopper kernel) is required. The script imports flash_attn_interface directly and will fail without it. FA2 is not sufficient — it produces ~100ms/step vs ~87ms, losing ~1,000 training steps and ~0.004 BPB.

pip install --break-system-packages flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291
python3 -c "from flash_attn_interface import flash_attn_func; print('FA3 OK')"

Also requires: zstandard, sentencepiece

Run command

python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
SEED=7 TARGET_MB=15.9 torchrun --standalone --nproc_per_node=8 train_gpt.py

Negative results

Techniques tested on this stack that did not help:

Technique	BPB	Delta	Why
Value Residual Learning (linear)	1.1298	+0.0012	Conflicts with VE128 — both inject identity info into deep layers
VRL sigmoid gates + TrigramHash	1.1174	+0.0020	Combined overhead costs ~100 steps, net negative
Catalytic Residuals	1.1285	-0.0001	Redundant with existing attn_scale/mlp_scale/resid_mix
Backout Connection	1.1291	+0.0005	Redundant with U-Net skip connections
Gated Attention + XSA-all	1.1279	+0.0011 vs XSA-all	3% step overhead outweighs quality gain
Hadamard rotation + GPTQ	1.1266	-0.0002	+0.5MB artifact size, hurts zstd compressibility
TrigramHash (zero params)	1.1237	+0.0049	Changes weight distribution, hurts compression
BigramHash(8192)	1.1200	-0.0068	Artifact 0.52MB over 16MB budget
BigramHash(4096)	1.1285	+0.0097	Artifact 0.52MB over budget, cold cache
Stride=32 eval	—	+0.0001	Negligible at seq2048. Stride=64 already gives 1984 context
Temperature scaling (T≠1.0)	—	+0.0002 to +0.003	Model already well-calibrated; T=1.0 optimal
Extended context eval (seq4096)	1.5695	catastrophic	RoPE breaks completely beyond training length
Checkpoint logit ensemble	—	infeasible	EMA-raw delta is 16.2MB compressed (int8+zstd)
Entropy coding (ANS/Huffman)	—	+0.05MB max	lzma already at 99.7% of Shannon entropy limit
Magnitude pruning (all ±1)	1.1341	+0.015	Too aggressive — no smooth continuum between threshold=0 and threshold=1

Credits

Base model + Parallel Muon: PR #593 by @abaybektursun
Full GPTQ: PR #535 by @raahilshah, PR #569 by @gowtham0992
LeakyReLU(0.5)²: PR #493, PR #518
XSA: arXiv:2603.09078

abaybektursun · 2026-03-24T12:40:40Z

Such a nice and low std!

valerio-oai · 2026-03-24T15:31:01Z

Same as #569 -- Are you counting the calibration time as part of the training 600s, or the eval 600s? If it's part of training (and you can prove it), then I am more incline to believe this is legal (would have to look more into it) but does not meet the minimum nat-difference, so it is a non-sota valid submission, if it's part of eval time then this is not valid as it is accessing training data at eval time.

Track: 10min_16mb Based on: PR openai#549 (LeakyReLU+ParallelMuon), PR openai#606 (Soft-Round+AdamW TTT), PR openai#609 (XSA-all+Full GPTQ) Changes from SOTA (openai#549): - XSA on all 11 layers (was 4) - Soft-Round QAT with tanh-based differentiable rounding (alpha 1->16) - Full GPTQ with Hessian-aware column-reordered Cholesky error compensation - MHA 8/8 (was GQA 8/4) - MLP 3.5x expansion (1792 hidden, was 3.0x/1536) - BigramHash vocabulary 8192 (was 2048) - AdamW TTT with grouped LR and cosine schedule (was SGD) - Early QAT threshold 0.5 (was late 0.15) - Selective ±1 magnitude pruning to hit size target

saml212 · 2026-03-24T17:26:26Z

@valerio-oai (Sorry for the long response but just trying to lay out my thoughts)

The GPTQ calibration accesses training data after the 600s training wallclock but before the artifact is saved or any val data is touched. It's part of artifact creation — 256 forward-only passes to compute per-layer Hessians for better quantization. No gradients, no weight updates.

The rules say "you aren't allowed to access any training data during evaluation." The calibration doesn't happen during evaluation — it happens during the post-training quantization that produces the artifact. The eval phase begins when the artifact is loaded and val data is scored, and at that point no training data is accessed.

If the calibration needs to fit inside the 600s training budget, I can rerun with MAX_WALLCLOCK_SECONDS=560 to accommodate. The ~40s calibration overhead would cost ~450 steps and roughly +0.002 BPB — still a competitive result.

I do want to flag something respectfully. The new merged leader (#549 at 1.1194) runs SGD — actual gradient descent with weight updates — on validation data during the eval phase. That's training on the test set during evaluation. My submission accesses training data (read-only, no weight updates) before eval even begins. I'm not sure why the former is accepted and the latter is questioned.

If the TTT ruling stands, then my 1.1154 is 0.004 nats short of the 0.005 threshold against 1.1194 and I'm happy to reclassify as a non-record submission since there are some useful ideas that have already been adopted by the group. But I wanted to raise the inconsistency.

A side note: I also have four earlier PRs (#114, #236, #332 — all with clean diffs now, 3-seed validated, no TTT, no post-training calibration) that were submitted when they were SOTA but never made it to the leaderboard. I would appreciate if you considered putting those up as well because I do believe they were relative achievements at the time and contributed to the architectures people are still using today.

valerio-oai · 2026-03-24T17:43:05Z

Hi @saml212 , no worries about the long answer. I'll talk to the other organizers, but I would not regard "artifact creation" as untimed (otherwise you'd be able to sneak in arbitrary compute there), so everything needs to fit into either training or eval, and this has to fall under training to be legal.

As for #549's SGD: it's doing test-time training, which is allowed (within reason, and as we've seen across the past day, it can be very easy to unintentionally leak eval tokens there): my understanding from the code is that for every doc D, it first runs the model on D, takes whatever loss D gives and counts that as the loss over that document, then updates the weights of the model, and uses these updated weights over the rest of the docs. Things of this nature are allowed: all tokens are scored first, so the model is still not scoring any tokens it has trained on, and we are not training on the eval set. Methods like these are why participants are allowed 10min of eval time, we want to see ingenuity at inference-time, too, so this is not inconsistent.

With regards to your other PRs: we'll do a final pass of the leaderboard by the time the challenge closes, so if what you claim is true, we'll pick it up then -- personally, I haven't reviewed those PRs.

saml212 · 2026-03-24T18:11:34Z

@valerio-oai totally makes sense, appreciate the clarification and your time! I'll update this to a non-record submission and fit the calibration into the training budget for future submissions.

Fork of PR openai#609 with int5 MLP quantization to fund BigramHash(8192).

Maps every top entry through BPB = L + Q + T + M: - openai#700 solved M (mixer) but has worst L (training) - openai#609 solved Q (quant) but has zero T and M (no eval pipeline) - openai#549 solved L (training) but has zero M (no mixer) - Nobody has optimized all four terms simultaneously - Theoretical optimal = 1.052 (combine best of each) - Our Track B path to 1.025 via recurrence + FiLM-only TTT + Mixer Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: Kevin Francis Tan <kf.tan@lightarchitects.io>

…eframe Corrections: - T+M are combined (-0.020), not separate. PR openai#700 gets -0.073 (3.6x better) - Our Q gap (0.066) is larger than the openai#549-openai#700 total gap — Q is THE bottleneck - Added "Best Known" column comparing against best per-term, not just merged SOTA New insights added: - Kaplan width scaling, hidden ≥ 512 threshold, Goldilocks depth - MoE viability at small scale (inactive experts compress well) - Vocab expansion opportunity (mechanical BPB reduction) - Compression reframe: BPB competition = compression competition, 20 years of literature - Strategic evolution: feature bloat → simplify → Q bottleneck → compression-first approach - Theoretical optimal 1.052 = combine best of openai#549 + openai#609 + openai#700 (nobody has done this) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: Kevin Francis Tan <kf.tan@lightarchitects.io>

- Based on PR openai#414 SOTA (1.1228 BPB) - LeakyReLU(0.5)^2 activation (proven frontier technique) - XSA on all 11 layers (PR openai#609 approach) - Full Hessian-aware GPTQ quantization with column reordering - Binary-search selective pruning to fit under 16MB - Auto-detects GPU: uses FlashAttention 3 + torch.compile on H100, falls back to PyTorch SDP + no compile on T4/P100/L4 - 1500 lines exactly Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

saml212 added 4 commits March 23, 2026 22:52

xsa-all gptq selective pruning

8ad3d8e

fix script and readme

e632d48

update negative results table with all 15 dead ends

244f135

xsa-all gptq selective pruning

e2868b3

notapplica mentioned this pull request Mar 24, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

abaybektursun mentioned this pull request Mar 24, 2026

Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean) #399

Open

senstar-hsoleimani mentioned this pull request Mar 24, 2026

submission: XSA-all + Soft-Round QAT + Full GPTQ + MLP 3.5x + AdamW TTT #631

Open

saml212 changed the title ~~Record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed)~~ Non-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed) Mar 24, 2026

Asukabot0 mentioned this pull request Mar 24, 2026

Record: 11L XSA-all + LeakyReLU(0.5)² + VR + GA (val_bpb=1.1164, pending 3-seed) #638

Closed

Robby955 mentioned this pull request Mar 24, 2026

Full GPTQ + XSA-all + SWA/EMA + Score-First TTT (compliant BPB=1.1175, 1-seed) #639

Closed

kasimte pushed a commit to kasimte/parameter-golf that referenced this pull request Mar 24, 2026

No-TTT submission: XSA-all + Full GPTQ + Int5-MLP + BigramHash(8192)

34c0475

Fork of PR openai#609 with int5 MLP quantization to fund BigramHash(8192).

valerio-oai mentioned this pull request Mar 25, 2026

Record: Residual Input Mixing + mixed int6 GPTQ + grouped TTT + MLP 3.5x #615

Closed

Asukabot0 mentioned this pull request Mar 25, 2026

Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727

Closed

abaybektursun mentioned this pull request Mar 25, 2026

Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.1142 (3-seed mean) #728

Open

Robby955 mentioned this pull request Mar 25, 2026

Non-record: Full GPTQ + XSA-4 + Score-First TTT (3-seed mean 1.1198) #734

Closed

hypery11 mentioned this pull request Mar 25, 2026

Record: 11L XSA-all + 7-gram cache (mean val_bpb=1.0465) #758

Open

4 tasks

Asukabot0 mentioned this pull request Mar 25, 2026

Record: Score-First TTT + N-gram Backoff (3-seed mean val_bpb=0.9581) #761

Open

raahilshah mentioned this pull request Mar 25, 2026

Record: 11L Full GPTQ + Multi-Order N-gram Backoff (fixed-alpha 0.9757 / entropy-adaptive 0.9605, 3-seed) #778

Open

abaybektursun mentioned this pull request Mar 26, 2026

Non-record: Negative results — quantization algorithms, TTT, architecture, and self-generated GPTQ calibration study #756

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed)#609

Non-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed)#609
saml212 wants to merge 4 commits intoopenai:mainfrom
saml212:sam/xsa-all-gptq-pruning

saml212 commented Mar 24, 2026 •

edited

Loading

Uh oh!

abaybektursun commented Mar 24, 2026

Uh oh!

valerio-oai commented Mar 24, 2026

Uh oh!

saml212 commented Mar 24, 2026

Uh oh!

valerio-oai commented Mar 24, 2026

Uh oh!

saml212 commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

saml212 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Non-record: 11L XSA-all + Full GPTQ + Selective Pruning + Parallel Muon

Key additions over PR #593

Results (3 seeds, 8xH100 SXM)

Requirements

Run command

Negative results

Credits

Uh oh!

abaybektursun commented Mar 24, 2026

Uh oh!

valerio-oai commented Mar 24, 2026

Uh oh!

saml212 commented Mar 24, 2026

Uh oh!

valerio-oai commented Mar 24, 2026

Uh oh!

saml212 commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

saml212 commented Mar 24, 2026 •

edited

Loading