Non-record: 11L XSA + SwiGLU + LoRA TTT (val_bpb=1.1573, 1xH100) by swapp1990 · Pull Request #2 · swapp1990/parameter-golf

swapp1990 · 2026-03-24T21:50:50Z

Summary

val_bpb: 1.1573 (LoRA TTT) | 15.02 MB artifact | 1xH100 PCIe, ~80 min
11-layer transformer: XSA (last 4 layers), SwiGLU 3x MLP, SmearGate, U-Net skips, OrthoInit, Muon WD=0.04, SWA
Mixed quantization: int5-MLP + int6-attn + int8-embed + zstd-22
Score-then-train LoRA TTT (rank-8, 256-token chunks) brings val_bpb from 1.191 → 1.157
18 experiments over 5 days, from val_bpb=3.10 to 1.1573 (~$50 total compute)

Why Non-Record

Trained on 1xH100 PCIe with grad accumulation (~80 min), not 8xH100 in 10 min. Architecture is identical to what would run on 8xH100.

Test plan

Full training pipeline validated on 2xH100 dry run
Mixed quantization fits in 15.02 MB (< 16 MB)
LoRA TTT parallelized across GPUs with all_reduce
Score-then-train ordering verified (legal per PR Record: PROTEUS v8 — 11L INT6 + LoRA TTT 5ep cosine (mean val_bpb=0.7853, 4 seeds) openai/parameter-golf#568 ruling)
Pending: 8xH100 record run when spot capacity available

🤖 Generated with Claude Code

## Submission: Mixed Quantization (int6 blocks + int8 embeddings) + Sliding Window Eval **val_bpb: 1.1630** | **Total size: 15,353,490 bytes** (under 16MB) Four orthogonal improvements over the naive baseline: 1. **Wider MLP (MLP_MULT=3)** — 2x→3x expansion (hidden=1536), enabled by aggressive quantization 2. **Mixed-precision quantization** — int6 per-row (31 levels) on STE-protected block weights, int8 per-row (127 levels) on the token embedding which lacks STE fake-quant. Reduces quant penalty from +0.048 to +0.0015 BPB. 3. **Optimized throughput** — seq_len=1024 + batch=524K tokens for 48.4ms/step, ~6.5B total tokens in 10 minutes 4. **Sliding window eval (stride=64)** — each scored token gets 960 tokens of context, ~0.034 BPB improvement, zero artifact cost ### Run command ```bash RUN_ID=v2_int6_qat_mlp3 MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=2000 TRAIN_LOG_EVERY=200 \ torchrun --standalone --nproc_per_node=8 train_gpt.py ``` ### Key metrics | Metric | Value | |--------|-------| | Steps (10 min cap) | 12,395 | | int6/int8 sliding val_bpb | **1.1630** | | Quantization penalty | +0.0015 BPB | | Artifact size | 15,353,490 bytes |

… 1.2129) 10-layer transformer with mixed-precision export achieving mean val_bpb=1.2129 across 5 seeds on 8xH100 SXM, improving on the naive baseline by 0.0248 nats (t=34.12, p<<0.001). Key changes: - 10 layers (vs 9 baseline) - Lower LRs: MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03 - FP16 tied embedding export (reduces quant gap) - Int6 quantization for middle layers 2-7 (fits under 16MB) Mean artifact size: 15.36MB (under 16MB cap). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Major upgrade from previous 10L submission (1.2129 -> 1.1652 BPB). Key changes: - 9L with MLP_MULT=3 (wider MLP, 3x expansion, 21.8M params) - QAT: STE fake-quantize simulates int6 during training - Int6 quantization on all block weights (layers 0-8) - Sliding window eval (stride=64) for ~0.033 BPB free gain - FP16 tied embedding + lower LRs (carried over) 5-seed results on 8xH100 SXM: Mean slide_bpb: 1.1652 (std=0.0017) Mean rt_bpb: 1.1985 t-statistic: 78.93 (p << 0.001) All artifacts under 16MB (mean: 15.64MB) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The window_starts filter dropped windows shorter than stride, silently skipping up to (stride-1) tokens at the end of the validation set. Now includes all windows with >= 1 scoreable token, and clamps the score start for short final windows.

Co-authored-by: spokane-way <spokane@way>

…val_bpb=1.1748) (openai#60) * Add NTK Eval + Overtone Init submission (1.2160 BPB) Train@1024 with overtone embedding init and phase-transition residual mixing, eval@2048 with NTK-aware dynamic RoPE scaling. Mean val_bpb 1.2160 across 3 seeds (p=0.0012 for 0.0194-nat improvement over baseline). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update submission: Muon WD + NTK Eval + Overtone Init (1.2094 BPB, p=0.0002) * Update submission: 10-Layer + Muon WD + NTK Eval + Overtone Init (1.2029 BPB, p=0.0006) * Update submission: FP16 Embed + 10L + Muon WD + NTK + Overtone (1.2008 BPB) * Update submission: 1.2000 BPB — FP16 Embed + 10L + Muon WD + NTK@1408 + Overtone * Update: 1.1748 BPB — Sliding Window + FP16 Embed + 10L + Muon WD + Overtone --------- Co-authored-by: notapplica <notapplica@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Warmdown-quantization co-optimization, val_bpb=1.2154 Novel finding: aggressive LR decay (WARMDOWN_ITERS=20000) reduces int8 quantization penalty from 0.014 to 0.005 BPB. Combined with FP16 tied embeddings and moderate NTK-RoPE extrapolation (eval@1408). Full warmdown sweep across 10 values and detailed analysis in README. * breakthrough: 1.1574 BPB via int6 + MLP 3x + sliding window stride=256 --------- Co-authored-by: Sam Larson <saml212@users.noreply.github.com>

3-seed validation: mean 1.2067 BPB (std 0.00044), improvement 0.0353 nats over baseline, t=-70.69 (p << 0.01). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

V3: Added 10th layer with mixed int8/int6 quantization (middle layers), plus sliding window evaluation (stride=64). 3-seed mean 1.1793 BPB, improvement 0.0815 nats over baseline, t=-137 (p << 0.01). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

V4b: Full int6 quantization [-31,31] + zstd-22 compression enables MLP expansion to 1344 (2.6x). Muon momentum 0.99, LR 0.02, grad clip 0.3. 3-seed mean 1.1632 BPB (sliding window), 0.1087 nats over baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

V6b: Added straight-through estimator fake int6 quantization during training. Completely eliminates quantization gap (pre-quant = post-quant). 3-seed mean 1.1598 BPB (sliding window), beating previous leader (1.1605). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…layers) (openai#39) * Add Lower LR submission: val_bpb=1.2230 (MATRIX_LR=0.02) Systematic LR sweep showed default Muon/Adam learning rates (0.04) were too high. MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03 gives consistent improvement. Same 9L/512d architecture, no other changes. * Add 10L Mixed Precision submission: val_bpb=1.2147 10 transformer layers (vs baseline 9) with mixed int8/int6 compression: - Full int8 for first/last 3 layers (precision-sensitive) - Int6 (step=4 rounding) for middle layers 3-6 (compression-friendly) - Lower LR: MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03 - Artifact: 15,928,974 bytes (under 16MB cap) - Improvement: 0.0097 bpb / 0.0217 nats over baseline (1.2244) Also adds PRUNE_RATIO and INT4_LAYERS/INT4_STEP support to train_gpt.py for mixed-precision post-training quantization. * Revert root train_gpt.py to upstream baseline The root script should remain the baseline. Submission-specific modifications (PRUNE_RATIO, INT4_LAYERS, INT4_STEP) only belong in the records/ folder copy.

val_bpb: 1.1556 (post-quant int6+zstd-22, sliding window eval stride=64) Summary A 22.4M parameter transformer language model trained in under 10 minutes on 8×H100 GPUs, compressed to a 15.1MB artifact via int6 quantization-aware training and zstd-22. The architecture combines a SmearGate bigram embedding layer, orthogonal weight initialization, 3× MLP expansion, U-Net skip connections, and decoupled Muon weight decay, evaluated with sliding window context at stride 64. Architecture Transformer Core A 9-layer, 512-dim transformer with 8 attention heads (4 KV heads via grouped-query attention) and tied input/output embeddings over a 1024-token BPE vocabulary. Sequence length during training is 1024 tokens. SmearGate A learned per-dimension gate (~512 params) that blends each token's embedding with the previous token's embedding before the transformer processes anything: ```python gate = sigmoid(self.gate) # shape \[dim], init ≈ 0.95 output = gate \* current\_emb + (1 - gate) \* prev\_token\_emb ``` This injects bigram (two-token) context directly into the embedding layer. Normally a transformer must discover token-pair relationships through self-attention; SmearGate provides this signal for free. The gate is initialized via `sigmoid(3.0) ≈ 0.95` so it starts near-identity (mostly current token), and the model learns per-dimension how much previous-token blending is useful. Applied after embedding lookup and bigram hash addition, before RMS normalization. Bigram Hash Embedding A 4096-bucket hash table (dim=128, projected to 512) maps consecutive token pairs to learned embeddings via `(prev \* 92821 + cur) % 4096`. This gives the model direct access to token-pair features at minimal parameter cost. MLP 3× Expansion MLP hidden dimension is 3× the model dimension (1536 for a 512-dim model). The space savings from int6 quantization fund this extra capacity — wider MLPs allow more expressive nonlinear feature transformation between attention operations. U-Net Skip Connections The 9-layer transformer is split into an encoder half (4 layers) and a decoder half (5 layers) with learned skip weights connecting corresponding encoder/decoder layers. This gives the decoder direct access to earlier representations without relying solely on the residual stream. Training Muon Optimizer with Weight Decay The Muon optimizer (MomentUm Orthogonalized by Newton-Schulz) runs SGD with Nesterov momentum, then post-processes each 2D parameter's gradient update by replacing it with the nearest orthogonal matrix via 5-step Newton-Schulz iteration. This is equivalent to steepest descent under the spectral norm, improving the conditioning of the optimization landscape. Decoupled weight decay (`p.mul\_(1 - wd \* lr)`, wd=0.01) is applied before each gradient update. This keeps weights smaller and better-distributed, which directly benefits both generalization and downstream quantization — tighter weight distributions quantize into fewer int6 buckets with less error and compress better with zstd. Momentum is warmed from 0.92 → 0.99 over the first 1500 steps. Orthogonal Weight Initialization All non-zero-init CastedLinear weight matrices are initialized with `nn.init.orthogonal\_()`. Orthogonal matrices have all singular values equal to 1, meaning gradients flow uniformly through the network at initialization with no vanishing or exploding signals. Additionally, since Muon's Newton-Schulz step orthogonalizes updates, starting from an already-orthogonal matrix means early updates are immediately useful rather than spent correcting a random initialization. With only ~12k steps in the 10-minute budget, faster convergence matters. Int6 Quantization-Aware Training (STE) All 2D weight matrices are fake-quantized to int6 ([-31, 31]) during every forward pass via Straight-Through Estimator — the forward pass sees quantized weights while gradients flow through the rounding operation as if it were identity. The model learns weight configurations that are inherently robust to post-training quantization. The tied embedding matrix is stored as fp16 passthrough (not quantized), since it serves double duty for both input embeddings and output predictions where errors compound in both directions. Learning Rate Schedule Warmup over 20 steps, followed by linear warmdown over the final 3000 steps. Separate learning rates for tied embeddings (0.030), matrix parameters (0.020), and scalar parameters (0.020). Evaluation Sliding Window (stride=64) Instead of chopping validation text into non-overlapping chunks (where tokens near the start of each chunk lack context), sliding window uses overlapping windows with stride 64 and the full 1024-token context window. Each scored token gets 960+ tokens of prior context. This is purely an evaluation-time technique — it does not change the model. Export Int6 + zstd-22 Compression All quantized weights are packed into int8 containers and compressed with zstandard at level 22. The int6 representation plus aggressive compression brings the full submission (model + code) to 15.1MB, under the 16MB cap. Metrics Metric Value Post-quant sliding window val_bpb 1.1556 Post-quant sliding window val_loss 1.9511 Post-quant standard val_bpb 1.1891 Post-quant standard val_loss 2.0077 Quantization gap (standard eval) ~0.0001 BPB Model parameters 22,368,840 Artifact size (int6+zstd-22) 15,878,809 bytes (15.1 MB) Train steps completed 12,047 Train time 600s (10.0 min) Sliding window eval time 75s Peak GPU memory 11,340 MiB Configuration ``` VOCAB\_SIZE=1024 NUM\_LAYERS=9 MODEL\_DIM=512 NUM\_HEADS=8 NUM\_KV\_HEADS=4 MLP\_MULT=3 TIE\_EMBEDDINGS=1 USE\_SMEARGATE=1 TRAIN\_SEQ\_LEN=1024 TRAIN\_BATCH\_TOKENS=524288 LOGIT\_SOFTCAP=30.0 ROPE\_BASE=10000.0 QK\_GAIN\_INIT=1.5 BIGRAM\_HASH\_BUCKETS=4096 BIGRAM\_HASH\_DIM=128 TIED\_EMBED\_LR=0.030 MATRIX\_LR=0.020 SCALAR\_LR=0.020 MUON\_MOMENTUM=0.99 MUON\_MOMENTUM\_WARMUP\_START=0.92 MUON\_MOMENTUM\_WARMUP\_STEPS=1500 MUON\_WEIGHT\_DECAY=0.01 MUON\_BACKEND\_STEPS=5 WARMDOWN\_ITERS=3000 WARMUP\_STEPS=20 EVAL\_STRIDE=64 MAX\_WALLCLOCK\_SECONDS=600 SEED=1337 ``` Command ```bash RUN\_ID=smeargate\_orthoinit\_muonwd \\ DATA\_PATH=./data/datasets/fineweb10B\_sp1024 \\ TOKENIZER\_PATH=./data/tokenizers/fineweb\_1024\_bpe.model \\ torchrun --standalone --nproc\_per\_node=8 train\_gpt.py ``` Hardware 8× NVIDIA H100 80GB HBM3 SXM (RunPod).

… (mean val_bpb=1.1483, 3 seeds)

…, 3-seed mean)

…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results: Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Mean: 1.1194 (std 0.0006) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…stopher-Lee-McClendon

…stopher-Lee-McClendon

Record: 11L + Efficient Partial XSA (val_bpb: 1.1307)

…nt6-mlp3x-wd04-1.1271 Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)

…e-lateqat-1.1248 Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)

…-1.1233 Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233)

…oard-merged-records Update README leaderboard with merged record submissions

…u-legal-ttt-1.1183 Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean)

Update README.md

…U-Net + NeoMuon + 4xrelu²MLP + Smear + Fact Tied Emb + Poly5 Softcap + YaRN2048 + 8192BPE + FP8 + Bit-packing LZMA + Stride-16 Eval - 2h (openai#641) * Notable Non-Record Submission: 1.1239 BPB - 106.2M Binary U-Net (15L 768d 8192BPE relu² 4xMLP FP8 SmearGate, 50k steps) * Updated README.md for Non-record submission. --------- Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>

… relu² 4xMLP FP8) (openai#640) Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>

…What Works, What Doesn't, and Why (openai#363) * Non-record: depth recurrence + quantization error amplification finding 4 unique blocks × 3 cycles = 12 effective depth, 768d, 3x MLP BigramHash + XSA + LoRA + Late STE QAT + int8+zstd Key finding: quantization error amplifies ~900x through recurrence cycles, making int6 incompatible with weight-sharing architectures. Int8 for shared blocks reduces the gap from 1.14 to 0.37 bpb. 3-seed mean: 2.0711 bpb (pre-quant), 2.4402 bpb (post-quant int8) * docs: comprehensive depth recurrence research writeup Complete 4-day experimental report on looped transformers in Parameter Golf: - Controlled flat vs looped comparison: 1.1648 vs 1.1894 bpb (+0.025 gap) - Noisy QAT: novel technique collapsing quant error from 0.37 to 0.002 bpb - 3x3 > 2x5 loop finding: more unique blocks with fewer repeats wins - 12 negative results with specific numbers - Hyperparameter sweep data (EMA, warmdown, MTP, WD, grad clip) - Updated training script with all experimental features * Update README.md me when I cant write * fix: remove extra files, update writeup per reviewer feedback - Remove pr325_train_gpt.py from PR (dev file, not submission) - Restore original README.md - Update records/ writeup with v2 content - Add hyperlink for Ciprian-Florin Ifrim (FIIZiK_) - Clarify T=0.90 is activation-dependent (relu² specific, found via grid search) --------- Co-authored-by: Evangeline Kamin <eve@aurora.lan>

…, 1xH100) 11-layer transformer with XSA, SwiGLU, SmearGate, and score-first LoRA TTT. Trained on 1xH100 PCIe (~80 min). val_bpb: 1.1573, artifact: 15.02 MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

aquariouseworkman and others added 30 commits March 19, 2026 03:33

Record: Seq4096 + Sliding Window Eval, val_bpb=1.1808

9d318e7

Non-record: SwiGLU + warmdown fix + quarter batch (1x5090, 1.3281 bpb)

b8a1426

Update README.md

6b40978

New SOTA attempt (openai#52)

78c24e2

Co-authored-by: spokane-way <spokane@way>

Update README.md

b87b883

Update README.md

2d6e9e0

Update README.md

ce6cf9a

Update README.md

ad7b62c

Update README.md

cfa5726

Update README.md

f3897c1

Update README.md

5353524

Update README.md

d2bd760

Add Seq2048 + FP16 Tied Embedding submission (mean val_bpb 1.2067)

34fccfb

3-seed validation: mean 1.2067 BPB (std 0.00044), improvement 0.0353 nats over baseline, t=-70.69 (p << 0.01). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update README.md

ae88208

commit ttt record (openai#77)

bd2463a

Update README.md

5e29bfd

Update README.md

45bbccf

Merge branch 'openai:main' into main

3aface5

Add submission: 2026-03-20_Int6_MLP3x_SmearGate_BigramHash_MuonWD_SWA…

14cdf6f

… (mean val_bpb=1.1483, 3 seeds)

signalrush and others added 26 commits March 22, 2026 00:48

Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233…

15776db

…, 3-seed mean)

Fix pre-TTT BPB, TTT gains, and steps to match logs exactly

139d573

Fix author attributions: PR openai#493 @parinzee, PR openai#461 @Chri…

b08d72a

…stopher-Lee-McClendon

Merge pull request openai#265 from unnir/submission/v22-XSA3-beats-top1

56a9283

Record: 11L + Efficient Partial XSA (val_bpb: 1.1307)

Merge pull request openai#287 from jfprincz/submission/11l-xsa4-ema-i…

0d44464

…nt6-mlp3x-wd04-1.1271 Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)

Merge pull request openai#315 from jfprincz/submission/11l-partialrop…

cdabe13

…e-lateqat-1.1248 Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)

Merge pull request openai#414 from signalrush/submission/ema-gptqlite…

b5ac0de

…-1.1233 Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233)

Update README leaderboard with merged records

b82c50d

Use GitHub usernames in new leaderboard rows

d74c0b5

Describe leaderboard entries by base-run diff

8a77849

Merge pull request openai#561 from openai/codex/update-readme-leaderb…

ebda3af

…oard-merged-records Update README leaderboard with merged record submissions

Merge pull request openai#549 from abaybektursun/submission/leaky-rel…

2377f43

…u-legal-ttt-1.1183 Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean)

Update README.md

69050b3

Merge pull request openai#616 from openai/valerio-oai-patch-1

91b26be

Update README.md

Update README.md

630bb5e

Update README.md

499d002

Record Submission: 1.1570 BPB - 73.7M Ternary U-Net (10L 768d 8192BPE…

69bc84e

… relu² 4xMLP FP8) (openai#640) Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>

Update README.md

226d817

Update README.md

dc57d78

Update README.md

499a606

Update README.md

9f44bc6

Update README.md

0e5b198

swapp1990 force-pushed the submission/nonrecord-11l-xsa-lora-ttt branch from 1f5091a to 370a048 Compare March 29, 2026 18:14

swapp1990 and others added 3 commits March 29, 2026 11:17

Add technical report placeholder and link from README

3c97468

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Link technical report to fork repo, remove placeholder from submission

b0d7261

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add comprehensive technical report for 11L XSA + LoRA TTT submission

cb7f417

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: 11L XSA + SwiGLU + LoRA TTT (val_bpb=1.1573, 1xH100)#2

Non-record: 11L XSA + SwiGLU + LoRA TTT (val_bpb=1.1573, 1xH100)#2
swapp1990 wants to merge 87 commits intomainfrom
submission/nonrecord-11l-xsa-lora-ttt

swapp1990 commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

swapp1990 commented Mar 24, 2026

Summary

Why Non-Record

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants