Record: 11L XSA + Mixed INT6 + Adaptive N-gram Cache (2->7 backoff) - val_bpb=0.9631, 3-seed by aerosta · Pull Request #993 · openai/parameter-golf

aerosta · 2026-03-28T00:13:56Z

11L XSA + Mixed INT6 + Adaptive N-gram Cache (2->7 backoff)

val_bpb: 0.96308303 (3-seed mean, std 0.00035576) | 15,882,569 bytes mean | 8xH100 SXM, 600s

Results (8xH100 SXM, 600s)

Seed	Steps	Sliding val_bpb	Final val_bpb	Artifact bytes
1337	6,892	1.12124241	0.96314788	15,879,364
42	6,894	1.12125743	0.96340191	15,884,280
2024	6,897	1.12043283	0.96269931	15,884,064

Mean val_bpb: 0.96308303. Inter-seed std: 0.00035576.

Architecture

Component	Setting
Layers	11 (512d, 8Q, 4KV)
MLP	3x with `relu2`
XSA	All 11 layers
Embeddings	Tied
Weight averaging	EMA + late SWA
Quantization	Post-training mixed INT6 + LZMA
Eval	Sliding window, stride 64

Adaptive N-gram Cache

Score-first adaptive n-gram cache with backoff orders 2->7.

Backward-looking evaluation order:

Score each window under torch.inference_mode().
Add only already-scored tokens to the cache.
Apply the cache only to later positions and later windows.

No training data is accessed during evaluation.

Parameter	Value
Orders	`2->7`
Adaptive mode	`sigmoid_raw_entropy`
Alpha range	`[0.05, 0.60]`
Hash buckets	`4,194,304`
Min count	`2`

Ablation (seed 1337)

Configuration	val_bpb
Post-EMA, pre-quant	1.1369
+ INT6 quantization	1.14466175
+ Sliding window (stride 64)	1.12124241
+ Adaptive n-gram cache	0.96314788

Reproducibility

From the records folder:

torchrun --standalone --nproc_per_node=8 train_gpt.py
SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py
SEED=2024 torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

XSA: PR Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) #265 and arXiv:2603.09078
EMA and quantization lineage: PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414, PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549
Early sliding-window evaluation path: PR [record bpb=1.195] sliding window + LoRA TTT #77
Score-first evaluation framing: Issue Invalid submissions due to information leakage during TTT #402

…2->7 backoff)

valerio-oai · 2026-03-28T00:28:29Z

Hashed n-gram models in this way are disallowed, so this submission is illegal, apologies.

aerosta · 2026-03-28T01:33:24Z

@valerio-oai Noted. Thanks for the review.

aerosta force-pushed the my-submission branch from b654098 to 6c9e5a5 Compare March 28, 2026 00:24

Add record submission: 11L XSA + Mixed INT6 + Adaptive N-gram Cache (…

479aa08

…2->7 backoff)

aerosta force-pushed the my-submission branch from 6c9e5a5 to 479aa08 Compare March 28, 2026 00:27

valerio-oai closed this Mar 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L XSA + Mixed INT6 + Adaptive N-gram Cache (2->7 backoff) - val_bpb=0.9631, 3-seed#993

Record: 11L XSA + Mixed INT6 + Adaptive N-gram Cache (2->7 backoff) - val_bpb=0.9631, 3-seed#993
aerosta wants to merge 1 commit intoopenai:mainfrom
aerosta:my-submission

aerosta commented Mar 28, 2026

Uh oh!

valerio-oai commented Mar 28, 2026

Uh oh!

aerosta commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aerosta commented Mar 28, 2026

11L XSA + Mixed INT6 + Adaptive N-gram Cache (2->7 backoff)

Results (8xH100 SXM, 600s)

Architecture

Adaptive N-gram Cache

Ablation (seed 1337)

Reproducibility

Credits

Uh oh!

valerio-oai commented Mar 28, 2026

Uh oh!

aerosta commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants