Record: int5 GPTQ + Soft-Round QAT (3-seed mean 1.1162) by EthanYangTW · Pull Request #606 · openai/parameter-golf

EthanYangTW · 2026-03-24T07:49:41Z

Summary

3-seed mean val_bpb = 1.1162 (std 0.0006)
int5 GPTQ quantization ([-15,15], 31 levels) with Hessian-aware error compensation enables 33.6M params in 16MB
Soft-Round QAT: differentiable tanh-based rounding (alpha 1→16) replacing STE for better training quality at zero cost
Legal score-first AdamW TTT with cosine LR decay across chunks

Results

Seed	TTT BPB	Artifact
1337	1.1155	15,822,078 bytes
42	1.1163	15,415,405 bytes
7	1.1167	15,368,627 bytes
Mean	1.1162
Std	0.0006

Key Innovations

int5 quantization — 31 unique values stored as int8, ~0.46 bytes/param after zstd (vs int6's ~0.58). Enables 33.6M param model under 16MB.
GPTQ error compensation — Hessian-aware column reordering + Cholesky error redistribution. 256-sample calibration.
Soft-Round QAT — s_α(y) = floor(y) + 0.5 * tanh(α·r) / tanh(α/2) + 0.5 with alpha annealing 1→16 during QAT steps.
33.6M param model — MHA 8/8, BigramHash 8192, MLP 3.5x (1792), XSA all 11 layers.
Early QAT 0.5 + EMA 0.997 + FA3 Hopper

Architecture

11L, model_dim=512, 8H/8KV (MHA), MLP 3.5x relu²
XSA on all 11 layers, Partial RoPE 16/64, LN Scale
SmearGate + OrthoInit, BigramHash 8192, Shared VE128
Tight SWA (every 50), Muon lr=0.025, WD=0.04

TTT (Legal, Score-First)

131072-token chunks, score first then adapt
AdamW (lr=0.0001, wd=0.0), 3 epochs/chunk, cosine LR
Last 2 blocks + norms + lm_head unfrozen (~5.8M / 33.6M)
Every token scored BEFORE any gradient update using it

Test plan

3-seed validation (seeds 1337, 42, 7)
All artifacts under 16,000,000 bytes
Training completes under 600s on 8xH100 SXM
Evaluation completes under 600s
Score-first TTT (legal per rulings val-only 10min record (val_bpb:1.1111) #44, Add TTT (Test-Time Training) submission: 1.1767 BPB #152, SOTA Attempt: Paid prefix (val_bpb=1.0238) #168)

…=1.1162) int5 GPTQ quantization with Hessian-aware error compensation enables 33.6M params in 16MB. Soft-Round QAT (differentiable tanh rounding, alpha 1→16) replaces STE for better training quality at zero cost. 3-seed results: - Seed 1337: val_bpb=1.1155, artifact=15.82MB - Seed 42: val_bpb=1.1163, artifact=15.42MB - Seed 7: val_bpb=1.1167, artifact=15.37MB - Mean: 1.1162 (std 0.0006)

Copilot

Pull request overview

Adds a new 10-minute / 16MB record submission folder capturing a 33.6M-parameter model that uses int5 GPTQ quantization, Soft-Round QAT, and “legal score-first” test-time training (TTT), along with the logs and metadata to reproduce the reported 3-seed mean val_bpb.

Changes:

Adds a full training/eval script implementing int5 GPTQ, Soft-Round QAT, and score-first TTT.
Adds per-seed training logs documenting results, artifact sizes, and run configuration.
Adds submission metadata (submission.json) and a README describing the method and reproduction command.

Reviewed changes

Copilot reviewed 2 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_gpt.py	New end-to-end training + GPTQ quantization + TTT evaluation script for the record submission.
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_seed1337.log	Seed 1337 run log supporting the reported metrics/size.
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_seed42.log	Seed 42 run log supporting the reported metrics/size.
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_seed7.log	Seed 7 run log supporting the reported metrics/size.
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/submission.json	Submission metadata capturing aggregate results, per-seed values, and configuration summary.
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/README.md	Method/architecture write-up and reproduction instructions for the submission.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-24T07:55:24Z

records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_gpt.py

+                w32 = self.weight.float()
+                row_clip = torch.quantile(w32.abs(), 0.9995, dim=1)
+                scale = (row_clip / float(cr)).clamp_min(1.0 / float(cr))
+                w_scaled = w32 / scale[:, None]
+                w_rounded = CastedLinear.soft_round(w_scaled, CastedLinear._soft_round_alpha)
+                w_q = (torch.clamp(w_rounded, -(cr+1), cr) * scale[:, None]).to(x.dtype)
+                w = w_q  # fully differentiable path
+            else:
+                # Original STE QAT
+                with torch.no_grad():
+                    w32 = self.weight.float()
+                    row_clip = torch.quantile(w32.abs(), 0.9995, dim=1)
+                    scale = (row_clip / float(cr)).clamp_min(1.0 / float(cr))
+                    w_q = (torch.clamp(torch.round(w32 / scale[:, None]), -(cr+1), cr) * scale[:, None]).to(x.dtype)
+                w = w + (w_q - w).detach()


In QAT paths you clamp weights with torch.clamp(..., -(cr+1), cr), which makes the quantization range asymmetric and increases the number of representable integer levels (e.g., int5 becomes [-16, 15] = 32 values, int6 becomes [-32, 31] = 64 values). This does not match the stated int5/int6 ranges used elsewhere in this script (e.g., GPTQ quantization clamps to [-cr, cr]) and can cause a train/inference mismatch. Consider clamping to [-cr, cr] (or otherwise aligning QAT and post-training quantization ranges and updating the documentation accordingly).

Copilot · 2026-03-24T07:55:25Z

records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_gpt.py

+    print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True)
+    print(f"mixed_precision: {int5_params} int5 params, {int6_params} int6 params", flush=True)


mixed_quantize_int6_gptq uses plain print(...) for quantization stats. In distributed runs this will print from every rank, which matches the duplicated lines seen in the added logs and makes logs noisy/harder to parse. Prefer routing these messages through log0 / gating them on rank == 0 (or returning the stats to the caller and logging once).

Suggested change

print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True)

print(f"mixed_precision: {int5_params} int5 params, {int6_params} int6 params", flush=True)

msg1 = f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers"

msg2 = f"mixed_precision: {int5_params} int5 params, {int6_params} int6 params"

if not dist.is_available() or not dist.is_initialized() or dist.get_rank() == 0:

print(msg1, flush=True)

print(msg2, flush=True)

Copilot · 2026-03-24T07:55:25Z

records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_gpt.py

+    # Initialize GPU-vectorized logistic context mixer
+    use_mixer = os.environ.get("USE_MIXER", "1") == "1"
+    mixer = LogisticContextMixer(
+        vocab_size=val_tokens.to(torch.int32).max().item() + 1,


LogisticContextMixer vocab size is computed as val_tokens.to(torch.int32).max().item() + 1, which forces a full dtype conversion of the entire validation tensor (tens of millions of tokens) just to get the max. Since vocab size is already known (args.vocab_size, or sp.vocab_size()), prefer using that directly (or int(val_tokens.max()) + 1 without casting) to avoid unnecessary CPU work and memory pressure during eval.

Suggested change

vocab_size=val_tokens.to(torch.int32).max().item() + 1,

vocab_size=args.vocab_size,

Copilot · 2026-03-24T07:55:25Z

records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_gpt.py

+        with open("final_model.int6.ptz", "wb") as f:
+            f.write(quant_blob)
+        quant_file_bytes = len(quant_blob)
+        code_bytes = len(code.encode("utf-8"))
+        log0(f"Serialized model int6+{_COMPRESSOR}: {quant_file_bytes} bytes")
+        log0(f"Total submission size int6+{_COMPRESSOR}: {quant_file_bytes + code_bytes} bytes")
+    if distributed:
+        dist.barrier()
+    with open("final_model.int6.ptz", "rb") as f:


This record script writes/reads final_model.int6.ptz and logs it as int6+..., but the submission/README describe int5 GPTQ (and most scripts in the repo use the final_model.int8.ptz artifact name). If there are any shared tooling expectations around artifact filenames or log labels, this inconsistency can break automation and is confusing for reproduction. Consider aligning the artifact filename/log labels with the actual format (e.g., final_model.int5.ptz or the repo-standard final_model.int8.ptz naming convention).

Suggested change

with open("final_model.int6.ptz", "wb") as f:

f.write(quant_blob)

quant_file_bytes = len(quant_blob)

code_bytes = len(code.encode("utf-8"))

log0(f"Serialized model int6+{_COMPRESSOR}: {quant_file_bytes} bytes")

log0(f"Total submission size int6+{_COMPRESSOR}: {quant_file_bytes + code_bytes} bytes")

if distributed:

dist.barrier()

with open("final_model.int6.ptz", "rb") as f:

with open("final_model.int5.ptz", "wb") as f:

f.write(quant_blob)

quant_file_bytes = len(quant_blob)

code_bytes = len(code.encode("utf-8"))

log0(f"Serialized model int5+{_COMPRESSOR}: {quant_file_bytes} bytes")

log0(f"Total submission size int5+{_COMPRESSOR}: {quant_file_bytes + code_bytes} bytes")

if distributed:

dist.barrier()

with open("final_model.int5.ptz", "rb") as f:

Copilot · 2026-03-24T07:55:26Z

records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/README.md

+
+## Architecture
+
+- 11 layers, model_dim=512, 8 heads / 8 KV heads (MHA), MLP 3.5x relu²


The README describes the MLP as "relu²", but the implementation in train_gpt.py uses leaky_relu(..., negative_slope=0.5).square(). Updating the README to match the actual activation used will make the architecture description accurate and easier to reproduce/compare against other runs.

Suggested change

- 11 layers, model_dim=512, 8 heads / 8 KV heads (MHA), MLP 3.5x relu²

- 11 layers, model_dim=512, 8 heads / 8 KV heads (MHA), MLP 3.5x leaky_relu² (negative_slope=0.5)

Track: 10min_16mb Based on: PR openai#549 (LeakyReLU+ParallelMuon), PR openai#606 (Soft-Round+AdamW TTT), PR openai#609 (XSA-all+Full GPTQ) Changes from SOTA (openai#549): - XSA on all 11 layers (was 4) - Soft-Round QAT with tanh-based differentiable rounding (alpha 1->16) - Full GPTQ with Hessian-aware column-reordered Cholesky error compensation - MHA 8/8 (was GQA 8/4) - MLP 3.5x expansion (1792 hidden, was 3.0x/1536) - BigramHash vocabulary 8192 (was 2048) - AdamW TTT with grouped LR and cosine schedule (was SGD) - Early QAT threshold 0.5 (was late 0.15) - Selective ±1 magnitude pruning to hit size target

valerio-oai · 2026-03-25T01:02:17Z

As per previous PRs, it's disallowed to use training data in any way during evaluation, which your GPTQ calibration is currently doing. This would be legal (or at least more likely to be legal) if you ended training early and calibrated as part of the training time. However, per the logs provided, it looks like you're training for the full 10 minutes, then doing 3.6s of calibration and then the rest of the eval, meaning this is included in the eval time and therefore currently illegal.

Add GPU-vectorized trigram + entropy experts to the existing 3-expert (neural + unigram + bigram) Hedge mixer from PR openai#606. Result: 1.0902 BPB (vs 1.1165 without mixer, -0.026 BPB gain) BUT eval takes 1573s (must be under 600s). Speed fix needed. Experts: neural, unigram, bigram, hashed-trigram, neural-entropy All GPU-vectorized, no Python per-token loops. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

EthanYangTW marked this pull request as ready for review March 24, 2026 07:50

Copilot AI review requested due to automatic review settings March 24, 2026 07:50

Copilot started reviewing on behalf of EthanYangTW March 24, 2026 07:50 View session

notapplica mentioned this pull request Mar 24, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

Copilot AI reviewed Mar 24, 2026

View reviewed changes

scutuatua-crypto approved these changes Mar 24, 2026

View reviewed changes

senstar-hsoleimani mentioned this pull request Mar 24, 2026

submission: XSA-all + Soft-Round QAT + Full GPTQ + MLP 3.5x + AdamW TTT #631

Open

Robby955 mentioned this pull request Mar 24, 2026

Full GPTQ + XSA-all + SWA/EMA + Score-First TTT (compliant BPB=1.1175, 1-seed) #639

Closed

valerio-oai closed this Mar 25, 2026

valerio-oai mentioned this pull request Mar 25, 2026

Illegal submissions megathread #677

Open

This was referenced Mar 25, 2026

Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745) #687

Closed

Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745) #688

Closed

deanbrr mentioned this pull request Mar 25, 2026

Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779

Closed

pentxayc mentioned this pull request Mar 26, 2026

Record: 0.4416 BPB -- Complementary Training + Backoff N-gram Mixer #803

Open

3 tasks

hypery11 mentioned this pull request Mar 26, 2026

Record: Order-Adaptive BackoffMixer (mean val_bpb=0.5440) #825

Open

4 tasks

AnirudhRahul mentioned this pull request Mar 26, 2026

Record: 0.1663 BPB - N-gram-Aware Training + Frozen N-gram Oracle + Backoff TTT #834

Closed

6 tasks

sofiabod mentioned this pull request Mar 26, 2026

Record: Order-Adaptive 9-gram Backoff + Distributed Prefill — val_bpb 0.4405 (3-seed mean) #890

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: int5 GPTQ + Soft-Round QAT (3-seed mean 1.1162)#606

Record: int5 GPTQ + Soft-Round QAT (3-seed mean 1.1162)#606
EthanYangTW wants to merge 1 commit intoopenai:mainfrom
EthanYangTW:submission/softround-3seed

EthanYangTW commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

valerio-oai commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True)
		print(f"mixed_precision: {int5_params} int5 params, {int6_params} int6 params", flush=True)

-    print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True)
-    print(f"mixed_precision: {int5_params} int5 params, {int6_params} int6 params", flush=True)
+    msg1 = f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers"
+    msg2 = f"mixed_precision: {int5_params} int5 params, {int6_params} int6 params"
+    if not dist.is_available() or not dist.is_initialized() or dist.get_rank() == 0:
+        print(msg1, flush=True)
+        print(msg2, flush=True)

	vocab_size=val_tokens.to(torch.int32).max().item() + 1,
	vocab_size=args.vocab_size,


		## Architecture

		- 11 layers, model_dim=512, 8 heads / 8 KV heads (MHA), MLP 3.5x relu²

Conversation

EthanYangTW commented Mar 24, 2026

Summary

Results

Key Innovations

Architecture

TTT (Legal, Score-First)

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

valerio-oai commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants