Record: int5 GPTQ + Soft-Round QAT (3-seed mean 1.1162)#606
Record: int5 GPTQ + Soft-Round QAT (3-seed mean 1.1162)#606EthanYangTW wants to merge 1 commit intoopenai:mainfrom
Conversation
…=1.1162) int5 GPTQ quantization with Hessian-aware error compensation enables 33.6M params in 16MB. Soft-Round QAT (differentiable tanh rounding, alpha 1→16) replaces STE for better training quality at zero cost. 3-seed results: - Seed 1337: val_bpb=1.1155, artifact=15.82MB - Seed 42: val_bpb=1.1163, artifact=15.42MB - Seed 7: val_bpb=1.1167, artifact=15.37MB - Mean: 1.1162 (std 0.0006)
There was a problem hiding this comment.
Pull request overview
Adds a new 10-minute / 16MB record submission folder capturing a 33.6M-parameter model that uses int5 GPTQ quantization, Soft-Round QAT, and “legal score-first” test-time training (TTT), along with the logs and metadata to reproduce the reported 3-seed mean val_bpb.
Changes:
- Adds a full training/eval script implementing int5 GPTQ, Soft-Round QAT, and score-first TTT.
- Adds per-seed training logs documenting results, artifact sizes, and run configuration.
- Adds submission metadata (
submission.json) and a README describing the method and reproduction command.
Reviewed changes
Copilot reviewed 2 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_gpt.py | New end-to-end training + GPTQ quantization + TTT evaluation script for the record submission. |
| records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_seed1337.log | Seed 1337 run log supporting the reported metrics/size. |
| records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_seed42.log | Seed 42 run log supporting the reported metrics/size. |
| records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_seed7.log | Seed 7 run log supporting the reported metrics/size. |
| records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/submission.json | Submission metadata capturing aggregate results, per-seed values, and configuration summary. |
| records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/README.md | Method/architecture write-up and reproduction instructions for the submission. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| w32 = self.weight.float() | ||
| row_clip = torch.quantile(w32.abs(), 0.9995, dim=1) | ||
| scale = (row_clip / float(cr)).clamp_min(1.0 / float(cr)) | ||
| w_scaled = w32 / scale[:, None] | ||
| w_rounded = CastedLinear.soft_round(w_scaled, CastedLinear._soft_round_alpha) | ||
| w_q = (torch.clamp(w_rounded, -(cr+1), cr) * scale[:, None]).to(x.dtype) | ||
| w = w_q # fully differentiable path | ||
| else: | ||
| # Original STE QAT | ||
| with torch.no_grad(): | ||
| w32 = self.weight.float() | ||
| row_clip = torch.quantile(w32.abs(), 0.9995, dim=1) | ||
| scale = (row_clip / float(cr)).clamp_min(1.0 / float(cr)) | ||
| w_q = (torch.clamp(torch.round(w32 / scale[:, None]), -(cr+1), cr) * scale[:, None]).to(x.dtype) | ||
| w = w + (w_q - w).detach() |
There was a problem hiding this comment.
In QAT paths you clamp weights with torch.clamp(..., -(cr+1), cr), which makes the quantization range asymmetric and increases the number of representable integer levels (e.g., int5 becomes [-16, 15] = 32 values, int6 becomes [-32, 31] = 64 values). This does not match the stated int5/int6 ranges used elsewhere in this script (e.g., GPTQ quantization clamps to [-cr, cr]) and can cause a train/inference mismatch. Consider clamping to [-cr, cr] (or otherwise aligning QAT and post-training quantization ranges and updating the documentation accordingly).
| print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True) | ||
| print(f"mixed_precision: {int5_params} int5 params, {int6_params} int6 params", flush=True) |
There was a problem hiding this comment.
mixed_quantize_int6_gptq uses plain print(...) for quantization stats. In distributed runs this will print from every rank, which matches the duplicated lines seen in the added logs and makes logs noisy/harder to parse. Prefer routing these messages through log0 / gating them on rank == 0 (or returning the stats to the caller and logging once).
| print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True) | |
| print(f"mixed_precision: {int5_params} int5 params, {int6_params} int6 params", flush=True) | |
| msg1 = f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers" | |
| msg2 = f"mixed_precision: {int5_params} int5 params, {int6_params} int6 params" | |
| if not dist.is_available() or not dist.is_initialized() or dist.get_rank() == 0: | |
| print(msg1, flush=True) | |
| print(msg2, flush=True) |
| # Initialize GPU-vectorized logistic context mixer | ||
| use_mixer = os.environ.get("USE_MIXER", "1") == "1" | ||
| mixer = LogisticContextMixer( | ||
| vocab_size=val_tokens.to(torch.int32).max().item() + 1, |
There was a problem hiding this comment.
LogisticContextMixer vocab size is computed as val_tokens.to(torch.int32).max().item() + 1, which forces a full dtype conversion of the entire validation tensor (tens of millions of tokens) just to get the max. Since vocab size is already known (args.vocab_size, or sp.vocab_size()), prefer using that directly (or int(val_tokens.max()) + 1 without casting) to avoid unnecessary CPU work and memory pressure during eval.
| vocab_size=val_tokens.to(torch.int32).max().item() + 1, | |
| vocab_size=args.vocab_size, |
| with open("final_model.int6.ptz", "wb") as f: | ||
| f.write(quant_blob) | ||
| quant_file_bytes = len(quant_blob) | ||
| code_bytes = len(code.encode("utf-8")) | ||
| log0(f"Serialized model int6+{_COMPRESSOR}: {quant_file_bytes} bytes") | ||
| log0(f"Total submission size int6+{_COMPRESSOR}: {quant_file_bytes + code_bytes} bytes") | ||
| if distributed: | ||
| dist.barrier() | ||
| with open("final_model.int6.ptz", "rb") as f: |
There was a problem hiding this comment.
This record script writes/reads final_model.int6.ptz and logs it as int6+..., but the submission/README describe int5 GPTQ (and most scripts in the repo use the final_model.int8.ptz artifact name). If there are any shared tooling expectations around artifact filenames or log labels, this inconsistency can break automation and is confusing for reproduction. Consider aligning the artifact filename/log labels with the actual format (e.g., final_model.int5.ptz or the repo-standard final_model.int8.ptz naming convention).
| with open("final_model.int6.ptz", "wb") as f: | |
| f.write(quant_blob) | |
| quant_file_bytes = len(quant_blob) | |
| code_bytes = len(code.encode("utf-8")) | |
| log0(f"Serialized model int6+{_COMPRESSOR}: {quant_file_bytes} bytes") | |
| log0(f"Total submission size int6+{_COMPRESSOR}: {quant_file_bytes + code_bytes} bytes") | |
| if distributed: | |
| dist.barrier() | |
| with open("final_model.int6.ptz", "rb") as f: | |
| with open("final_model.int5.ptz", "wb") as f: | |
| f.write(quant_blob) | |
| quant_file_bytes = len(quant_blob) | |
| code_bytes = len(code.encode("utf-8")) | |
| log0(f"Serialized model int5+{_COMPRESSOR}: {quant_file_bytes} bytes") | |
| log0(f"Total submission size int5+{_COMPRESSOR}: {quant_file_bytes + code_bytes} bytes") | |
| if distributed: | |
| dist.barrier() | |
| with open("final_model.int5.ptz", "rb") as f: |
|
|
||
| ## Architecture | ||
|
|
||
| - 11 layers, model_dim=512, 8 heads / 8 KV heads (MHA), MLP 3.5x relu² |
There was a problem hiding this comment.
The README describes the MLP as "relu²", but the implementation in train_gpt.py uses leaky_relu(..., negative_slope=0.5).square(). Updating the README to match the actual activation used will make the architecture description accurate and easier to reproduce/compare against other runs.
| - 11 layers, model_dim=512, 8 heads / 8 KV heads (MHA), MLP 3.5x relu² | |
| - 11 layers, model_dim=512, 8 heads / 8 KV heads (MHA), MLP 3.5x leaky_relu² (negative_slope=0.5) |
Track: 10min_16mb Based on: PR openai#549 (LeakyReLU+ParallelMuon), PR openai#606 (Soft-Round+AdamW TTT), PR openai#609 (XSA-all+Full GPTQ) Changes from SOTA (openai#549): - XSA on all 11 layers (was 4) - Soft-Round QAT with tanh-based differentiable rounding (alpha 1->16) - Full GPTQ with Hessian-aware column-reordered Cholesky error compensation - MHA 8/8 (was GQA 8/4) - MLP 3.5x expansion (1792 hidden, was 3.0x/1536) - BigramHash vocabulary 8192 (was 2048) - AdamW TTT with grouped LR and cosine schedule (was SGD) - Early QAT threshold 0.5 (was late 0.15) - Selective ±1 magnitude pruning to hit size target
|
As per previous PRs, it's disallowed to use training data in any way during evaluation, which your GPTQ calibration is currently doing. This would be legal (or at least more likely to be legal) if you ended training early and calibrated as part of the training time. However, per the logs provided, it looks like you're training for the full 10 minutes, then doing 3.6s of calibration and then the rest of the eval, meaning this is included in the eval time and therefore currently illegal. |
Add GPU-vectorized trigram + entropy experts to the existing 3-expert (neural + unigram + bigram) Hedge mixer from PR openai#606. Result: 1.0902 BPB (vs 1.1165 without mixer, -0.026 BPB gain) BUT eval takes 1573s (must be under 600s). Speed fix needed. Experts: neural, unigram, bigram, hashed-trigram, neural-entropy All GPU-vectorized, no Python per-token loops. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Results
Key Innovations
s_α(y) = floor(y) + 0.5 * tanh(α·r) / tanh(α/2) + 0.5with alpha annealing 1→16 during QAT steps.Architecture
TTT (Legal, Score-First)
Test plan