Skip to content

Record: int5 GPTQ + Soft-Round QAT (3-seed mean 1.1162)#606

Closed
EthanYangTW wants to merge 1 commit intoopenai:mainfrom
EthanYangTW:submission/softround-3seed
Closed

Record: int5 GPTQ + Soft-Round QAT (3-seed mean 1.1162)#606
EthanYangTW wants to merge 1 commit intoopenai:mainfrom
EthanYangTW:submission/softround-3seed

Conversation

@EthanYangTW
Copy link
Copy Markdown

Summary

  • 3-seed mean val_bpb = 1.1162 (std 0.0006)
  • int5 GPTQ quantization ([-15,15], 31 levels) with Hessian-aware error compensation enables 33.6M params in 16MB
  • Soft-Round QAT: differentiable tanh-based rounding (alpha 1→16) replacing STE for better training quality at zero cost
  • Legal score-first AdamW TTT with cosine LR decay across chunks

Results

Seed TTT BPB Artifact
1337 1.1155 15,822,078 bytes
42 1.1163 15,415,405 bytes
7 1.1167 15,368,627 bytes
Mean 1.1162
Std 0.0006

Key Innovations

  1. int5 quantization — 31 unique values stored as int8, ~0.46 bytes/param after zstd (vs int6's ~0.58). Enables 33.6M param model under 16MB.
  2. GPTQ error compensation — Hessian-aware column reordering + Cholesky error redistribution. 256-sample calibration.
  3. Soft-Round QATs_α(y) = floor(y) + 0.5 * tanh(α·r) / tanh(α/2) + 0.5 with alpha annealing 1→16 during QAT steps.
  4. 33.6M param model — MHA 8/8, BigramHash 8192, MLP 3.5x (1792), XSA all 11 layers.
  5. Early QAT 0.5 + EMA 0.997 + FA3 Hopper

Architecture

  • 11L, model_dim=512, 8H/8KV (MHA), MLP 3.5x relu²
  • XSA on all 11 layers, Partial RoPE 16/64, LN Scale
  • SmearGate + OrthoInit, BigramHash 8192, Shared VE128
  • Tight SWA (every 50), Muon lr=0.025, WD=0.04

TTT (Legal, Score-First)

  • 131072-token chunks, score first then adapt
  • AdamW (lr=0.0001, wd=0.0), 3 epochs/chunk, cosine LR
  • Last 2 blocks + norms + lm_head unfrozen (~5.8M / 33.6M)
  • Every token scored BEFORE any gradient update using it

Test plan

…=1.1162)

int5 GPTQ quantization with Hessian-aware error compensation enables 33.6M
params in 16MB. Soft-Round QAT (differentiable tanh rounding, alpha 1→16)
replaces STE for better training quality at zero cost.

3-seed results:
- Seed 1337: val_bpb=1.1155, artifact=15.82MB
- Seed 42:   val_bpb=1.1163, artifact=15.42MB
- Seed 7:    val_bpb=1.1167, artifact=15.37MB
- Mean: 1.1162 (std 0.0006)
@EthanYangTW EthanYangTW marked this pull request as ready for review March 24, 2026 07:50
Copilot AI review requested due to automatic review settings March 24, 2026 07:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new 10-minute / 16MB record submission folder capturing a 33.6M-parameter model that uses int5 GPTQ quantization, Soft-Round QAT, and “legal score-first” test-time training (TTT), along with the logs and metadata to reproduce the reported 3-seed mean val_bpb.

Changes:

  • Adds a full training/eval script implementing int5 GPTQ, Soft-Round QAT, and score-first TTT.
  • Adds per-seed training logs documenting results, artifact sizes, and run configuration.
  • Adds submission metadata (submission.json) and a README describing the method and reproduction command.

Reviewed changes

Copilot reviewed 2 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_gpt.py New end-to-end training + GPTQ quantization + TTT evaluation script for the record submission.
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_seed1337.log Seed 1337 run log supporting the reported metrics/size.
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_seed42.log Seed 42 run log supporting the reported metrics/size.
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_seed7.log Seed 7 run log supporting the reported metrics/size.
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/submission.json Submission metadata capturing aggregate results, per-seed values, and configuration summary.
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/README.md Method/architecture write-up and reproduction instructions for the submission.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +809 to +823
w32 = self.weight.float()
row_clip = torch.quantile(w32.abs(), 0.9995, dim=1)
scale = (row_clip / float(cr)).clamp_min(1.0 / float(cr))
w_scaled = w32 / scale[:, None]
w_rounded = CastedLinear.soft_round(w_scaled, CastedLinear._soft_round_alpha)
w_q = (torch.clamp(w_rounded, -(cr+1), cr) * scale[:, None]).to(x.dtype)
w = w_q # fully differentiable path
else:
# Original STE QAT
with torch.no_grad():
w32 = self.weight.float()
row_clip = torch.quantile(w32.abs(), 0.9995, dim=1)
scale = (row_clip / float(cr)).clamp_min(1.0 / float(cr))
w_q = (torch.clamp(torch.round(w32 / scale[:, None]), -(cr+1), cr) * scale[:, None]).to(x.dtype)
w = w + (w_q - w).detach()
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In QAT paths you clamp weights with torch.clamp(..., -(cr+1), cr), which makes the quantization range asymmetric and increases the number of representable integer levels (e.g., int5 becomes [-16, 15] = 32 values, int6 becomes [-32, 31] = 64 values). This does not match the stated int5/int6 ranges used elsewhere in this script (e.g., GPTQ quantization clamps to [-cr, cr]) and can cause a train/inference mismatch. Consider clamping to [-cr, cr] (or otherwise aligning QAT and post-training quantization ranges and updating the documentation accordingly).

Copilot uses AI. Check for mistakes.
Comment on lines +1686 to +1687
print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True)
print(f"mixed_precision: {int5_params} int5 params, {int6_params} int6 params", flush=True)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mixed_quantize_int6_gptq uses plain print(...) for quantization stats. In distributed runs this will print from every rank, which matches the duplicated lines seen in the added logs and makes logs noisy/harder to parse. Prefer routing these messages through log0 / gating them on rank == 0 (or returning the stats to the caller and logging once).

Suggested change
print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True)
print(f"mixed_precision: {int5_params} int5 params, {int6_params} int6 params", flush=True)
msg1 = f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers"
msg2 = f"mixed_precision: {int5_params} int5 params, {int6_params} int6 params"
if not dist.is_available() or not dist.is_initialized() or dist.get_rank() == 0:
print(msg1, flush=True)
print(msg2, flush=True)

Copilot uses AI. Check for mistakes.
# Initialize GPU-vectorized logistic context mixer
use_mixer = os.environ.get("USE_MIXER", "1") == "1"
mixer = LogisticContextMixer(
vocab_size=val_tokens.to(torch.int32).max().item() + 1,
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LogisticContextMixer vocab size is computed as val_tokens.to(torch.int32).max().item() + 1, which forces a full dtype conversion of the entire validation tensor (tens of millions of tokens) just to get the max. Since vocab size is already known (args.vocab_size, or sp.vocab_size()), prefer using that directly (or int(val_tokens.max()) + 1 without casting) to avoid unnecessary CPU work and memory pressure during eval.

Suggested change
vocab_size=val_tokens.to(torch.int32).max().item() + 1,
vocab_size=args.vocab_size,

Copilot uses AI. Check for mistakes.
Comment on lines +2190 to +2198
with open("final_model.int6.ptz", "wb") as f:
f.write(quant_blob)
quant_file_bytes = len(quant_blob)
code_bytes = len(code.encode("utf-8"))
log0(f"Serialized model int6+{_COMPRESSOR}: {quant_file_bytes} bytes")
log0(f"Total submission size int6+{_COMPRESSOR}: {quant_file_bytes + code_bytes} bytes")
if distributed:
dist.barrier()
with open("final_model.int6.ptz", "rb") as f:
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This record script writes/reads final_model.int6.ptz and logs it as int6+..., but the submission/README describe int5 GPTQ (and most scripts in the repo use the final_model.int8.ptz artifact name). If there are any shared tooling expectations around artifact filenames or log labels, this inconsistency can break automation and is confusing for reproduction. Consider aligning the artifact filename/log labels with the actual format (e.g., final_model.int5.ptz or the repo-standard final_model.int8.ptz naming convention).

Suggested change
with open("final_model.int6.ptz", "wb") as f:
f.write(quant_blob)
quant_file_bytes = len(quant_blob)
code_bytes = len(code.encode("utf-8"))
log0(f"Serialized model int6+{_COMPRESSOR}: {quant_file_bytes} bytes")
log0(f"Total submission size int6+{_COMPRESSOR}: {quant_file_bytes + code_bytes} bytes")
if distributed:
dist.barrier()
with open("final_model.int6.ptz", "rb") as f:
with open("final_model.int5.ptz", "wb") as f:
f.write(quant_blob)
quant_file_bytes = len(quant_blob)
code_bytes = len(code.encode("utf-8"))
log0(f"Serialized model int5+{_COMPRESSOR}: {quant_file_bytes} bytes")
log0(f"Total submission size int5+{_COMPRESSOR}: {quant_file_bytes + code_bytes} bytes")
if distributed:
dist.barrier()
with open("final_model.int5.ptz", "rb") as f:

Copilot uses AI. Check for mistakes.

## Architecture

- 11 layers, model_dim=512, 8 heads / 8 KV heads (MHA), MLP 3.5x relu²
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README describes the MLP as "relu²", but the implementation in train_gpt.py uses leaky_relu(..., negative_slope=0.5).square(). Updating the README to match the actual activation used will make the architecture description accurate and easier to reproduce/compare against other runs.

Suggested change
- 11 layers, model_dim=512, 8 heads / 8 KV heads (MHA), MLP 3.5x relu²
- 11 layers, model_dim=512, 8 heads / 8 KV heads (MHA), MLP 3.5x leaky_relu² (negative_slope=0.5)

Copilot uses AI. Check for mistakes.
senstar-hsoleimani added a commit to senstar-hsoleimani/parameter-golf that referenced this pull request Mar 24, 2026
Track: 10min_16mb
Based on: PR openai#549 (LeakyReLU+ParallelMuon), PR openai#606 (Soft-Round+AdamW TTT), PR openai#609 (XSA-all+Full GPTQ)

Changes from SOTA (openai#549):
- XSA on all 11 layers (was 4)
- Soft-Round QAT with tanh-based differentiable rounding (alpha 1->16)
- Full GPTQ with Hessian-aware column-reordered Cholesky error compensation
- MHA 8/8 (was GQA 8/4)
- MLP 3.5x expansion (1792 hidden, was 3.0x/1536)
- BigramHash vocabulary 8192 (was 2048)
- AdamW TTT with grouped LR and cosine schedule (was SGD)
- Early QAT threshold 0.5 (was late 0.15)
- Selective ±1 magnitude pruning to hit size target
@valerio-oai
Copy link
Copy Markdown
Contributor

As per previous PRs, it's disallowed to use training data in any way during evaluation, which your GPTQ calibration is currently doing. This would be legal (or at least more likely to be legal) if you ended training early and calibrated as part of the training time. However, per the logs provided, it looks like you're training for the full 10 minutes, then doing 3.6s of calibration and then the rest of the eval, meaning this is included in the eval time and therefore currently illegal.

RoyiRa added a commit to RoyiRa/parameter-golf that referenced this pull request Mar 25, 2026
Add GPU-vectorized trigram + entropy experts to the existing
3-expert (neural + unigram + bigram) Hedge mixer from PR openai#606.

Result: 1.0902 BPB (vs 1.1165 without mixer, -0.026 BPB gain)
BUT eval takes 1573s (must be under 600s). Speed fix needed.

Experts: neural, unigram, bigram, hashed-trigram, neural-entropy
All GPU-vectorized, no Python per-token loops.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants