Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean) by anthony-maio · Pull Request #376 · openai/parameter-golf

anthony-maio · 2026-03-21T22:58:17Z

Summary

val_bpb = 0.9642 (3-seed mean, std 0.0002) | ~15.95 MB | 8×H100 SXM

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed	step_avg	steps	Pre-ngram bpb	Post-ngram bpb	ng_helped	Artifact
1337	88.7ms	6,765	1.1225	0.9640	38.5%	15,981,848
42	88.6ms	6,772	1.1224	0.9641	38.6%	15,904,632
2025	88.6ms	6,776	1.1231	0.9644	38.6%	15,974,308
Mean	88.6ms	6,771	1.1227	0.9642 (std 0.0002)	38.6%

All artifacts under 16,000,000 bytes. All 3 train logs attached.

Key Innovation: Multi-Order N-gram Backoff Cache

Backward-looking n-gram cache built causally from already-scored tokens. Zero artifact cost.

Entropy-Adaptive Alpha: alpha = 0.05 + 0.55 * sigmoid(2*(H-4)). Low alpha when neural model is confident, high alpha when uncertain.

Multi-Order Backoff (2-7gram): Highest matching order wins. 4M hash buckets per order. min_count=2 gate. Raw count ratios, no smoothing.

Compliance: Score-first — every token scored under torch.inference_mode() before any table update. N-gram tables built from already-scored tokens only. No training data access during eval. No oracle selection.

Training Architecture

PR #414 base + LeakyReLU² + VRL + lzma:
11L, 512d, 8H/4KV GQA, LeakyReLU(0.5)² MLP 3×, VRL, VE128, BigramHash(2048), XSA4, Partial RoPE 16/64, LN Scale, SmearGate, U-Net skips, EMA(0.997) + Tight SWA, Late QAT, GPTQ-lite int6 + lzma, FA3 Hopper, Muon WD=0.04

Credits

N-gram backoff: PR Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727 by @Asukabot0
Base model: PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 by @signalrush
LeakyReLU²: PR Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309) #493 by @parinzee, PR Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622) #518 by @sofiabod
VRL: ResFormer (arXiv:2410.17897), PR Record: 11L VRL + LeakyReLU² + Full GPTQ (3-seed mean val_bpb=1.1175) #569 by @gowtham0992

Test plan

Seed 1337: 0.9640 bpb, 15.98MB valid
Seed 42: 0.9641 bpb, 15.90MB valid
Seed 2025: 0.9644 bpb, 15.97MB valid
3-seed mean: 0.9642, std 0.0002
All train logs attached
All artifacts under 16,000,000 bytes
Score-first compliance verified

🤖 Generated with Claude Code

Takes the proven SOTA script exactly (seq2048, MLP 3x, SmearGate, BigramHash, int6+zstd, SWA, Muon WD 0.02, OrthoInit) and adds TTT LoRA evaluation. TTT passes base_model directly (compiled). If TTT works on this architecture: expected ~1.11-1.12 bpb (new record). If TTT fails (SmearGate/BigramHash incompatibility): 1.1483 baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Key changes from PR openai#162 base: - 11 layers (from 9) — enabled by int6 compression headroom - Full-weight SGD TTT (not LoRA): lr=0.002, momentum=0.9, 3 epochs over val data, freeze first 2 blocks for stability - NTK-RoPE base=50000 (from 10000) for long-context extrapolation - matrix_lr=0.025, scalar_lr=0.025, tied_embed_lr=0.035 - weight_decay=0.04 (from 0.01) - BigramHash 2048 buckets (from 4096) - TTT_ENABLED=1 env var toggle Target: match FarnsworthEngine's 1.1303 bpb or beat it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

At ~5700 steps on our pods, warmdown=3000 means 53% of training is in the LR decay phase. Reducing to 1500 doubles full-LR training time. Council identified this as a free 0.005+ bpb improvement. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

NameError crashed after TTT epoch 3 completed successfully. eval_stride/eval_sl were local variables from the pre-TTT eval section, not visible in the TTT section. Use args.eval_stride and args.train_seq_len directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

9 layers (valid artifact under 16MB), full SOTA stack: MLP 3x, SmearGate, BigramHash 2048, int6+zstd-22, SWA, Muon WD=0.04, NTK-RoPE 50k, OrthoInit, sliding window stride=64. Trained 4,782 steps at ~125ms/step on 8xH100 SXM. Custom kernel integration in progress for next submission. TTT disabled (does not improve results on this architecture). Set NUM_LAYERS=11 for 11L variant (requires tighter compression). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a new record submission under records/track_10min_16mb/2026-03-21_MatchSOTA_TTT capturing a high-performing 10-minute / 16MB training run configuration (model + training loop + quantization export), along with metadata and a write-up.

Changes:

Introduces a new standalone train_gpt.py for the record, including Muon/Adam optimizer split, int6 mixed quantization + zstd/zlib export, and sliding-window evaluation.
Adds submission.json metadata for the record’s reported score/config.
Adds a README describing the approach (TTT + kernels) and reproduction steps.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
records/track_10min_16mb/2026-03-21_MatchSOTA_TTT/train_gpt.py	Full training/eval/quantization pipeline for the record (includes optional TTT block).
records/track_10min_16mb/2026-03-21_MatchSOTA_TTT/submission.json	Record metadata (author, score, hardware, etc.).
records/track_10min_16mb/2026-03-21_MatchSOTA_TTT/README.md	Narrative description of the submission and reproduction command.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-21T23:01:48Z

records/track_10min_16mb/2026-03-21_MatchSOTA_TTT/train_gpt.py

+            compiled_logits_ttt = torch.compile(base_model.forward_logits, dynamic=False) if use_compile else base_model.forward_logits
+            # Warmup
+            ttt_eval_sl = args.train_seq_len
+            warmup_x = torch.zeros(args.eval_batch_seqs, ttt_eval_sl, dtype=torch.int64, device=device)
+            with torch.inference_mode(), torch.autocast(device_type="cuda", dtype=torch.bfloat16):
+                _ = compiled_logits_ttt(warmup_x)
+            ttt_val_loss, ttt_val_bpb = eval_val_sliding(
+                compiled_logits_ttt, rank, world_size, device,
+                val_tokens_eval, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut,
+                ttt_eval_sl, args.eval_stride, eval_batch_seqs=args.eval_batch_seqs,
+            )
+        else:
+            ttt_val_loss, ttt_val_bpb = eval_val(
+                args, base_model, rank, world_size, device, grad_accum_steps,
+                val_tokens_eval, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut,


The full-weight TTT eval path is currently broken: use_compile and val_tokens_eval are undefined, and the eval_val_sliding(...) call uses a different signature than the function defined earlier in this file. Enabling TTT_ENABLED=1 will raise at runtime; please wire this block to the existing val_tokens/eval_val_sliding(args, base_model, ...) API (or add the missing variables and an adapter function) so the advertised TTT mode runs end-to-end.

Suggested change

compiled_logits_ttt = torch.compile(base_model.forward_logits, dynamic=False) if use_compile else base_model.forward_logits

# Warmup

ttt_eval_sl = args.train_seq_len

warmup_x = torch.zeros(args.eval_batch_seqs, ttt_eval_sl, dtype=torch.int64, device=device)

with torch.inference_mode(), torch.autocast(device_type="cuda", dtype=torch.bfloat16):

_ = compiled_logits_ttt(warmup_x)

ttt_val_loss, ttt_val_bpb = eval_val_sliding(

compiled_logits_ttt, rank, world_size, device,

val_tokens_eval, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut,

ttt_eval_sl, args.eval_stride, eval_batch_seqs=args.eval_batch_seqs,

)

else:

ttt_val_loss, ttt_val_bpb = eval_val(

args, base_model, rank, world_size, device, grad_accum_steps,

val_tokens_eval, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut,

# Use the standard sliding-window eval API with the TTT-adapted model

ttt_val_loss, ttt_val_bpb = eval_val_sliding(

args,

base_model,

rank,

world_size,

device,

val_tokens,

base_bytes_lut,

has_leading_space_lut,

is_boundary_token_lut,

)

else:

# Fallback to the standard non-sliding eval API

ttt_val_loss, ttt_val_bpb = eval_val(

args,

base_model,

rank,

world_size,

device,

grad_accum_steps,

val_tokens,

base_bytes_lut,

has_leading_space_lut,

is_boundary_token_lut,

Copilot · 2026-03-21T23:01:48Z

records/track_10min_16mb/2026-03-21_MatchSOTA_TTT/train_gpt.py

+"""
+The `train_gpt.py` and `train_gpt_mlx.py` scripts are intended as good launching-off points for new participants, not SOTA configs. We'll accept PRs that tune, improve, or simplify these scripts without significantly increasing complexity, but competitive submissions should stay in the `/records` folder.
+
+Hard stop: To keep readable for newcomers, let's make sure `train_gpt.py` and `train_gpt_mlx.py` never are longer than 1500 lines.


The module docstring states a hard stop of 1500 lines for train_gpt.py, but this file is currently ~1541 lines long. Either split/refactor to get back under the stated limit, or update/remove the hard-stop text so it matches the actual policy for record scripts.

Suggested change

Hard stop: To keep readable for newcomers, let's make sure `train_gpt.py` and `train_gpt_mlx.py` never are longer than 1500 lines.

Guideline: To keep these scripts readable for newcomers, try to keep `train_gpt.py` and `train_gpt_mlx.py` reasonably small and avoid unnecessary complexity, even if they occasionally grow beyond roughly 1500 lines.

Copilot · 2026-03-21T23:01:49Z

records/track_10min_16mb/2026-03-21_MatchSOTA_TTT/README.md

+# FarnsworthEngine-class: 11L + Full-Weight SGD TTT + Custom Kernel Pipeline
+
+## Summary
+
+Combines an 11-layer transformer with the full competitive stack and full-weight SGD test-time training. This submission also introduces a **custom Triton/CUDA kernel pipeline** (via Makora automated generation) targeting fused attention glue ops, MLP activation, and eval-time acceleration — a direction no other submission has explored.
+
+**val_bpb: PENDING (run in progress)**
+
+## Architecture & Techniques
+
+| Component | Details |
+|-----------|---------|
+| **Layers** | 11 transformer layers, 512 dim, 8 heads, 4 KV heads (GQA) |
+| **MLP** | 3x expansion (hidden=1536), ReLU² activation |
+| **Quantization** | Int6 mixed precision (MLP+attention int6, embeddings fp16) |


README claims an 11-layer model and shows val_bpb: PENDING, but this record’s train_gpt.py defaults to 9 layers (NUM_LAYERS=9) and submission.json reports val_bpb: 1.1401. Please reconcile the README with the actual configuration/results in this record (or update the code/metadata to match the README).

Copilot · 2026-03-21T23:01:49Z

records/track_10min_16mb/2026-03-21_MatchSOTA_TTT/submission.json

+    "author": "Anthony Maio",
+    "github_id": "anthony-maio",
+    "val_bpb": 1.1401,
+    "track": "10min_16mb",
+    "num_gpus": 8,
+    "gpu_type": "H100 SXM",
+    "training_time_seconds": 600,
+    "compressed_model_bytes": null,
+    "code_bytes": null,
+    "total_artifact_bytes": null,
+    "description": "9L MLP3x + SmearGate + BigramHash 2048 + int6+zstd + SWA + Muon WD=0.04 + NTK-RoPE 50k + OrthoInit + sliding window eval stride=64. Custom Triton/CUDA kernel pipeline in development.",


submission.json schema is inconsistent with other records in records/track_10min_16mb (which use keys like name, blurb, val_loss, bytes_total, bytes_code). Using compressed_model_bytes/code_bytes/total_artifact_bytes: null may break tooling that expects the established fields; please align this file to the existing record format or update the repository tooling/schema accordingly.

Suggested change

"author": "Anthony Maio",

"github_id": "anthony-maio",

"val_bpb": 1.1401,

"track": "10min_16mb",

"num_gpus": 8,

"gpu_type": "H100 SXM",

"training_time_seconds": 600,

"compressed_model_bytes": null,

"code_bytes": null,

"total_artifact_bytes": null,

"description": "9L MLP3x + SmearGate + BigramHash 2048 + int6+zstd + SWA + Muon WD=0.04 + NTK-RoPE 50k + OrthoInit + sliding window eval stride=64. Custom Triton/CUDA kernel pipeline in development.",

"name": "MatchSOTA_TTT",

"author": "Anthony Maio",

"github_id": "anthony-maio",

"track": "10min_16mb",

"num_gpus": 8,

"gpu_type": "H100 SXM",

"training_time_seconds": 600,

"val_loss": 1.1401,

"bytes_total": null,

"bytes_code": null,

"blurb": "9L MLP3x + SmearGate + BigramHash 2048 + int6+zstd + SWA + Muon WD=0.04 + NTK-RoPE 50k + OrthoInit + sliding window eval stride=64. Custom Triton/CUDA kernel pipeline in development.",

Copilot · 2026-03-21T23:01:49Z

records/track_10min_16mb/2026-03-21_MatchSOTA_TTT/train_gpt.py

+    if master_process:
+        with open("final_model.int8.ptz", "wb") as f:
+            f.write(quant_blob)
+        quant_file_bytes = os.path.getsize("final_model.int8.ptz")
+        code_bytes = len(code.encode("utf-8"))
+        log0(f"Serialized model int6+{_COMPRESSOR}: {quant_file_bytes} bytes")
+        log0(f"Total submission size int8+zlib: {quant_file_bytes + code_bytes} bytes")
+
+    if distributed:
+        dist.barrier()
+    with open("final_model.int8.ptz", "rb") as f:


This artifact is written to final_model.int8.ptz and the logs mention int8+zlib, but the code is doing mixed int6 quantization and may use zstd depending on availability. Consider renaming the output filename/log strings to reflect the actual quantization/compressor to avoid confusion when comparing artifacts.

Suggested change

if master_process:

with open("final_model.int8.ptz", "wb") as f:

f.write(quant_blob)

quant_file_bytes = os.path.getsize("final_model.int8.ptz")

code_bytes = len(code.encode("utf-8"))

log0(f"Serialized model int6+{_COMPRESSOR}: {quant_file_bytes} bytes")

log0(f"Total submission size int8+zlib: {quant_file_bytes + code_bytes} bytes")

if distributed:

dist.barrier()

with open("final_model.int8.ptz", "rb") as f:

quant_filename = f"final_model.int6.{_COMPRESSOR}.ptz"

if master_process:

with open(quant_filename, "wb") as f:

f.write(quant_blob)

quant_file_bytes = os.path.getsize(quant_filename)

code_bytes = len(code.encode("utf-8"))

log0(f"Serialized model int6+{_COMPRESSOR}: {quant_file_bytes} bytes")

log0(f"Total submission size int6+{_COMPRESSOR}: {quant_file_bytes + code_bytes} bytes")

if distributed:

dist.barrier()

with open(quant_filename, "rb") as f:

Makora-generated persistent-CTA kernel fuses relu² + second matmul into a single Triton launch during eval. First matmul stays on cuBLAS. Active only during eval (not self.training) to preserve autograd. Called 9x per forward pass during sliding window eval. Expected ~10% eval time reduction (190s → ~170s), freeing eval budget. Falls back to PyTorch when Triton unavailable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

FA3 (flash_attn_func) uses Hopper-native TMA/WGMMA for 75-85% GPU utilization vs FA2's ~60%. Expected to cut step time from ~108ms to ~85ms, yielding ~7,000 steps in 10 min. Falls back to F.scaled_dot_product_attention when FA3 unavailable. Also includes fused ReLU² MLP Triton kernel (1.26x during eval). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

int5 for MLP weights (largest tensors, ~60% of params), int6 for attention weights. Expected compression: 19.1MB × (5/6 for MLP portion) ≈ 15.9MB. Also restores NUM_LAYERS=11 as default. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previous int5-MLP + int6-attention produced 16.56MB (556KB over). Switching all large matrices to int5 should save ~700KB more. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

int5-all was 16.27MB (340KB over). MLP is ~60% of params. int4 MLP + int5 attention should save ~500KB more. Expected: ~15.8MB artifact. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Int4 MLP was too aggressive (0.028 bpb penalty). Int5-all on 11L was 340KB over. 10L at int5 should be ~14.8MB — safe margin. 10L is faster (~100ms vs 115ms) = more steps = compensates for one fewer layer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Int5 was penalizing bpb by ~0.015-0.026. 9L with int6 fits at 15.9MB. QUANT_BITS env var allows int5 for 11L when needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Research finding: setting warmdown higher than total steps makes LR decay from step 1, compacting weight magnitudes continuously. This reduces int6 quant penalty from ~0.014 to ~0.005 bpb. Our 1.1401 result used warmdown=3000 on ~4800 steps (63% warmdown) while our 1.1518 used warmdown=1500 on ~7400 steps (20% warmdown) — the higher warmdown fraction gave better post-quant quality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

From arXiv:2603.09078. Projects out the self-value component from attention output, forcing the network to use contextual information. Applied via GQA-aware zero-alloc view reshape on last 4 of 11 layers. Both top unmerged submissions (PR openai#374 at 1.1246 and PR openai#379 at 1.1260) use XSA as a key technique. Full next-gen stack now includes: 11L, XSA, Partial RoPE 16/64, Late QAT STE, Tight SWA, GPTQ-lite, LN Scale, FA3, SmearGate, BigramHash, int6+zstd, Muon WD, OrthoInit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previous int6 produced 19.0MB on 11L. Int5 should give ~15.8MB. Late QAT STE clusters weights near the int5 grid during training, so the quality penalty should be much smaller than without QAT. val_bpb=1.1309 achieved with int6 (artifact too big). Int5+QAT should preserve most of that while fitting under 16MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Full stack: 11 layers, XSA on last 4, Partial RoPE 16/64, Late QAT STE, Tight SWA (scale<0.2), GPTQ-lite clip search, LN Scale 1/sqrt(i+1), FA3, MLP3x, SmearGate, BigramHash 2048, int5+zstd, Muon WD=0.04, NTK-RoPE 50k, OrthoInit, sliding window stride=64. 4,832 steps at 117ms/step on slow pod. On 80ms pod: 1.1309 (invalid artifact). With fast pod + int5: expected ~1.13 valid. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

5,660 steps at 101ms/step. Full stack: 11L, XSA, Partial RoPE, Late QAT STE, Tight SWA (7 checkpoints), GPTQ-lite, LN Scale, FA3, MLP3x, SmearGate, BigramHash, int5+zstd, Muon WD, OrthoInit. openai#1 on merged leaderboard. Beats thwu1 (1.1428) by 0.003. On faster pods (80ms): 1.1309 achieved (invalid artifact with int6). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- README: updated with actual 1.1399 results, removed TTT/PENDING claims - submission.json: aligned to repo schema (name, blurb, bytes_total) - train_gpt.py: fixed docstring line count claim, renamed artifact file, fixed misleading int8+zlib log string to reflect actual int5+compressor - Addresses all 5 Copilot review comments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. Fused RMSNorm (fwd+bwd): replaces F.rms_norm in Block.forward for both attn_norm and mlp_norm. Saves rstd for backward. Called 22x per step (2 per block × 11 blocks). 2. Fused ReLU² MLP backward: fuses (grad_out @ proj_weight) * relu_deriv into single Triton kernel, eliminating [M, 1536] HBM intermediate. Called 11x per step backward pass. Both fall back to PyTorch when Triton unavailable. Expected: 13-15ms/step savings on 100ms baseline = 13-15% speedup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. Detached .to(x.dtype) copies broke gradient chain to fp32 params. Fix: pass raw fp32 params to Function, cast inside forward, return .float() gradients in backward. 2. Grid capped at num_sms*4 but kernel isn't persistent — tiles beyond cap were never computed, leaving grad_h uninitialized. Fix: launch all tiles (remove min cap). Both kernels re-enabled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Custom Triton kernels add 38ms/step overhead vs torch.compile baseline. The Inductor compiler already fuses RMSNorm and MLP operations effectively on H100. Custom kernels remain in codebase for future optimization but are disabled for the competition submission. Kernel code is correct (no NaN after bug fixes) but slower than compiled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Full reproducibility log showing end-to-end training + eval pipeline. 5,205 steps at 108ms/step. Note: this particular run's artifact was 16.46MB (462KB over limit) due to pod variance in SWA averaging. Our submitted score of 1.1399 comes from a run with valid 15.79MB artifact on a faster pod (101ms/step, 5,660 steps). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Custom binary packing stores 4 int6 values in 3 bytes (6 bits each) instead of wasting 2 bits per value with int8 storage. This reduces raw artifact size by 25%, which combined with zstd-22 compression should fit 11L models under 16MB with int6 precision. Int6 has ~0.015 bpb less quantization penalty than int5, so this change should improve our score from ~1.14 to ~1.125 while keeping artifacts under the 16MB limit. Also switches QUANT_BITS default from 5 back to 6 since packed format eliminates the size constraint that forced int5. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Packed int6 + zstd-22 produces 20.2MB artifacts — still over 16MB. The extra entropy per int6 value (64 states vs 32 for int5) doesn't compress away. The competition's int6 fits via aggressive QAT that clusters weights near grid points, reducing entropy. Our QAT isn't aggressive enough yet. Keep int5 as default (15.79MB, valid). Packed int6 code is preserved for future use when QAT improves. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Removed unused eval_val_ttt_lora function and all TTT helper functions (_reset_ttt_optimizer, _build_ttt_optimizer, _find_docs, _compute_chunk_window, _accumulate_bpb) — none were called in the scored config - Removed broken full-weight SGD TTT block that used undefined variables (use_compile, val_tokens_eval) — Copilot flagged this as a runtime crash - TTT work continues on the separate submission/reproduce-414 branch - Scored config unchanged: 11L, int5+zstd, 1.1399 bpb, 15.79MB artifact Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

anthony-maio · 2026-03-24T16:17:33Z

Closing this PR in favor of a new submission based on the PR #414 lineage.

This submission served as our initial exploration (custom kernels, depth recurrence, TTT debugging) but the scored config (1.1399 bpb) is no longer competitive with the current frontier (~1.12x). The techniques explored here (fused Triton kernels, autograd-compatible training kernels, systematic TTT debugging) are documented in the commit history for anyone interested.

New submission will reproduce #414's stack with additional improvements (LeakyReLU(0.5)², VRL, FA3 Hopper) and include 3-seed significance testing as required.

Sub-1.0 bpb! Multi-order n-gram backoff (2-7gram) with entropy-adaptive alpha mixing on top of our 1.1229 neural base. 3-seed mean 0.9642, std 0.0002. All artifacts under 16MB. Seed 1337: 0.9640 | Seed 42: 0.9641 | Seed 2025: 0.9644 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

anthony-maio · 2026-03-26T19:07:59Z

Superseded by new submission with n-gram backoff results.

anthony-maio and others added 5 commits March 21, 2026 02:21

Copilot AI review requested due to automatic review settings March 21, 2026 22:58

Copilot started reviewing on behalf of anthony-maio March 21, 2026 22:58 View session

Copilot AI reviewed Mar 21, 2026

View reviewed changes

notapplica mentioned this pull request Mar 21, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

anthony-maio and others added 12 commits March 21, 2026 19:26

Int5 for ALL large weights (not just MLP) to fit 11L under 16MB

c262d1f

Previous int5-MLP + int6-attention produced 16.56MB (556KB over). Switching all large matrices to int5 should save ~700KB more. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Mixed int4/int5: int4 for MLP, int5 for attention to fit 11L

34d0a92

int5-all was 16.27MB (340KB over). MLP is ~60% of params. int4 MLP + int5 attention should save ~500KB more. Expected: ~15.8MB artifact. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix: default to int6 quant (QUANT_BITS=6) and 9 layers

333843d

Int5 was penalizing bpb by ~0.015-0.026. 9L with int6 fits at 15.9MB. QUANT_BITS env var allows int5 for 11L when needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert warmdown to 3000 (20000 breaks SWA averaging)

ea25505

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

anthony-maio changed the title ~~Record: 9L MLP3x Full Stack + Custom Kernel Pipeline, val_bpb=1.1401~~ Record: 11L Next-Gen Stack + Custom Kernels, val_bpb=1.1399 Mar 22, 2026

anthony-maio and others added 7 commits March 22, 2026 05:34

Disable both custom kernels: NaN in training - debugging

f788912

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

anthony-maio and others added 2 commits March 22, 2026 19:05

anthony-maio closed this Mar 24, 2026

anthony-maio reopened this Mar 26, 2026

anthony-maio changed the title ~~Record: 11L Next-Gen Stack + Custom Kernels, val_bpb=1.1399~~ Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean) Mar 26, 2026

anthony-maio closed this Mar 26, 2026

anthony-maio deleted the submission/match-sota-plus-ttt branch March 26, 2026 19:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean)#376

Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean)#376
anthony-maio wants to merge 28 commits intoopenai:mainfrom
anthony-maio:submission/match-sota-plus-ttt

anthony-maio commented Mar 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

anthony-maio commented Mar 24, 2026

Uh oh!

anthony-maio commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	Hard stop: To keep readable for newcomers, let's make sure `train_gpt.py` and `train_gpt_mlx.py` never are longer than 1500 lines.
	Guideline: To keep these scripts readable for newcomers, try to keep `train_gpt.py` and `train_gpt_mlx.py` reasonably small and avoid unnecessary complexity, even if they occasionally grow beyond roughly 1500 lines.

Conversation

anthony-maio commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Key Innovation: Multi-Order N-gram Backoff Cache

Training Architecture

Credits

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

anthony-maio commented Mar 24, 2026

Uh oh!

anthony-maio commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anthony-maio commented Mar 21, 2026 •

edited

Loading