Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean) by anthony-maio · Pull Request #889 · openai/parameter-golf

anthony-maio · 2026-03-26T19:13:50Z

Summary

val_bpb = 0.9642 (3-seed mean, std 0.0002) | ~15.95 MB | 8×H100 SXM

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed	step_avg	steps	Pre-ngram bpb	Post-ngram bpb	ng_helped	Artifact
1337	88.7ms	6,765	1.1225	0.9640	38.5%	15,981,848
42	88.6ms	6,772	1.1224	0.9641	38.6%	15,904,632
2025	88.6ms	6,776	1.1231	0.9644	38.6%	15,974,308
Mean	88.6ms	6,771	1.1227	0.9642 (std 0.0002)	38.6%

All artifacts under 16,000,000 bytes. All 3 train logs attached.

Key Innovation: Multi-Order N-gram Backoff Cache

Backward-looking n-gram cache built causally from already-scored tokens. Zero artifact cost.

Entropy-Adaptive Alpha: alpha = 0.05 + 0.55 * sigmoid(2*(H-4)). Neural-confident → alpha≈0.05. Neural-uncertain → alpha≈0.60.

Multi-Order Backoff (2-7gram): Highest matching order wins. 4M hash buckets per order. min_count=2 gate. Raw count ratios, no smoothing.

Compliance: Score-first — every token scored before any table update. N-gram tables built from already-scored tokens only. No training data access during eval. No oracle selection.

Training Architecture

PR #414 base + LeakyReLU² + VRL + lzma:
11L, 512d, 8H/4KV GQA, LeakyReLU(0.5)² MLP 3×, VRL, VE128, BigramHash(2048), XSA4, Partial RoPE 16/64, LN Scale, SmearGate, U-Net skips, EMA(0.997) + Tight SWA, Late QAT, GPTQ-lite int6 + lzma, FA3 Hopper, Muon WD=0.04

Credits

N-gram backoff: PR Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727 by @Asukabot0
Base model: PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 by @signalrush
LeakyReLU²: PR Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309) #493 by @parinzee, PR Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622) #518 by @sofiabod
VRL: ResFormer (arXiv:2410.17897), PR Record: 11L VRL + LeakyReLU² + Full GPTQ (3-seed mean val_bpb=1.1175) #569 by @gowtham0992

Sub-1.0 bpb via multi-order n-gram backoff (2-7gram) with entropy-adaptive alpha mixing. 3-seed mean 0.9642, std 0.0002. All artifacts under 16MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a new record submission for track_10min_16mb showcasing a multi-order (2–7) n-gram backoff cache combined with VRL + LeakyReLU², along with reproducibility artifacts and metadata.

Changes:

Added training/eval script implementing n-gram backoff evaluation and model architecture used for the record.
Added attached training logs for multiple seeds and a README describing results/compliance/repro steps.
Added submission metadata JSON for the record entry.

Reviewed changes

Copilot reviewed 3 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/train_gpt.py	Training + evaluation script including sliding-window eval and n-gram backoff cache.
records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/train_seed42.log	Attached run log for seed 42 supporting reported metrics and artifact size.
records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/train_seed1337.log	Attached run log for seed 1337 supporting reported metrics and artifact size.
records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/submission.json	Record metadata (val_bpb/val_loss/bytes, hardware, etc.).
records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/README.md	Human-readable summary of the method, results, compliance, and reproduction steps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-26T19:20:10Z

records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/train_gpt.py

+ all_tokens = val_tokens.cpu().numpy().astype(np.int32)
+ scored_up_to = my_windows[0] if my_windows else 0
+ ngram_helped = 0
+ ngram_total = 0


In distributed n-gram eval, each rank’s cache starts at scored_up_to = my_windows[0], so ranks whose first window does not start at 0 will not include earlier (globally previous) tokens in their cache. This makes the n-gram backoff results depend on world_size/window partitioning rather than matching a single causal pass over the validation stream. To make the cache behavior consistent with a global score-first causal ordering, either (mandatory): (a) initialize each rank’s cache with the prefix tokens up to the first token position it will score (e.g., update the cache over [0, first_scored_pos) before scoring), or (b) run the n-gram backoff evaluation on a single rank (rank 0) and skip the distributed aggregation for that phase.

Copilot · 2026-03-26T19:20:12Z

records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/train_gpt.py

+   probs = torch.exp(log_probs)
+   entropy = -(probs * log_probs).sum(dim=-1)


This computes and materializes both log_probs and probs for the full [B, T, V] tensor, which is large and increases peak memory/bandwidth. You can compute entropy directly from log_probs without keeping a separate probs tensor (e.g., using log_probs.exp() inline) to reduce memory pressure.

Suggested change

probs = torch.exp(log_probs)

entropy = -(probs * log_probs).sum(dim=-1)

entropy = -(log_probs.exp() * log_probs).sum(dim=-1)

Copilot · 2026-03-26T19:20:13Z

records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/train_gpt.py

+ tokens = torch.cat([load_data_shard(file) for file in files]).contiguous()
+ usable = ((tokens.numel() - 1) // seq_len) * seq_len
+ if usable <= 0:
+  raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}")


The error message hardcodes TRAIN_SEQ_LEN even though the function parameter is seq_len (and the caller may pass a validation/eval seq length). Consider changing the message to refer to seq_len (or EVAL_SEQ_LEN when applicable) to avoid confusion when debugging validation setup.

Suggested change

raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}")

raise ValueError(f"Validation split is too short for seq_len={seq_len}")

Copilot · 2026-03-26T19:20:14Z

records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/README.md

+| Seed | step_avg | steps | Pre-ngram bpb | **Post-ngram bpb** | ng_helped | Artifact |
+|------|----------|-------|--------------|-------------------|-----------|----------|
+| 1337 | 88.7ms | 6,765 | 1.1225 | **0.9640** | 38.5% | 15,981,848 |
+| 42 | 88.6ms | 6,772 | 1.1224 | **0.9641** | 38.6% | 15,904,632 |
+| 2025 | 88.6ms | 6,776 | 1.1231 | **0.9644** | 38.6% | 15,974,308 |
+| **Mean** | **88.6ms** | **6,771** | **1.1227** | **0.9642 (std 0.0002)** | **38.6%** | |


The results table rows start with ||, which renders as an extra empty column in standard Markdown table syntax. Use a single leading | per row so the table formats correctly on GitHub.

Copilot · 2026-03-26T19:20:14Z

records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/submission.json

+    "val_bpb": 0.9642,
+    "val_loss": 1.6279,
+    "bytes_total": 15953596,


bytes_total appears to be an average across seeds (it doesn’t match the per-seed totals shown in the attached logs). If submission.json is meant to describe a specific submitted artifact, it should use the exact bytes_total (and ideally the exact val_loss/val_bpb) for that chosen artifact; otherwise consider adding explicit fields indicating these values are 3-seed means.

Suggested change

"val_bpb": 0.9642,

"val_loss": 1.6279,

"bytes_total": 15953596,

"val_bpb_mean_3seed": 0.9642,

"val_loss_mean_3seed": 1.6279,

"bytes_total_mean_3seed": 15953596,

anthony-maio and others added 2 commits March 26, 2026 15:12

Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed)

11f5159

Sub-1.0 bpb via multi-order n-gram backoff (2-7gram) with entropy-adaptive alpha mixing. 3-seed mean 0.9642, std 0.0002. All artifacts under 16MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove private files from submission branch

50ec6bc

Copilot AI review requested due to automatic review settings March 26, 2026 19:13

Copilot started reviewing on behalf of anthony-maio March 26, 2026 19:14 View session

Copilot AI reviewed Mar 26, 2026

View reviewed changes

anthony-maio mentioned this pull request Mar 27, 2026

Non-record: Fused Softcap+CE Megakernel (1.94x vs torch.compile) + N-gram Backoff #915

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean)#889

Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean)#889
anthony-maio wants to merge 2 commits intoopenai:mainfrom
anthony-maio:submission/ngram-backoff-clean

anthony-maio commented Mar 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		probs = torch.exp(log_probs)
		entropy = -(probs * log_probs).sum(dim=-1)

	probs = torch.exp(log_probs)
	entropy = -(probs * log_probs).sum(dim=-1)
	entropy = -(log_probs.exp() * log_probs).sum(dim=-1)

	raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}")
	raise ValueError(f"Validation split is too short for seq_len={seq_len}")

Conversation

anthony-maio commented Mar 26, 2026

Summary

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Key Innovation: Multi-Order N-gram Backoff Cache

Training Architecture

Credits

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants