Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean)#889
Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean)#889anthony-maio wants to merge 2 commits intoopenai:mainfrom
Conversation
Sub-1.0 bpb via multi-order n-gram backoff (2-7gram) with entropy-adaptive alpha mixing. 3-seed mean 0.9642, std 0.0002. All artifacts under 16MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new record submission for track_10min_16mb showcasing a multi-order (2–7) n-gram backoff cache combined with VRL + LeakyReLU², along with reproducibility artifacts and metadata.
Changes:
- Added training/eval script implementing n-gram backoff evaluation and model architecture used for the record.
- Added attached training logs for multiple seeds and a README describing results/compliance/repro steps.
- Added submission metadata JSON for the record entry.
Reviewed changes
Copilot reviewed 3 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/train_gpt.py | Training + evaluation script including sliding-window eval and n-gram backoff cache. |
| records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/train_seed42.log | Attached run log for seed 42 supporting reported metrics and artifact size. |
| records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/train_seed1337.log | Attached run log for seed 1337 supporting reported metrics and artifact size. |
| records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/submission.json | Record metadata (val_bpb/val_loss/bytes, hardware, etc.). |
| records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/README.md | Human-readable summary of the method, results, compliance, and reproduction steps. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| all_tokens = val_tokens.cpu().numpy().astype(np.int32) | ||
| scored_up_to = my_windows[0] if my_windows else 0 | ||
| ngram_helped = 0 | ||
| ngram_total = 0 |
There was a problem hiding this comment.
In distributed n-gram eval, each rank’s cache starts at scored_up_to = my_windows[0], so ranks whose first window does not start at 0 will not include earlier (globally previous) tokens in their cache. This makes the n-gram backoff results depend on world_size/window partitioning rather than matching a single causal pass over the validation stream. To make the cache behavior consistent with a global score-first causal ordering, either (mandatory): (a) initialize each rank’s cache with the prefix tokens up to the first token position it will score (e.g., update the cache over [0, first_scored_pos) before scoring), or (b) run the n-gram backoff evaluation on a single rank (rank 0) and skip the distributed aggregation for that phase.
| probs = torch.exp(log_probs) | ||
| entropy = -(probs * log_probs).sum(dim=-1) |
There was a problem hiding this comment.
This computes and materializes both log_probs and probs for the full [B, T, V] tensor, which is large and increases peak memory/bandwidth. You can compute entropy directly from log_probs without keeping a separate probs tensor (e.g., using log_probs.exp() inline) to reduce memory pressure.
| probs = torch.exp(log_probs) | |
| entropy = -(probs * log_probs).sum(dim=-1) | |
| entropy = -(log_probs.exp() * log_probs).sum(dim=-1) |
| tokens = torch.cat([load_data_shard(file) for file in files]).contiguous() | ||
| usable = ((tokens.numel() - 1) // seq_len) * seq_len | ||
| if usable <= 0: | ||
| raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}") |
There was a problem hiding this comment.
The error message hardcodes TRAIN_SEQ_LEN even though the function parameter is seq_len (and the caller may pass a validation/eval seq length). Consider changing the message to refer to seq_len (or EVAL_SEQ_LEN when applicable) to avoid confusion when debugging validation setup.
| raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}") | |
| raise ValueError(f"Validation split is too short for seq_len={seq_len}") |
| | Seed | step_avg | steps | Pre-ngram bpb | **Post-ngram bpb** | ng_helped | Artifact | | ||
| |------|----------|-------|--------------|-------------------|-----------|----------| | ||
| | 1337 | 88.7ms | 6,765 | 1.1225 | **0.9640** | 38.5% | 15,981,848 | | ||
| | 42 | 88.6ms | 6,772 | 1.1224 | **0.9641** | 38.6% | 15,904,632 | | ||
| | 2025 | 88.6ms | 6,776 | 1.1231 | **0.9644** | 38.6% | 15,974,308 | | ||
| | **Mean** | **88.6ms** | **6,771** | **1.1227** | **0.9642 (std 0.0002)** | **38.6%** | | |
There was a problem hiding this comment.
The results table rows start with ||, which renders as an extra empty column in standard Markdown table syntax. Use a single leading | per row so the table formats correctly on GitHub.
| "val_bpb": 0.9642, | ||
| "val_loss": 1.6279, | ||
| "bytes_total": 15953596, |
There was a problem hiding this comment.
bytes_total appears to be an average across seeds (it doesn’t match the per-seed totals shown in the attached logs). If submission.json is meant to describe a specific submitted artifact, it should use the exact bytes_total (and ideally the exact val_loss/val_bpb) for that chosen artifact; otherwise consider adding explicit fields indicating these values are 3-seed means.
| "val_bpb": 0.9642, | |
| "val_loss": 1.6279, | |
| "bytes_total": 15953596, | |
| "val_bpb_mean_3seed": 0.9642, | |
| "val_loss_mean_3seed": 1.6279, | |
| "bytes_total_mean_3seed": 15953596, |
Summary
val_bpb = 0.9642 (3-seed mean, std 0.0002) | ~15.95 MB | 8×H100 SXM
3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)
All artifacts under 16,000,000 bytes. All 3 train logs attached.
Key Innovation: Multi-Order N-gram Backoff Cache
Backward-looking n-gram cache built causally from already-scored tokens. Zero artifact cost.
Entropy-Adaptive Alpha:
alpha = 0.05 + 0.55 * sigmoid(2*(H-4)). Neural-confident → alpha≈0.05. Neural-uncertain → alpha≈0.60.Multi-Order Backoff (2-7gram): Highest matching order wins. 4M hash buckets per order. min_count=2 gate. Raw count ratios, no smoothing.
Compliance: Score-first — every token scored before any table update. N-gram tables built from already-scored tokens only. No training data access during eval. No oracle selection.
Training Architecture
PR #414 base + LeakyReLU² + VRL + lzma:
11L, 512d, 8H/4KV GQA, LeakyReLU(0.5)² MLP 3×, VRL, VE128, BigramHash(2048), XSA4, Partial RoPE 16/64, LN Scale, SmearGate, U-Net skips, EMA(0.997) + Tight SWA, Late QAT, GPTQ-lite int6 + lzma, FA3 Hopper, Muon WD=0.04
Credits