Skip to content

Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean)#889

Open
anthony-maio wants to merge 2 commits intoopenai:mainfrom
anthony-maio:submission/ngram-backoff-clean
Open

Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean)#889
anthony-maio wants to merge 2 commits intoopenai:mainfrom
anthony-maio:submission/ngram-backoff-clean

Conversation

@anthony-maio
Copy link
Copy Markdown

Summary

val_bpb = 0.9642 (3-seed mean, std 0.0002) | ~15.95 MB | 8×H100 SXM

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed step_avg steps Pre-ngram bpb Post-ngram bpb ng_helped Artifact
1337 88.7ms 6,765 1.1225 0.9640 38.5% 15,981,848
42 88.6ms 6,772 1.1224 0.9641 38.6% 15,904,632
2025 88.6ms 6,776 1.1231 0.9644 38.6% 15,974,308
Mean 88.6ms 6,771 1.1227 0.9642 (std 0.0002) 38.6%

All artifacts under 16,000,000 bytes. All 3 train logs attached.

Key Innovation: Multi-Order N-gram Backoff Cache

Backward-looking n-gram cache built causally from already-scored tokens. Zero artifact cost.

Entropy-Adaptive Alpha: alpha = 0.05 + 0.55 * sigmoid(2*(H-4)). Neural-confident → alpha≈0.05. Neural-uncertain → alpha≈0.60.

Multi-Order Backoff (2-7gram): Highest matching order wins. 4M hash buckets per order. min_count=2 gate. Raw count ratios, no smoothing.

Compliance: Score-first — every token scored before any table update. N-gram tables built from already-scored tokens only. No training data access during eval. No oracle selection.

Training Architecture

PR #414 base + LeakyReLU² + VRL + lzma:
11L, 512d, 8H/4KV GQA, LeakyReLU(0.5)² MLP 3×, VRL, VE128, BigramHash(2048), XSA4, Partial RoPE 16/64, LN Scale, SmearGate, U-Net skips, EMA(0.997) + Tight SWA, Late QAT, GPTQ-lite int6 + lzma, FA3 Hopper, Muon WD=0.04

Credits

anthony-maio and others added 2 commits March 26, 2026 15:12
Sub-1.0 bpb via multi-order n-gram backoff (2-7gram) with entropy-adaptive
alpha mixing. 3-seed mean 0.9642, std 0.0002. All artifacts under 16MB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 26, 2026 19:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new record submission for track_10min_16mb showcasing a multi-order (2–7) n-gram backoff cache combined with VRL + LeakyReLU², along with reproducibility artifacts and metadata.

Changes:

  • Added training/eval script implementing n-gram backoff evaluation and model architecture used for the record.
  • Added attached training logs for multiple seeds and a README describing results/compliance/repro steps.
  • Added submission metadata JSON for the record entry.

Reviewed changes

Copilot reviewed 3 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/train_gpt.py Training + evaluation script including sliding-window eval and n-gram backoff cache.
records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/train_seed42.log Attached run log for seed 42 supporting reported metrics and artifact size.
records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/train_seed1337.log Attached run log for seed 1337 supporting reported metrics and artifact size.
records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/submission.json Record metadata (val_bpb/val_loss/bytes, hardware, etc.).
records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/README.md Human-readable summary of the method, results, compliance, and reproduction steps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +974 to +977
all_tokens = val_tokens.cpu().numpy().astype(np.int32)
scored_up_to = my_windows[0] if my_windows else 0
ngram_helped = 0
ngram_total = 0
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In distributed n-gram eval, each rank’s cache starts at scored_up_to = my_windows[0], so ranks whose first window does not start at 0 will not include earlier (globally previous) tokens in their cache. This makes the n-gram backoff results depend on world_size/window partitioning rather than matching a single causal pass over the validation stream. To make the cache behavior consistent with a global score-first causal ordering, either (mandatory): (a) initialize each rank’s cache with the prefix tokens up to the first token position it will score (e.g., update the cache over [0, first_scored_pos) before scoring), or (b) run the n-gram backoff evaluation on a single rank (rank 0) and skip the distributed aggregation for that phase.

Copilot uses AI. Check for mistakes.
Comment on lines +995 to +996
probs = torch.exp(log_probs)
entropy = -(probs * log_probs).sum(dim=-1)
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This computes and materializes both log_probs and probs for the full [B, T, V] tensor, which is large and increases peak memory/bandwidth. You can compute entropy directly from log_probs without keeping a separate probs tensor (e.g., using log_probs.exp() inline) to reduce memory pressure.

Suggested change
probs = torch.exp(log_probs)
entropy = -(probs * log_probs).sum(dim=-1)
entropy = -(log_probs.exp() * log_probs).sum(dim=-1)

Copilot uses AI. Check for mistakes.
tokens = torch.cat([load_data_shard(file) for file in files]).contiguous()
usable = ((tokens.numel() - 1) // seq_len) * seq_len
if usable <= 0:
raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}")
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message hardcodes TRAIN_SEQ_LEN even though the function parameter is seq_len (and the caller may pass a validation/eval seq length). Consider changing the message to refer to seq_len (or EVAL_SEQ_LEN when applicable) to avoid confusion when debugging validation setup.

Suggested change
raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}")
raise ValueError(f"Validation split is too short for seq_len={seq_len}")

Copilot uses AI. Check for mistakes.
Comment on lines +7 to +12
| Seed | step_avg | steps | Pre-ngram bpb | **Post-ngram bpb** | ng_helped | Artifact |
|------|----------|-------|--------------|-------------------|-----------|----------|
| 1337 | 88.7ms | 6,765 | 1.1225 | **0.9640** | 38.5% | 15,981,848 |
| 42 | 88.6ms | 6,772 | 1.1224 | **0.9641** | 38.6% | 15,904,632 |
| 2025 | 88.6ms | 6,776 | 1.1231 | **0.9644** | 38.6% | 15,974,308 |
| **Mean** | **88.6ms** | **6,771** | **1.1227** | **0.9642 (std 0.0002)** | **38.6%** | |
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The results table rows start with ||, which renders as an extra empty column in standard Markdown table syntax. Use a single leading | per row so the table formats correctly on GitHub.

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +11
"val_bpb": 0.9642,
"val_loss": 1.6279,
"bytes_total": 15953596,
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bytes_total appears to be an average across seeds (it doesn’t match the per-seed totals shown in the attached logs). If submission.json is meant to describe a specific submitted artifact, it should use the exact bytes_total (and ideally the exact val_loss/val_bpb) for that chosen artifact; otherwise consider adding explicit fields indicating these values are 3-seed means.

Suggested change
"val_bpb": 0.9642,
"val_loss": 1.6279,
"bytes_total": 15953596,
"val_bpb_mean_3seed": 0.9642,
"val_loss_mean_3seed": 1.6279,
"bytes_total_mean_3seed": 15953596,

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants