[Closed] Phrase Cache + N-gram Backoff + EMA-GPU (val_bpb=0.2722) by Idan3011 · Pull Request #972 · openai/parameter-golf

Idan3011 · 2026-03-27T18:35:45Z

Note: This PR is closed. Further testing confirmed that the initial n-gram gains were a collision artifact due to a denominator error. Final BPP degrades to ~1.23-1.51 once normalized.

Normalized N-gram + Bayesian First-Match + Pre-Enrichment + XSA

val_bpb: 0.3922 (full-vocab 1024-token normalized n-gram, Bayesian first-match, fixed 0.5 blend)
Sliding window: 1.1478 | 14.94 MB | 8×H100 SXM, 600s

Progress

Version	v1	v2	v3	v4	v5	v6	v7	v8	v9	v10	v11 (this)
val_bpb	1.1855	1.1709	1.1668	1.1629	1.0689	0.9784	0.9408	0.9393	0.2995	0.2722	0.3922

v11 is intentionally higher than v10. I replaced standard single-token scoring with full-vocab 1024-token normalized distributions. The 0.12 BPP increase measures the collision premium — the portion of n-gram gain from inflated pseudo-probabilities rather than genuine statistical signal.

Key Contributions

Full-Vocab 1024-Token Normalized Scoring: For each scored position and each n-gram order, look up counts for all 1024 vocabulary tokens and normalize to sum to 1.0, instead of computing a single pair_count / ctx_count ratio for only the target token.
Vectorized [chunk, 1024] gather per order: GPU stays saturated.
First-match-wins backoff: Orders 11→10→...→2, highest match wins.
Bayesian First-Match with Neural Prior: p_local = (raw_correct + beta * p_neural) / (ctx_count + beta) with beta=2.0. Neural prior contributes 2 pseudo-counts. Low-evidence contexts smoothed toward neural prediction rather than overfit to sparse counts.

Collision Premium Analysis

Standard scoring (PR [Closed] Phrase Cache + N-gram Backoff + EMA-GPU (val_bpb=0.2722) #810): 0.2722 BPP
Normalized scoring (this PR): 0.3922 BPP
Collision premium: 0.120 BPP
256M-bucket experiment: n-gram gain drops to near-zero (1.1123 vs 1.1109 float)
Remaining 0.756 BPP gain: (1.1478 → 0.3922) is genuine n-gram signal

Additional Techniques

SmearGate: Per-dim gate blending each token with previous token.
BigramHash (2048×128): Hash-table embedding for token bigrams.
EMA (decay=0.997) on GPU: 37% faster training (64.7ms vs 101ms/step).
XSA (Exclusive Self Attention) on Last 4 Layers: Removes self-value bias via orthogonal projection.

What Didn't Work (on valid distributions)

Hedge mixer: Online learned weights worse than hand-tuned alpha (0.3265).
Learned mixer head (Linear 512→11): Didn't generalize from training to eval (0.3310).
TTT (AdamW, score-first): Destroyed quantized weights (0.3528).
Dirichlet mixing: Needs high counts, wrong for incremental cache (0.3171).

Compliance

Score-first: N-gram cache updated AFTER scoring each chunk.
Backward-looking: Cache at position p contains only tokens 0..p-1.
Normalized distributions: N-gram probabilities computed across all 1024 tokens.

Reproduction

python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
torchrun --standalone --nproc_per_node=8 train_gpt.py

Key Metrics

Metric	Value
val_bpb (normalized n-gram)	0.3922
Sliding window val_bpb	1.1478
Training time	600,031ms
Eval time	193,472ms
Artifact size	14,942,971 bytes
Model parameters	25,254,992

Correct the eval-time n-gram posterior to normalize by the summed hashed-vocab mass and update the recorded metrics. The honest rerun lands at 1.5134 BPB, showing the earlier 0.3922 result came from the flawed normalization path. Made-with: Cursor

AnirudhRahul · 2026-03-27T19:46:45Z

#978
^Reran this and I think the bpb results are off because your target distribution wasn't normalized correctly

Idan3011 force-pushed the normalized-ngram branch 2 times, most recently from 68dfd02 to a999142 Compare March 27, 2026 18:44

submission: Normalized N-gram + Bayesian First-Match (val_bpb=0.3922)

d55045e

Idan3011 force-pushed the normalized-ngram branch from a999142 to d55045e Compare March 27, 2026 18:47

AnirudhRahul mentioned this pull request Mar 27, 2026

Review: Rerun of #972 with actual full-vocab normalization #978

Open

2 tasks

Idan3011 closed this Mar 27, 2026

Idan3011 deleted the normalized-ngram branch March 27, 2026 19:58

Idan3011 changed the title ~~Normalized N-gram + Bayesian First-Match (val_bpb 0.3922)~~ /// Mar 31, 2026

Idan3011 changed the title ~~///~~ closed Mar 31, 2026

Idan3011 changed the title ~~closed~~ [Closed] Mar 31, 2026

Idan3011 changed the title ~~[Closed]~~ [Closed] Phrase Cache + N-gram Backoff + EMA-GPU (val_bpb=0.2722) Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Closed] Phrase Cache + N-gram Backoff + EMA-GPU (val_bpb=0.2722) #972

[Closed] Phrase Cache + N-gram Backoff + EMA-GPU (val_bpb=0.2722) #972
Idan3011 wants to merge 1 commit intoopenai:mainfrom
Idan3011:normalized-ngram

Idan3011 commented Mar 27, 2026 •

edited

Loading

Uh oh!

AnirudhRahul commented Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Idan3011 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Normalized N-gram + Bayesian First-Match + Pre-Enrichment + XSA

Progress

Key Contributions

Collision Premium Analysis

Additional Techniques

What Didn't Work (on valid distributions)

Compliance

Reproduction

Key Metrics

Uh oh!

AnirudhRahul commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Idan3011 commented Mar 27, 2026 •

edited

Loading

AnirudhRahul commented Mar 27, 2026 •

edited

Loading