Skip to content

[Closed] Phrase Cache + N-gram Backoff + EMA-GPU (val_bpb=0.2722) #972

Closed
Idan3011 wants to merge 1 commit intoopenai:mainfrom
Idan3011:normalized-ngram
Closed

[Closed] Phrase Cache + N-gram Backoff + EMA-GPU (val_bpb=0.2722) #972
Idan3011 wants to merge 1 commit intoopenai:mainfrom
Idan3011:normalized-ngram

Conversation

@Idan3011
Copy link
Copy Markdown

@Idan3011 Idan3011 commented Mar 27, 2026

Note: This PR is closed. Further testing confirmed that the initial n-gram gains were a collision artifact due to a denominator error. Final BPP degrades to ~1.23-1.51 once normalized.


Normalized N-gram + Bayesian First-Match + Pre-Enrichment + XSA

val_bpb: 0.3922 (full-vocab 1024-token normalized n-gram, Bayesian first-match, fixed 0.5 blend)
Sliding window: 1.1478 | 14.94 MB | 8×H100 SXM, 600s


Progress

Version v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 (this)
val_bpb 1.1855 1.1709 1.1668 1.1629 1.0689 0.9784 0.9408 0.9393 0.2995 0.2722 0.3922

v11 is intentionally higher than v10. I replaced standard single-token scoring with full-vocab 1024-token normalized distributions. The 0.12 BPP increase measures the collision premium — the portion of n-gram gain from inflated pseudo-probabilities rather than genuine statistical signal.


Key Contributions

  • Full-Vocab 1024-Token Normalized Scoring: For each scored position and each n-gram order, look up counts for all 1024 vocabulary tokens and normalize to sum to 1.0, instead of computing a single pair_count / ctx_count ratio for only the target token.
  • Vectorized [chunk, 1024] gather per order: GPU stays saturated.
  • First-match-wins backoff: Orders 11→10→...→2, highest match wins.
  • Bayesian First-Match with Neural Prior: p_local = (raw_correct + beta * p_neural) / (ctx_count + beta) with beta=2.0. Neural prior contributes 2 pseudo-counts. Low-evidence contexts smoothed toward neural prediction rather than overfit to sparse counts.

Collision Premium Analysis


Additional Techniques

  • SmearGate: Per-dim gate blending each token with previous token.
  • BigramHash (2048×128): Hash-table embedding for token bigrams.
  • EMA (decay=0.997) on GPU: 37% faster training (64.7ms vs 101ms/step).
  • XSA (Exclusive Self Attention) on Last 4 Layers: Removes self-value bias via orthogonal projection.

What Didn't Work (on valid distributions)

  • Hedge mixer: Online learned weights worse than hand-tuned alpha (0.3265).
  • Learned mixer head (Linear 512→11): Didn't generalize from training to eval (0.3310).
  • TTT (AdamW, score-first): Destroyed quantized weights (0.3528).
  • Dirichlet mixing: Needs high counts, wrong for incremental cache (0.3171).

Compliance

  • Score-first: N-gram cache updated AFTER scoring each chunk.
  • Backward-looking: Cache at position p contains only tokens 0..p-1.
  • Normalized distributions: N-gram probabilities computed across all 1024 tokens.

Reproduction

python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
torchrun --standalone --nproc_per_node=8 train_gpt.py

Key Metrics

Metric Value
val_bpb (normalized n-gram) 0.3922
Sliding window val_bpb 1.1478
Training time 600,031ms
Eval time 193,472ms
Artifact size 14,942,971 bytes
Model parameters 25,254,992

@Idan3011 Idan3011 force-pushed the normalized-ngram branch 2 times, most recently from 68dfd02 to a999142 Compare March 27, 2026 18:44
AnirudhRahul pushed a commit to AnirudhRahul/parameter-golf that referenced this pull request Mar 27, 2026
Correct the eval-time n-gram posterior to normalize by the summed hashed-vocab mass and update the recorded metrics. The honest rerun lands at 1.5134 BPB, showing the earlier 0.3922 result came from the flawed normalization path.

Made-with: Cursor
@AnirudhRahul
Copy link
Copy Markdown

AnirudhRahul commented Mar 27, 2026

#978
^Reran this and I think the bpb results are off because your target distribution wasn't normalized correctly

@Idan3011 Idan3011 closed this Mar 27, 2026
@Idan3011 Idan3011 deleted the normalized-ngram branch March 27, 2026 19:58
@Idan3011 Idan3011 changed the title Normalized N-gram + Bayesian First-Match (val_bpb 0.3922) /// Mar 31, 2026
@Idan3011 Idan3011 changed the title /// closed Mar 31, 2026
@Idan3011 Idan3011 changed the title closed [Closed] Mar 31, 2026
@Idan3011 Idan3011 changed the title [Closed] [Closed] Phrase Cache + N-gram Backoff + EMA-GPU (val_bpb=0.2722) Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants