RFC: A framework for deciding the n-gram question by abaybektursun · Pull Request #886 · openai/parameter-golf

abaybektursun · 2026-03-26T19:03:36Z

📄 Full article with charts: abay.tech/posts/eval-time-model-growth

Summary

We ran eval-time n-gram caching on our ValCalib GPTQ base model (PR #728). Strict causality — only uses already-scored tokens. BPB drops from 1.11 to 0.38. Zero artifact bytes. But the hash tables that make it work grow to 256 MB during evaluation. The "16 MB model" becomes 272 MB by the time it finishes scoring.

We ran the experiments, wrote up the numbers, and landed on a question we think the competition needs to answer: should eval-time state count toward the size limit?

Results

Single GPU (stride=64, FineWeb val, 62M tokens):

Config	BPB	Eval-time state	Effective model
Base LM (int6 quantized, leaderboard)	1.1142	0 MB	16 MB
Base LM (float, pre-quant)	1.1109	0 MB	16 MB
N-gram only (no base LM)	1.0615	192 MB	192 MB
Backoff 2-7, α=0.40	0.4923	192 MB	208 MB
Backoff 2-9, order-adaptive	0.3779	256 MB	272 MB

8×H100 with all-reduce sync (first three under 600s budget):

Config	BPB	Time	Sync overhead
Base LM	1.1130	110s	—
Backoff 2-7, α=0.40	0.4941	401s	1.6s
Backoff 2-9, α=0.40	0.4548	500s	1.9s
Backoff 2-7, α=0.80	0.3942	939s	~2.0s

Prior entries (PRs #727, #788) used independent per-GPU caches — each GPU sees 1/8 of the data. They scored ~0.91–0.97 BPB. We synced hash table deltas across GPUs via all-reduce. Cost: 1.6 seconds total. That closed a 0.50 BPB gap.

The reported BPB scores are inflated

Update: our original explanation of the collision mechanism was incomplete. Credit to @Eppie (comment) for identifying the probability validity issue, and to Mirco on Discord for the P(cache_bin) formulation that clarified the mechanism.

We swept hash table sizes from 1M to 256M buckets:

Buckets	BPB	Table memory
1M	0.5793	48 MB
4M	0.6535	192 MB
64M	1.0629	3 GB
256M	1.1123	12 GB

256M buckets (near collision-free) performs no better than no cache. 1M buckets (maximum collisions) gives the best BPB. Why?

The hash ratio full_table[hash(ctx, tok)] / ctx_table[hash(ctx)] is not a conditional probability. The two tables use different hash functions mapping to the same number of buckets. With 1M buckets and 62M tokens, each bucket averages ~62 entries in both tables. The ratio of two similarly-populated buckets approaches 1.0. This is P(cache_bin) — a collision-aggregated hash ratio, not P(tok | ctx).

The blend (1-α) * p_model + α * P(cache_bin) with P(cache_bin) ≈ 1.0 mechanically pushes the correct token's probability up toward p_model + α*(1 - p_model). NLL drops. BPB looks great. But the blend is only computed for the correct token — the other 1023 tokens are never checked. If you computed P(cache_bin) for all tokens, each would also be ~1.0 (same collision dynamics). The distribution would sum to far more than 1. After renormalization, the n-gram contribution washes out.

The 1-bucket extreme makes this obvious: P(cache_bin) = T/T = 1.0 for every lookup. With α = 1, BPB = 0. Perfect score. Obviously wrong.

The reported BPB numbers from n-gram caching are not achievable by a valid compressor. The improvement is primarily a measurement artifact from point-evaluating an invalid probability distribution. With collision-free tables and proper normalization, n-grams would provide at most a modest improvement from genuine corpus repetition — not the 0.49 or 0.38 BPB being reported.

The real-world problem

The competition limits training to 10 minutes on 8×H100. That limit exists to keep things accessible — real-world training isn't limited. Companies train for weeks on thousands of GPUs.

Inference is different. In deployment, inference is genuinely constrained:

You rent one GPU. Maybe a T4. Maybe CPU-only. The point of a 16 MB model is that you don't need expensive hardware. You're serving hundreds of users on that machine. Every megabyte matters.
Users wait milliseconds, not minutes. 50–200ms for the first token. The model predicts from a cold start on whatever the user sends. There's no 62M-token corpus to build a cache over.
Queries don't repeat. Cooking recipes, Python errors, legal text. The cache starts empty each time. A few hundred tokens of context gives the n-gram table almost nothing to work with.
256 MB per session doesn't scale. A thousand concurrent users means 256 GB of hash tables. The artifact size IS the model size in production.
Inference speed matters. The n-gram cache adds K hash lookups and K table updates per token across every order. In our experiments, this roughly doubles eval time (606s → 1,079s for backoff 2-7). The overhead is constant — it doesn't get worse as the cache fills — but a flat 2× slowdown matters when your latency budget is 50–200ms. You pay the per-token cost on every request, but you only get the BPB benefit after millions of tokens of contiguous corpus. On a 500-token prompt, you get the slowdown without the payoff.

The competition gives evaluation 8×H100 and 600 seconds for the full 62M-token corpus. 640 GB of VRAM. Enough time to build statistical models from the scored data that would be useless in any real deployment.

	Competition	Deployment
Training compute	Limited (fairness)	Unlimited
Inference compute	Effectively unlimited*	Limited (economics)
Inference hardware	8× H100, 640 GB	1 GPU or CPU
Inference time	600s for 62M tokens	<200ms per request
Inference speed	2× slowdown absorbed over 62M tokens	2× slowdown felt on every request
Eval-time memory	Unconstrained	Shared across users
Corpus	Fixed, repetitive	Independent queries

*10-minute wall clock, but 640 GB VRAM is effectively unlimited for a 16 MB model.

Training constraints are artificial. Inference constraints would be realistic. The competition constrains the phase where resources are abundant and leaves unconstrained the phase where they're scarce.

A model that scores 0.38 BPB on the benchmark but 1.11 BPB in deployment is not a better model. It's a better test-taker.

Where the line is

Eval-time model growth is already happening at approved scales:

Technique	Eval-time state	Status
KV cache (sliding window)	~20 MB	Uncontroversial
Score-first TTT (PRs #549, #548)	~2 MB	Technique deemed legal
Per-doc LoRA TTT, 8 epochs (PR #596)	~2 MB	Technique deemed legal
N-gram cache (backoff 2-7)	192 MB	Under review
N-gram cache (backoff 2-9, 64M buckets)	4 GB	Under review

The n-gram cache does the same thing as TTT — builds state from scored tokens. The difference is scale. 2 MB vs 192 MB. Causality isn't the question. The question is whether unbounded eval-time growth fits the spirit of a 16 MB size constraint.

Two suggestions

@0hq @cocohearts @valerio-oai — the n-gram question is hard because the rules distinguish "artifact" from "eval-time behavior," but nobody anticipated techniques that grow the effective model 17× during evaluation. The rules were written for a world where the artifact is the model.

Two ways to resolve it, either or both:

1. Cap auxiliary eval-time state

A subtlety worth flagging: "cap total GPU memory" doesn't work. A 16 MB int6+compressed artifact decompresses into ~50–100 MB of bf16 weights in VRAM. Add activations, KV cache, CUDA overhead, and the base model alone uses several hundred MB. Any naive GPU memory cap would be exceeded before any n-gram tables are allocated.

The right thing to constrain is auxiliary state: tensors that accumulate across the evaluation and are not derivable from the artifact alone.

Constrained (auxiliary):

N-gram hash tables (192–256 MB) — built from scored tokens
TTT LoRA deltas (~2 MB) — built from scored tokens
Any state that persists across batches and grows with the corpus

Not constrained (infrastructure):

Model weights (deterministic decompression of the artifact)
KV cache (recomputed each sliding window, does not accumulate)
Activations (transient, discarded after each forward pass)

A cap of auxiliary state preserves everything currently approved (TTT LoRA at ~2 MB) while constraining techniques that grow the effective model by 10–250×. Enforcement: sum the sizes of all non-model tensors that persist across batches.

2. Cap per-token overhead

Eval-time techniques must not increase per-token latency by more than 50% over the base model forward pass on the same hardware. Not an absolute number — a ratio. Hardware-agnostic and easy to measure: run eval with and without the technique.

Base LM on 8×H100 takes 110s. A 1.5× cap means 165s max. The n-gram cache takes 401s (3.6×) — well over. KV cache, TTT, LoRA are all within 1.5×. This also catches two-pass rescoring mechanically.

The exact numbers (32 MB, 1.5×) are design choices. The principle is what matters: eval-time constraints should reflect real-world inference constraints.

What do you think?

Test plan

README.md with full study and results
submission.json with metadata
Single-GPU experiments (EXP-0): 7 configs
8-GPU all-reduce experiments (EXP-11): 4 configs + alpha sweep
Bucket sweep (EXP-5): 1M to 256M — collisions inflate P(cache_bin), not improve accuracy
Maintainer input on eval-time state policy

🤖 Generated with Claude Code

Study of eval-time n-gram caching — a technique that reduces BPB from 1.11 to 0.38 while preserving strict causality, costing zero artifact bytes, but growing the effective model to 17x the artifact limit. Includes single-GPU ablations, 8-GPU all-reduce results (0.49 BPB in 401s, under 600s budget), alpha sweep, and a comparison of competition eval setup vs real-world inference constraints. Proposes four rule clarifications to align the competition with deployment realities. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

robinojw · 2026-03-26T19:49:31Z

This puts a clear name on something I've been navigating by feel: the line between approved eval-time learning and unbounded model growth is quantitative, not qualitative, and right now nobody knows where it is.
A concrete ruling would help. The 64MB cap proposed here seems right it preserves everything currently approved and gives competitors a budget to engineer against instead of guessing at intent.

- Base model is ValCalib GPTQ (1.1142 BPB), not PR openai#549 (1.1194) - Remove stale "not yet deployed" / "we estimate" for EXP-11 - Note α=0.80 (939s) exceeds 600s budget - Fix PR openai#727 score to 0.9674, PR openai#788 to 0.9059 - Fix PR openai#596 BPB to 0.6430 - "Approved" → "Technique deemed legal" for closed PRs - Add bucket sweep and per-token overhead proposal - Replace "neural" with "base LM" throughout Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Decompressed model weights alone exceed any naive GPU memory cap. The right constraint is auxiliary state: tensors that accumulate during eval and are not derivable from the artifact (hash tables, TTT deltas). Not model weights, KV cache, or activations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

EthanYangTW · 2026-03-26T23:19:49Z

strong support for this RFC. The fact that a frequency table with zero training beat every trained model just proves this thing is measuring dataset memorization, not language modeling quality. We've been pushing neural improvements — GPTQ, QAT, architecture even novel stuff, and it's demoralizing to see lookup tables dominate.

abaybektursun · 2026-03-27T05:35:51Z

Correction: our original explanation of why hash collisions help was wrong. Credit to @Eppie (#677 comment) for identifying the probability validity issue, and to Mirco on Discord for the P(cache_bin) formulation.

Our bucket sweep data is correct, but the mechanism is different from what we originally described. The hash ratio full_table[hash(ctx, tok)] / ctx_table[hash(ctx)] is not a conditional probability — it's a collision-aggregated ratio that approaches 1.0 as tables fill. The blend inflates the correct token's probability without renormalizing the other 1023 tokens. The BPB improvement is primarily a measurement artifact from point-evaluating an invalid distribution, not from useful statistical estimation.

PR body and README updated to reflect this.

@Eppie

Credit to @Eppie and Mirco (Discord) for the correct formulation. The hash ratio is not a conditional probability — it approaches 1.0 as collision-aggregated counts fill both tables proportionally. The BPB improvement is a measurement artifact from point-evaluating an invalid distribution, not from useful statistical estimation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abaybektursun and others added 2 commits March 26, 2026 14:03

Simplify proposal to single option: cap eval-time state

4555280

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abaybektursun marked this pull request as ready for review March 26, 2026 19:10

abaybektursun changed the title ~~Non-record: Your 16 MB model is 272 MB at eval time~~ RFC: The leaderboard is optimizing for compression, not language modeling Mar 26, 2026

abaybektursun changed the title ~~RFC: The leaderboard is optimizing for compression, not language modeling~~ RFC: A framework for deciding the n-gram question Mar 26, 2026

abaybektursun mentioned this pull request Mar 26, 2026

Illegal submissions megathread #677

Open

abaybektursun and others added 3 commits March 26, 2026 17:07

Fix base model reference to PR openai#728

1f91dfe

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TimPietrusky mentioned this pull request Mar 27, 2026

Record: Order-16 Frozen N-gram Oracle + Learned Gate + TTT — val_bpb 0.0274 (3-seed mean) #945

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: A framework for deciding the n-gram question#886

RFC: A framework for deciding the n-gram question#886
abaybektursun wants to merge 6 commits intoopenai:mainfrom
abaybektursun:nonrecord/eval-time-model-growth-study

abaybektursun commented Mar 26, 2026 •

edited

Loading

Uh oh!

robinojw commented Mar 26, 2026

Uh oh!

EthanYangTW commented Mar 26, 2026

Uh oh!

abaybektursun commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

abaybektursun commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

The reported BPB scores are inflated

The real-world problem

Where the line is

Two suggestions

1. Cap auxiliary eval-time state

2. Cap per-token overhead

Test plan

Uh oh!

robinojw commented Mar 26, 2026

Uh oh!

EthanYangTW commented Mar 26, 2026

Uh oh!

abaybektursun commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

abaybektursun commented Mar 26, 2026 •

edited

Loading