Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745) by RoyiRa · Pull Request #688 · openai/parameter-golf

RoyiRa · 2026-03-25T06:32:22Z

Summary

3-seed mean val_bpb: 1.0745 (std 0.021) | <15.5 MB | 8xH100 SXM, 600s

Results

Seed	Pre-TTT BPB	Post-TTT BPB	Artifact
1337	1.1248	1.0560	15.48 MB
42	1.1257	1.0970	15.41 MB
7	1.1251	1.0704	15.43 MB
Mean	1.1252	1.0745

Key Technique: 5-expert Logistic Context Mixer

GPU-vectorized online context mixing using the Hedge algorithm. Five experts blend predictions in log-probability space during TTT eval:

Expert	Source
Neural	Base model log-softmax
Unigram	Token frequency from scored tokens
Bigram	P(next \| prev) from scored tokens
Trigram	Hashed P(next \| prev2, prev1) with 64K buckets
Entropy	Neural model entropy as confidence regularizer

N-gram tables built incrementally from already-scored tokens only (legal). Expert weights updated online via Hedge: log_w -= eta * loss.

Each expert produces an NLL for every token. The mixer maintains learned weights (one per expert) updated via the Hedge algorithm. At each position, the mixed prediction is:
mixed_NLL = -log(sum_k w_k * exp(-NLL_k))

Training Budget

GPTQ calibration runs within the 600s training budget (18s reserved).

Phase	Time
Training loop	582s
EMA + GPTQ calibration + quantization	~18s
Total training	~600s
TTT eval with mixer	~562s

Reproduction

pip install -r requirements.txt
SEED=1337 MAX_WALLCLOCK_SECONDS=600 USE_MIXER=1 TTT_LR=0.0001 TTT_CHUNK_TOKENS=131072 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

Base model: PR Record: int5 GPTQ + Soft-Round QAT (3-seed mean 1.1162) #606 by @gowtham0992
TTT recipe: PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461 by @Christopher-Lee-McClendon

Key differences from our failed SGD TTT: - AdamW(lr=1e-4) instead of SGD(lr=0.002) — 20x lower LR - Only last 2 blocks unfrozen (9 frozen) — protects VRL gates - Polyak weight averaging (decay=0.998) for scoring stability - Cosine LR decay across chunks - Score with Polyak weights, train with live weights PR openai#688 gets -0.05 bpb with this recipe. Our SGD TTT was diverging because we used 20x higher LR and unfroze all blocks including VRL gates. TTT_ENABLED=1 to activate. Default: AdamW lr=1e-4, 3 epochs, 9 frozen blocks, Polyak decay=0.998. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

valerio-oai · 2026-03-27T22:49:21Z

Thanks for your submission! Unfortunately, it's disallowed due to the use of hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens. Please refer to the long discussion about this under the issues tab for more details, and please submit more runs in the future!

Eppie · 2026-03-28T01:42:26Z

@valerio-oai
Technically this one has a slightly different problem than the majority of the n-gram cache submissions. I believe it does normalize the over the full vocabulary correctly for its n-gram caches. The entropy expert is the actual problem: it assigns the same implicit probability exp(-H) to every token in the vocabulary, which is very unlikely to sum up to 1.

RoyiRa · 2026-03-28T06:32:21Z

Thank you @Eppie for the correction, I should have been more careful!

RoyiRa and others added 4 commits March 25, 2026 08:31

Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745)

bba994f

Move submission to records/track_10min_16mb/

a7afc01

Remove old submission directory

9a1d2fd

Update author and add github_id in submission.json

a366675

RoyiRa mentioned this pull request Mar 25, 2026

Record Submission: 1.0541 BPB - 5-expert Hedge Mixer + CROWN-Q + stride=64 #700

Open

This was referenced Mar 25, 2026

Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0278, 3-seed mean) #733

Closed

Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0222, 3-seed mean) #745

Closed

This was referenced Mar 26, 2026

Progressive Depth + Hedge Mixer — val_bpb 1.1454 #856

Open

Non-record: 4-Hour Progressive Depth — val_bpb 1.0889 #895

Open

valerio-oai closed this Mar 27, 2026

valerio-oai mentioned this pull request Mar 27, 2026

Illegal submissions megathread #677

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745)#688

Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745)#688
RoyiRa wants to merge 4 commits intoopenai:mainfrom
RoyiRa:submission-2026-03-24

RoyiRa commented Mar 25, 2026 •

edited

Loading

Uh oh!

valerio-oai commented Mar 27, 2026

Uh oh!

Eppie commented Mar 28, 2026

Uh oh!

RoyiRa commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

RoyiRa commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

Key Technique: 5-expert Logistic Context Mixer

Training Budget

Reproduction

Credits

Uh oh!

valerio-oai commented Mar 27, 2026

Uh oh!

Eppie commented Mar 28, 2026

Uh oh!

RoyiRa commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RoyiRa commented Mar 25, 2026 •

edited

Loading