Skip to content

Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745)#688

Closed
RoyiRa wants to merge 4 commits intoopenai:mainfrom
RoyiRa:submission-2026-03-24
Closed

Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745)#688
RoyiRa wants to merge 4 commits intoopenai:mainfrom
RoyiRa:submission-2026-03-24

Conversation

@RoyiRa
Copy link
Copy Markdown

@RoyiRa RoyiRa commented Mar 25, 2026

Summary

3-seed mean val_bpb: 1.0745 (std 0.021) | <15.5 MB | 8xH100 SXM, 600s

Results

Seed Pre-TTT BPB Post-TTT BPB Artifact
1337 1.1248 1.0560 15.48 MB
42 1.1257 1.0970 15.41 MB
7 1.1251 1.0704 15.43 MB
Mean 1.1252 1.0745

Key Technique: 5-expert Logistic Context Mixer

GPU-vectorized online context mixing using the Hedge algorithm. Five experts blend predictions in log-probability space during TTT eval:

Expert Source
Neural Base model log-softmax
Unigram Token frequency from scored tokens
Bigram P(next | prev) from scored tokens
Trigram Hashed P(next | prev2, prev1) with 64K buckets
Entropy Neural model entropy as confidence regularizer

N-gram tables built incrementally from already-scored tokens only (legal). Expert weights updated online via Hedge: log_w -= eta * loss.

Each expert produces an NLL for every token. The mixer maintains learned weights (one per expert) updated via the Hedge algorithm. At each position, the mixed prediction is:
mixed_NLL = -log(sum_k w_k * exp(-NLL_k))

Training Budget

GPTQ calibration runs within the 600s training budget (18s reserved).

Phase Time
Training loop 582s
EMA + GPTQ calibration + quantization ~18s
Total training ~600s
TTT eval with mixer ~562s

Reproduction

pip install -r requirements.txt
SEED=1337 MAX_WALLCLOCK_SECONDS=600 USE_MIXER=1 TTT_LR=0.0001 TTT_CHUNK_TOKENS=131072 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

anthony-maio added a commit to anthony-maio/parameter-golf that referenced this pull request Mar 25, 2026
Key differences from our failed SGD TTT:
- AdamW(lr=1e-4) instead of SGD(lr=0.002) — 20x lower LR
- Only last 2 blocks unfrozen (9 frozen) — protects VRL gates
- Polyak weight averaging (decay=0.998) for scoring stability
- Cosine LR decay across chunks
- Score with Polyak weights, train with live weights

PR openai#688 gets -0.05 bpb with this recipe. Our SGD TTT was diverging
because we used 20x higher LR and unfroze all blocks including VRL gates.

TTT_ENABLED=1 to activate. Default: AdamW lr=1e-4, 3 epochs,
9 frozen blocks, Polyak decay=0.998.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@valerio-oai
Copy link
Copy Markdown
Contributor

Thanks for your submission! Unfortunately, it's disallowed due to the use of hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens. Please refer to the long discussion about this under the issues tab for more details, and please submit more runs in the future!

@Eppie
Copy link
Copy Markdown

Eppie commented Mar 28, 2026

@valerio-oai
Technically this one has a slightly different problem than the majority of the n-gram cache submissions. I believe it does normalize the over the full vocabulary correctly for its n-gram caches. The entropy expert is the actual problem: it assigns the same implicit probability exp(-H) to every token in the vocabulary, which is very unlikely to sum up to 1.

@RoyiRa
Copy link
Copy Markdown
Author

RoyiRa commented Mar 28, 2026

Thank you @Eppie for the correction, I should have been more careful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants