Skip to content

PP12: Bayesian posterior packets + selective gating (1.1261 BPB)#1043

Open
okezue wants to merge 1 commit intoopenai:mainfrom
okezue:pp12-submission
Open

PP12: Bayesian posterior packets + selective gating (1.1261 BPB)#1043
okezue wants to merge 1 commit intoopenai:mainfrom
okezue:pp12-submission

Conversation

@okezue
Copy link
Copy Markdown

@okezue okezue commented Mar 28, 2026

Summary

  • Bayesian posterior packets distilled from training data, mixed with neural predictions via selective gating
  • Conjugate online updating combines training priors with eval-time counts
  • Selective gate prevents quality degradation from naive mixing (-0.0006 BPB improvement over pure neural TTT)
  • Built on PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 stack (LeakyReLU² + Legal TTT + Parallel Muon)
  • Observation: early TTT chunks achieve 1.109 BPB (below SOTA) but drift to 1.126; periodic reset implemented but not yet validated

Test plan

  • Verified on 8xH100 SXM (RunPod)
  • Packet eval improves over pure neural TTT (1.1267 → 1.1261)
  • Periodic TTT reset validation pending (compute grant requested)

@okezue okezue force-pushed the pp12-submission branch 2 times, most recently from 3f47370 to f754794 Compare March 28, 2026 23:14
@okezue okezue changed the title Non-record: PriorPacket-12 — Bayesian posterior packets + selective gating (1.1261 BPB) PP12: Bayesian posterior packets + selective gating (1.1261 BPB) Mar 28, 2026
icryo added a commit to icryo/parameter-golf that referenced this pull request Mar 29, 2026
PR openai#1043 found early TTT chunks achieve 1.109 BPB (below SOTA!)
but accumulated SGD updates cause drift to 1.126 by late chunks.

Fix: periodically reset model weights to the original checkpoint.
This prevents catastrophic drift while preserving local adaptation.

Implementation:
- TTT_RESET_EVERY=N: reset weights every N chunks (0=disabled)
- Resets both weights and optimizer momentum state
- Uses in-place copy (no reallocation, parameter references preserved)

H100 sweep now tests 11 configurations:
  6 temperatures × sliding eval
  5 TTT configs:
    A: SOTA baseline (lr=0.002, 3ep)
    B: PR openai#1039 (lr=0.0025, 4ep)
    C: 5 epochs (lr=0.002, 5ep)
    D: PR openai#1039 + reset/100 (anti-drift)
    E: PR openai#1039 + reset/50 (anti-drift)

If early chunks consistently hit 1.109 and reset prevents drift,
the mean across all chunks could drop from 1.119 toward 1.110-1.114.
That's record territory.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant