Skip to content

GolfStudent v2 14L: d=352, Value Residuals, GPTQ-lite, Schedule-Free, Muon+EMA#604

Open
whitestone1121-web wants to merge 12 commits intoopenai:mainfrom
whitestone1121-web:feat/alan-samaha-golf
Open

GolfStudent v2 14L: d=352, Value Residuals, GPTQ-lite, Schedule-Free, Muon+EMA#604
whitestone1121-web wants to merge 12 commits intoopenai:mainfrom
whitestone1121-web:feat/alan-samaha-golf

Conversation

@whitestone1121-web
Copy link

@whitestone1121-web whitestone1121-web commented Mar 24, 2026

GolfStudent v2 — 16MB Hybrid LM

Architecture: d=352, L=14 (10x GatedMLP + 4x Attention every 3rd layer), vocab=1024, weight-tied embedding/lm_head, SwiGLU FFN (3x expansion), RoPE on attention layers, orthogonal weight init

v2 improvements over v1:

  • d=352 (from 288) — +65% model capacity, ~15MB INT8+zlib (94% of 16MB budget)
  • Value Residuals — learned scalar skip gates (init=0, tanh-gated) every 3 blocks
  • GPTQ-lite — 5 clip percentile candidates per row, min reconstruction MSE INT8
  • Schedule-Free final 120s — constant LR floor (10%) + faster EMA (decay=0.97) instead of LR→0 warmdown

Training:

  • Muon optimizer (momentum 0.85→0.99 warmup over 1500 steps, WD=0.04) for matrix params
  • Adam (fused) for embeddings/scalars
  • EMA decay=0.997, updated every step
  • Wallclock-aware schedule: cosine + schedule-free final 120s
  • Grad clip=0.3

Quantization: Per-row INT8 GPTQ-lite (5 clip percentiles, min MSE) + zlib level=9
Size: ~15.06MB / 16MB (94.1% budget)

@whitestone1121-web whitestone1121-web changed the title GolfStudent 14L: Hybrid GatedMLP+Attn, d=288, vocab=1024, Muon+EMA, INT8+zlib GolfStudent v2 14L: d=352, Value Residuals, GPTQ-lite, Schedule-Free, Muon+EMA Mar 25, 2026
@whitestone1121-web
Copy link
Author

Updated for v2: Architecture is now d=352 (from 288), adding Value Residuals (learned tanh-gated skip connections every 3 blocks, init=0) and GPTQ-lite INT8 (5 clip percentile candidates per row, min reconstruction MSE). No distillation - pure CE on FineWeb binary shards. Quantization + zlib happens after the wallclock timer exits, matching standard contest format. Dry-run confirms ~15.06MB (94.1% of 16MB budget).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant