Skip to content

Record: 11L XSA-all + 7-gram cache (mean val_bpb=1.0465)#758

Open
hypery11 wants to merge 1 commit intoopenai:mainfrom
hypery11:submission/2026-03-25_11L_XSA_ngram
Open

Record: 11L XSA-all + 7-gram cache (mean val_bpb=1.0465)#758
hypery11 wants to merge 1 commit intoopenai:mainfrom
hypery11:submission/2026-03-25_11L_XSA_ngram

Conversation

@hypery11
Copy link
Copy Markdown

Results

Seed val_bpb
42 1.0467
1337 1.0470
2024 1.0457
Mean 1.0465
Std 0.0007
  • Artifact: 13.99 MB
  • Train: 600s on 8xH100 SXM
  • Eval: ~116s

Method

11-layer transformer with XSA-all (Exclusive Self-Attention on all layers), LeakyReLU(0.5)^2, Value Residual, Gated Attention, BigramHash(10240), SmearGate. GPTQ-lite int6 + zstd-22. EMA(0.997) + Tight SWA + Late QAT.

7-gram backward-looking eval cache (alpha=0.40, 4M buckets). Score-first, deterministic, no TTT.

Architecture builds on community techniques from PRs #609, #549.

  • 8xH100 SXM, train ≤600s
  • Eval ≤600s (116s)
  • Artifact ≤16MB (13.99MB)
  • 3-seed validation (std 0.0007)

Seeds: 1.0467 / 1.0470 / 1.0457 (std 0.0007).
11L with XSA-all, LeakyReLU^2, VR, GA, GPTQ-lite int6.
13.99MB artifact. Train 600s, eval 116s.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant