Skip to content

11L AttnRes + Gated Attention + Looped Blocks + EMA + Cosine + QAT#607

Draft
Neopolita wants to merge 1 commit intoopenai:mainfrom
Neopolita:submission/11L-attnres-gated-loop-ema-cos-qat
Draft

11L AttnRes + Gated Attention + Looped Blocks + EMA + Cosine + QAT#607
Neopolita wants to merge 1 commit intoopenai:mainfrom
Neopolita:submission/11L-attnres-gated-loop-ema-cos-qat

Conversation

@Neopolita
Copy link

Summary

  • Block Attention Residuals (arXiv 2603.15031) replacing fixed skip_weights with learned depth routing
  • Per-head gated attention preventing attention-sink pathology
  • Looped middle blocks (layers 4-7 x2) for zero-param compute depth
  • EMA (0.995), cosine LR decay, QAT (last 15%)
  • 11L, 3x MLP, 26.5M params, ~13.7MB artifact

Status

H100 run pending — requesting compute credits. Local M4 Max validation (500 steps): val_bpb 1.475 (float).

Test plan

  • Run on 8x H100 for 10 minutes
  • Update submission.json with final val_bpb and bytes_total
  • Add training log
  • Mark PR as ready for review

… Cosine + QAT

Architecture: 11 layers, 3x MLP, Block Attention Residuals (replacing skip_weights),
per-head gated attention, looped middle blocks (layers 4-7 x2).
Training: EMA (0.995), cosine LR decay, QAT (last 15%).
26.5M params, ~13.7MB artifact. H100 run pending.
@Tqela
Copy link

Tqela commented Mar 24, 2026

use Flash attention 4 or 3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants