Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# AttnRes + Gated Attention + Looped Blocks

### Architecture changes
- **11 layers, 3x MLP** — increased capacity matching top submissions
- **Block Attention Residuals** (Moonshot AI, arXiv 2603.15031) — replaces fixed `skip_weights` with softmax attention over all encoder outputs using per-decoder-layer pseudo-queries
- **Per-head gated attention** — learnable `sigmoid(gate)` per attention head, prevents attention-sink pathology
- **Looped middle blocks** — layers 4-7 run twice per forward pass, adding compute depth without parameters

### Training changes
- **EMA** (decay=0.995) — exponential moving average weights for final eval/export
- **Cosine LR decay** — replaces linear warmdown
- **QAT** (last 15%) — simulates int8 per-row quantization to reduce roundtrip degradation

## Local validation (M4 Max, 500 steps)
- val_bpb: 1.475 (float), int8 roundtrip: 1.648
- Artifact size: 13.7MB (2.3MB under 16MB cap)
- Full H100 run pending

## Config
- 26.5M params, 11 layers, 512 dim, 8 heads, 4 KV heads, 3x MLP, tied embeddings
- Estimated artifact: ~13.7MB after int8+zlib
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"author": "Neopolita",
"github_id": "Neopolita",
"name": "11L AttnRes + Gated Attention + Looped Blocks + EMA + Cosine + QAT",
"blurb": "Block Attention Residuals (arXiv 2603.15031) replacing fixed skip_weights with learned depth routing, per-head sigmoid gated attention, looped middle blocks (layers 4-7 x2) for zero-param compute depth, EMA(0.995), cosine LR decay, QAT@0.15. 11L 3xMLP relu² baseline with 26.5M params.",
"date": "2026-03-24T00:00:00Z",
"val_loss": null,
"val_bpb": null,
"bytes_total": null
}
Loading