Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean)#376
Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean)#376anthony-maio wants to merge 28 commits intoopenai:mainfrom
Conversation
Takes the proven SOTA script exactly (seq2048, MLP 3x, SmearGate, BigramHash, int6+zstd, SWA, Muon WD 0.02, OrthoInit) and adds TTT LoRA evaluation. TTT passes base_model directly (compiled). If TTT works on this architecture: expected ~1.11-1.12 bpb (new record). If TTT fails (SmearGate/BigramHash incompatibility): 1.1483 baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Key changes from PR openai#162 base: - 11 layers (from 9) — enabled by int6 compression headroom - Full-weight SGD TTT (not LoRA): lr=0.002, momentum=0.9, 3 epochs over val data, freeze first 2 blocks for stability - NTK-RoPE base=50000 (from 10000) for long-context extrapolation - matrix_lr=0.025, scalar_lr=0.025, tied_embed_lr=0.035 - weight_decay=0.04 (from 0.01) - BigramHash 2048 buckets (from 4096) - TTT_ENABLED=1 env var toggle Target: match FarnsworthEngine's 1.1303 bpb or beat it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
At ~5700 steps on our pods, warmdown=3000 means 53% of training is in the LR decay phase. Reducing to 1500 doubles full-LR training time. Council identified this as a free 0.005+ bpb improvement. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NameError crashed after TTT epoch 3 completed successfully. eval_stride/eval_sl were local variables from the pre-TTT eval section, not visible in the TTT section. Use args.eval_stride and args.train_seq_len directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9 layers (valid artifact under 16MB), full SOTA stack: MLP 3x, SmearGate, BigramHash 2048, int6+zstd-22, SWA, Muon WD=0.04, NTK-RoPE 50k, OrthoInit, sliding window stride=64. Trained 4,782 steps at ~125ms/step on 8xH100 SXM. Custom kernel integration in progress for next submission. TTT disabled (does not improve results on this architecture). Set NUM_LAYERS=11 for 11L variant (requires tighter compression). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new record submission under records/track_10min_16mb/2026-03-21_MatchSOTA_TTT capturing a high-performing 10-minute / 16MB training run configuration (model + training loop + quantization export), along with metadata and a write-up.
Changes:
- Introduces a new standalone
train_gpt.pyfor the record, including Muon/Adam optimizer split, int6 mixed quantization + zstd/zlib export, and sliding-window evaluation. - Adds
submission.jsonmetadata for the record’s reported score/config. - Adds a README describing the approach (TTT + kernels) and reproduction steps.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| records/track_10min_16mb/2026-03-21_MatchSOTA_TTT/train_gpt.py | Full training/eval/quantization pipeline for the record (includes optional TTT block). |
| records/track_10min_16mb/2026-03-21_MatchSOTA_TTT/submission.json | Record metadata (author, score, hardware, etc.). |
| records/track_10min_16mb/2026-03-21_MatchSOTA_TTT/README.md | Narrative description of the submission and reproduction command. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| compiled_logits_ttt = torch.compile(base_model.forward_logits, dynamic=False) if use_compile else base_model.forward_logits | ||
| # Warmup | ||
| ttt_eval_sl = args.train_seq_len | ||
| warmup_x = torch.zeros(args.eval_batch_seqs, ttt_eval_sl, dtype=torch.int64, device=device) | ||
| with torch.inference_mode(), torch.autocast(device_type="cuda", dtype=torch.bfloat16): | ||
| _ = compiled_logits_ttt(warmup_x) | ||
| ttt_val_loss, ttt_val_bpb = eval_val_sliding( | ||
| compiled_logits_ttt, rank, world_size, device, | ||
| val_tokens_eval, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, | ||
| ttt_eval_sl, args.eval_stride, eval_batch_seqs=args.eval_batch_seqs, | ||
| ) | ||
| else: | ||
| ttt_val_loss, ttt_val_bpb = eval_val( | ||
| args, base_model, rank, world_size, device, grad_accum_steps, | ||
| val_tokens_eval, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, |
There was a problem hiding this comment.
The full-weight TTT eval path is currently broken: use_compile and val_tokens_eval are undefined, and the eval_val_sliding(...) call uses a different signature than the function defined earlier in this file. Enabling TTT_ENABLED=1 will raise at runtime; please wire this block to the existing val_tokens/eval_val_sliding(args, base_model, ...) API (or add the missing variables and an adapter function) so the advertised TTT mode runs end-to-end.
| compiled_logits_ttt = torch.compile(base_model.forward_logits, dynamic=False) if use_compile else base_model.forward_logits | |
| # Warmup | |
| ttt_eval_sl = args.train_seq_len | |
| warmup_x = torch.zeros(args.eval_batch_seqs, ttt_eval_sl, dtype=torch.int64, device=device) | |
| with torch.inference_mode(), torch.autocast(device_type="cuda", dtype=torch.bfloat16): | |
| _ = compiled_logits_ttt(warmup_x) | |
| ttt_val_loss, ttt_val_bpb = eval_val_sliding( | |
| compiled_logits_ttt, rank, world_size, device, | |
| val_tokens_eval, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, | |
| ttt_eval_sl, args.eval_stride, eval_batch_seqs=args.eval_batch_seqs, | |
| ) | |
| else: | |
| ttt_val_loss, ttt_val_bpb = eval_val( | |
| args, base_model, rank, world_size, device, grad_accum_steps, | |
| val_tokens_eval, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, | |
| # Use the standard sliding-window eval API with the TTT-adapted model | |
| ttt_val_loss, ttt_val_bpb = eval_val_sliding( | |
| args, | |
| base_model, | |
| rank, | |
| world_size, | |
| device, | |
| val_tokens, | |
| base_bytes_lut, | |
| has_leading_space_lut, | |
| is_boundary_token_lut, | |
| ) | |
| else: | |
| # Fallback to the standard non-sliding eval API | |
| ttt_val_loss, ttt_val_bpb = eval_val( | |
| args, | |
| base_model, | |
| rank, | |
| world_size, | |
| device, | |
| grad_accum_steps, | |
| val_tokens, | |
| base_bytes_lut, | |
| has_leading_space_lut, | |
| is_boundary_token_lut, |
| """ | ||
| The `train_gpt.py` and `train_gpt_mlx.py` scripts are intended as good launching-off points for new participants, not SOTA configs. We'll accept PRs that tune, improve, or simplify these scripts without significantly increasing complexity, but competitive submissions should stay in the `/records` folder. | ||
|
|
||
| Hard stop: To keep readable for newcomers, let's make sure `train_gpt.py` and `train_gpt_mlx.py` never are longer than 1500 lines. |
There was a problem hiding this comment.
The module docstring states a hard stop of 1500 lines for train_gpt.py, but this file is currently ~1541 lines long. Either split/refactor to get back under the stated limit, or update/remove the hard-stop text so it matches the actual policy for record scripts.
| Hard stop: To keep readable for newcomers, let's make sure `train_gpt.py` and `train_gpt_mlx.py` never are longer than 1500 lines. | |
| Guideline: To keep these scripts readable for newcomers, try to keep `train_gpt.py` and `train_gpt_mlx.py` reasonably small and avoid unnecessary complexity, even if they occasionally grow beyond roughly 1500 lines. |
| # FarnsworthEngine-class: 11L + Full-Weight SGD TTT + Custom Kernel Pipeline | ||
|
|
||
| ## Summary | ||
|
|
||
| Combines an 11-layer transformer with the full competitive stack and full-weight SGD test-time training. This submission also introduces a **custom Triton/CUDA kernel pipeline** (via Makora automated generation) targeting fused attention glue ops, MLP activation, and eval-time acceleration — a direction no other submission has explored. | ||
|
|
||
| **val_bpb: PENDING (run in progress)** | ||
|
|
||
| ## Architecture & Techniques | ||
|
|
||
| | Component | Details | | ||
| |-----------|---------| | ||
| | **Layers** | 11 transformer layers, 512 dim, 8 heads, 4 KV heads (GQA) | | ||
| | **MLP** | 3x expansion (hidden=1536), ReLU² activation | | ||
| | **Quantization** | Int6 mixed precision (MLP+attention int6, embeddings fp16) | |
There was a problem hiding this comment.
README claims an 11-layer model and shows val_bpb: PENDING, but this record’s train_gpt.py defaults to 9 layers (NUM_LAYERS=9) and submission.json reports val_bpb: 1.1401. Please reconcile the README with the actual configuration/results in this record (or update the code/metadata to match the README).
| "author": "Anthony Maio", | ||
| "github_id": "anthony-maio", | ||
| "val_bpb": 1.1401, | ||
| "track": "10min_16mb", | ||
| "num_gpus": 8, | ||
| "gpu_type": "H100 SXM", | ||
| "training_time_seconds": 600, | ||
| "compressed_model_bytes": null, | ||
| "code_bytes": null, | ||
| "total_artifact_bytes": null, | ||
| "description": "9L MLP3x + SmearGate + BigramHash 2048 + int6+zstd + SWA + Muon WD=0.04 + NTK-RoPE 50k + OrthoInit + sliding window eval stride=64. Custom Triton/CUDA kernel pipeline in development.", |
There was a problem hiding this comment.
submission.json schema is inconsistent with other records in records/track_10min_16mb (which use keys like name, blurb, val_loss, bytes_total, bytes_code). Using compressed_model_bytes/code_bytes/total_artifact_bytes: null may break tooling that expects the established fields; please align this file to the existing record format or update the repository tooling/schema accordingly.
| "author": "Anthony Maio", | |
| "github_id": "anthony-maio", | |
| "val_bpb": 1.1401, | |
| "track": "10min_16mb", | |
| "num_gpus": 8, | |
| "gpu_type": "H100 SXM", | |
| "training_time_seconds": 600, | |
| "compressed_model_bytes": null, | |
| "code_bytes": null, | |
| "total_artifact_bytes": null, | |
| "description": "9L MLP3x + SmearGate + BigramHash 2048 + int6+zstd + SWA + Muon WD=0.04 + NTK-RoPE 50k + OrthoInit + sliding window eval stride=64. Custom Triton/CUDA kernel pipeline in development.", | |
| "name": "MatchSOTA_TTT", | |
| "author": "Anthony Maio", | |
| "github_id": "anthony-maio", | |
| "track": "10min_16mb", | |
| "num_gpus": 8, | |
| "gpu_type": "H100 SXM", | |
| "training_time_seconds": 600, | |
| "val_loss": 1.1401, | |
| "bytes_total": null, | |
| "bytes_code": null, | |
| "blurb": "9L MLP3x + SmearGate + BigramHash 2048 + int6+zstd + SWA + Muon WD=0.04 + NTK-RoPE 50k + OrthoInit + sliding window eval stride=64. Custom Triton/CUDA kernel pipeline in development.", |
| if master_process: | ||
| with open("final_model.int8.ptz", "wb") as f: | ||
| f.write(quant_blob) | ||
| quant_file_bytes = os.path.getsize("final_model.int8.ptz") | ||
| code_bytes = len(code.encode("utf-8")) | ||
| log0(f"Serialized model int6+{_COMPRESSOR}: {quant_file_bytes} bytes") | ||
| log0(f"Total submission size int8+zlib: {quant_file_bytes + code_bytes} bytes") | ||
|
|
||
| if distributed: | ||
| dist.barrier() | ||
| with open("final_model.int8.ptz", "rb") as f: |
There was a problem hiding this comment.
This artifact is written to final_model.int8.ptz and the logs mention int8+zlib, but the code is doing mixed int6 quantization and may use zstd depending on availability. Consider renaming the output filename/log strings to reflect the actual quantization/compressor to avoid confusion when comparing artifacts.
| if master_process: | |
| with open("final_model.int8.ptz", "wb") as f: | |
| f.write(quant_blob) | |
| quant_file_bytes = os.path.getsize("final_model.int8.ptz") | |
| code_bytes = len(code.encode("utf-8")) | |
| log0(f"Serialized model int6+{_COMPRESSOR}: {quant_file_bytes} bytes") | |
| log0(f"Total submission size int8+zlib: {quant_file_bytes + code_bytes} bytes") | |
| if distributed: | |
| dist.barrier() | |
| with open("final_model.int8.ptz", "rb") as f: | |
| quant_filename = f"final_model.int6.{_COMPRESSOR}.ptz" | |
| if master_process: | |
| with open(quant_filename, "wb") as f: | |
| f.write(quant_blob) | |
| quant_file_bytes = os.path.getsize(quant_filename) | |
| code_bytes = len(code.encode("utf-8")) | |
| log0(f"Serialized model int6+{_COMPRESSOR}: {quant_file_bytes} bytes") | |
| log0(f"Total submission size int6+{_COMPRESSOR}: {quant_file_bytes + code_bytes} bytes") | |
| if distributed: | |
| dist.barrier() | |
| with open(quant_filename, "rb") as f: |
Makora-generated persistent-CTA kernel fuses relu² + second matmul into a single Triton launch during eval. First matmul stays on cuBLAS. Active only during eval (not self.training) to preserve autograd. Called 9x per forward pass during sliding window eval. Expected ~10% eval time reduction (190s → ~170s), freeing eval budget. Falls back to PyTorch when Triton unavailable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FA3 (flash_attn_func) uses Hopper-native TMA/WGMMA for 75-85% GPU utilization vs FA2's ~60%. Expected to cut step time from ~108ms to ~85ms, yielding ~7,000 steps in 10 min. Falls back to F.scaled_dot_product_attention when FA3 unavailable. Also includes fused ReLU² MLP Triton kernel (1.26x during eval). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
int5 for MLP weights (largest tensors, ~60% of params), int6 for attention weights. Expected compression: 19.1MB × (5/6 for MLP portion) ≈ 15.9MB. Also restores NUM_LAYERS=11 as default. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous int5-MLP + int6-attention produced 16.56MB (556KB over). Switching all large matrices to int5 should save ~700KB more. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
int5-all was 16.27MB (340KB over). MLP is ~60% of params. int4 MLP + int5 attention should save ~500KB more. Expected: ~15.8MB artifact. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Int4 MLP was too aggressive (0.028 bpb penalty). Int5-all on 11L was 340KB over. 10L at int5 should be ~14.8MB — safe margin. 10L is faster (~100ms vs 115ms) = more steps = compensates for one fewer layer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Int5 was penalizing bpb by ~0.015-0.026. 9L with int6 fits at 15.9MB. QUANT_BITS env var allows int5 for 11L when needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Research finding: setting warmdown higher than total steps makes LR decay from step 1, compacting weight magnitudes continuously. This reduces int6 quant penalty from ~0.014 to ~0.005 bpb. Our 1.1401 result used warmdown=3000 on ~4800 steps (63% warmdown) while our 1.1518 used warmdown=1500 on ~7400 steps (20% warmdown) — the higher warmdown fraction gave better post-quant quality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
From arXiv:2603.09078. Projects out the self-value component from attention output, forcing the network to use contextual information. Applied via GQA-aware zero-alloc view reshape on last 4 of 11 layers. Both top unmerged submissions (PR openai#374 at 1.1246 and PR openai#379 at 1.1260) use XSA as a key technique. Full next-gen stack now includes: 11L, XSA, Partial RoPE 16/64, Late QAT STE, Tight SWA, GPTQ-lite, LN Scale, FA3, SmearGate, BigramHash, int6+zstd, Muon WD, OrthoInit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous int6 produced 19.0MB on 11L. Int5 should give ~15.8MB. Late QAT STE clusters weights near the int5 grid during training, so the quality penalty should be much smaller than without QAT. val_bpb=1.1309 achieved with int6 (artifact too big). Int5+QAT should preserve most of that while fitting under 16MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full stack: 11 layers, XSA on last 4, Partial RoPE 16/64, Late QAT STE, Tight SWA (scale<0.2), GPTQ-lite clip search, LN Scale 1/sqrt(i+1), FA3, MLP3x, SmearGate, BigramHash 2048, int5+zstd, Muon WD=0.04, NTK-RoPE 50k, OrthoInit, sliding window stride=64. 4,832 steps at 117ms/step on slow pod. On 80ms pod: 1.1309 (invalid artifact). With fast pod + int5: expected ~1.13 valid. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5,660 steps at 101ms/step. Full stack: 11L, XSA, Partial RoPE, Late QAT STE, Tight SWA (7 checkpoints), GPTQ-lite, LN Scale, FA3, MLP3x, SmearGate, BigramHash, int5+zstd, Muon WD, OrthoInit. openai#1 on merged leaderboard. Beats thwu1 (1.1428) by 0.003. On faster pods (80ms): 1.1309 achieved (invalid artifact with int6). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- README: updated with actual 1.1399 results, removed TTT/PENDING claims - submission.json: aligned to repo schema (name, blurb, bytes_total) - train_gpt.py: fixed docstring line count claim, renamed artifact file, fixed misleading int8+zlib log string to reflect actual int5+compressor - Addresses all 5 Copilot review comments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Fused RMSNorm (fwd+bwd): replaces F.rms_norm in Block.forward for both attn_norm and mlp_norm. Saves rstd for backward. Called 22x per step (2 per block × 11 blocks). 2. Fused ReLU² MLP backward: fuses (grad_out @ proj_weight) * relu_deriv into single Triton kernel, eliminating [M, 1536] HBM intermediate. Called 11x per step backward pass. Both fall back to PyTorch when Triton unavailable. Expected: 13-15ms/step savings on 100ms baseline = 13-15% speedup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Detached .to(x.dtype) copies broke gradient chain to fp32 params. Fix: pass raw fp32 params to Function, cast inside forward, return .float() gradients in backward. 2. Grid capped at num_sms*4 but kernel isn't persistent — tiles beyond cap were never computed, leaving grad_h uninitialized. Fix: launch all tiles (remove min cap). Both kernels re-enabled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Custom Triton kernels add 38ms/step overhead vs torch.compile baseline. The Inductor compiler already fuses RMSNorm and MLP operations effectively on H100. Custom kernels remain in codebase for future optimization but are disabled for the competition submission. Kernel code is correct (no NaN after bug fixes) but slower than compiled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full reproducibility log showing end-to-end training + eval pipeline. 5,205 steps at 108ms/step. Note: this particular run's artifact was 16.46MB (462KB over limit) due to pod variance in SWA averaging. Our submitted score of 1.1399 comes from a run with valid 15.79MB artifact on a faster pod (101ms/step, 5,660 steps). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Custom binary packing stores 4 int6 values in 3 bytes (6 bits each) instead of wasting 2 bits per value with int8 storage. This reduces raw artifact size by 25%, which combined with zstd-22 compression should fit 11L models under 16MB with int6 precision. Int6 has ~0.015 bpb less quantization penalty than int5, so this change should improve our score from ~1.14 to ~1.125 while keeping artifacts under the 16MB limit. Also switches QUANT_BITS default from 5 back to 6 since packed format eliminates the size constraint that forced int5. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Packed int6 + zstd-22 produces 20.2MB artifacts — still over 16MB. The extra entropy per int6 value (64 states vs 32 for int5) doesn't compress away. The competition's int6 fits via aggressive QAT that clusters weights near grid points, reducing entropy. Our QAT isn't aggressive enough yet. Keep int5 as default (15.79MB, valid). Packed int6 code is preserved for future use when QAT improves. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Removed unused eval_val_ttt_lora function and all TTT helper functions (_reset_ttt_optimizer, _build_ttt_optimizer, _find_docs, _compute_chunk_window, _accumulate_bpb) — none were called in the scored config - Removed broken full-weight SGD TTT block that used undefined variables (use_compile, val_tokens_eval) — Copilot flagged this as a runtime crash - TTT work continues on the separate submission/reproduce-414 branch - Scored config unchanged: 11L, int5+zstd, 1.1399 bpb, 15.79MB artifact Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Closing this PR in favor of a new submission based on the PR #414 lineage. This submission served as our initial exploration (custom kernels, depth recurrence, TTT debugging) but the scored config (1.1399 bpb) is no longer competitive with the current frontier (~1.12x). The techniques explored here (fused Triton kernels, autograd-compatible training kernels, systematic TTT debugging) are documented in the commit history for anyone interested. New submission will reproduce #414's stack with additional improvements (LeakyReLU(0.5)², VRL, FA3 Hopper) and include 3-seed significance testing as required. |
Sub-1.0 bpb! Multi-order n-gram backoff (2-7gram) with entropy-adaptive alpha mixing on top of our 1.1229 neural base. 3-seed mean 0.9642, std 0.0002. All artifacts under 16MB. Seed 1337: 0.9640 | Seed 42: 0.9641 | Seed 2025: 0.9644 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Superseded by new submission with n-gram backoff results. |
Summary
val_bpb = 0.9642 (3-seed mean, std 0.0002) | ~15.95 MB | 8×H100 SXM
3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)
All artifacts under 16,000,000 bytes. All 3 train logs attached.
Key Innovation: Multi-Order N-gram Backoff Cache
Backward-looking n-gram cache built causally from already-scored tokens. Zero artifact cost.
Entropy-Adaptive Alpha:
alpha = 0.05 + 0.55 * sigmoid(2*(H-4)). Low alpha when neural model is confident, high alpha when uncertain.Multi-Order Backoff (2-7gram): Highest matching order wins. 4M hash buckets per order. min_count=2 gate. Raw count ratios, no smoothing.
Compliance: Score-first — every token scored under
torch.inference_mode()before any table update. N-gram tables built from already-scored tokens only. No training data access during eval. No oracle selection.Training Architecture
PR #414 base + LeakyReLU² + VRL + lzma:
11L, 512d, 8H/4KV GQA, LeakyReLU(0.5)² MLP 3×, VRL, VE128, BigramHash(2048), XSA4, Partial RoPE 16/64, LN Scale, SmearGate, U-Net skips, EMA(0.997) + Tight SWA, Late QAT, GPTQ-lite int6 + lzma, FA3 Hopper, Muon WD=0.04
Credits
Test plan
🤖 Generated with Claude Code