Non-record: Negative results — quantization algorithms, TTT, architecture, and self-generated GPTQ calibration study#756
Open
abaybektursun wants to merge 1 commit intoopenai:mainfrom
Conversation
…PTQ stack 6 experiments on the current SOTA stack (1.1142 BPB), all negative: - Qronos iterative Hessian (3 iters): +0.0007 worse - CDQuant coordinate descent (3 passes): +0.0005 worse - Full TTT (all params): +0.0001 worse - MLP-down-only TTT: +0.0001 neutral - MLP-all TTT: +0.0001 neutral Key finding: At int6, GPTQ algorithm is near-optimal. Remaining quant headroom is in the grid (what values to quantize to), not the algorithm (how to assign). TTT is dead on this stack — 25 total failed attempts across two stacks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Negative results on the 1.1142 BPB stack (GPTQ + XSA-all + BigramHash 3072×112): quantization algorithms, TTT, architecture experiments, and a self-generated GPTQ calibration study.
Self-Generated GPTQ Calibration Study
GPTQ calibration estimates H = X^T X (activation covariance) per layer to guide int6 rounding decisions. We tested whether the model can calibrate itself without any external data.
Trained once (seed 314, 6,942 steps), saved checkpoint, ran GPTQ with different calibration sources on the same frozen weights:
Row 2: the model generates 64 coherent sequences of 2048 tokens autoregressively from its own learned distribution (temperature=0.8, batch_size=8). No external data accessed. Confirmed on a separate checkpoint (BigramHash 2048×128, 8×H100); the relative gaps are consistent across stacks.
Findings:
Autoregressive self-generation closes 84% of the gap. The val-vs-random gap is 0.00204 BPB. Autoregressive generation recovers 0.00173 of that, leaving only 0.00031 BPB. The gap is predominantly natural language vs random noise — coherent text from the model's own distribution produces Hessians nearly identical to val data.
The remaining 0.0003 BPB is P_model vs P_data divergence. The model's output distribution is a 27M-parameter approximation of the training data distribution. This small residual gap measures how far the model's internal activation patterns have drifted from those of real FineWeb text. It is negligible.
Gibbs refinement does not help (1.11663 vs 1.11650 for plain random). Gibbs replaces tokens in-place conditioned on still-mostly-random neighbors — it does not produce coherent text. Autoregressive generation builds coherent sequences left-to-right, which is what produces natural-language-like activations.
More random tokens do not help. 131K and 25M tokens give identical BPB (1.11650). The Hessian converges quickly at int6 — it mainly needs to identify dead columns and relative importance, which are properties of the model's weights, not input statistics.
Self-generated calibration at 1.1165 beats SOTA (from our PR #549, 1.1194) by 0.003 BPB with zero legality risk. Autoregressive self-generation at 1.1148 comes within 0.0003 of val-calibrated performance.
Why random tokens work at int6: The Hessian diagonal and off-diagonal structure are dominated by the model's learned weights — embedding geometry, attention patterns, MLP scales. At 63 grid levels, the rounding decisions are coarse enough that the Hessian quality threshold is low.
Quantization Algorithm Experiments
Quant gap: +0.0036 BPB (pre-quant 1.1341 → roundtrip 1.1377). At int6, GPTQ is near-optimal.
TTT Experiments (Score-First, Legal)
Same protocol as our merged PR #549. 25 total TTT experiments have failed across two stacks.
SGD lr=0.002, momentum=0.9, 3 epochs, 32K chunks, cosine LR, grad_clip=1.0. Baselines differ per row because each TTT variant freezes different layers, changing the eval-time forward pass.
Why TTT fails on this stack but worked on our PR #549 (−0.0025 BPB):
Architecture and Eval-Time Experiments
What's Exhausted
Untested: Non-uniform quantization grid, rate-distortion quantization (CERWU), QK-Norm, Peri-LN.
🤖 Generated with Claude Code