Skip to content

Non-record: Negative results — quantization algorithms, TTT, architecture, and self-generated GPTQ calibration study#756

Open
abaybektursun wants to merge 1 commit intoopenai:mainfrom
abaybektursun:negative-results-quant-algo-ttt
Open

Non-record: Negative results — quantization algorithms, TTT, architecture, and self-generated GPTQ calibration study#756
abaybektursun wants to merge 1 commit intoopenai:mainfrom
abaybektursun:negative-results-quant-algo-ttt

Conversation

@abaybektursun
Copy link
Copy Markdown
Contributor

@abaybektursun abaybektursun commented Mar 25, 2026

Summary

Negative results on the 1.1142 BPB stack (GPTQ + XSA-all + BigramHash 3072×112): quantization algorithms, TTT, architecture experiments, and a self-generated GPTQ calibration study.


Self-Generated GPTQ Calibration Study

GPTQ calibration estimates H = X^T X (activation covariance) per layer to guide int6 rounding decisions. We tested whether the model can calibrate itself without any external data.

Trained once (seed 314, 6,942 steps), saved checkpoint, ran GPTQ with different calibration sources on the same frozen weights:

# Calibration Source Tokens Time Sliding BPB vs Val-calib
1 Val data ~50M ~5s 1.11446
2 Autoregressive self-generation 131K 186s 1.11477 +0.00031
3 Random tokens (64 batches) 131K 3.4s 1.11650 +0.00204
4 Random tokens (256×48 batches) 25M 35s 1.11650 +0.00204
5 Gibbs-refined (3 rounds, 64×48) 6.3M 24.4s 1.11663 +0.00217

Row 2: the model generates 64 coherent sequences of 2048 tokens autoregressively from its own learned distribution (temperature=0.8, batch_size=8). No external data accessed. Confirmed on a separate checkpoint (BigramHash 2048×128, 8×H100); the relative gaps are consistent across stacks.

Findings:

  1. Autoregressive self-generation closes 84% of the gap. The val-vs-random gap is 0.00204 BPB. Autoregressive generation recovers 0.00173 of that, leaving only 0.00031 BPB. The gap is predominantly natural language vs random noise — coherent text from the model's own distribution produces Hessians nearly identical to val data.

  2. The remaining 0.0003 BPB is P_model vs P_data divergence. The model's output distribution is a 27M-parameter approximation of the training data distribution. This small residual gap measures how far the model's internal activation patterns have drifted from those of real FineWeb text. It is negligible.

  3. Gibbs refinement does not help (1.11663 vs 1.11650 for plain random). Gibbs replaces tokens in-place conditioned on still-mostly-random neighbors — it does not produce coherent text. Autoregressive generation builds coherent sequences left-to-right, which is what produces natural-language-like activations.

  4. More random tokens do not help. 131K and 25M tokens give identical BPB (1.11650). The Hessian converges quickly at int6 — it mainly needs to identify dead columns and relative importance, which are properties of the model's weights, not input statistics.

  5. Self-generated calibration at 1.1165 beats SOTA (from our PR #549, 1.1194) by 0.003 BPB with zero legality risk. Autoregressive self-generation at 1.1148 comes within 0.0003 of val-calibrated performance.

Why random tokens work at int6: The Hessian diagonal and off-diagonal structure are dominated by the model's learned weights — embedding geometry, attention patterns, MLP scales. At 63 grid levels, the rounding decisions are coarse enough that the Hessian quality threshold is low.


Quantization Algorithm Experiments

Quant gap: +0.0036 BPB (pre-quant 1.1341 → roundtrip 1.1377). At int6, GPTQ is near-optimal.

Technique Paper Result Mechanism
Qronos (ICLR 2026) arXiv:2505.11695 +0.0007 ❌ Re-collects Hessians from quantized activations. At int6, activation mismatch <0.1% — updated Hessians nearly identical.
CDQuant arXiv:2406.17542 +0.0005 ❌ Coordinate descent re-visiting columns. At ~0.06 scale-unit spacing, most weights already at optimal grid point.

TTT Experiments (Score-First, Legal)

Same protocol as our merged PR #549. 25 total TTT experiments have failed across two stacks.

Approach Params Unfrozen TTT BPB Baseline Delta
Full TTT 27.1M (100%) 1.1146 1.1139 +0.0007 ❌
MLP-down 8.7M (32%) 1.1145 1.1144 +0.0001
MLP-all 17.3M (64%) 1.1144 1.1144 +0.0000

SGD lr=0.002, momentum=0.9, 3 epochs, 32K chunks, cosine LR, grad_clip=1.0. Baselines differ per row because each TTT variant freezes different layers, changing the eval-time forward pass.

Why TTT fails on this stack but worked on our PR #549 (−0.0025 BPB):

  • XSA-all already captures the inter-document context patterns that TTT was adapting to on the previous stack
  • At 27M params, score-first TTT cannot overcome the forgetting/adaptation tradeoff — early chunks get no benefit, and the model is too small for late-chunk gains to compensate

Architecture and Eval-Time Experiments

Technique Result Mechanism
Spectral Init (λ=10 on QKV, arXiv:2603.07162) 1.52 BPB, 650ms/step ❌ λ=10 is 200× Xavier init magnitude at 27M params. 924 steps in 600s vs 6,950 baseline. Paper tested ~100M models.
SLOT bias (512-dim additive, 3 AdamW steps/chunk) +0.0013 ❌ Global shift cannot capture per-document patterns. Final-norm → logit pipeline already calibrated.

What's Exhausted

  • Quant algorithm: Qronos, CDQuant both negative
  • Eval-time adaptation: 3× TTT + SLOT all negative
  • Architecture: Spectral Init catastrophic; Gated Attention (+0.0011, PR #609), DiffTransformer (1.5× slower, PR #418), Attention Residuals (54% slower, PR #362) all dead

Untested: Non-uniform quantization grid, rate-distortion quantization (CERWU), QK-Norm, Peri-LN.

🤖 Generated with Claude Code

…PTQ stack

6 experiments on the current SOTA stack (1.1142 BPB), all negative:
- Qronos iterative Hessian (3 iters): +0.0007 worse
- CDQuant coordinate descent (3 passes): +0.0005 worse
- Full TTT (all params): +0.0001 worse
- MLP-down-only TTT: +0.0001 neutral
- MLP-all TTT: +0.0001 neutral

Key finding: At int6, GPTQ algorithm is near-optimal. Remaining quant headroom
is in the grid (what values to quantize to), not the algorithm (how to assign).
TTT is dead on this stack — 25 total failed attempts across two stacks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@abaybektursun abaybektursun changed the title Non-record: Negative results — quantization algorithms & TTT on val-GPTQ stack Non-record: Negative results — quantization algorithms, TTT, architecture, and self-generated GPTQ calibration study Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant