Non-record: Negative results — quantization algorithms, TTT, architecture, and self-generated GPTQ calibration study by abaybektursun · Pull Request #756 · openai/parameter-golf

abaybektursun · 2026-03-25T18:57:51Z

Summary

Negative results on the 1.1142 BPB stack (GPTQ + XSA-all + BigramHash 3072×112): quantization algorithms, TTT, architecture experiments, and a self-generated GPTQ calibration study.

Self-Generated GPTQ Calibration Study

GPTQ calibration estimates H = X^T X (activation covariance) per layer to guide int6 rounding decisions. We tested whether the model can calibrate itself without any external data.

Trained once (seed 314, 6,942 steps), saved checkpoint, ran GPTQ with different calibration sources on the same frozen weights:

#	Calibration Source	Tokens	Time	Sliding BPB	vs Val-calib
1	Val data	~50M	~5s	1.11446	—
2	Autoregressive self-generation	131K	186s	1.11477	+0.00031
3	Random tokens (64 batches)	131K	3.4s	1.11650	+0.00204
4	Random tokens (256×48 batches)	25M	35s	1.11650	+0.00204
5	Gibbs-refined (3 rounds, 64×48)	6.3M	24.4s	1.11663	+0.00217

Row 2: the model generates 64 coherent sequences of 2048 tokens autoregressively from its own learned distribution (temperature=0.8, batch_size=8). No external data accessed. Confirmed on a separate checkpoint (BigramHash 2048×128, 8×H100); the relative gaps are consistent across stacks.

Findings:

Autoregressive self-generation closes 84% of the gap. The val-vs-random gap is 0.00204 BPB. Autoregressive generation recovers 0.00173 of that, leaving only 0.00031 BPB. The gap is predominantly natural language vs random noise — coherent text from the model's own distribution produces Hessians nearly identical to val data.
The remaining 0.0003 BPB is P_model vs P_data divergence. The model's output distribution is a 27M-parameter approximation of the training data distribution. This small residual gap measures how far the model's internal activation patterns have drifted from those of real FineWeb text. It is negligible.
Gibbs refinement does not help (1.11663 vs 1.11650 for plain random). Gibbs replaces tokens in-place conditioned on still-mostly-random neighbors — it does not produce coherent text. Autoregressive generation builds coherent sequences left-to-right, which is what produces natural-language-like activations.
More random tokens do not help. 131K and 25M tokens give identical BPB (1.11650). The Hessian converges quickly at int6 — it mainly needs to identify dead columns and relative importance, which are properties of the model's weights, not input statistics.
Self-generated calibration at 1.1165 beats SOTA (from our PR #549, 1.1194) by 0.003 BPB with zero legality risk. Autoregressive self-generation at 1.1148 comes within 0.0003 of val-calibrated performance.

Why random tokens work at int6: The Hessian diagonal and off-diagonal structure are dominated by the model's learned weights — embedding geometry, attention patterns, MLP scales. At 63 grid levels, the rounding decisions are coarse enough that the Hessian quality threshold is low.

Quantization Algorithm Experiments

Quant gap: +0.0036 BPB (pre-quant 1.1341 → roundtrip 1.1377). At int6, GPTQ is near-optimal.

Technique	Paper	Result	Mechanism
Qronos (ICLR 2026)	arXiv:2505.11695	+0.0007 ❌	Re-collects Hessians from quantized activations. At int6, activation mismatch <0.1% — updated Hessians nearly identical.
CDQuant	arXiv:2406.17542	+0.0005 ❌	Coordinate descent re-visiting columns. At ~0.06 scale-unit spacing, most weights already at optimal grid point.

TTT Experiments (Score-First, Legal)

Same protocol as our merged PR #549. 25 total TTT experiments have failed across two stacks.

Approach	Params Unfrozen	TTT BPB	Baseline	Delta
Full TTT	27.1M (100%)	1.1146	1.1139	+0.0007 ❌
MLP-down	8.7M (32%)	1.1145	1.1144	+0.0001
MLP-all	17.3M (64%)	1.1144	1.1144	+0.0000

SGD lr=0.002, momentum=0.9, 3 epochs, 32K chunks, cosine LR, grad_clip=1.0. Baselines differ per row because each TTT variant freezes different layers, changing the eval-time forward pass.

Why TTT fails on this stack but worked on our PR #549 (−0.0025 BPB):

XSA-all already captures the inter-document context patterns that TTT was adapting to on the previous stack
At 27M params, score-first TTT cannot overcome the forgetting/adaptation tradeoff — early chunks get no benefit, and the model is too small for late-chunk gains to compensate

Architecture and Eval-Time Experiments

Technique	Result	Mechanism
Spectral Init (λ=10 on QKV, arXiv:2603.07162)	1.52 BPB, 650ms/step ❌	λ=10 is 200× Xavier init magnitude at 27M params. 924 steps in 600s vs 6,950 baseline. Paper tested ~100M models.
SLOT bias (512-dim additive, 3 AdamW steps/chunk)	+0.0013 ❌	Global shift cannot capture per-document patterns. Final-norm → logit pipeline already calibrated.

What's Exhausted

Quant algorithm: Qronos, CDQuant both negative
Eval-time adaptation: 3× TTT + SLOT all negative
Architecture: Spectral Init catastrophic; Gated Attention (+0.0011, PR #609), DiffTransformer (1.5× slower, PR #418), Attention Residuals (54% slower, PR #362) all dead

Untested: Non-uniform quantization grid, rate-distortion quantization (CERWU), QK-Norm, Peri-LN.

🤖 Generated with Claude Code

…PTQ stack 6 experiments on the current SOTA stack (1.1142 BPB), all negative: - Qronos iterative Hessian (3 iters): +0.0007 worse - CDQuant coordinate descent (3 passes): +0.0005 worse - Full TTT (all params): +0.0001 worse - MLP-down-only TTT: +0.0001 neutral - MLP-all TTT: +0.0001 neutral Key finding: At int6, GPTQ algorithm is near-optimal. Remaining quant headroom is in the grid (what values to quantize to), not the algorithm (how to assign). TTT is dead on this stack — 25 total failed attempts across two stacks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 25, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

abaybektursun mentioned this pull request Mar 26, 2026

Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.1142 (3-seed mean) #728

Open

abaybektursun changed the title ~~Non-record: Negative results — quantization algorithms & TTT on val-GPTQ stack~~ Non-record: Negative results — quantization algorithms, TTT, architecture, and self-generated GPTQ calibration study Mar 26, 2026

This was referenced Mar 26, 2026

Non-record: Technique Taxonomy — Tier List, Interaction Effects, and BPB Verification Tools #891

Closed

Non-record: Technique Taxonomy — Tier List, Interaction Effects, and BPB Verification Tools #892

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Negative results — quantization algorithms, TTT, architecture, and self-generated GPTQ calibration study#756

Non-record: Negative results — quantization algorithms, TTT, architecture, and self-generated GPTQ calibration study#756
abaybektursun wants to merge 1 commit intoopenai:mainfrom
abaybektursun:negative-results-quant-algo-ttt

abaybektursun commented Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abaybektursun commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Self-Generated GPTQ Calibration Study

Quantization Algorithm Experiments

TTT Experiments (Score-First, Legal)

Architecture and Eval-Time Experiments

What's Exhausted

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abaybektursun commented Mar 25, 2026 •

edited

Loading