Separating "what's used" from "what's remembered" in GPT-2 hidden states.
This project builds on Language Models are Injective and Hence Invertible (SipIt) and asks a follow-up question: is the information that makes hidden states invertible the same information the model uses for next-token prediction, or does it live in a separate channel?
The answer is that GPT-2 appears to maintain two largely distinct channels in its hidden representations:
- A low-dimensional behavior channel that controls next-token predictions
- A high-dimensional identity channel that preserves prompt-specific information and drives invertibility
We show this at two levels of description: continuous geometry (gradient-based subspace decomposition) and sparse mechanisms (CLT feature analysis).
git clone https://github.com/giorgosnikolaou/SIPIT.git
cd SIPIT
pip install -e .
pip install datasets matplotlib scipyFor CLT analysis, also install CLT-Forge in a Python 3.11 environment.
For a layer l and a set of calibration prompts, compute the gradient of the next-token loss with respect to the last-token hidden state h at layer l:
g_i = grad_{h^l} (-log p(x_{t+1} | x_{<=t}))
Form the empirical gradient covariance:
C_l = (1/N) sum_i g_i g_i^T
The top-k eigenvectors of C_l span the behavior subspace B_l(k). The orthogonal complement B_l(k)^perp is the identity subspace. This is a first-order causal proxy: directions with large gradient energy are directions where changing the hidden state changes predictions most.
Insert a projection P_B (behavior) or P_I = I - P_B (identity) at layer l as a forward hook, then measure perplexity degradation (dPPL) on held-out data. Also measure KL divergence and top-1 token agreement against the unmodified model.
For each prefix p and true next token y, sweep all v in the vocabulary and compute:
margin(p) = min_{v != y} || g(h(p+y)) - g(h(p+v)) ||_2
where g is the transform (projection + quantization). This one-step margin is the object that controls SipIt's local verifier: if the margin is large, the token is uniquely recoverable from the (possibly quantized) hidden state.
Train a cross-layer transcoder (CLT) on GPT-2 and decompose hidden states into sparse feature activations a^l(p). Define:
- Support code: s(p) = 1[a^l(p) > 0] (binary pattern of which features fire)
- Amplitude code: the continuous activation values on active features
Compute support-code margins (minimum Hamming distance to any other token's support pattern) and classify features by behavioral impact.
160 utility configs, 128 leakage configs, GPT-2 small, layers 6 and 11.
The behavior subspace captures utility far more efficiently than the identity complement, while the identity complement carries most of the inversion margin:
| Layer 6, 8-bit | Behavior dPPL | Identity dPPL | Behavior margin (% of full) | Identity margin (% of full) |
|---|---|---|---|---|
| k=32 | 18,953 | 1,079 | 18% | 98% |
| k=64 | 4,233 | 4,299 | 27% | 96% |
| k=128 | 1,731 | 41,638 | 39% | 92% |
| k=256 | 893 | 673,984 | 55% | 84% |
At k=128, behavior projection preserves 24x more utility (dPPL 1,731 vs 41,638) while retaining only 39% of the inversion margin (vs 92% for identity). This is the privacy-utility frontier: you can keep what the model uses for predictions while destroying most of the information needed for prompt recovery.
The crossover at k~64 (layer 6) is visible in the plots. Below that, both channels are too small to preserve their respective information well. Above that, the behavior subspace is sufficient for predictions while the identity complement is needed for invertibility.
Gradient energy spectrum: top-32 eigenvectors capture ~30% of gradient energy, top-128 capture ~56%, top-256 capture ~74%. The behavior-relevant directions are spread across many dimensions but are far more concentrated than the full 768-dimensional space.
CLT with d_latent=12288, trained on GPT-2 small.
Feature extraction across 12 layers reveals a clear partition:
- 72% of alive features (75,081/103,732) have zero behavioral score (logit-silent). These are scaffold features that fire in response to input but do not measurably affect next-token predictions.
- 28% of alive features (28,651/103,732) have nonzero behavioral score. These are backbone features that drive predictions.
The scaffold is densest in early layers (L0-L1: 90%+ scaffold) and late layers (L10-L11: 95%+ scaffold). Backbone features concentrate in middle layers (L4-L6: ~50% backbone).
The support code is empirically injective. Out of 500 randomly sampled prompt-position pairs, all 500 had unique binary support patterns. Each prompt maps to a distinct cell in the piecewise-linear feature space.
Support-code margin exceeds hidden-state margin. In a scaled evaluation over 100 prefixes with full vocabulary sweep:
| Margin type | Median | p10 | p90 |
|---|---|---|---|
| Support code (Hamming) | 37.0 | 15.0 | 101.3 |
| Hidden state (L2) | 24.8 | ||
| Amplitude (L2) | 4.8 |
The binary pattern of which features fire is more discriminative than the full continuous hidden state for token identification. Amplitudes alone are poor discriminators (median 4.8), confirming that the identity information lives primarily in the combinatorial support pattern, not the continuous signal.
Support and amplitude distances are negatively correlated (Spearman rho = -0.451). Prompts with similar support patterns tend to have more different amplitude patterns, suggesting the two channels encode partially independent information.
Pairwise Hamming distances between support codes have median 962 (out of 12,288 features per layer), confirming that prompts live in well-separated combinatorial cells.
The main 12-panel figure is at artifacts/clt_v2/figures/main_figure_12panel.png. Individual plots are in artifacts/plots/.
Key plots:
dppl_vs_k_layer6_b32.png: behavior vs identity utility at layer 6margin_vs_k_layer6_b8.png: inversion margin by subspace dimensionprivacy_utility_frontier.png: the tradeoff between utility and leakage robustness
two_channel/
compute_subspace.py # gradient covariance eigenvectors
transforms.py # projection, quantization, noise
eval_utility.py # perplexity/KL under projection
eval_leakage.py # one-step margin and collision counting
plot_results.py # figure generation
run_full_gpu.py # full GPU pipeline (subspace + utility + leakage + plots)
clt_analysis/
run_stage2.py # CLT training + feature extraction + scoring
run_stage3.py # attribution graphs + math objects + visualizations
scaffold_backbone.py # feature classification utilities
python two_channel/run_full_gpu.pyThis runs subspace computation (5k samples), utility evaluation (160 configs), leakage evaluation (128 configs), and plot generation. Takes about 14 hours on an A10G GPU.
python two_channel/clt_analysis/run_stage2.pyTrains a CLT, extracts features, computes behavioral scores, and measures support-code margins. Takes about 20 minutes for 300K-token CLT on A10G.
python two_channel/clt_analysis/run_stage3.py --skip aGenerates attribution graphs, computes pairwise distances and cell statistics, runs scaled margin analysis, and produces the 12-panel main figure.
This project builds on:
- Language Models are Injective and Hence Invertible (Nikolaou et al., 2025)
- Transcoders Find Interpretable LLM Feature Circuits (Dunefsky et al., 2024)
- Circuit Tracing (Anthropic, 2025)
- CLT-Forge