Skip to content

okezue/two-channel-coding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Two-Channel Coding in Injective LLM Representations

Separating "what's used" from "what's remembered" in GPT-2 hidden states.

This project builds on Language Models are Injective and Hence Invertible (SipIt) and asks a follow-up question: is the information that makes hidden states invertible the same information the model uses for next-token prediction, or does it live in a separate channel?

The answer is that GPT-2 appears to maintain two largely distinct channels in its hidden representations:

  1. A low-dimensional behavior channel that controls next-token predictions
  2. A high-dimensional identity channel that preserves prompt-specific information and drives invertibility

We show this at two levels of description: continuous geometry (gradient-based subspace decomposition) and sparse mechanisms (CLT feature analysis).

Setup

git clone https://github.com/giorgosnikolaou/SIPIT.git
cd SIPIT
pip install -e .
pip install datasets matplotlib scipy

For CLT analysis, also install CLT-Forge in a Python 3.11 environment.

Method

Defining the behavior subspace

For a layer l and a set of calibration prompts, compute the gradient of the next-token loss with respect to the last-token hidden state h at layer l:

g_i = grad_{h^l} (-log p(x_{t+1} | x_{<=t}))

Form the empirical gradient covariance:

C_l = (1/N) sum_i g_i g_i^T

The top-k eigenvectors of C_l span the behavior subspace B_l(k). The orthogonal complement B_l(k)^perp is the identity subspace. This is a first-order causal proxy: directions with large gradient energy are directions where changing the hidden state changes predictions most.

Measuring utility

Insert a projection P_B (behavior) or P_I = I - P_B (identity) at layer l as a forward hook, then measure perplexity degradation (dPPL) on held-out data. Also measure KL divergence and top-1 token agreement against the unmodified model.

Measuring leakage (robust invertibility)

For each prefix p and true next token y, sweep all v in the vocabulary and compute:

margin(p) = min_{v != y} || g(h(p+y)) - g(h(p+v)) ||_2

where g is the transform (projection + quantization). This one-step margin is the object that controls SipIt's local verifier: if the margin is large, the token is uniquely recoverable from the (possibly quantized) hidden state.

CLT feature analysis

Train a cross-layer transcoder (CLT) on GPT-2 and decompose hidden states into sparse feature activations a^l(p). Define:

  • Support code: s(p) = 1[a^l(p) > 0] (binary pattern of which features fire)
  • Amplitude code: the continuous activation values on active features

Compute support-code margins (minimum Hamming distance to any other token's support pattern) and classify features by behavioral impact.

Results

Study 1: Continuous geometry

160 utility configs, 128 leakage configs, GPT-2 small, layers 6 and 11.

The behavior subspace captures utility far more efficiently than the identity complement, while the identity complement carries most of the inversion margin:

Layer 6, 8-bit Behavior dPPL Identity dPPL Behavior margin (% of full) Identity margin (% of full)
k=32 18,953 1,079 18% 98%
k=64 4,233 4,299 27% 96%
k=128 1,731 41,638 39% 92%
k=256 893 673,984 55% 84%

At k=128, behavior projection preserves 24x more utility (dPPL 1,731 vs 41,638) while retaining only 39% of the inversion margin (vs 92% for identity). This is the privacy-utility frontier: you can keep what the model uses for predictions while destroying most of the information needed for prompt recovery.

The crossover at k~64 (layer 6) is visible in the plots. Below that, both channels are too small to preserve their respective information well. Above that, the behavior subspace is sufficient for predictions while the identity complement is needed for invertibility.

Gradient energy spectrum: top-32 eigenvectors capture ~30% of gradient energy, top-128 capture ~56%, top-256 capture ~74%. The behavior-relevant directions are spread across many dimensions but are far more concentrated than the full 768-dimensional space.

Study 2: CLT scaffold and backbone

CLT with d_latent=12288, trained on GPT-2 small.

Feature extraction across 12 layers reveals a clear partition:

  • 72% of alive features (75,081/103,732) have zero behavioral score (logit-silent). These are scaffold features that fire in response to input but do not measurably affect next-token predictions.
  • 28% of alive features (28,651/103,732) have nonzero behavioral score. These are backbone features that drive predictions.

The scaffold is densest in early layers (L0-L1: 90%+ scaffold) and late layers (L10-L11: 95%+ scaffold). Backbone features concentrate in middle layers (L4-L6: ~50% backbone).

Study 3: Mathematical objects

The support code is empirically injective. Out of 500 randomly sampled prompt-position pairs, all 500 had unique binary support patterns. Each prompt maps to a distinct cell in the piecewise-linear feature space.

Support-code margin exceeds hidden-state margin. In a scaled evaluation over 100 prefixes with full vocabulary sweep:

Margin type Median p10 p90
Support code (Hamming) 37.0 15.0 101.3
Hidden state (L2) 24.8
Amplitude (L2) 4.8

The binary pattern of which features fire is more discriminative than the full continuous hidden state for token identification. Amplitudes alone are poor discriminators (median 4.8), confirming that the identity information lives primarily in the combinatorial support pattern, not the continuous signal.

Support and amplitude distances are negatively correlated (Spearman rho = -0.451). Prompts with similar support patterns tend to have more different amplitude patterns, suggesting the two channels encode partially independent information.

Pairwise Hamming distances between support codes have median 962 (out of 12,288 features per layer), confirming that prompts live in well-separated combinatorial cells.

Figures

The main 12-panel figure is at artifacts/clt_v2/figures/main_figure_12panel.png. Individual plots are in artifacts/plots/.

Key plots:

  • dppl_vs_k_layer6_b32.png: behavior vs identity utility at layer 6
  • margin_vs_k_layer6_b8.png: inversion margin by subspace dimension
  • privacy_utility_frontier.png: the tradeoff between utility and leakage robustness

Code

two_channel/
  compute_subspace.py      # gradient covariance eigenvectors
  transforms.py            # projection, quantization, noise
  eval_utility.py          # perplexity/KL under projection
  eval_leakage.py          # one-step margin and collision counting
  plot_results.py          # figure generation
  run_full_gpu.py          # full GPU pipeline (subspace + utility + leakage + plots)
  clt_analysis/
    run_stage2.py          # CLT training + feature extraction + scoring
    run_stage3.py          # attribution graphs + math objects + visualizations
    scaffold_backbone.py   # feature classification utilities

Running the experiments

Study 1 (two-channel subspace)

python two_channel/run_full_gpu.py

This runs subspace computation (5k samples), utility evaluation (160 configs), leakage evaluation (128 configs), and plot generation. Takes about 14 hours on an A10G GPU.

Study 2 (CLT analysis)

python two_channel/clt_analysis/run_stage2.py

Trains a CLT, extracts features, computes behavioral scores, and measures support-code margins. Takes about 20 minutes for 300K-token CLT on A10G.

Study 3 (attribution + math objects)

python two_channel/clt_analysis/run_stage3.py --skip a

Generates attribution graphs, computes pairwise distances and cell statistics, runs scaled margin analysis, and produces the 12-panel main figure.

Citation

This project builds on:

About

Two-Channel Coding in Injective LLM Representations: separating behavior from identity in GPT-2 hidden states

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages