Two-Channel Coding in Injective LLM Representations

Separating "what's used" from "what's remembered" in GPT-2 hidden states.

This project builds on Language Models are Injective and Hence Invertible (SipIt) and asks a follow-up question: is the information that makes hidden states invertible the same information the model uses for next-token prediction, or does it live in a separate channel?

The answer is that GPT-2 appears to maintain two largely distinct channels in its hidden representations:

A low-dimensional behavior channel that controls next-token predictions
A high-dimensional identity channel that preserves prompt-specific information and drives invertibility

We show this at two levels of description: continuous geometry (gradient-based subspace decomposition) and sparse mechanisms (CLT feature analysis).

Setup

git clone https://github.com/giorgosnikolaou/SIPIT.git
cd SIPIT
pip install -e .
pip install datasets matplotlib scipy

For CLT analysis, also install CLT-Forge in a Python 3.11 environment.

Method

Defining the behavior subspace

For a layer l and a set of calibration prompts, compute the gradient of the next-token loss with respect to the last-token hidden state h at layer l:

g_i = grad_{h^l} (-log p(x_{t+1} | x_{<=t}))

Form the empirical gradient covariance:

C_l = (1/N) sum_i g_i g_i^T

The top-k eigenvectors of C_l span the behavior subspace B_l(k). The orthogonal complement B_l(k)^perp is the identity subspace. This is a first-order causal proxy: directions with large gradient energy are directions where changing the hidden state changes predictions most.

Measuring utility

Insert a projection P_B (behavior) or P_I = I - P_B (identity) at layer l as a forward hook, then measure perplexity degradation (dPPL) on held-out data. Also measure KL divergence and top-1 token agreement against the unmodified model.

Measuring leakage (robust invertibility)

For each prefix p and true next token y, sweep all v in the vocabulary and compute:

margin(p) = min_{v != y} || g(h(p+y)) - g(h(p+v)) ||_2

where g is the transform (projection + quantization). This one-step margin is the object that controls SipIt's local verifier: if the margin is large, the token is uniquely recoverable from the (possibly quantized) hidden state.

CLT feature analysis

Train a cross-layer transcoder (CLT) on GPT-2 and decompose hidden states into sparse feature activations a^l(p). Define:

Support code: s(p) = 1[a^l(p) > 0] (binary pattern of which features fire)
Amplitude code: the continuous activation values on active features

Compute support-code margins (minimum Hamming distance to any other token's support pattern) and classify features by behavioral impact.

Results

Study 1: Continuous geometry

160 utility configs, 128 leakage configs, GPT-2 small, layers 6 and 11.

The behavior subspace captures utility far more efficiently than the identity complement, while the identity complement carries most of the inversion margin:

Layer 6, 8-bit	Behavior dPPL	Identity dPPL	Behavior margin (% of full)	Identity margin (% of full)
k=32	18,953	1,079	18%	98%
k=64	4,233	4,299	27%	96%
k=128	1,731	41,638	39%	92%
k=256	893	673,984	55%	84%

At k=128, behavior projection preserves 24x more utility (dPPL 1,731 vs 41,638) while retaining only 39% of the inversion margin (vs 92% for identity). This is the privacy-utility frontier: you can keep what the model uses for predictions while destroying most of the information needed for prompt recovery.

The crossover at k~64 (layer 6) is visible in the plots. Below that, both channels are too small to preserve their respective information well. Above that, the behavior subspace is sufficient for predictions while the identity complement is needed for invertibility.

Gradient energy spectrum: top-32 eigenvectors capture ~30% of gradient energy, top-128 capture ~56%, top-256 capture ~74%. The behavior-relevant directions are spread across many dimensions but are far more concentrated than the full 768-dimensional space.

Study 2: CLT scaffold and backbone

CLT with d_latent=12288, trained on GPT-2 small.

Feature extraction across 12 layers reveals a clear partition:

72% of alive features (75,081/103,732) have zero behavioral score (logit-silent). These are scaffold features that fire in response to input but do not measurably affect next-token predictions.
28% of alive features (28,651/103,732) have nonzero behavioral score. These are backbone features that drive predictions.

The scaffold is densest in early layers (L0-L1: 90%+ scaffold) and late layers (L10-L11: 95%+ scaffold). Backbone features concentrate in middle layers (L4-L6: ~50% backbone).

Study 3: Mathematical objects

The support code is empirically injective. Out of 500 randomly sampled prompt-position pairs, all 500 had unique binary support patterns. Each prompt maps to a distinct cell in the piecewise-linear feature space.

Support-code margin exceeds hidden-state margin. In a scaled evaluation over 100 prefixes with full vocabulary sweep:

Margin type	Median	p10	p90
Support code (Hamming)	37.0	15.0	101.3
Hidden state (L2)	24.8
Amplitude (L2)	4.8

The binary pattern of which features fire is more discriminative than the full continuous hidden state for token identification. Amplitudes alone are poor discriminators (median 4.8), confirming that the identity information lives primarily in the combinatorial support pattern, not the continuous signal.

Support and amplitude distances are negatively correlated (Spearman rho = -0.451). Prompts with similar support patterns tend to have more different amplitude patterns, suggesting the two channels encode partially independent information.

Pairwise Hamming distances between support codes have median 962 (out of 12,288 features per layer), confirming that prompts live in well-separated combinatorial cells.

Figures

The main 12-panel figure is at artifacts/clt_v2/figures/main_figure_12panel.png. Individual plots are in artifacts/plots/.

Key plots:

dppl_vs_k_layer6_b32.png: behavior vs identity utility at layer 6
margin_vs_k_layer6_b8.png: inversion margin by subspace dimension
privacy_utility_frontier.png: the tradeoff between utility and leakage robustness

Code

two_channel/
  compute_subspace.py      # gradient covariance eigenvectors
  transforms.py            # projection, quantization, noise
  eval_utility.py          # perplexity/KL under projection
  eval_leakage.py          # one-step margin and collision counting
  plot_results.py          # figure generation
  run_full_gpu.py          # full GPU pipeline (subspace + utility + leakage + plots)
  clt_analysis/
    run_stage2.py          # CLT training + feature extraction + scoring
    run_stage3.py          # attribution graphs + math objects + visualizations
    scaffold_backbone.py   # feature classification utilities

Running the experiments

Study 1 (two-channel subspace)

python two_channel/run_full_gpu.py

This runs subspace computation (5k samples), utility evaluation (160 configs), leakage evaluation (128 configs), and plot generation. Takes about 14 hours on an A10G GPU.

Study 2 (CLT analysis)

python two_channel/clt_analysis/run_stage2.py

Trains a CLT, extracts features, computes behavioral scores, and measures support-code margins. Takes about 20 minutes for 300K-token CLT on A10G.

Study 3 (attribution + math objects)

python two_channel/clt_analysis/run_stage3.py --skip a

Generates attribution graphs, computes pairwise distances and cell statistics, runs scaled margin analysis, and produces the 12-panel main figure.

Citation

This project builds on:

Language Models are Injective and Hence Invertible (Nikolaou et al., 2025)
Transcoders Find Interpretable LLM Feature Circuits (Dunefsky et al., 2024)
Circuit Tracing (Anthropic, 2025)
CLT-Forge

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
artifacts		artifacts
two_channel		two_channel
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Two-Channel Coding in Injective LLM Representations

Setup

Method

Defining the behavior subspace

Measuring utility

Measuring leakage (robust invertibility)

CLT feature analysis

Results

Study 1: Continuous geometry

Study 2: CLT scaffold and backbone

Study 3: Mathematical objects

Figures

Code

Running the experiments

Study 1 (two-channel subspace)

Study 2 (CLT analysis)

Study 3 (attribution + math objects)

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Two-Channel Coding in Injective LLM Representations

Setup

Method

Defining the behavior subspace

Measuring utility

Measuring leakage (robust invertibility)

CLT feature analysis

Results

Study 1: Continuous geometry

Study 2: CLT scaffold and backbone

Study 3: Mathematical objects

Figures

Code

Running the experiments

Study 1 (two-channel subspace)

Study 2 (CLT analysis)

Study 3 (attribution + math objects)

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages