Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .env.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
HF_TOKEN=
NDIF_API_KEY=
WANDB_API_KEY=
39 changes: 34 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,24 @@
# Crosslayer Transcoder


This repository trains Crosslayer Transcoders and variants with PyTorch/Lightning on multi‑GPU via tensor parallelism. It implements Anthropic’s [crosslayer transcoders](https://transformer-circuits.pub/2024/crosscoders/index.html) and related architectures (per‑layer transcoders, MOLTs, SAEs, Matryoshka CLTs) and supports losses such as ReLU, JumpReLU, TopK, and BatchTopK, for learning human‑interpretable features from LLM activations and building replacement models for [circuit tracing](https://transformer-circuits.pub/2025/attribution-graphs/methods.html).

We want to understand the “brain” of LLMs: what their representations encode and what algorithms emerged from circuits. To start, we can learn feature dictionaries with [Sparse Autoencoders](https://transformer-circuits.pub/2023/monosemantic-features) to break activations into human‑interpretable features. They tell us _what_ features representations contain but not _how_ they interact to make circuits and algorithms. For that, we need to sparsify the entire model (we call this a sparse replacement model), not just representations of a single layer. One approach are [Transcoders](https://arxiv.org/abs/2406.11944), which learn features that approximate MLP components, which lets us swap in a replacement model and trace circuits end to end. [Crosslayer transcoders](https://transformer-circuits.pub/2024/crosscoders/index.html) allow features to affect all subsequent layers, essentially letting features live across layers. This yields smaller and more interpretable circuits and enables [circuit tracing](https://transformer-circuits.pub/2025/attribution-graphs/methods.html) and studies of [LLM biology](https://transformer-circuits.pub/2025/attribution-graphs/biology.html).


## Implemented and Planned Features


> **⚠️ Early Development Disclaimer**
> This repository is still in very early development and under active development. It's not yet a stable, production-ready package. There will likely be many breaking changes in the future as the codebase evolves. Use at your own risk and expect API changes between commits.

### Architectures

- ✅ Per-Layer Transcoder (PLT)
- ✅ Crosslayer Transcoder (CLT)
- ✅ Sparse Mixture of Linear Transforms (MOLT)
- ⏳ Matryoshka CLTs
- ⏳ SAEs (by tweaking the activation data extractor)

### Nonlinearities and Loss Functions

- ✅ ReLU and JumpReLU (via straight-through estimators)
- ✅ TopK
- ✅ BatchTopK (per layer and across layers)
Expand All @@ -29,6 +28,7 @@ We want to understand the “brain” of LLMs: what their representations encode
- ✅ Activation standardization

### Training

- ✅ On-demand activation extraction and streaming using a shared-memory activation buffer
- ⚠️ Tensor parallelism using PyTorch DTensor API (requires PyTorch 2.8; comms optimization in progress)
- ⏳ Sparse Kernels
Expand All @@ -37,13 +37,13 @@ We want to understand the “brain” of LLMs: what their representations encode
- ✅ Mixed precision (float16 + gradient scaler or bfloat16), gradient accumulation, checkpointing, profiling

### Metrics (logged to WandB during training)

- ✅ Replacement Model Accuracy and KL divergence
- ✅ Dead Features
- ✅ Feature activation frequency and other statistics
- ✅ L0
- ⏳ Replacement Model Score


## Installation

Recommended: use the setup script (it installs uv if needed and creates the venv).
Expand All @@ -58,17 +58,47 @@ cd crosslayer-transcoder
```

Notes:

- This will create `.venv/` and install from `pyproject.toml`, using `uv.lock` for reproducibility.
- For GPU installs, ensure you have a compatible PyTorch build for your CUDA setup. If needed, follow the official PyTorch instructions to select the right wheel for your CUDA version.

## Environment Variables

### Required Environment Variables

- `HF_TOKEN` - HuggingFace API token for accessing models and datasets
- `NDIF_API_KEY` - NDIF API key required for training features
- `WANDB_API_KEY` - Weights & Biases API key for experiment tracking and logging

### Setup Options

**Option 1: During installation (recommended)**

The `setup.sh` script will automatically prompt you for these values. You can press Enter to skip any prompt, but note that all three are required for training to work properly.

**Option 2: Manual configuration**

```bash
# Copy the template
cp .env.template .env

# Edit the .env file with your API keys
nano .env # or use your preferred editor

```

### Getting API Keys

- **HuggingFace Token**: Get it from [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
- **NDIF API Key**: Sign up [here](https://ndif.us/)
- **Weights & Biases Key**: Get it from [https://wandb.ai/authorize](https://wandb.ai/authorize)

## How to Use (Configure and Customize with Lightning CLI)

You can customize almost everything: datasets and activation extraction, model architecture, loss functions, and all training hyperparameters. This works by using PyTorch Lightning’s CLI to read a YAML config that defines which classes to use and how to compose them. By editing a single `config.yaml`, you control the entire run and keep every parameter in one place; you can still override any field from the command line for quick experiments.

- Why this is great

- Single source of truth for all settings → easy to reproduce and share
- Composable: swap architectures, losses, data pipelines by changing class entries in YAML
- Discoverable and explicit: every knob is visible in one file, with sane defaults
Expand Down Expand Up @@ -129,7 +159,6 @@ The `config` folder contains example configuration files for different architect
3. **It plugs into the same training loop** - multi‑GPU (DDP) works out of the box
4. **Tensor parallelism works automatically** because PyTorch Lightning handles the distributed setup and PyTorch's Distributed Tensor API shards your model across GPUs without requiring changes to your component code


## Testing

Run the test suite to ensure everything is working correctly:
Expand Down
2 changes: 2 additions & 0 deletions crosslayer_transcoder/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
"""

import os
from dotenv import load_dotenv

import lightning as L
from lightning.pytorch.cli import LightningCLI
Expand All @@ -29,6 +30,7 @@ def add_arguments_to_parser(self, parser):

def main():
"""Main entry point for training."""
load_dotenv()
# Set up wandb directories
os.environ.setdefault("WANDB_DIR", f"{os.getcwd()}/wandb")
os.environ.setdefault("WANDB_CACHE_DIR", f"{os.getcwd()}/wandb_cache")
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ dependencies = [
"transformers>=4.46.0",
"numpy>=1.24.0",
"jsonargparse[signatures]>=4.27.7",
"dotenv>=0.9.9",
]

[project.optional-dependencies]
Expand Down
76 changes: 76 additions & 0 deletions setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,82 @@ if [ $? -eq 0 ]; then
# Fix permissions for all executables in .venv/bin
chmod +x .venv/bin/*

# Setup environment variables
echo ""
echo -e "${BLUE}🔑 Setting up environment variables...${NC}"

# Create .env from template if it doesn't exist
if [ ! -f ".env" ]; then
if [ -f ".env.template" ]; then
cp .env.template .env
echo -e "${GREEN}✅ Created .env file from template${NC}"
else
echo -e "${YELLOW}⚠️ .env.template not found, creating .env file${NC}"
touch .env
fi
else
echo -e "${GREEN}✅ .env file already exists${NC}"
fi

# Function to prompt for environment variable
prompt_env_var() {
local var_name="$1"
local var_description="$2"
local current_value="${!var_name}"

# Check if already set in environment
if [ -n "$current_value" ]; then
echo -e "${GREEN}✅ $var_name already set in environment${NC}"
# Ensure it's in .env file
if ! grep -q "^${var_name}=" .env 2>/dev/null; then
echo "${var_name}=${current_value}" >> .env
fi
return
fi

# Check if set in .env file
if [ -f ".env" ]; then
local env_value=$(grep "^${var_name}=" .env | cut -d '=' -f 2-)
if [ -n "$env_value" ]; then
echo -e "${GREEN}✅ $var_name already set in .env file${NC}"
export "${var_name}=${env_value}"
return
fi
fi

# Prompt user
echo ""
echo -e "${YELLOW}${var_description}${NC}"
echo -e "${BLUE}[Press Enter to skip]${NC}"
read -p "${var_name}=" user_input

# Update .env file
if grep -q "^${var_name}=" .env 2>/dev/null; then
# Update existing line
sed -i.bak "s|^${var_name}=.*|${var_name}=${user_input}|" .env && rm .env.bak
else
# Add new line
echo "${var_name}=${user_input}" >> .env
fi

# Export if not empty
if [ -n "$user_input" ]; then
export "${var_name}=${user_input}"
echo -e "${GREEN}✅ $var_name set and exported${NC}"
else
echo -e "${YELLOW}⚠️ $var_name left empty (required for training)${NC}"
fi
}

# Prompt for each environment variable
prompt_env_var "HF_TOKEN" "HuggingFace API token - required for model/dataset access"
prompt_env_var "NDIF_API_KEY" "NDIF API key - required for training"
prompt_env_var "WANDB_API_KEY" "Weights & Biases API key - required for logging"

echo ""
echo -e "${BLUE}📝 Environment variables saved to .env file${NC}"
echo ""

# Add uv to PATH permanently by updating shell profile
UV_PATH_EXPORT='export PATH="$HOME/.local/bin:$PATH"'

Expand Down
24 changes: 23 additions & 1 deletion uv.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.