Alpha Zero SAE

Sparse Autoencoders (SAEs) are a class of models used to find interpretable features in a model's activation space. They have become a useful tool for understanding language models' internals since they were introduced last year.

The goal of this repo is to use SAEs to extract novel features from a model that is superhuman at a task, unlike current generation LLMs. This paper from DeepMind puts AlphaFold and AlphaZero in that category. We train AlphaZero to play the board game Othello and use it as our subject model.

Quick Start: Explore the Features

Option 1: Hosted Visualization (Recommended)

View the Interactive Visualization This is the quickest way to explore the extracted features:

Loading takes 3-4 seconds initially, faster on subsequent loads due to caching.
Qualitatively, SAEs with L1 penalties 3 and 4, and feature counts of 1024 and 2048 are more interpretable.

Option 2: Local Visualization

If you prefer to run the visualization locally:

cd vis
./download_vis_data.sh # You may need to chmod +x first
python -m http.server 8000

Then open http://localhost:8000 in your browser. It should look like this:

Dependencies

This project requires only 4 dependencies: torch, numpy, tqdm, and wandb (optional, for logging). See requirements.txt for details.

Training Your Own SAEs

To train your own Sparse Autoencoders:

Download the training data (model activations):
```
./sae/download_sae_data.sh
```
Run the training script:
```
python train_sae.py
```

Data and Code

All data is hosted on Hugging Face.
To generate your own data, use:
- collect_activations.py
- generate_vis_data.py

Training Your Own AlphaZero

This repo uses a modified version of the AlphaZero implementation from alpha-zero-general. It's trained to play Othello on a 6x6 board. Othello is a popular board game typically played on an 8x8 board. To better understand the features extracted, we recommend playing a game online.

Key files:

model.py: Contains the neural network architecture. We've replaced the original 8-layer ConvNets with 4-layer Residual blocks with a Feed-forward layer similar to those in Transformers. This change allows training on an M1 Air in just 15 minutes, achieving performance similar to the original implementation which took 3 days on an NVIDIA K80 - a 250x improvement without using GPUs.
coach.py: Handles self-play and evaluation.
mcts.py: Implements Monte Carlo Tree Search.
NetworkWrapper.py: Contains the training loop for the neural network.

To train your own model:

python train_alphazero.py

To play against your trained model or pit it against another AI:

python othello/play.py --human  # Play against the AI
python othello/play.py --games 10  # AI vs AI for 10 games

SAE Experiments

Sparse Autoencoder is a simple 2-layer feed-forward network. It takes an input vector, expands it into a higher dimension, and then compresses it back into the original dimension. Here's the simplest autoencoder possible:

input = torch.randn(5)
encoder, decoder = nn.Linear(5,20), nn.Linear(20,5) # Two Feed-forward layers
encoded = nn.ReLU(encoder(input))  # ReLU is applied after layer 1
output = decoder(encoded)         # Inputs are reconstructed from the encoded representation
l1_penalty = 5                    # L1 penalty used to control sparsity
loss = ((output-input)**2).sum() + l1_penalty*encoded.sum() # Reconstruction error + sparsity loss

The theory is that encoded[i] is more interpretable than input[i]. I've trained 25 SAEs with L1 penalties [1, 2, 3, 4, 5] and number of features [256, 512, 1024, 2048, 4096]. All the models were trained with:

Batch size: 16384
Learning Rate: 0.0001 (Selected through a hyperparameter sweep)
Input: Layer 2 residual stream (dim = 256) activations from AlphaZero
Number of Training Examples: 3M activations
Epochs: 12000
Hardware: RTX 4090
Training Time: 1-30 minutes depending on the model size

Here are some scaling trends we observe over different model sizes. Note that the X-axis (Number of features) is in log scale! Details of these runs can be found at scaling.json.

Acknowledgements

AlphaZero General for reference implementation of AlphaZero.
Anthropic for publishing their SAE training setup
Scaling Scaling Laws with Board Games which influenced a lot of my training decisions.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
alphazero		alphazero
assets		assets
othello		othello
sae		sae
vis		vis
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.sh		setup.sh
train_alphazero.py		train_alphazero.py
train_sae.py		train_sae.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Alpha Zero SAE

Quick Start: Explore the Features

Option 1: Hosted Visualization (Recommended)

Option 2: Local Visualization

Dependencies

Training Your Own SAEs

Data and Code

Training Your Own AlphaZero

SAE Experiments

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

pavanyellow/alphasae

Folders and files

Latest commit

History

Repository files navigation

Alpha Zero SAE

Quick Start: Explore the Features

Option 1: Hosted Visualization (Recommended)

Option 2: Local Visualization

Dependencies

Training Your Own SAEs

Data and Code

Training Your Own AlphaZero

SAE Experiments

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages