Skip to content

hannamw/MIB-circuit-track

Repository files navigation

MIB Splash

This repository provides code for the circuit localization track of the Mechanistic Interpretability Benchmark, including code for circuit discovery and evaluation.
circuits · localization · faithfulness

Apache 2.0 Email website

Dependencies

This code requires only EAP-IG and tabulate, which is included as a submodule. You can pull submodules using git submodule update --init --recursive. Once it is pulled, you can install dependencies by running pip install . in this directory. Note that if you wish to visualize the circuits you find, you may want to pip install EAP-IG[viz], which will also install the necessary pygraphviz package. Installing this package can be challenging, which is why it has been excluded. Our code was tested using torch == 2.4.1.

If you use the oa.py script, you will additionally need nnsight. We used nnsight==0.2.15.

Circuit Discovery

Overview of the circuit localization track.

Here, we describe how to run the circuit discovery methods that we compare in the paper. In general, you can run circuit discovery by running:

python run_attribution.py
--models [MODELS]
--tasks [TASKS]
--method [METHOD]
--level [LEVEL="edge"]
--ablation [ABLATION="patching"]
--batch-size [BATCH_SIZE=20]
--circuit-dir [CIRCUIT-DIR="circuits/"]

This will iterate over each model and task specified, producing an attribution graph file for each. Each entry of MODELS should be in {interpbench, gpt2, qwen2.5, gemma2, llama3}, and each entry of TASKS should be in {ioi, mcqa, arithmetic_subtraction, arithmetic_addition, arc_easy, arc_challenge}. The ablation option controls the ablation used - by default patching ablations, but mean and zero ablations are also possible for certain circuit-finding methods (eap, eap-ig-activations, and exact). level is the level of granularity at which attribution is performed: edge (by default) or node / neuron. batch-size is the batch size used during attribution, and is set across models. circuit-dir is where circuit files are output.

We support the following attribution methods:

  • Edge Attribution Patching (EAP; eap). Note that by changing --level to node or neuron, you obtain node / neuron attribution patching. Node-level patching happens at the level of submodules (e.g., the MLP at layer 10, or attention head 5 at layer 3), whereas neuron-level patching assigns scores to each neuron in each of those submodules.

  • EAP with Optimal Ablations You will first need to compute the optimal ablations vector given a model and task. This can be done by running oa.py --models model1,models --tasks task1,task2, which requires the nnsight package. Then, run python run_attribution.py with --method EAP --ablation optimal --optimal_ablation_path=[PATH_to_OA_outputs].

  • Edge Attribution Patching with Integrated Gradients (EAP-IG; eap-ig-inputs / eap-ig-activations). EAP-IG-inputs runs an interpolation between many values of the input embeddings, but allows the activations to flow freely through the rest of the model from there. EAP-IG-activations interpolates between intermediate activations at the component that is being attributed. We would recommend starting with EAP-IG-inputs, as it runs faster—and, in most cases, performs better.

  • Activation Patching (exact). This is the exact activation patching approach that EAP is approximating. Its runtime is long, so it is generally only feasible to run on smaller models unless you have a large enough GPU to increase the batch size significantly. Note that this approach operates at the level of edges, not nodes.

  • Information Flow Routes (IFR; information-flow-routes).

  • Uniform Gradient Sampling (UGS). To obtain the UGS results, first run this script with the reg_lamb hyperparameter set to 0.001. This will train the continuous mask $\alpha$ over the model’s edges. Then, run the convert_mask_to_graph.py script to convert the learned mask into a graph object, where each edge is assigned a weight equal to its corresponding $\alpha$ value. These edge weights are then used to determine the subgraphs for evaluation.

For example, to perform EAP-IG (inputs) with patching for IOI and MCQA on both Qwen-2.5 (0.5B) and Gemma-2 (2B) at the edge level, run:

python run_attribution.py \
--models qwen2.5 gemma2 \
--tasks ioi mcqa \
--method EAP-IG-inputs \
--level edge \
--ablation patching \
--batch-size 20

Evaluation

To evaluate these circuits, run:

python run_evaluation.py
--models [MODELS]
--tasks [TASKS]
--split [SPLIT="validation"]
--level [LEVEL="edge"]
--ablation [ABLATION="patching"]
--batch-size [BATCH_SIZE=20]
--circuit-dir [CIRCUIT-DIR="circuits/"]
--output-dir [OUTPUT_DIR="results/"]

By default, this will evaluate on the validation set. To evaluate on the train or (public) test set, use --split train / --split test.

The argument structure is the same as for the attribution script, so simply port the same arguments you used when running circuit discovery while changing the python script. This will load circuits from the locations they would have been saved in when running the circuit discovery method described above.

If you are using custom circuits not obtained using this code, use the --circuit-files argument. This takes a series of space-separated paths to circuits to be evaluated. These circuits must be provided in either .json or .pt format; see examples provided here.

This script will save your results in .pkl files inside --output-dir containing the faithfulness scores at all circuit sizes, the weighted edge counts of all circuits, and the CPR and CMD scores.

Printing Results

Once you've finished evaluation, run:

python print_results.py
--output-dir [OUTPUT_DIR="results/"]
--split [SPLIT="validation"]
--metric [METRIC="cpr"]

This will output a table of scores for the specified split and metric. To display CMD scores instead, set --metric cmd. To display InterpBench scores (AUROC), use --metric auroc.

Submitting to the MIB Leaderboard

If you would like to submit your circuits for evaluation on the private test set, start by collecting your circuits. We expect one folder per task/model, whre each folder contains the name of the model and the task, separated by an underscore—for example, ioi_gpt2, or arc-easy_llama3.

Each folder should contain either (1) a single .json or .pt file with floating-point importance scores assigned to each node or edge in the model, or (2) 9 .json or .pt files with binary membership variables assigned to each node or edge in the model. If (2), there should be one circuit containing no more than each of the following percentages of edges:

{0, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50}

In other words, we expect one circuit with $k \leq 0.1$% of edges, one with $k \leq 0.2$% of edges, etc., where $k$ is the percentage of edges in the circuit compared to the full model.

We require the circuits to be publicly available on HuggingFace. We will request a URL to a directory in a HuggingFace repository that contains one folder per task/model. These folders should contain either your importance scores or your 9 circuits. If you used our code, you'll already have this directory structure: simply upload the folder corresponding to the method name.

Example Circuits and Submissions

We provide examples of valid submissions in this repository. See here for an example of importance scores, and here for an example of multiple circuits. You do not need to provide folders for all tasks/models; however, to prevent trivial submissions, we require you to provide circuits for $\geq$ 2 models, and $\geq$ 2 tasks.

We provide an example of an edge-level circuit in the importance-score format here. If you choose to provide multiple circuits instead of importance scores, the circuit file format is nearly identical, but without the floating-point edge/node scores. We provide an example of a neuron-level node circuit here.

Rate Limit

There is a rate limit of 2 submissions per user per week to prevent hill-climbing on the private test set. Our automatic submission checker will verify whether what you have provided is in a valid format, and only count your submission toward your limit if it is. In case of issues, we ask that you provide a contact email.

Citation

If you use the resources in this repository, please cite our paper:

@article{mib-2025,
	title = {{MIB}: A Mechanistic Interpretability Benchmark},
	author = {Aaron Mueller and Atticus Geiger and Sarah Wiegreffe and Dana Arad and Iv{\'a}n Arcuschin and Adam Belfki and Yik Siu Chan and Jaden Fiotto-Kaufman and Tal Haklay and Michael Hanna and Jing Huang and Rohan Gupta and Yaniv Nikankin and Hadas Orgad and Nikhil Prakash and Anja Reusch and Aruna Sankaranarayanan and Shun Shao and Alessandro Stolfo and Martin Tutek and Amir Zur and David Bau and Yonatan Belinkov},
	year = {2025},
	journal = {CoRR},
	volume = {arXiv:2504.13151},
	url = {https://arxiv.org/abs/2504.13151v1}
}

License

We release the content in this repository under an Apache 2.0 license.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 6