This repository provides code for the circuit localization track of the Mechanistic Interpretability Benchmark, including code for circuit discovery and evaluation.
circuits · localization · faithfulness
This code requires only EAP-IG
and tabulate
, which is included as a submodule. You can pull submodules using git submodule update --init --recursive
. Once it is pulled, you can install dependencies by running pip install .
in this directory. Note that if you wish to visualize the circuits you find, you may want to pip install EAP-IG[viz]
, which will also install the necessary pygraphviz
package. Installing this package can be challenging, which is why it has been excluded. Our code was tested using torch == 2.4.1
.
If you use the oa.py
script, you will additionally need nnsight
. We used nnsight==0.2.15
.
Here, we describe how to run the circuit discovery methods that we compare in the paper. In general, you can run circuit discovery by running:
python run_attribution.py
--models [MODELS]
--tasks [TASKS]
--method [METHOD]
--level [LEVEL="edge"]
--ablation [ABLATION="patching"]
--batch-size [BATCH_SIZE=20]
--circuit-dir [CIRCUIT-DIR="circuits/"]
This will iterate over each model and task specified, producing an attribution graph file for each. Each entry of MODELS
should be in {interpbench, gpt2, qwen2.5, gemma2, llama3}
, and each entry of TASKS
should be in {ioi, mcqa, arithmetic_subtraction, arithmetic_addition, arc_easy, arc_challenge}
. The ablation
option controls the ablation used - by default patching ablations, but mean
and zero
ablations are also possible for certain circuit-finding methods (eap
, eap-ig-activations
, and exact
). level
is the level of granularity at which attribution is performed: edge
(by default) or node
/ neuron
. batch-size
is the batch size used during attribution, and is set across models. circuit-dir
is where circuit files are output.
We support the following attribution methods:
-
Edge Attribution Patching (EAP;
eap
). Note that by changing--level
tonode
orneuron
, you obtain node / neuron attribution patching. Node-level patching happens at the level of submodules (e.g., the MLP at layer 10, or attention head 5 at layer 3), whereas neuron-level patching assigns scores to each neuron in each of those submodules. -
EAP with Optimal Ablations You will first need to compute the optimal ablations vector given a model and task. This can be done by running
oa.py --models model1,models --tasks task1,task2
, which requires thennsight
package. Then, runpython run_attribution.py
with--method EAP --ablation optimal --optimal_ablation_path=[PATH_to_OA_outputs]
. -
Edge Attribution Patching with Integrated Gradients (EAP-IG;
eap-ig-inputs
/eap-ig-activations
). EAP-IG-inputs runs an interpolation between many values of the input embeddings, but allows the activations to flow freely through the rest of the model from there. EAP-IG-activations interpolates between intermediate activations at the component that is being attributed. We would recommend starting with EAP-IG-inputs, as it runs faster—and, in most cases, performs better. -
Activation Patching (
exact
). This is the exact activation patching approach that EAP is approximating. Its runtime is long, so it is generally only feasible to run on smaller models unless you have a large enough GPU to increase the batch size significantly. Note that this approach operates at the level of edges, not nodes. -
Information Flow Routes (IFR;
information-flow-routes
). -
Uniform Gradient Sampling (UGS). To obtain the UGS results, first run this script with the
reg_lamb
hyperparameter set to 0.001. This will train the continuous mask$\alpha$ over the model’s edges. Then, run theconvert_mask_to_graph.py
script to convert the learned mask into a graph object, where each edge is assigned a weight equal to its corresponding$\alpha$ value. These edge weights are then used to determine the subgraphs for evaluation.
For example, to perform EAP-IG (inputs) with patching for IOI and MCQA on both Qwen-2.5 (0.5B) and Gemma-2 (2B) at the edge level, run:
python run_attribution.py \
--models qwen2.5 gemma2 \
--tasks ioi mcqa \
--method EAP-IG-inputs \
--level edge \
--ablation patching \
--batch-size 20
To evaluate these circuits, run:
python run_evaluation.py
--models [MODELS]
--tasks [TASKS]
--split [SPLIT="validation"]
--level [LEVEL="edge"]
--ablation [ABLATION="patching"]
--batch-size [BATCH_SIZE=20]
--circuit-dir [CIRCUIT-DIR="circuits/"]
--output-dir [OUTPUT_DIR="results/"]
By default, this will evaluate on the validation set. To evaluate on the train or (public) test set, use --split train
/ --split test
.
The argument structure is the same as for the attribution script, so simply port the same arguments you used when running circuit discovery while changing the python script. This will load circuits from the locations they would have been saved in when running the circuit discovery method described above.
If you are using custom circuits not obtained using this code, use the --circuit-files
argument. This takes a series of space-separated paths to circuits to be evaluated. These circuits must be provided in either .json or .pt format; see examples provided here.
This script will save your results in .pkl files inside --output-dir
containing the faithfulness scores at all circuit sizes, the weighted edge counts of all circuits, and the CPR and CMD scores.
Once you've finished evaluation, run:
python print_results.py
--output-dir [OUTPUT_DIR="results/"]
--split [SPLIT="validation"]
--metric [METRIC="cpr"]
This will output a table of scores for the specified split and metric. To display CMD scores instead, set --metric cmd
. To display InterpBench scores (AUROC), use --metric auroc
.
If you would like to submit your circuits for evaluation on the private test set, start by collecting your circuits. We expect one folder per task/model, whre each folder contains the name of the model and the task, separated by an underscore—for example, ioi_gpt2
, or arc-easy_llama3
.
Each folder should contain either (1) a single .json or .pt file with floating-point importance scores assigned to each node or edge in the model, or (2) 9 .json or .pt files with binary membership variables assigned to each node or edge in the model. If (2), there should be one circuit containing no more than each of the following percentages of edges:
{0, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50}
In other words, we expect one circuit with
We require the circuits to be publicly available on HuggingFace. We will request a URL to a directory in a HuggingFace repository that contains one folder per task/model. These folders should contain either your importance scores or your 9 circuits. If you used our code, you'll already have this directory structure: simply upload the folder corresponding to the method name.
We provide examples of valid submissions in this repository. See here for an example of importance scores, and here for an example of multiple circuits. You do not need to provide folders for all tasks/models; however, to prevent trivial submissions, we require you to provide circuits for
We provide an example of an edge-level circuit in the importance-score format here. If you choose to provide multiple circuits instead of importance scores, the circuit file format is nearly identical, but without the floating-point edge/node scores. We provide an example of a neuron-level node circuit here.
There is a rate limit of 2 submissions per user per week to prevent hill-climbing on the private test set. Our automatic submission checker will verify whether what you have provided is in a valid format, and only count your submission toward your limit if it is. In case of issues, we ask that you provide a contact email.
If you use the resources in this repository, please cite our paper:
@article{mib-2025,
title = {{MIB}: A Mechanistic Interpretability Benchmark},
author = {Aaron Mueller and Atticus Geiger and Sarah Wiegreffe and Dana Arad and Iv{\'a}n Arcuschin and Adam Belfki and Yik Siu Chan and Jaden Fiotto-Kaufman and Tal Haklay and Michael Hanna and Jing Huang and Rohan Gupta and Yaniv Nikankin and Hadas Orgad and Nikhil Prakash and Anja Reusch and Aruna Sankaranarayanan and Shun Shao and Alessandro Stolfo and Martin Tutek and Amir Zur and David Bau and Yonatan Belinkov},
year = {2025},
journal = {CoRR},
volume = {arXiv:2504.13151},
url = {https://arxiv.org/abs/2504.13151v1}
}
We release the content in this repository under an Apache 2.0 license.