GNEprop model associated with the manuscript: "A high-throughput phenotypic screen combined with an ultra-large-scale deep learning-based virtual screening reveals novel scaffolds of antibacterial compounds" (updated version will be released soon).
GNEprop is a graph neural network-based model to predict antibacterial activity from molecular structures in virtual screening settings. GNEprop is based on a GNN encoder and includes multiple features, namely: self-supervised contrastive pre-training, multi-scale adversarial augmentations, meta-learning fine-tuning for out-of-distribution generalization (check manuscript for additional details).
We are currently finalizing the release of this repository, stay tuned for more updates.
The python environment is managed by conda
.
First, install Miniconda from https://conda.io/miniconda.html.
Then please use:
conda env create -f environment.yml --name gneprop
conda activate gneprop
git clone https://github.com/learnables/learn2learn/
cd learn2learn
pip install .
No further installation is currently needed.
GNEprop has two main entry points
clr.py
to run the model for self-supervised pretraining and representation learninggneprop_pyg.py
to run the model for supervised training
In the following, trainings are based on the best parameters as reported in the manuscript.
Hyperparameter search can be run specifying the search space as a .yaml
file. Default configuration is reported in config/hparams_search.yaml
, and can be run with the following command:
python gneprop_pyg.py --dataset_path support_data/s1b.csv --gpus 1 --split_type scaffold --keep_all_checkpoints --max_epochs 30 --metric val_ap --num_workers 8 --log_directory <log_directory> --parallel_folds 20 --adv flag --adv_m 5 --hparams_search_conf_path config/hparams_search.yaml"
- GNEprop trained without using pretrained weights:
python gneprop_pyg.py --dataset_path support_data/s1b.csv --lr 4.9379e-05 --hidden_size 500 --depth 5 --num_readout_layers 1 --dropout 0.13 --lr_strategy warmup_cosine_step --aggr mean --gpus 1 --split_type scaffold --max_epochs 30 --metric val_ap --num_workers 3 --log_directory <log_directory> --parallel_folds 20 --adv flag --adv_m 5
- GNEprop trained using pretrained weights:
python gneprop_pyg.py --dataset_path support_data/s1b.csv --lr 4.9379e-05 --hidden_size 500 --depth 5 --num_readout_layers 1 --dropout 0.13 --lr_strategy warmup_cosine_step --aggr mean --gpus 1 --split_type scaffold --max_epochs 30 --metric val_ap --num_workers 3 --log_directory <log_directory> --parallel_folds 20 --pretrain_path <pretrained_path> --mp_to_freeze 0 --freeze_ab_embeddings --freeze_batchnorm --adv flag --adv_m 5
- To also add molecular features computed with RDKit, add the argument:
--use_mol_features
- To use random splitting instead of scaffold splitting, use:
--split_type random
- GNEprop training, scaffold splitting:
python gneprop_pyg.py --dataset_path support_data/GNEtolC.csv --lr 4.9379e-05 --hidden_size 500 --depth 5 --num_readout_layers 1 --dropout 0.13 --lr_strategy warmup_cosine_step --aggr mean --gpus 1 --split_type scaffold --keep_all_checkpoints --max_epochs 50 --metric val_ap --num_workers 8 --exclude_bn_bias --log_directory <log_directory> --parallel_folds 8 --pretrain_path <pretrained_path> --mp_to_freeze 0 --freeze_ab_embeddings --freeze_batchnorm --freeze_bias --ig_baseline_ratio 0.3 --adv flag --adv_m 5
- GNEprop training, scaffold-cluster splitting:
python gneprop_pyg.py --dataset_path support_data/GNEtolC.csv --lr 4.9379e-05 --hidden_size 500 --depth 5 --num_readout_layers 1 --dropout 0.13 --lr_strategy warmup_cosine_step --aggr mean --gpus 1 --keep_all_checkpoints --max_epochs 50 --metric val_ap --num_workers 8 --exclude_bn_bias --log_directory <log_directory> --parallel_folds 8 --pretrain_path <pretrained_path> --mp_to_freeze 0 --freeze_ab_embeddings --freeze_batchnorm --freeze_bias --ig_baseline_ratio 0.3 --adv flag --adv_m 5 --split_type index_predetermined --index_predetermined_file support_data/dataset_100k_v1.pkl
The self-supervised model (trained on ~120M molecules from ZINC15) has been made available (20210827-082422.zip) (check "Data Availability" section).
The self-supervised model can be re-trained using:
python clr.py --dataset_path data_path/zinc15_cell_screening_GNE_all_081320_normalized_unique.csv --gpus 1 --max_epoch 50 --lr 1e-03 --model_hidden_size 500 --model_depth 5 --batch_size 1024 --weight_decay 0. --exclude_bn_bias --num_workers 64 --project_output_dim 256
The meta-learning fine-tuning can be trained using --meta
parameter, for example:
python gneprop_pyg.py --dataset_path data_path/<dataset_file> --lr 4.9379e-05 --hidden_size 500 --depth 5 --num_readout_layers 1 --dropout 0.13 --lr_strategy warmup_cosine_step --aggr mean --gpus 1 --split_type scaffold --max_epochs 30 --metric val_ap --num_workers 8 --log_directory <log_directory> --parallel_folds 8 --mp_to_freeze 0 --freeze_ab_embeddings --freeze_batchnorm --adv flag --adv_m 5 --supervised_pretrain_path_folds <supervised_training_path_dir> --meta --keep_all_checkpoints --meta_test half_val --keep_last_checkpoint
Filters definitions are included in chem_utils.py
.
Refer to gneprop_pyg.py
and clr.py
for other parameters.
Refer to explainability.py
, in particular explain_graph
method.
Refer to ood.py
.
Data are available: https://drive.google.com/drive/folders/1g3wZFa0jxadElcayJR0euvCWTWymXZ1J?usp=sharing
In particular:
- support_data:
- Public dataset (Stokes et al., 2020):
s1b.csv
- GNEtolC dataset:
GNEtolC.csv
- scaffold-cluster splitting for GNEtolC dataset:
dataset_100k_v1.pkl
- Known antibiotics for novel MOA detection analysis:
Extended_data_table_antibiotics.csv
- Dataset for self-supervised training:
zinc15_cell_screening_GNE_all_081320_normalized_unique.csv.tgz
- Virtual hits labeled with result label:
screening_hits.xlsx
- Public dataset (Stokes et al., 2020):
- checkpoints:
20240709-103356
: GNEprop training on GNEtolC dataset with scaffold splitting20240709-103508
: GNEprop training on GNEtolC dataset with scaffold-cluster splitting
- pretrained_weights:
20210827-082422
: self-supervised checkpoint
All associated data is licensed under a Creative Commons Attribution-NonCommercial 4.0 International license.
In general, it is possible to train small datasets (hundreds of molecules) in a few minutes using CPU only. Training GNEprop on a CUDA-enabled GPU is recommended.
Multiple GPUs can be used to speed up multi-folds training
(using arguments --parallel_folds
and --num_gpus_per_fold
) or to speed up a single training
with data parallelism (using argument --gpus
from pytorch_lightning
).
For self-supervised training on large datasets (e.g., 120M molecules as in the manuscript) is recommended having multiple GPUs and CPUs available.
GNEprop relies on cudatoolkit
and cuDNN
.
Parameters:
python clr.py --help
python gneprop_pyg.py --help
Refer to manuscript.
Reach out to Gabriele Scalia ([email protected]), Ziqing Lu ([email protected]), or Tommaso Biancalani ([email protected]) for questions on the repository.