Building a Profile Hidden Markov Model for the Kunitz-Type Protease Inhibitor Domain
This repository contains the implementation of a computational pipeline for building a Profile Hidden Markov Model (HMM) targeting the Kunitz-type protease inhibitor domain (Pfam ID: PF00014). The Kunitz domain is a small protein domain found in a variety of protease inhibitors across multiple species. Detecting and classifying these domains is crucial for understanding protease regulation in biological systems. Using a structure-based multiple alignment of representative Kunitz domain proteins, an HMM model is built and calibrated. The model is then evaluated on both positive (true Kunitz) and negative (non-Kunitz) protein sequences to assess its detection performance. Additionally, a comparison is performed with a sequence-based multiple alignment HMM model, built using MUSCLE, to evaluate differences in predictive performance. This project was developed as part of a Laboratory of Bioinformatics 1 assignment during my MSc in Bioinformatics (at University of Bologna), with the goal of integrating structural bioinformatics, sequence analysis, and statistical evaluation of predictive models.
The following packages are required:
-
CD-HIT
Purpose: Clustering and redundancy reduction of protein sequences. Install via conda:
conda install -c bioconda cd-hit
-
HMMER
Purpose: Building and searching Hidden Markov Models (HMMs) for protein domain detection. Install via conda:
conda install -c bioconda hmmer
-
BLAST+
Purpose: Performing protein sequence similarity searches using
blastp
. Install via conda:conda install -c bioconda blast
-
MUSCLE (Optional — required only for Comparison with sequence based hmm)
Purpose: Performing multiple sequence alignments for sequence-based HMM construction. Install via conda:
conda install -c bioconda muscle
-
BIOPYTHON
Purpose: Required to run
get_seq.py
, used for sequence extraction.
Install via conda:conda install -c conda-forge biopython
This repository contains:
-
rcsb_pdb_custom_report_20250410062557.csv
see Additional Notes A custom report downloaded from the PDB using the following query:Data Collection Resolution <= 3.5 AND ( Identifier = "PF00014" AND Annotation Type = "Pfam" ) AND Polymer Entity Sequence Length <= 80 AND Polymer Entity Sequence Length >= 45
-
pdb_kunitz_rp.ali
An initially empty file. You will paste into this file the multi-structure alignment obtained from PDBeFold. -
all_kunitz.fasta
see Additional Notes A FASTA file containing all Kunitz proteins (human and non-human). -
uniprot_sprot.fasta
Full Swiss-Prot protein dataset in FASTA format, used to extract negative sets and full Kunitz reference. -
script_recover_representative_kunitz.sh
A script that extracts representative Kunitz domain PDB IDs and writes them totmp_pdb_efold_ids.txt
. -
create_hmm_str.sh
A script that builds the structural HMM from the PDBeFold alignment. -
create_testing_sets.sh
A script that performs test set generation and 2-fold cross-validation. It removes overlap with training data, computes optimal thresholds (via MCC), and outputs results tohmm_results_strali.txt
. -
create_hmm_seq.sh
A script for sequence-based alignment using MUSCLE and subsequent HMM construction. Uses the same test sets and evaluates performance, writing results tohmm_results_seqali.txt
. -
get_seq.py
A Python script used to extract specific sequences from a FASTA file (typically the full Swiss-Prot database). It matches accession numbers (from the FASTA headers) against a provided list of IDs and prints the corresponding sequences in FASTA format. This script is used to generate the input FASTA files for positive and negative sets, based on pre-selected identifiers. -
performance.py
A Python script that evaluates prediction performance using E-values and known labels. It computes metrics such as accuracy (q2), Matthews Correlation Coefficient (MCC), true positive rate (TPR), and precision (PPV). It allows comparison of results using full-sequence vs. best-domain E-values, with manual threshold selection. -
OUTPUT_EXAMPLES
folder A folder containing examples of the final output files generated by the pipeline. -
graphics_and_tables
folder A folder containing summary plots, figures, and tables generated during the evaluation of the HMM models. -
INTERPRO_ANALYSIS
This folder contains all materials and results obtained from a secondary exploratory analysis that integrates the InterPro domain annotation IPR036880, which is an alternative to the Pfam ID PF00014 used in the primary HMM analysis. The motivation for this integration stems from the observation that several Kunitz-type proteins were excluded from the positive set due to incomplete or inconsistent domain annotations in UniProt/Swiss-Prot. To address this, all sequences annotated with InterPro entry IPR036880 were retrieved and analyzed. -
marco_cuscuna_report.pdf
This document is the final lab report for the Laboratory of Bioinformatics 1 course (MSc in Bioinformatics, University of Bologna), authored by Marco Cuscunà.
It describes the complete development and evaluation of profile Hidden Markov Models (HMMs) for detecting the Kunitz-type protease inhibitor domain.
The report compares sequence- and structure-based alignment approaches using 2-fold cross-validation, highlights annotation-related challenges (e.g., Pfam vs. InterPro cross-reference issues), and discusses the impact of structural information on classification performance.
Visit UniProt and download all protein sequences annotated in the Swiss-Prot database. Insert the file into the folder with the scripts and recall it uniprot_sprot.fasta
.
This file will be used to extract non-Kunitz protein sequences.
Install the packages listed in the Required Packages section using conda
.
bash script_recover_representative_kunitz.sh
This will create the file tmp_pdb_efold_ids.txt
.
Note: Before submitting the list to PDBeFold, manually inspect the sequences and remove any that are too long, too short, or have unstructured tails. This ensures compatibility with PDBeFold and maintains alignment and hmm quality.
-
Go to the PDBeFold Multi Alignment Tool
-
Choose:
- Mode: Multiple
- Source: List of PDB codes
-
Upload the file
tmp_pdb_efold_ids.txt
-
Click Download FASTA Alignment
-
Paste the downloaded content into the file
pdb_kunitz_rp.ali
bash create_hmm_str.sh
bash create_testing_sets.sh
This pipeline will:
-
Build a structural HMM from the PDBeFold structural alignment
-
Remove training sequences from the full dataset and generate random subsets of positive and negative sequences, which will be used to create test sets
-
Automatically identify the optimal E-value thresholds via MCC evaluation (2-fold CV)
-
Perform evaluation on:
- Set 1, using Set 2’s threshold
- Set 2, using Set 1’s threshold
- Combined Set 1 + Set 2, using both thresholds for overall assessment
-
Report MCC, precision, recall, and identify false positives and false negatives
-
Write detailed results to
hmm_results_strali.txt
bash create_hmm_seq.sh
This builds an HMM using a MUSCLE sequence alignment of the same representative sequences. It uses the same test sets and evaluates performance, writing results to hmm_results_seqali.txt
.
hmm_results_strali.txt
andhmm_results_seqali.txt
contain:- The best E-value thresholds selected by maximizing the Matthews Correlation Coefficient (MCC)
- Performance metrics for each test set and overall, calculated using the E-value that yielded the highest MCC, based on either full sequence or best single domain evaluations
- Lists of false positives and false negatives
neg_1.fasta
andneg_2.fasta
: FASTA files of non-Kunitz sequences used respectively as the negative set 1 and set 2.pos_1.fasta
andpos_2.fasta
: FASTA files of Kunitz (positive) sequences used in set 1 and set 2.pdb_kunitz_rp_clean.fasta
: Cleaned representative Kunitz sequences used for both structural and sequence alignments (after filtering for length).pdb_kunitz_rp_seqali.fasta
andpdb_kunitz_rp_seqali.hmm
: The multiple sequence alignment and resulting HMM model produced using MUSCLE.pdb_kunitz_rp_strali.fasta
andpdb_kunitz_rp_strali.hmm
: The multiple structure alignment and resulting HMM model built from PDBeFold.set_1_strali.class
andset_2_strali.class
: Classification results fromhmmsearch
for set 1 and set 2 (structure-based model).temp_overall.class
: Classification results for the combined dataset (positive + negative), used to estimate overall performance for each threshold.
These results can be used to compare performances of structural vs. sequence-based modeling approaches.
All files ending in
_seqali
were generated using the sequence-based alignment and evaluation (5. (Optional) Compare with sequence-based HMM)*). All files ending in_strali
were generated using the structure-based alignment and model evaluation (default pipeline).
You can see an Example of Output files into
OUTPUT_EXAMPLES
folder.
You are not required to use the provided rcsb_pdb_custom_report_20250410062557.csv
or all_kunitz.fasta
files.
- To generate your own PDB custom report, visit the RCSB PDB Advanced Search and customize the filters to suit your needs (e.g., domain = PF00014, resolution limits, sequence length).
- To build your own all_kunitz.fasta, go to UniProt, search for all proteins annotated with the Pfam domain
PF00014
, and download the resulting sequences in FASTA format.
For any questions, suggestions, or contributions, feel free to open an issue or contact the maintainer: Marco Cuscunà.