Skip to content

A structural bioinformatics project focused on building and evaluating a Profile Hidden Markov Model (HMM) for the detection of Kunitz-type protease inhibitor domains (PF00014), developed during the Laboratory of Bioinformatics 1 course at the University of Bologna.

Notifications You must be signed in to change notification settings

Markus2409/MSc_Bioinformatics_HMM_KunitzDomain

Repository files navigation

Building a Profile Hidden Markov Model for the Kunitz-Type Protease Inhibitor Domain

Introduction

This repository contains the implementation of a computational pipeline for building a Profile Hidden Markov Model (HMM) targeting the Kunitz-type protease inhibitor domain (Pfam ID: PF00014). The Kunitz domain is a small protein domain found in a variety of protease inhibitors across multiple species. Detecting and classifying these domains is crucial for understanding protease regulation in biological systems. Using a structure-based multiple alignment of representative Kunitz domain proteins, an HMM model is built and calibrated. The model is then evaluated on both positive (true Kunitz) and negative (non-Kunitz) protein sequences to assess its detection performance. Additionally, a comparison is performed with a sequence-based multiple alignment HMM model, built using MUSCLE, to evaluate differences in predictive performance. This project was developed as part of a Laboratory of Bioinformatics 1 assignment during my MSc in Bioinformatics (at University of Bologna), with the goal of integrating structural bioinformatics, sequence analysis, and statistical evaluation of predictive models.

Required Packages

The following packages are required:

  • CD-HIT

    Purpose: Clustering and redundancy reduction of protein sequences. Install via conda:

    conda install -c bioconda cd-hit
  • HMMER

    Purpose: Building and searching Hidden Markov Models (HMMs) for protein domain detection. Install via conda:

    conda install -c bioconda hmmer
  • BLAST+

    Purpose: Performing protein sequence similarity searches using blastp. Install via conda:

    conda install -c bioconda blast
  • MUSCLE (Optional — required only for Comparison with sequence based hmm)

    Purpose: Performing multiple sequence alignments for sequence-based HMM construction. Install via conda:

    conda install -c bioconda muscle
  • BIOPYTHON

    Purpose: Required to run get_seq.py, used for sequence extraction.
    Install via conda:

    conda install -c conda-forge biopython

Repository Contents

This repository contains:

  • rcsb_pdb_custom_report_20250410062557.csv see Additional Notes A custom report downloaded from the PDB using the following query:

    Data Collection Resolution <= 3.5 AND ( Identifier = "PF00014" AND Annotation Type = "Pfam" ) AND Polymer Entity Sequence Length <= 80 AND Polymer Entity Sequence Length >= 45

  • pdb_kunitz_rp.ali An initially empty file. You will paste into this file the multi-structure alignment obtained from PDBeFold.

  • all_kunitz.fasta see Additional Notes A FASTA file containing all Kunitz proteins (human and non-human).

  • uniprot_sprot.fasta
    Full Swiss-Prot protein dataset in FASTA format, used to extract negative sets and full Kunitz reference.

  • script_recover_representative_kunitz.sh A script that extracts representative Kunitz domain PDB IDs and writes them to tmp_pdb_efold_ids.txt.

  • create_hmm_str.sh A script that builds the structural HMM from the PDBeFold alignment.

  • create_testing_sets.sh A script that performs test set generation and 2-fold cross-validation. It removes overlap with training data, computes optimal thresholds (via MCC), and outputs results to hmm_results_strali.txt.

  • create_hmm_seq.sh A script for sequence-based alignment using MUSCLE and subsequent HMM construction. Uses the same test sets and evaluates performance, writing results to hmm_results_seqali.txt.

  • get_seq.py A Python script used to extract specific sequences from a FASTA file (typically the full Swiss-Prot database). It matches accession numbers (from the FASTA headers) against a provided list of IDs and prints the corresponding sequences in FASTA format. This script is used to generate the input FASTA files for positive and negative sets, based on pre-selected identifiers.

  • performance.py A Python script that evaluates prediction performance using E-values and known labels. It computes metrics such as accuracy (q2), Matthews Correlation Coefficient (MCC), true positive rate (TPR), and precision (PPV). It allows comparison of results using full-sequence vs. best-domain E-values, with manual threshold selection.

  • OUTPUT_EXAMPLES folder A folder containing examples of the final output files generated by the pipeline.

  • graphics_and_tables folder A folder containing summary plots, figures, and tables generated during the evaluation of the HMM models.

  • INTERPRO_ANALYSIS This folder contains all materials and results obtained from a secondary exploratory analysis that integrates the InterPro domain annotation IPR036880, which is an alternative to the Pfam ID PF00014 used in the primary HMM analysis. The motivation for this integration stems from the observation that several Kunitz-type proteins were excluded from the positive set due to incomplete or inconsistent domain annotations in UniProt/Swiss-Prot. To address this, all sequences annotated with InterPro entry IPR036880 were retrieved and analyzed.

  • marco_cuscuna_report.pdf This document is the final lab report for the Laboratory of Bioinformatics 1 course (MSc in Bioinformatics, University of Bologna), authored by Marco Cuscunà.
    It describes the complete development and evaluation of profile Hidden Markov Models (HMMs) for detecting the Kunitz-type protease inhibitor domain.
    The report compares sequence- and structure-based alignment approaches using 2-fold cross-validation, highlights annotation-related challenges (e.g., Pfam vs. InterPro cross-reference issues), and discusses the impact of structural information on classification performance.


What to Do

0. Download the Swiss-Prot FASTA file

Visit UniProt and download all protein sequences annotated in the Swiss-Prot database. Insert the file into the folder with the scripts and recall it uniprot_sprot.fasta. This file will be used to extract non-Kunitz protein sequences.


1. Install the required packages

Install the packages listed in the Required Packages section using conda.


2. Run the representative ID extraction script

bash script_recover_representative_kunitz.sh

This will create the file tmp_pdb_efold_ids.txt.

Note: Before submitting the list to PDBeFold, manually inspect the sequences and remove any that are too long, too short, or have unstructured tails. This ensures compatibility with PDBeFold and maintains alignment and hmm quality.


3. Perform structure-based multiple alignment

  1. Go to the PDBeFold Multi Alignment Tool

  2. Choose:

    • Mode: Multiple
    • Source: List of PDB codes
  3. Upload the file tmp_pdb_efold_ids.txt

  4. Click Download FASTA Alignment

  5. Paste the downloaded content into the file pdb_kunitz_rp.ali


4. Build and test the HMM model for structural alignment

bash create_hmm_str.sh
bash create_testing_sets.sh

This pipeline will:

  • Build a structural HMM from the PDBeFold structural alignment

  • Remove training sequences from the full dataset and generate random subsets of positive and negative sequences, which will be used to create test sets

  • Automatically identify the optimal E-value thresholds via MCC evaluation (2-fold CV)

  • Perform evaluation on:

    • Set 1, using Set 2’s threshold
    • Set 2, using Set 1’s threshold
    • Combined Set 1 + Set 2, using both thresholds for overall assessment
  • Report MCC, precision, recall, and identify false positives and false negatives

  • Write detailed results to hmm_results_strali.txt


5. (Optional) Compare with sequence-based HMM

bash create_hmm_seq.sh

This builds an HMM using a MUSCLE sequence alignment of the same representative sequences. It uses the same test sets and evaluates performance, writing results to hmm_results_seqali.txt.


Output(s)

  • hmm_results_strali.txt and hmm_results_seqali.txt contain:
    • The best E-value thresholds selected by maximizing the Matthews Correlation Coefficient (MCC)
    • Performance metrics for each test set and overall, calculated using the E-value that yielded the highest MCC, based on either full sequence or best single domain evaluations
    • Lists of false positives and false negatives
  • neg_1.fasta and neg_2.fasta: FASTA files of non-Kunitz sequences used respectively as the negative set 1 and set 2.
  • pos_1.fasta and pos_2.fasta: FASTA files of Kunitz (positive) sequences used in set 1 and set 2.
  • pdb_kunitz_rp_clean.fasta: Cleaned representative Kunitz sequences used for both structural and sequence alignments (after filtering for length).
  • pdb_kunitz_rp_seqali.fasta and pdb_kunitz_rp_seqali.hmm: The multiple sequence alignment and resulting HMM model produced using MUSCLE.
  • pdb_kunitz_rp_strali.fasta and pdb_kunitz_rp_strali.hmm: The multiple structure alignment and resulting HMM model built from PDBeFold.
  • set_1_strali.class and set_2_strali.class: Classification results from hmmsearch for set 1 and set 2 (structure-based model).
  • temp_overall.class: Classification results for the combined dataset (positive + negative), used to estimate overall performance for each threshold.

These results can be used to compare performances of structural vs. sequence-based modeling approaches.

All files ending in _seqali were generated using the sequence-based alignment and evaluation (5. (Optional) Compare with sequence-based HMM)*). All files ending in _strali were generated using the structure-based alignment and model evaluation (default pipeline).

You can see an Example of Output files into OUTPUT_EXAMPLES folder.


Contact and Additional Notes

You are not required to use the provided rcsb_pdb_custom_report_20250410062557.csv or all_kunitz.fasta files.

  • To generate your own PDB custom report, visit the RCSB PDB Advanced Search and customize the filters to suit your needs (e.g., domain = PF00014, resolution limits, sequence length).
  • To build your own all_kunitz.fasta, go to UniProt, search for all proteins annotated with the Pfam domain PF00014, and download the resulting sequences in FASTA format.

For any questions, suggestions, or contributions, feel free to open an issue or contact the maintainer: Marco Cuscunà.

About

A structural bioinformatics project focused on building and evaluating a Profile Hidden Markov Model (HMM) for the detection of Kunitz-type protease inhibitor domains (PF00014), developed during the Laboratory of Bioinformatics 1 course at the University of Bologna.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published