TrackMyPDB - Streamlit Application

A comprehensive bioinformatics pipeline for extracting heteroatoms from protein structures and finding molecularly similar compounds using fingerprint-based similarity analysis.

🎯 Overview

TrackMyPDB is a user-friendly Streamlit web application that combines two powerful components:

Heteroatom Extraction Tool: Systematically extracts all heteroatoms from PDB structures associated with UniProt proteins
Molecular Similarity Analyzer: Finds ligands most similar to a target molecule using Morgan fingerprints and Tanimoto similarity

🚀 Quick Start

Web Application

https://trackmypdbsscai.streamlit.app/

Launch using local machine

Prerequisites

Python 3.7+
Internet connection for API calls
Windows OS (optimized for Windows environment)

Installation

Clone the repository:

git clone <repository-url>
cd TrackMyPDB

Install dependencies:
```
pip install -r requirements.txt
```
Launch the application:
```
streamlit run streamlit_app.py
```
Open your browser to http://localhost:8501

Basic Usage

Navigate to the web interface
Choose analysis type:
- 🔍 Heteroatom Extraction
- 🧪 Similarity Analysis
- 📊 Complete Pipeline
Input your data:
- UniProt IDs (e.g., Q9UNQ0, P37231, P06276)
- Target SMILES structure
Run analysis and download CSV results

📋 Application Features

🔍 Heteroatom Extraction

Input: UniProt protein identifiers
Process: Fetches PDB structures, extracts heteroatoms, retrieves SMILES
Output: Comprehensive CSV with chemical information
APIs: RCSB PDB, PubChem integration
Features: Progress tracking, error handling, result caching

🧪 Molecular Similarity Analysis

Input: Target SMILES structure
Process: Morgan fingerprint computation, Tanimoto similarity calculation
Output: Ranked similarity results with interactive visualizations
Features: Configurable parameters, real-time analysis, comprehensive reports

📊 Complete Pipeline

Workflow: End-to-end processing from UniProt IDs to similarity results
Integration: Automatic heteroatom extraction followed by similarity analysis
Output: Both heteroatom database and similarity results

🏗️ Project Structure

TrackMyPDB/
├── streamlit_app.py              # Main Streamlit application
├── requirements.txt              # Python dependencies
├── backend/
│   ├── __init__.py              # Package initialization
│   ├── heteroatom_extractor.py  # Heteroatom extraction logic
│   └── similarity_analyzer.py   # Similarity analysis logic
└── README.md                    # This file

🛠️ Technical Details

Dependencies

Streamlit: Web application framework
RDKit: Cheminformatics and molecular similarity
Pandas: Data manipulation and analysis
Plotly: Interactive visualizations
Requests: API communications
NumPy: Numerical computations

API Integration

PDBe REST API: PDB structure mappings
RCSB PDB API: Chemical component data
PubChem API: Backup molecular data

Molecular Analysis

Morgan Fingerprints: Circular molecular fingerprints (radius=2, 2048 bits)
Tanimoto Similarity: Industry-standard similarity metric (0-1 scale)
Interactive Visualizations: Distribution plots, similarity rankings, statistical analysis

🎮 User Interface

Apple-Inspired Design

Modern UI: Clean, minimalist design inspired by Apple Design principles
Responsive Layout: Optimized for different screen sizes
Interactive Elements: Smooth animations and hover effects
Intuitive Navigation: Clear section organization and progress indicators

Key Features

Real-time Progress: Progress bars and status updates
Error Handling: Graceful error messages and troubleshooting
Data Export: CSV download functionality with timestamps
Result Caching: Session state management for efficiency

📊 Expected Results

Typical Output

Heteroatoms: ~1000-5000 heteroatoms per 10 UniProt proteins
SMILES Success: ~60-80% success rate for SMILES retrieval
Similar Ligands: ~50-200 similar compounds per target (similarity > 0.2)
Processing Time: 30-60 minutes for complete pipeline

File Outputs

heteroatom_results_YYYYMMDD_HHMMSS.csv: Complete heteroatom extraction results
similarity_results_YYYYMMDD_HHMMSS.csv: Molecular similarity analysis results

🔧 Configuration Options

Heteroatom Extraction

UniProt IDs: Multiple input formats (comma-separated, line-separated)
Result Caching: Previous results loading and management
API Settings: Automatic retry logic and rate limiting

Similarity Analysis

Fingerprint Parameters:
- Morgan radius: 1, 2, 3 (default: 2)
- Fingerprint bits: 1024, 2048, 4096 (default: 2048)
Analysis Parameters:
- Top N results: 10-100 (default: 50)
- Minimum similarity: 0.0-1.0 (default: 0.2)

🚨 Troubleshooting

Common Issues

"Module not found" errors

# Install dependencies
pip install -r requirements.txt

# For RDKit installation issues on Windows
conda install -c conda-forge rdkit

"Invalid SMILES" warnings

Verify SMILES syntax using online validators
Check for special characters or formatting issues
Example valid SMILES: CCO (ethanol), CC(=O)O (acetic acid)

Slow performance

Reduce number of UniProt IDs for testing
Use higher minimum similarity threshold
Check internet connection stability

API timeout errors

Wait a few minutes and retry
Check if external APIs (RCSB, PubChem) are accessible
Reduce batch size for large datasets

💡 Use Cases

Drug Discovery

Lead Optimization: Find similar compounds to known drugs
Scaffold Hopping: Identify alternative molecular frameworks
Target Analysis: Understand ligand binding preferences

Chemical Biology

Cofactor Analysis: Study enzyme cofactor preferences
Binding Site Analysis: Characterize pocket properties
Cross-reactivity Prediction: Assess off-target binding

Academic Research

Structural Biology: Build custom screening libraries
Comparative Analysis: Study protein-ligand interactions
Database Construction: Create specialized molecular databases

🤝 Contributing

Development Guidelines

Follow PEP 8 style guidelines
Add comprehensive error handling
Include progress indicators for long operations
Document all functions and classes
Test with various input formats

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Open Source Project - Free to use, modify, and distribute under the MIT License terms.

Please respect API terms of service and rate limits when using this application.

🙏 Acknowledgments

RCSB PDB: Protein structure data
PDBe: Structure mapping services
PubChem: Chemical information database
RDKit: Cheminformatics toolkit
Streamlit: Web application framework

👨‍💻 Developers

Prject Lead/Senior Engineer: Sul sharif
Lead Engineer: Anu Gamage
Associate Engineers: Damilola Bodun, Kalana Kotawalagedara & Logan Geffen

📞 Support

For issues or questions:

Check the troubleshooting section
Verify input data format
Test with provided examples
Review browser console for errors
Contact the developers through LinkedIn

Happy molecular hunting! 🧬🔍

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.streamlit		.streamlit
backend		backend
ENHANCEMENT_STATUS.md		ENHANCEMENT_STATUS.md
LICENSE		LICENSE
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
README.md		README.md
STARTUP_GUIDE.md		STARTUP_GUIDE.md
complete_pipeline_heteroatoms.csv		complete_pipeline_heteroatoms.csv
complete_pipeline_similarity.csv		complete_pipeline_similarity.csv
example_inputs.md		example_inputs.md
install.bat		install.bat
requirements.txt		requirements.txt
run.bat		run.bat
ssc.png		ssc.png
streamlit_app.py		streamlit_app.py
test_installation.py		test_installation.py

License

Standard-Seed-Corporation/TrackMyPDB

Folders and files

Latest commit

History

Repository files navigation