A comprehensive bioinformatics pipeline for extracting heteroatoms from protein structures and finding molecularly similar compounds using fingerprint-based similarity analysis.
ยฉ 2025 Standard Seed Corporation. This is an open-source project developed and released by Standard Seed Corporation under the MIT License. All rights reserved.
TrackMyPDB is a user-friendly Streamlit web application that combines two powerful components:
- Heteroatom Extraction Tool: Systematically extracts all heteroatoms from PDB structures associated with UniProt proteins
- Molecular Similarity Analyzer: Finds ligands most similar to a target molecule using Morgan fingerprints and Tanimoto similarity
- Python 3.7+
- Internet connection for API calls
- Windows OS (optimized for Windows environment)
-
Clone the repository:
git clone <repository-url> cd TrackMyPDB
-
Install dependencies:
pip install -r requirements.txt
-
Launch the application:
streamlit run streamlit_app.py
-
Open your browser to
http://localhost:8501
- Navigate to the web interface
- Choose analysis type:
- ๐ Heteroatom Extraction
- ๐งช Similarity Analysis
- ๐ Complete Pipeline
- Input your data:
- UniProt IDs (e.g., Q9UNQ0, P37231, P06276)
- Target SMILES structure
- Run analysis and download CSV results
- Input: UniProt protein identifiers
- Process: Fetches PDB structures, extracts heteroatoms, retrieves SMILES
- Output: Comprehensive CSV with chemical information
- APIs: RCSB PDB, PubChem integration
- Features: Progress tracking, error handling, result caching
- Input: Target SMILES structure
- Process: Morgan fingerprint computation, Tanimoto similarity calculation
- Output: Ranked similarity results with interactive visualizations
- Features: Configurable parameters, real-time analysis, comprehensive reports
- Workflow: End-to-end processing from UniProt IDs to similarity results
- Integration: Automatic heteroatom extraction followed by similarity analysis
- Output: Both heteroatom database and similarity results
TrackMyPDB/
โโโ streamlit_app.py # Main Streamlit application
โโโ requirements.txt # Python dependencies
โโโ backend/
โ โโโ __init__.py # Package initialization
โ โโโ heteroatom_extractor.py # Heteroatom extraction logic
โ โโโ similarity_analyzer.py # Similarity analysis logic
โโโ README.md # This file
- Streamlit: Web application framework
- RDKit: Cheminformatics and molecular similarity
- Pandas: Data manipulation and analysis
- Plotly: Interactive visualizations
- Requests: API communications
- NumPy: Numerical computations
- PDBe REST API: PDB structure mappings
- RCSB PDB API: Chemical component data
- PubChem API: Backup molecular data
- Morgan Fingerprints: Circular molecular fingerprints (radius=2, 2048 bits)
- Tanimoto Similarity: Industry-standard similarity metric (0-1 scale)
- Interactive Visualizations: Distribution plots, similarity rankings, statistical analysis
- Modern UI: Clean, minimalist design inspired by Apple Design principles
- Responsive Layout: Optimized for different screen sizes
- Interactive Elements: Smooth animations and hover effects
- Intuitive Navigation: Clear section organization and progress indicators
- Real-time Progress: Progress bars and status updates
- Error Handling: Graceful error messages and troubleshooting
- Data Export: CSV download functionality with timestamps
- Result Caching: Session state management for efficiency
- Heteroatoms: ~1000-5000 heteroatoms per 10 UniProt proteins
- SMILES Success: ~60-80% success rate for SMILES retrieval
- Similar Ligands: ~50-200 similar compounds per target (similarity > 0.2)
- Processing Time: 30-60 minutes for complete pipeline
heteroatom_results_YYYYMMDD_HHMMSS.csv
: Complete heteroatom extraction resultssimilarity_results_YYYYMMDD_HHMMSS.csv
: Molecular similarity analysis results
- UniProt IDs: Multiple input formats (comma-separated, line-separated)
- Result Caching: Previous results loading and management
- API Settings: Automatic retry logic and rate limiting
- Fingerprint Parameters:
- Morgan radius: 1, 2, 3 (default: 2)
- Fingerprint bits: 1024, 2048, 4096 (default: 2048)
- Analysis Parameters:
- Top N results: 10-100 (default: 50)
- Minimum similarity: 0.0-1.0 (default: 0.2)
# Install dependencies
pip install -r requirements.txt
# For RDKit installation issues on Windows
conda install -c conda-forge rdkit
- Verify SMILES syntax using online validators
- Check for special characters or formatting issues
- Example valid SMILES:
CCO
(ethanol),CC(=O)O
(acetic acid)
- Reduce number of UniProt IDs for testing
- Use higher minimum similarity threshold
- Check internet connection stability
- Wait a few minutes and retry
- Check if external APIs (RCSB, PubChem) are accessible
- Reduce batch size for large datasets
- Lead Optimization: Find similar compounds to known drugs
- Scaffold Hopping: Identify alternative molecular frameworks
- Target Analysis: Understand ligand binding preferences
- Cofactor Analysis: Study enzyme cofactor preferences
- Binding Site Analysis: Characterize pocket properties
- Cross-reactivity Prediction: Assess off-target binding
- Structural Biology: Build custom screening libraries
- Comparative Analysis: Study protein-ligand interactions
- Database Construction: Create specialized molecular databases
- Follow PEP 8 style guidelines
- Add comprehensive error handling
- Include progress indicators for long operations
- Document all functions and classes
- Test with various input formats
This project is licensed under the MIT License - see the LICENSE file for details.
Open Source Project - Free to use, modify, and distribute under the MIT License terms.
Please respect API terms of service and rate limits when using this application.
- RCSB PDB: Protein structure data
- PDBe: Structure mapping services
- PubChem: Chemical information database
- RDKit: Cheminformatics toolkit
- Streamlit: Web application framework
- Prject Lead/Senior Engineer: Sul sharif
- Lead Engineer: Anu Gamage
- Associate Engineers: Damilola Bodun, Kalana Kotawalagedara & Logan Geffen
For issues or questions:
- Check the troubleshooting section
- Verify input data format
- Test with provided examples
- Review browser console for errors
- Contact the developers through LinkedIn
Happy molecular hunting! ๐งฌ๐