ErwinATyper Tool

A comprehensive genomic analysis tool for Erwinia amylovora providing:

Multi-locus sequence typing (MLST) analysis
Locus typing (capsule, cellulose, LPS, sorbitol)
Plasmid detection
Streptomycin resistance gene identification
CRISPR genotype analysis
Type III/VI secretion system variant identification
Identification of flagellar systems

Prerequisites

Python 3.9+
Docker 26.0.0 (with the required images pulled)
BLAST 2.15.0+

Installation

Clone the repository:

git clone https://github.com/beach-fossils/BioFago.git
cd BioFago

Set up a Python virtual environment (optional but recommended):

Create a virtual environment named 'biofago_env'
```
python -m venv biofago_env
```
Activate the virtual environment

On Windows:
```
biofago_env\Scripts\activate
```
On Unix or MacOS:
```
source biofago_env/bin/activate
```
Your command prompt should now show (biofago_env), indicating it's active
Using a Conda environment (alternative method)

Create a Conda environment named 'biofago_env'
```
conda create -n biofago_env python=3.9
```
Activate the Conda environment
```
conda activate biofago_env
```
Your command prompt should now show (biofago_env), indicating it's active
Install the required Python packages:
```
pip install -r requirements.txt
```
Install BLAST:

This script will check if BLAST is already installed and at the correct version. If not, it will attempt to install or update BLAST.

Note: The script requires sudo privileges on Linux systems and Homebrew on macOS. Also, if you are running this on Windows you should use Git Bash or WSL to execute the following commands since Windows CMD or PowerShell does not support chmod. If you encounter any issues, please refer to the BLAST manual installation instructions.
```
  chmod +x external/blast/install_blast.sh
  ./external/blast/install_blast.sh
```
Install Docker:

Follow the official Docker installation guide for your operating system.

Docker Images

The development that has been made until now relies on two Docker images for some of its functionality. Before running the tool, make sure to pull these images:

Prokka (for genome annotation):
```
docker pull staphb/prokka:latest
```

Average Nucleotide Identity (ANI) calculator:

docker pull leightonpritchard/average_nucleotide_identity:v0.2.9

Usage

You can run ErwinATyper using command-line arguments.

Using Command-Line Arguments

Run the tool with the following command-line arguments:

python biofago_runner.py --input <input_path> --output_dir <output_directory> [options]

Available options:

--input: Specify the genome file (.fasta) or directory containing multiple genome files to be processed as input (mandatory)
--output_dir: Specify the output directory for results (mandatory)
--keep_sequence_loci: Flag to retain sequences for each analyzed locus (optional)
--threshold_species: Set the ANI threshold for species assignment (optional, default: 0.95)
--skip_species_assignment: Flag to skip the module to identify the species (optional)
--log_level: Set logging verbosity (optional, choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, default: INFO)
--batch_size: Number of genomes to process in parallel (optional, default: 0 = process all at once)
--num_workers: Number of worker processes for parallelization (optional, default: 4)
--docker_limit: Maximum number of concurrent Docker containers (optional, default: 4)
--quiet: Reduce console output and show only progress bars and essential messages (optional)

Examples:

# Process a single genome
python biofago_runner.py --input /path/to/genome1.fasta --output_dir /path/to/output --keep_sequence_loci

# Process a directory with multiple genomes
python biofago_runner.py --input /path/to/genomes_folder --output_dir /path/to/output --keep_sequence_loci

# Process a large batch of genomes with controlled parallelism
python biofago_runner.py --input /path/to/many_genomes --output_dir /path/to/output --batch_size 10 --num_workers 8 --docker_limit 4

# Process genomes with minimal console output (just progress bars and final result)
python biofago_runner.py --input /path/to/genomes --output_dir /path/to/output --quiet

Batch Processing and Parallelization

ErwinATyper supports processing large batches of genomes efficiently:

Batch Size (--batch_size): Process genomes in smaller batches to control memory usage. A value of 0 means process all genomes at once.
Worker Processes (--num_workers): Control how many genomes are processed in parallel.
Docker Container Limit (--docker_limit): Limit the number of concurrent Docker containers to prevent system overload.

For best performance on large datasets (200+ genomes):

Use a batch size of 10-20 genomes
Set worker processes based on your CPU cores (typically 4-8)
Limit Docker containers to 4-6 to avoid excessive resource usage

File Naming

ErwinATyper preserves full genome filenames (without extension) in all results. For example:

Input file: GCA_023183245.1_GCA_023183245.1_ASM2318324v1_genomic.fna
Name in results: GCA_023183245.1_GCA_023183245.1_ASM2318324v1_genomic

Note: The output folders named species_finder and types_finder (if --keep_sequence_loci is used) are automatically created in the specified output directory.

Output Structure

After running the tool, you can expect the following output structure:

output_dir/
│
├── species_finder/
│   └── all_results.csv
│
├── types_finder/ (if --keep_sequence_loci is used)
│   └── genome1/
│       ├── types_capsule/
│       ├── types_cellulose/
│       ├── types_lps/
│       └── types_srl/
│           ├── PROKKA_[DATE].fna
│           └── PROKKA_[DATE].gbk

Example `.csv` output

For a comprehensive example of the analysis results, you can view a sample example.csv output file in this GitHub repository:

example.csv

About the Project

Funding & Timeline

Project Reference: PRR-C05-i03-I-000179
Funded by: Plano de Recuperação e Resiliência (PRR)
República Portuguesa
União Europeia - NextGenerationEU
Duration: Jan 2023 - Sep 2025

Partners

Universities: UM (Coordinator), FCUP, IPVC
Research Centers: INIAV
Industry: ANP, COTHN, Asfertglobal, Frutus, Granfer, CAB, Coopval, Fruoeste, Cooperfrutas

Scientific Documentation

For a detailed scientific description of the project and ErwinATyper's implementation and methodology, please refer to the scientific document.

Contact

Bioinformatician/ Software dev: José Diogo Moura (UM) email: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 193 Commits
examples		examples
external/blast		external/blast
reference_crispr		reference_crispr
reference_mlst		reference_mlst
reference_plasmids		reference_plasmids
reference_real_species_genomes		reference_real_species_genomes
reference_resistance_genes		reference_resistance_genes
reference_species_genomes		reference_species_genomes
reference_types_database		reference_types_database
src		src
test-data		test-data
.gitignore		.gitignore
Dockerfile		Dockerfile
ErwinATyper_sci_doc.pdf		ErwinATyper_sci_doc.pdf
LICENSE		LICENSE
README.md		README.md
all_results_may162025.csv		all_results_may162025.csv
all_results_nov_2025.csv		all_results_nov_2025.csv
biofago_erwinia.xml		biofago_erwinia.xml
biofago_runner.py		biofago_runner.py
create_zip.py		create_zip.py
environment.yml		environment.yml
materiais_metodos_biofago_dez_2025.pdf		materiais_metodos_biofago_dez_2025.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ErwinATyper Tool

Prerequisites

Installation

Create a virtual environment named 'biofago_env'

Activate the virtual environment

On Windows:

On Unix or MacOS:

Your command prompt should now show (biofago_env), indicating it's active

Create a Conda environment named 'biofago_env'

Activate the Conda environment

Your command prompt should now show (biofago_env), indicating it's active

Docker Images

Usage

Using Command-Line Arguments

Batch Processing and Parallelization

File Naming

Output Structure

Example `.csv` output

About the Project

Scientific Documentation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

beach-fossils/BioFago

Folders and files

Latest commit

History

Repository files navigation

ErwinATyper Tool

Prerequisites

Installation

Create a virtual environment named 'biofago_env'

Activate the virtual environment

On Windows:

On Unix or MacOS:

Your command prompt should now show (biofago_env), indicating it's active

Create a Conda environment named 'biofago_env'

Activate the Conda environment

Your command prompt should now show (biofago_env), indicating it's active

Docker Images

Usage

Using Command-Line Arguments

Batch Processing and Parallelization

File Naming

Output Structure

Example .csv output

About the Project

Scientific Documentation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Example `.csv` output

Packages