Releases: ncbi/egapx
Releases · ncbi/egapx
v0.4.1-alpha
Bugfixes
- Fixed formatting for long reads identifiers that were causing issues in chainer #121
- Fixes for prepare_submission for GenBank-compliance
- Fixed an error where extra SRA runs were incorrectly retrieved and used when providing explicit SRA accessions in some situations
- Fixed an issue on AWS to prevent repeated HTTP connections causing connection timeout errors on tasks with periodic logging
- Fixed escape slashes triggering errors in sra_uids_query #144
v0.4.0-alpha
New features
- Added support for long-read RNA sequencing (e.g. PacBio, ONT) as evidence for annotation. This can increase isoform diversity and model accuracy
- uses long_reads: YAML parameter
- supports long reads from SRA or local files
- Added prepare_egapx_gb_submission.py to support annotation submission to GenBank for both new and existing assemblies.
- The script will produce submission files, however they are not yet GenBank submission-compliant. A subsequent patch release (likely 0.4.1) will produce GenBank-compliant files.
- Existing assemblies must have been processed through GenBank (not EMBL or DDBJ) and not submitted by a different group.
- Added quality assessments:
- BUSCO for annotation quality evaluation, automatically run on one longest isoform per gene with the closest available BUSCO lineage dataset. This is equivalent to BUSCO reporting on NCBI RefSeq annotations
- Contamination report from gnomon_biotype to help identify potential assembly problems
- Gnomon annotation reports summarizing evidence for each model: gnomon_report.txt, gnomon_quality_report.txt
- Reporting of RNA-seq model support in final annotation as part of the model_evidence attribute in GFF3
- Added additional outputs:
- Added summary of RNA-seq datasets at beginning of EGAPx run
- Added warning message when >20 RNAseq datasets are in-scope using SRA entrez query, to help prevent runs with unexpectedly large amounts of data. Use --force to override
- STAR BAM alignment files. Use export_bam: true to produce BAM output
- Parameter changes:
- Updated short-read RNA sequencing specification to short_reads: YAML parameter, to better distinguish from the long_reads option
- Data model revisions:
- Revised Ig/TCR segment annotations to omit CDS and pseudogenes to use mRNA instead of individual exon features for better compatibility with GenBank submission
- Gene naming improvements
- Increased representation of alternative splicing based on RNA-seq support
- Adopted some RefSeq filtering criteria:
- recategorization of some transcripts as 'transcript' or 'misc_RNA' when identified as NMD candidates
- filtering out weakly supported two-exon lncRNAs
- filtering out alternate splice isoforms that lack support for all introns from either a long transcript or a single RNA-seq sample or run
- Additional improvements:
- Updated to miniprot-0.15 to improve handling of protein alignments in tandem gene duplications
- Added genome masking using WindowMasker, which is used to exclude regions from Gnomon ab initio gene prediction. This does not impact gene prediction based on protein or transcript alignment evidence
- Added logic to include some models predicted solely by ab initio logic based on high coverage alignments to SwissProt proteins, referred to as "secondary support". This helps identify some genes that lack supporting transcript or protein alignments to the genome
- Support Data updates:
- Added target protein set for Echinoderms
- Added ortholog reference files for Bombyx mori and Tribolium castaneum, used for orthology calls and nomenclature in Lepidoptera and Coleoptera, respectively
- Updated SwissProt protein support data to latest release and to remove some proteins with names that do not conform to GenBank standards
- Documentation updates
- Revised docs to include new features
- Added Table of Contents
- Added FAQ
Bugfixes
v0.3.2-alpha
v0.3.1-alpha
Release 0.3.0-alpha
New features integrated from RefSeq EGAP:
- ortholog analysis vs a pre-defined reference species
- refinement of gene biotype (protein-coding, pseudogene, lncRNA) based on annotation and orthology properties
- Assignment of gene symbols, names, and protein names based on orthology or comparison to SwissProt proteins
- Better annotation of single-exon protein-coding genes based on well supported proteins
- Automatic selection of organism symbol format, ortholog reference species, protein reference sets, maximum intron size, and some annotation-related parameters
- Added target protein sets for plant clades and additional vertebrates
- Integration of structural and functional annotation into final output, including: ASN.1, GFF, GTF, mRNA FASTA, CDS FASTA, protein FASTA
Execution improvements:
- Added versioning for EGAPx (egapx.py runner, Docker/Singularity images)
- Added check for user input files
- Improved support for pre-download of reference files
- Updated STAR to produce csi index instead of bai index to work for large sequences
- Increased time limit for chainer
- Updated chunk size for miniprot tasks to 25k
- Enable skipping gnomon training when parameters from closely-related taxa are available
- Relocated Python requirements.txt to repo root
Future plans:
- Workflow for GenBank submission. Contact us if you want to help with testing.
- long-read transcript evidence using minimap2
- short ncRNA prediction with tRNAscan and Rfam
Release v0.2-alpha
- Updated resource allocation for different tasks
- Added support for non-SRA reads
- Added option for off-line mode
- Bug fixes
Release v0.1.2-alpha
- Added configs for biowulf cluster, and biowulf local
- Added config for SLURM, that users will need to edit according to their cluster specifications
- bug fixes
EGAPx alpha release
This version of EGAPx is an alpha release with limited features and organism scope to collect initial feedback on execution. Outputs are not yet complete and not intended for production use.