Releases · ncbi/egapx

15 Jul 20:21

boukn

v0.4.1-alpha

6d5d399

v0.4.1-alpha Latest

Latest

Bugfixes

Fixed formatting for long reads identifiers that were causing issues in chainer #121
Fixes for prepare_submission for GenBank-compliance
Fixed an error where extra SRA runs were incorrectly retrieved and used when providing explicit SRA accessions in some situations
Fixed an issue on AWS to prevent repeated HTTP connections causing connection timeout errors on tasks with periodic logging
Fixed escape slashes triggering errors in sra_uids_query #144

Assets 2

04 Jun 15:33

boukn

v0.4.0-alpha

e787e70

v0.4.0-alpha

New features

Added support for long-read RNA sequencing (e.g. PacBio, ONT) as evidence for annotation. This can increase isoform diversity and model accuracy
- uses long_reads: YAML parameter
- supports long reads from SRA or local files
Added prepare_egapx_gb_submission.py to support annotation submission to GenBank for both new and existing assemblies.
- The script will produce submission files, however they are not yet GenBank submission-compliant. A subsequent patch release (likely 0.4.1) will produce GenBank-compliant files.
- Existing assemblies must have been processed through GenBank (not EMBL or DDBJ) and not submitted by a different group.
Added quality assessments:
- BUSCO for annotation quality evaluation, automatically run on one longest isoform per gene with the closest available BUSCO lineage dataset. This is equivalent to BUSCO reporting on NCBI RefSeq annotations
- Contamination report from gnomon_biotype to help identify potential assembly problems
- Gnomon annotation reports summarizing evidence for each model: gnomon_report.txt, gnomon_quality_report.txt
- Reporting of RNA-seq model support in final annotation as part of the model_evidence attribute in GFF3
Added additional outputs:
- Added summary of RNA-seq datasets at beginning of EGAPx run
- Added warning message when >20 RNAseq datasets are in-scope using SRA entrez query, to help prevent runs with unexpectedly large amounts of data. Use --force to override
- STAR BAM alignment files. Use export_bam: true to produce BAM output
Parameter changes:
- Updated short-read RNA sequencing specification to short_reads: YAML parameter, to better distinguish from the long_reads option
Data model revisions:
- Revised Ig/TCR segment annotations to omit CDS and pseudogenes to use mRNA instead of individual exon features for better compatibility with GenBank submission
- Gene naming improvements
- Increased representation of alternative splicing based on RNA-seq support
- Adopted some RefSeq filtering criteria:
  - recategorization of some transcripts as 'transcript' or 'misc_RNA' when identified as NMD candidates
  - filtering out weakly supported two-exon lncRNAs
  - filtering out alternate splice isoforms that lack support for all introns from either a long transcript or a single RNA-seq sample or run
Additional improvements:
- Updated to miniprot-0.15 to improve handling of protein alignments in tandem gene duplications
- Added genome masking using WindowMasker, which is used to exclude regions from Gnomon ab initio gene prediction. This does not impact gene prediction based on protein or transcript alignment evidence
- Added logic to include some models predicted solely by ab initio logic based on high coverage alignments to SwissProt proteins, referred to as "secondary support". This helps identify some genes that lack supporting transcript or protein alignments to the genome
Support Data updates:
- Added target protein set for Echinoderms
- Added ortholog reference files for Bombyx mori and Tribolium castaneum, used for orthology calls and nomenclature in Lepidoptera and Coleoptera, respectively
- Updated SwissProt protein support data to latest release and to remove some proteins with names that do not conform to GenBank standards
Documentation updates
- Revised docs to include new features
- Added Table of Contents
- Added FAQ

Bugfixes

Fixed a problem when short-read SRA queries were returning (and subsequently aligning) long reads (#80)
Putative fix for integer seq-ids (#60)
Fixed a problem where miniprot alignment processing was incompatible with proteins containing internal stops (#74)

Assets 2

07 Jan 19:15

etvedte

v0.3.2-alpha

92cf9bd

v0.3.2-alpha

Bug fix release 0.3.2-alpha
Fixed issues:

Issue #61 run_gnomon_biotype errors
Issue #65 SRR formatting errors
Print effective max_intron parameter applied during run
Enable hard stop on EGAPx pipeline for unsupported taxids

Assets 2

18 Nov 22:53

victzh

v0.3.1-alpha

1057884

v0.3.1-alpha

Bug fix release 0.3.1-alpha
Fixed issues:

Issue #44, #49 FTP access for ortho files (Could not find path for ortho taxid)
Issue #47 incorrect temp directory for rnaseq_divide_by_strandedness
Issue #37 gnomon_training error

Assets 2

05 Nov 22:17

victzh

v0.3.0-alpha

f21ac06

Release 0.3.0-alpha

New features integrated from RefSeq EGAP:

ortholog analysis vs a pre-defined reference species
refinement of gene biotype (protein-coding, pseudogene, lncRNA) based on annotation and orthology properties
Assignment of gene symbols, names, and protein names based on orthology or comparison to SwissProt proteins
Better annotation of single-exon protein-coding genes based on well supported proteins
Automatic selection of organism symbol format, ortholog reference species, protein reference sets, maximum intron size, and some annotation-related parameters
Added target protein sets for plant clades and additional vertebrates
Integration of structural and functional annotation into final output, including: ASN.1, GFF, GTF, mRNA FASTA, CDS FASTA, protein FASTA

Execution improvements:

Added versioning for EGAPx (egapx.py runner, Docker/Singularity images)
Added check for user input files
Improved support for pre-download of reference files
Updated STAR to produce csi index instead of bai index to work for large sequences
Increased time limit for chainer
Updated chunk size for miniprot tasks to 25k
Enable skipping gnomon training when parameters from closely-related taxa are available
Relocated Python requirements.txt to repo root

Future plans:

Workflow for GenBank submission. Contact us if you want to help with testing.
long-read transcript evidence using minimap2
short ncRNA prediction with tRNAscan and Rfam

Assets 2

26 Jul 00:44

pstrope

v0.2-alpha

314cbbc

Release v0.2-alpha

Updated resource allocation for different tasks
Added support for non-SRA reads
Added option for off-line mode
Bug fixes

Assets 2

09 May 13:31

pstrope

v0.1.2-alpha

5a916d3

Release v0.1.2-alpha

Added configs for biowulf cluster, and biowulf local
Added config for SLURM, that users will need to edit according to their cluster specifications
bug fixes

Assets 2

01 Apr 14:48

pstrope

v0.1.0-alpha

203afce

EGAPx alpha release Pre-release

Pre-release

This version of EGAPx is an alpha release with limited features and organism scope to collect initial feedback on execution. Outputs are not yet complete and not intended for production use.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: ncbi/egapx

v0.4.1-alpha

Uh oh!

v0.4.0-alpha

Uh oh!

v0.3.2-alpha

Uh oh!

v0.3.1-alpha

Uh oh!

Release 0.3.0-alpha

Uh oh!

Release v0.2-alpha

Uh oh!

Release v0.1.2-alpha

Uh oh!

EGAPx alpha release

Uh oh!