Skip to content

Commit b2e6f1a

Browse files
author
Victor Joukov
committed
Release 0.3.1-alpha
1 parent f21ac06 commit b2e6f1a

File tree

10 files changed

+57
-85
lines changed

10 files changed

+57
-85
lines changed

README.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
EGAPx is the publicly accessible version of the updated NCBI [Eukaryotic Genome Annotation Pipeline](https://www.ncbi.nlm.nih.gov/refseq/annotation_euk/process/).
44

5-
EGAPx takes an assembly fasta file, a taxid of the organism, and RNA-seq data. Based on the taxid, EGAPx will pick protein sets and HMM models. The pipeline runs `miniprot` to align protein sequences, and `STAR` to align RNA-seq to the assembly. Protein alignments and RNA-seq read alignments are then passed to `Gnomon` for gene prediction. In the first step of `Gnomon`, the short alignments are chained together into putative gene models. In the second step, these predictions are further supplemented by _ab-initio_ predictions based on HMM models. Functional annotation is added to the final structural annotation set based on the type and quality of the model and orthology information. The final annotation for the input assembly is produced as a `gff` file.
5+
EGAPx takes an assembly FASTA file, a taxid of the organism, and RNA-seq data. Based on the taxid, EGAPx will pick protein sets and HMM models. The pipeline runs `miniprot` to align protein sequences, and `STAR` to align RNA-seq to the assembly. Protein alignments and RNA-seq read alignments are then passed to `Gnomon` for gene prediction. In the first step of `Gnomon`, the short alignments are chained together into putative gene models. In the second step, these predictions are further supplemented by _ab-initio_ predictions based on HMM models. Functional annotation is added to the final structural annotation set based on the type and quality of the model and orthology information. The final annotation for the input assembly is produced as a `gff` file.
66

77
We currently have protein datasets posted that are suitable for most vertebrates, arthropods, and some plants:
88
- Chordata - Mammalia, Sauropsida, Actinopterygii (ray-finned fishes), other Vertebrates
@@ -363,8 +363,9 @@ Description of the outputs:
363363
* `complete.transcripts.fna`: annotated transcripts in FASTA format (includes UTRs).
364364
* `complete.proteins.faa`: annotated protein products in FASTA format.
365365
* `annotated_genome.asn`: final annotation set in ASN1 format.
366+
366367
Description of the logs and miscellaneous outputs:
367-
* `annot_builder_output/accept.ftable_annot`: intermediate file with accepted annotation models called by GNOMON.
368+
* `annot_builder_output/accept.ftable_annot`: intermediate file with accepted annotation models called by Gnomon.
368369
* `annotation_data.cmt`: annotation structured comment file. Used for submission to GenBank.
369370
* `nextflow.log`: main Nextflow log that captures all the process information and their work directories.
370371
* `resume.sh`: Nextflow command for resuming a run from the last successful task.
@@ -377,13 +378,13 @@ Description of the logs and miscellaneous outputs:
377378
378379
## Interpreting Output
379380
380-
`stats/feature_counts.xml` contains summary counts of features by model prediction categories determined by GNOMON.
381+
`stats/feature_counts.xml` contains summary counts of features by model prediction categories determined by Gnomon.
381382
382383
**NOTE** not all categories are expected to have counts data (e.g. model RefSeq, fully supported, ab initio)
383384
384385
Genes with `major correction` are likely protein-coding genes with frameshifts and/or internal stops. These models include "LOW QUALITY PROTEIN" in the protein FASTA title, are marked up with exception=low-quality sequence region on the mRNA and CDS features, and the annotation is adjusted to meet GenBank criteria (frameshifts are compensated for by 1-2 bp microintrons in the mRNA and CDS features, and internal stops have a transl_except to translate the codon as X instead of a stop). For RefSeq, we set a threshold of no more than 10% of protein-coding genes with major corrections to release the annotation. We recommend users polish assembly sequences if the rate is higher than 10%.
385386
386-
Counts of protein-coding genes should be considered versus similar species. Low counts may result from insufficient supporting evidence (e.g. low RNAseq coverage or an unusual organism compared to the available protein data). High counts may indicate genome fragmentation or noise from genes annotated on transposons.
387+
Counts of protein-coding genes should be considered versus similar species. Low counts may result from insufficient supporting evidence (e.g. low RNAseq coverage or an unusual organism compared to the available protein data). High counts may indicate genome fragmentation, uncollapsed haplotypic duplication, or noise from genes annotated on transposons.
387388
388389
`stats/feature_stats.xml` contains summary statistics on transcript counts per gene, exon counts per transcript, and the counts and length distributions of features by sub-type.
389390
@@ -419,7 +420,7 @@ If you do not have internet access from your cluster, you can run EGAPx in offli
419420
```
420421
rm egap*sif
421422
singularity cache clean
422-
singularity pull docker://ncbi/egapx:0.3.0-alpha
423+
singularity pull docker://ncbi/egapx:0.3.1-alpha
423424
```
424425
425426
- Clone the repo:

nf/subworkflows/ncbi/gnomon-training-iteration/gnomon_training_iterations/main.nf

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -17,24 +17,23 @@ workflow gnomon_training_iterations {
1717
chainer_gap_fill_allowlist
1818
chainer_trusted_genes
1919
chainer_scaffolds
20-
gnomon_softmask_lds2
21-
gnomon_softmask_lds2_source
20+
gnomon_softmask
2221
gnomon_scaffolds
2322
max_intron
2423
parameters
2524
main:
2625
gnomon_training_iteration(models_file, genome_asn, proteins_asn ,chainer_alignments,chainer_evidence_denylist,chainer_gap_fill_allowlist,
27-
chainer_trusted_genes, chainer_scaffolds, gnomon_softmask_lds2,
28-
gnomon_softmask_lds2_source, gnomon_scaffolds, max_intron, parameters)
26+
chainer_trusted_genes, chainer_scaffolds,
27+
gnomon_softmask, gnomon_scaffolds, max_intron, parameters)
2928
gnomon_training_iteration2(gnomon_training_iteration.out.hmm_params_file, genome_asn, proteins_asn ,chainer_alignments,
30-
chainer_evidence_denylist,chainer_gap_fill_allowlist, chainer_trusted_genes, chainer_scaffolds, gnomon_softmask_lds2,
31-
gnomon_softmask_lds2_source, gnomon_scaffolds, max_intron, parameters)
29+
chainer_evidence_denylist,chainer_gap_fill_allowlist, chainer_trusted_genes, chainer_scaffolds,
30+
gnomon_softmask, gnomon_scaffolds, max_intron, parameters)
3231
gnomon_training_iteration3(gnomon_training_iteration2.out.hmm_params_file, genome_asn, proteins_asn ,chainer_alignments,
33-
chainer_evidence_denylist,chainer_gap_fill_allowlist, chainer_trusted_genes, chainer_scaffolds, gnomon_softmask_lds2,
34-
gnomon_softmask_lds2_source, gnomon_scaffolds, max_intron, parameters)
32+
chainer_evidence_denylist,chainer_gap_fill_allowlist, chainer_trusted_genes, chainer_scaffolds,
33+
gnomon_softmask, gnomon_scaffolds, max_intron, parameters)
3534
gnomon_training_iteration4(gnomon_training_iteration3.out.hmm_params_file, genome_asn, proteins_asn ,chainer_alignments,
36-
chainer_evidence_denylist,chainer_gap_fill_allowlist, chainer_trusted_genes, chainer_scaffolds, gnomon_softmask_lds2,
37-
gnomon_softmask_lds2_source, gnomon_scaffolds, max_intron, parameters)
35+
chainer_evidence_denylist,chainer_gap_fill_allowlist, chainer_trusted_genes, chainer_scaffolds,
36+
gnomon_softmask, gnomon_scaffolds, max_intron, parameters)
3837

3938
emit:
4039
hmm_params_file = gnomon_training_iteration4.out.hmm_params_file

nf/subworkflows/ncbi/gnomon-training-iteration/main.nf

Lines changed: 11 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -9,32 +9,31 @@ include { gnomon_training_iteration; gnomon_training_iteration as gnomon_trainin
99

1010
workflow gnomon_training_iterations {
1111
take:
12-
models_file
12+
initial_hmm_params
1313
genome_asn
1414
proteins_asn
1515
chainer_alignments
1616
chainer_evidence_denylist
1717
chainer_gap_fill_allowlist
1818
chainer_trusted_genes
1919
chainer_scaffolds
20-
gnomon_softmask_lds2
21-
gnomon_softmask_lds2_source
20+
gnomon_softmask
2221
gnomon_scaffolds
2322
max_intron
2423
parameters
2524
main:
26-
gnomon_training_iteration(models_file, genome_asn, proteins_asn ,chainer_alignments,chainer_evidence_denylist,chainer_gap_fill_allowlist,
27-
chainer_trusted_genes, chainer_scaffolds, gnomon_softmask_lds2,
28-
gnomon_softmask_lds2_source, gnomon_scaffolds, max_intron, parameters)
25+
gnomon_training_iteration(initial_hmm_params, genome_asn, proteins_asn ,chainer_alignments,chainer_evidence_denylist,chainer_gap_fill_allowlist,
26+
chainer_trusted_genes, chainer_scaffolds,
27+
gnomon_softmask, gnomon_scaffolds, max_intron, parameters)
2928
gnomon_training_iteration2(gnomon_training_iteration.out.hmm_params_file, genome_asn, proteins_asn ,chainer_alignments,
30-
chainer_evidence_denylist,chainer_gap_fill_allowlist, chainer_trusted_genes, chainer_scaffolds, gnomon_softmask_lds2,
31-
gnomon_softmask_lds2_source, gnomon_scaffolds, max_intron, parameters)
29+
chainer_evidence_denylist,chainer_gap_fill_allowlist, chainer_trusted_genes, chainer_scaffolds,
30+
gnomon_softmask, gnomon_scaffolds, max_intron, parameters)
3231
gnomon_training_iteration3(gnomon_training_iteration2.out.hmm_params_file, genome_asn, proteins_asn ,chainer_alignments,
33-
chainer_evidence_denylist,chainer_gap_fill_allowlist, chainer_trusted_genes, chainer_scaffolds, gnomon_softmask_lds2,
34-
gnomon_softmask_lds2_source, gnomon_scaffolds, max_intron, parameters)
32+
chainer_evidence_denylist,chainer_gap_fill_allowlist, chainer_trusted_genes, chainer_scaffolds,
33+
gnomon_softmask, gnomon_scaffolds, max_intron, parameters)
3534
gnomon_training_iteration4(gnomon_training_iteration3.out.hmm_params_file, genome_asn, proteins_asn ,chainer_alignments,
36-
chainer_evidence_denylist,chainer_gap_fill_allowlist, chainer_trusted_genes, chainer_scaffolds, gnomon_softmask_lds2,
37-
gnomon_softmask_lds2_source, gnomon_scaffolds, max_intron, parameters)
35+
chainer_evidence_denylist,chainer_gap_fill_allowlist, chainer_trusted_genes, chainer_scaffolds,
36+
gnomon_softmask, gnomon_scaffolds, max_intron, parameters)
3837

3938
emit:
4039
hmm_params_file = gnomon_training_iteration4.out.hmm_params_file
@@ -81,7 +80,6 @@ workflow gnomon_training_iterations {
8180
chainer_trusted_genes
8281
chainer_scaffolds
8382
gnomon_softmask_lds2
84-
gnomon_softmask_lds2_source
8583
gnomon_scaffolds
8684
max_intron
8785
parameters

nf/subworkflows/ncbi/gnomon-training-iteration/utilities.nf

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,23 +9,22 @@ include { gnomon_training } from '../gnomon/gnomon_training/main'
99

1010
workflow gnomon_training_iteration {
1111
take:
12-
models_file
12+
initial_hmm_params
1313
genome_asn
1414
proteins_asn
1515
chainer_alignments
1616
chainer_evidence_denylist
1717
chainer_gap_fill_allowlist
1818
chainer_trusted_genes
1919
chainer_scaffolds
20-
gnomon_softmask_lds2
21-
gnomon_softmask_lds2_source
20+
gnomon_softmask
2221
gnomon_scaffolds
2322
max_intron
2423
parameters
2524
main:
2625

27-
chainer(chainer_alignments, models_file, chainer_evidence_denylist, chainer_gap_fill_allowlist, chainer_scaffolds, chainer_trusted_genes, genome_asn, proteins_asn, parameters.get('chainer', [:]))
28-
gnomon_wnode(gnomon_scaffolds, chainer.out.chains, chainer.out.chains_slices, models_file, gnomon_softmask_lds2, gnomon_softmask_lds2_source, genome_asn, proteins_asn, parameters.get('gnomon', [:]))
26+
chainer(chainer_alignments, initial_hmm_params, chainer_evidence_denylist, chainer_gap_fill_allowlist, chainer_scaffolds, chainer_trusted_genes, genome_asn, proteins_asn, parameters.get('chainer_wnode', [:]))
27+
gnomon_wnode(gnomon_scaffolds, chainer.out.chains, chainer.out.chains_slices, initial_hmm_params, gnomon_softmask, [], genome_asn, proteins_asn, parameters.get('gnomon_wnode', [:]))
2928
gnomon_training(genome_asn, gnomon_wnode.out.outputs, max_intron, parameters.get('gnomon_training', [:]))
3029

3130
emit:
@@ -37,8 +36,7 @@ workflow gnomon_training_iteration {
3736
chainer_gap_fill_allowlist = chainer_gap_fill_allowlist
3837
chainer_trusted_genes = chainer_trusted_genes
3938
chainer_scaffolds = chainer_scaffolds
40-
gnomon_softmask_lds2 = gnomon_softmask_lds2
41-
gnomon_softmask_lds2_source = gnomon_softmask_lds2_source
39+
gnomon_softmask = gnomon_softmask
4240
gnomon_scaffolds = gnomon_scaffolds
4341
max_intron = max_intron
4442
parameters = parameters

nf/subworkflows/ncbi/gnomon/gnomon_wnode/main.nf

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ workflow gnomon_wnode {
1818
main:
1919
String gpx_qsubmit_params = merge_params("", parameters, 'gpx_qsubmit')
2020
String annot_params = merge_params("-margin 1000 -mincont 1000 -minlen 225 -mpp 10.0 -ncsp 25 -window 200000 -nonconsens -open", parameters, 'annot_wnode')
21-
String gpx_qdump_params = merge_params("-slices-for affinity -sort-by affinity", parameters, 'gpx_qdump')
21+
String gpx_qdump_params = merge_params("-unzip '*' -slices-for affinity -sort-by affinity", parameters, 'gpx_qdump')
2222

2323
def (jobs, lines_per_file) = gpx_qsubmit(scaffolds, chains, chains_slices, gpx_qsubmit_params)
2424
def annot_files = annot(jobs.flatten(), chains, hmm_params, softmask_lds2, softmask_lds2_source, genome, proteins, lines_per_file, annot_params)
@@ -140,7 +140,7 @@ process gpx_qdump {
140140
path "*.out", emit: "outputs"
141141
script:
142142
"""
143-
gpx_qdump $params -input-path inputs -output gnomon_wnode.out
143+
gpx_qdump $params -input-path inputs -output gnomon_wnode.out
144144
"""
145145
stub:
146146
"""

nf/subworkflows/ncbi/gnomon/main.nf

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,8 +42,7 @@ workflow gnomon_plane {
4242
effective_hmm = hmm_params
4343
} else {
4444
effective_hmm = gnomon_training_iterations(hmm_params, genome_asn, proteins_asn, alignments, /* evidence_denylist */ [], /* gap_fill_allowlist */ [],
45-
[proteins_trusted].flatten(), scaffolds, softmask,
46-
softmask, scaffolds,
45+
[proteins_trusted].flatten(), scaffolds, softmask, scaffolds,
4746
max_intron,
4847
task_params)
4948
}

nf/subworkflows/ncbi/only_gnomon.nf

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
nextflow.enable.dsl=2
66

77
include { setup_genome; setup_proteins } from './setup/main'
8+
include { get_hmm_params; run_get_hmm } from './default/get_hmm_params/main'
89
include { chainer_wnode as chainer } from './gnomon/chainer_wnode/main'
910
include { gnomon_wnode } from './gnomon/gnomon_wnode/main'
1011
include { prot_gnomon_prepare } from './annot_proc/prot_gnomon_prepare/main'
@@ -63,7 +64,17 @@ workflow only_gnomon {
6364

6465
// GNOMON
6566

66-
chainer(alignments, hmm_params, /* evidence_denylist */ [], /* gap_fill_allowlist */ [], scaffolds, /* trusted_genes */ [], genome_asn, proteins_asn, task_params.get('chainer', [:]))
67+
def effective_hmm
68+
if (hmm_params) {
69+
effective_hmm = hmm_params
70+
} else {
71+
tmp_hmm = run_get_hmm(tax_id)
72+
b = tmp_hmm | splitText( { it.split('\n') } ) | flatten
73+
c = b | last
74+
effective_hmm = c
75+
}
76+
77+
chainer(alignments, effective_hmm, /* evidence_denylist */ [], /* gap_fill_allowlist */ [], scaffolds, /* trusted_genes */ [], genome_asn, proteins_asn, task_params.get('chainer', [:]))
6778

6879
gnomon_wnode(scaffolds, chainer.out.chains, chainer.out.chains_slices, effective_hmm, [], softmask, genome_asn, proteins_asn, task_params.get('gnomon', [:]))
6980
def models = gnomon_wnode.out.outputs

nf/subworkflows/ncbi/rnaseq_short/bam_strandedness/main.nf

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,9 +35,10 @@ process rnaseq_divide_by_strandedness {
3535
script:
3636
"""
3737
mkdir -p output
38+
mkdir -p tmp
3839
samtools=\$(which samtools)
3940
echo "${bam_list.join('\n')}" > bam_list.mft
40-
rnaseq_divide_by_strandedness -align-manifest bam_list.mft -metadata $metadata_file $parameters -samtools-executable \$samtools -stranded-output output/stranded.list -strandedness-output output/run.strandedness -unstranded-output output/unstranded.list
41+
TMPDIR=tmp rnaseq_divide_by_strandedness -align-manifest bam_list.mft -metadata $metadata_file $parameters -samtools-executable \$samtools -stranded-output output/stranded.list -strandedness-output output/run.strandedness -unstranded-output output/unstranded.list
4142
"""
4243
stub:
4344
"""

ui/assets/config/docker_image.config

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
process.container = 'ncbi/egapx:0.3.0-alpha'
1+
process.container = 'ncbi/egapx:0.3.1-alpha'

ui/egapx.py

Lines changed: 8 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525

2626
import yaml
2727

28-
software_version = "0.3.0-alpha"
28+
software_version = "0.3.1-alpha"
2929

3030
VERBOSITY_DEFAULT=0
3131
VERBOSITY_QUIET=-1
@@ -496,7 +496,7 @@ def expand_and_validate_params(run_inputs):
496496
else:
497497
# Given max_intron is a hard limit, no further calculation is necessary
498498
inputs['genome_size_threshold'] = 0
499-
499+
500500
if 'ortho' not in inputs or inputs['ortho'] is None or len(inputs['ortho']) < 4:
501501
ortho_files = dict()
502502
if 'ortho' in inputs and isinstance(inputs['ortho'], dict):
@@ -508,24 +508,14 @@ def expand_and_validate_params(run_inputs):
508508
if chosen_taxid == 0:
509509
chosen_taxid = get_closest_ortho_ref_taxid(taxid)
510510
ortho_files['taxid'] = chosen_taxid
511-
511+
512512
file_id = ['genomic.fna', 'genomic.gff', 'protein.faa']
513-
514-
possible_files = []
515-
try:
516-
possible_files = get_files_under_path('ortholog_references', f'{chosen_taxid}/current')
517-
except:
518-
print(f'Could not find path for ortho taxid {chosen_taxid}')
519-
return False
520-
for pf in possible_files:
521-
for fi in file_id:
522-
if fi in ortho_files:
523-
continue
524-
if pf.find(fi) > -1:
525-
ortho_files[fi] = pf
526-
513+
for fi in file_id:
514+
ortho_files[fi] = get_file_path('ortholog_references', f'{chosen_taxid}/current/{fi}.gz')
515+
527516
ortho_files['name_from.rpt'] = get_file_path('ortholog_references',f'{chosen_taxid}/name_from_ortholog.rpt')
528517
inputs['ortho'] = ortho_files
518+
529519
if 'reference_sets' not in inputs or inputs['reference_sets'] is None:
530520
inputs['reference_sets'] = get_file_path('reference_sets', 'swissprot.asnb.gz')
531521

@@ -613,31 +603,6 @@ def get_file_path(subsystem, filename):
613603
return file_path
614604
return file_url
615605

616-
def get_files_under_path(subsystem, part_path):
617-
cache_dir = get_cache_dir()
618-
vfn = get_versioned_path(subsystem, part_path)
619-
file_path = os.path.join(cache_dir, vfn)
620-
file_url = f"{FTP_EGAP_ROOT}/{vfn}"
621-
## look under file_path
622-
files_below = list()
623-
try:
624-
for i in Path(file_path).iterdir():
625-
files_below.append(str(i))
626-
if files_below:
627-
return files_below
628-
except:
629-
None
630-
## if nothing, look under file_url
631-
if not files_below:
632-
ftpd = FtpDownloader()
633-
ftpd.connect(FTP_EGAP_SERVER)
634-
ftp_dir = f'{FTP_EGAP_ROOT_PATH}/{vfn}'
635-
files_found = ftpd.list_ftp_dir(ftp_dir)
636-
files_online = list()
637-
for i in files_found:
638-
files_online.append( f"{FTP_EGAP_ROOT}/{vfn}/{i}") ### .replace('//','/') )
639-
return files_online
640-
return list()
641606

642607
def get_config(script_directory, args):
643608
config_file = ""
@@ -1059,7 +1024,7 @@ def main(argv):
10591024
else:
10601025
minlen = 165
10611026
minscor = 25.0
1062-
task_params = merge_params(task_params, {'tasks': { 'chainer': {'chainer_wnode': f"-minlen {minlen} -minscor {minscor}"}}})
1027+
task_params = merge_params(task_params, {'tasks': { 'chainer_wnode': {'chainer_wnode': f"-minlen {minlen} -minscor {minscor}"}}})
10631028

10641029
# Add some parameters to specific tasks
10651030
inputs = run_inputs['input']

0 commit comments

Comments
 (0)