Skip to content

Commit 92cf9bd

Browse files
authored
Merge pull request #75 from ncbi/release-0.3.2-alpha
Release 0.3.2-alpha
2 parents 1057884 + 02439ea commit 92cf9bd

File tree

20 files changed

+633
-252
lines changed

20 files changed

+633
-252
lines changed

README.md

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,11 @@ We currently have protein datasets posted that are suitable for most vertebrates
1515

1616
Fungi, protists and nematodes are currently out-of-scope for EGAPx pending additional refinements.
1717

18-
18+
**Submitting to GenBank:**
19+
If you’d like to be an early tester as we refine the output and workflow for submitting EGAPx annotation to GenBank, please contact us at [email protected].
1920

2021
**Warning:**
21-
The current version is an alpha release with limited features and organism scope to collect initial feedback on execution. Outputs are not yet complete and not intended for production use. Please open a GitHub [Issue](https://github.com/ncbi/egapx/issues) if you encounter any problems with EGAPx. You can also write to [email protected] to give us your feedback or if you have any questions.
22+
The current version is an early release and still under active development to add features and refine outputs. The workflow for GenBank submission is still under development. Please open a GitHub [Issue](https://github.com/ncbi/egapx/issues) if you encounter any problems with EGAPx. You can also write to [email protected] to give us your feedback or if you have any questions.
2223

2324

2425
**Security Notice:**
@@ -59,7 +60,9 @@ Input to EGAPx is in the form of a YAML file.
5960
taxid: NCBI Taxonomy identifier of the target organism
6061
reads: RNA-seq data
6162
```
62-
You can obtain taxid from the [NCBI Taxonomy page](https://www.ncbi.nlm.nih.gov/taxonomy).
63+
- The assembled genome should be screened for contamination prior to running EGAPx. See the NCBI [Foreign Contamination Screen](https://github.com/ncbi/fcs) for a fast, user-friendly contamination screening tool.
64+
65+
- You can obtain taxid from the [NCBI Taxonomy page](https://www.ncbi.nlm.nih.gov/taxonomy).
6366

6467

6568
- RNA-seq data can be supplied in any one of the following ways:
@@ -71,9 +74,9 @@ Input to EGAPx is in the form of a YAML file.
7174
reads: SRA query for reads
7275
```
7376
74-
- The following are the _optional_ key-value pairs for the input file:
77+
- The following are the _optional_ key-value pairs for the input file. The default taxid-based settings (i.e. omitting these parameters) are recommended for most use cases:
7578
76-
- A protein set. A taxid-based protein set will be chosen if no protein set is provided.
79+
- A protein set. A taxid-based protein set will be chosen if no protein set is provided. This should only be needed for annotation of obscure organisms or those with little RNAseq data available.
7780
```
7881
proteins: path to proteins data in FASTA format.
7982
```
@@ -420,7 +423,7 @@ If you do not have internet access from your cluster, you can run EGAPx in offli
420423
```
421424
rm egap*sif
422425
singularity cache clean
423-
singularity pull docker://ncbi/egapx:0.3.1-alpha
426+
singularity pull docker://ncbi/egapx:0.3.2-alpha
424427
```
425428
426429
- Clone the repo:
@@ -452,7 +455,7 @@ Now edit the file paths of SRA reads files in `examples/input_D_farinae_small.ya
452455
- Run `egapx.py` first to edit the `biowulf_cluster.config`:
453456
```
454457
ui/egapx.py examples/input_D_farinae_small.yaml -e biowulf_cluster -w dfs_work -o dfs_out -lc ../local_cache
455-
echo "process.container = '/path_to_/egapx_0.3-alpha.sif'" >> egapx_config/biowulf_cluster.config
458+
echo "process.container = '/path_to_/egapx_0.3.2-alpha.sif'" >> egapx_config/biowulf_cluster.config
456459
```
457460
458461
- Run `egapx.py`:

nf/subworkflows/ncbi/annot_proc/diamond/main.nf

Lines changed: 0 additions & 24 deletions
This file was deleted.

nf/subworkflows/ncbi/annot_proc/gnomon_biotype/main.nf

Lines changed: 9 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -14,17 +14,15 @@ workflow gnomon_biotype {
1414
raw_blastp_hits
1515
parameters // Map : extra parameter and parameter update
1616
main:
17-
default_params = ""
18-
effective_params = merge_params(default_params, parameters, 'gnomon_biotype')
19-
run_gnomon_biotype(models_files, splices_files, denylist, gencoll_asn, swiss_prot_asn, lds2_source, raw_blastp_hits, default_params)
17+
def effective_params = merge_params("", parameters, 'gnomon_biotype')
18+
run_gnomon_biotype(models_files, splices_files, denylist, gencoll_asn, swiss_prot_asn, lds2_source, raw_blastp_hits, effective_params)
2019
emit:
2120
biotypes = run_gnomon_biotype.out.biotypes
2221
prots_rpt = run_gnomon_biotype.out.prots_rpt
2322
all = run_gnomon_biotype.out.all
2423
}
2524

2625

27-
2826
process run_gnomon_biotype {
2927
input:
3028
path models_files
@@ -34,7 +32,7 @@ process run_gnomon_biotype {
3432
path swiss_prot_asn
3533
path lds2_source, stageAs: 'genome/*'
3634
path raw_blastp_hits
37-
val parameters
35+
val parameters
3836
output:
3937
path ('output/biotypes.tsv'), emit: 'biotypes'
4038
path ('output/prots_rpt.tsv'), emit: 'prots_rpt'
@@ -45,18 +43,18 @@ process run_gnomon_biotype {
4543
mkdir -p ./asncache/
4644
prime_cache -cache ./asncache/ -ifmt asnb-seq-entry -i ${swiss_prot_asn} -oseq-ids spids -split-sequences
4745
prime_cache -cache ./asncache/ -ifmt asnb-seq-entry -i ${models_files} -oseq-ids gnids -split-sequences
48-
lds2_indexer -source genome/ -db LDS2
46+
lds2_indexer -source genome/ -db LDS2
4947
echo "${raw_blastp_hits.join('\n')}" > raw_blastp_hits.mft
5048
merge_blastp_hits -asn-cache ./asncache/ -nogenbank -lds2 LDS2 -input-manifest raw_blastp_hits.mft -o prot_hits.asn
5149
echo "${models_files.join('\n')}" > models.mft
5250
echo "prot_hits.asn" > prot_hits.mft
5351
echo "${splices_files.join('\n')}" > splices.mft
54-
if [ -z "$denylist" ]
55-
then
56-
gnomon_biotype -gc $gencoll_asn -asn-cache ./asncache/ -lds2 ./LDS2 -nogenbank -gnomon_models models.mft -o output/biotypes.tsv -o_prots_rpt output/prots_rpt.tsv -prot_hits prot_hits.mft -prot_splices splices.mft -reftrack-server 'NONE' -allow_lt631 true
57-
else
58-
gnomon_biotype -gc $gencoll_asn -asn-cache ./asncache/ -lds2 ./LDS2 -nogenbank -gnomon_models models.mft -o output/biotypes.tsv -o_prots_rpt output/prots_rpt.tsv -prot_denylist $denylist -prot_hits prot_hits.mft -prot_splices splices.mft -reftrack-server 'NONE' -allow_lt631 true
52+
effective_params="${parameters}"
53+
if [ -n "$denylist" ]; then
54+
effective_params="\$effective_params -prot_denylist $denylist"
5955
fi
56+
gnomon_biotype \$effective_params -logfile ./gn_biotype_log.txt -gc $gencoll_asn -asn-cache ./asncache/ -lds2 ./LDS2 -nogenbank -gnomon_models models.mft -o output/biotypes.tsv -o_prots_rpt output/prots_rpt.tsv -prot_hits prot_hits.mft -prot_splices splices.mft -reftrack-server 'NONE' -allow_lt631 true
57+
cat ./gn_biotype_log.txt
6058
"""
6159
stub:
6260
"""
@@ -65,4 +63,3 @@ process run_gnomon_biotype {
6563
touch output/biotypes.tsv
6664
"""
6765
}
68-

nf/subworkflows/ncbi/annot_proc/main.nf

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ workflow annot_proc_plane {
5959
symbol_format_class // string for how to format gene names
6060
ortho_files /// ortho reference input files
6161
reference_sets // reference sets, for now only swissprot
62+
prot_denylist
6263
task_params // task parameters for every task
6364
main:
6465
// Post GNOMON
@@ -70,7 +71,7 @@ workflow annot_proc_plane {
7071
// Seed Protein-Model Hits
7172
diamond_worker(prot_gnomon_prepare.out.prot_ids, swiss_prot_ids, gnomon_models, swiss_prot_asn, task_params.get('diamond_identify', [:]))
7273
best_protein_hits(gnomon_models, swiss_prot_asn, diamond_worker.out.alignments , task_params.get('best_protein_hits', [:]))
73-
gnomon_biotype(gnomon_models,/*splices_file -- constant*/ [], /*denylist -- constant*/ [], gencoll_asn, swiss_prot_asn, [], diamond_worker.out.alignments,task_params.get('gnomon_biotype', [:]))
74+
gnomon_biotype(gnomon_models,/*splices_file -- constant*/ [], prot_denylist, gencoll_asn, swiss_prot_asn, [], diamond_worker.out.alignments,task_params.get('gnomon_biotype', [:]))
7475

7576
annot_builder(gencoll_asn, gnomon_models, genome_asn, task_params.get('annot_builder', [:]))
7677
def accept_ftable_file = annot_builder.out.accept_ftable_annot

nf/subworkflows/ncbi/default/convert_annotations/main.nf

Lines changed: 33 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -34,58 +34,45 @@ process run_converter {
3434
path 'output/*.cds.fna', emit: 'cds_fasta'
3535
path 'output/*.proteins.faa', emit: 'proteins_fasta'
3636
script:
37-
//def basename = asn_file.baseName.toString()
38-
def basename = asn_files.first().baseName.toString()
3937
"""
40-
echo "${asn_files.join('\n')}" > ${basename}.mft
4138
mkdir -p output
42-
##if [ -s ${asn_files} ]; then
43-
mkdir -p tmpout
44-
for af in ${asn_files}
45-
do
46-
afb=\$(basename \$af)
47-
annotwriter ${gff_params} -nogenbank -i \${af} -format gff3 -o tmpout/\${afb}.genomic.gff
48-
annotwriter ${gtf_params} -nogenbank -i \${af} -format gtf -o tmpout/\${afb}.genomic.gtf
49-
asn2fasta -nogenbank -i \${af} -nucs-only |sed -e 's/^>lcl|\\(.*\\)/>\\1/' > tmpout/\${afb}.genomic.fna
50-
asn2fasta -nogenbank -i \${af} -feats rna_fasta -o tmpout/\${afb}.transcripts.fna
51-
asn2fasta -nogenbank -i \${af} -feats fasta_cds_na -o tmpout/\${afb}.cds.fna
52-
asn2fasta -nogenbank -i \${af} -prots-only -o tmpout/\${afb}.proteins.faa
53-
done
54-
cat tmpout/*.gff > output/complete.genomic.gff
55-
cat tmpout/*.gtf > output/complete.genomic.gtf
56-
cat tmpout/*.genomic.fna > output/complete.genomic.fna
57-
cat tmpout/*.transcripts.fna > output/complete.transcripts.fna
58-
cat tmpout/*.cds.fna > output/complete.cds.fna
59-
cat tmpout/*.proteins.faa > output/complete.proteins.faa
60-
rm tmpout/*
61-
62-
##annotwriter ${gff_params} -nogenbank -i ${asn_files} -format gff3 -o output/${basename}.genomic.gff
63-
##annotwriter ${gtf_params} -nogenbank -i ${asn_files} -format gtf -o output/${basename}.genomic.gtf
64-
##asn2fasta -nogenbank -nucs-only -indir asn_inputs -o - |sed -e 's/^>lcl|\\(.*\\)/>\\1/' >output/${basename}.genomic.fna
65-
##asn2fasta -nogenbank -feats rna_fasta -indir asn_inputs -o output/${basename}.transcripts.fna
66-
##asn2fasta -nogenbank -feats fasta_cds_na -i -indir asn_inputs -o output/${basename}.cds.fna
67-
##asn2fasta -nogenbank -prots-only -i -indir asn_inputs -o output/${basename}.proteins.faa
68-
##else
69-
## touch output/${basename}.genomic.gff
70-
## touch output/${basename}.genomic.gtf
71-
## touch output/${basename}.genomic.fna
72-
## touch output/${basename}.transcripts.fna
73-
## touch output/${basename}.cds.fna
74-
## touch output/${basename}.proteins.faa
75-
##fi
39+
mkdir -p tmpout
40+
found_afbs=(0)
41+
for af in asn_inputs/*
42+
do
43+
afb=\$(basename \$af)
44+
found_afbs+=(\${afb})
45+
annotwriter ${gff_params} -nogenbank -i \${af} -format gff3 -o tmpout/\${afb}.genomic.gff
46+
annotwriter ${gtf_params} -nogenbank -i \${af} -format gtf -o tmpout/\${afb}.genomic.gtf
47+
asn2fasta -nogenbank -i \${af} -nucs-only |sed -e 's/^>lcl|\\(.*\\)/>\\1/' > tmpout/\${afb}.genomic.fna
48+
asn2fasta -nogenbank -i \${af} -feats rna_fasta -o tmpout/\${afb}.transcripts.fna
49+
asn2fasta -nogenbank -i \${af} -feats fasta_cds_na -o tmpout/\${afb}.cds.fna
50+
asn2fasta -nogenbank -i \${af} -prots-only -o tmpout/\${afb}.proteins.faa
51+
done
52+
##echo 'D: ' \${found_afbs[@]}
53+
cat `find tmpout -name g*.gff -o -name all_unannot*.genomic.gff` > output/complete.genomic.gff
54+
cat `find tmpout -name g*.gtf -o -name all_unannot*.genomic.gtf` > output/complete.genomic.gtf
55+
cat `find tmpout -name g*.genomic.fna -o -name all_unannot*.genomic.fna` > output/complete.genomic.fna
56+
cat `find tmpout -name g*.transcripts.fna -o -name all_unannot*.transcripts.fna` > output/complete.transcripts.fna
57+
cat `find tmpout -name g*.cds.fna -o -name all_unannot*.cds.fna` > output/complete.cds.fna
58+
cat `find tmpout -name g*.proteins.faa -o -name all_unannot*.proteins.faa` > output/complete.proteins.faa
59+
rm tmpout/*
60+
touch output/complete.genomic.gff
61+
touch output/complete.genomic.gtf
62+
touch output/complete.genomic.fna
63+
touch output/complete.transcripts.fna
64+
touch output/complete.cds.fna
65+
touch output/complete.proteins.faa
7666
"""
7767

7868
stub:
79-
def basename = asn_files.first().baseName.toString()
80-
print(asn_files)
81-
print(basename)
8269
"""
8370
mkdir -p output
84-
echo "Genomic GFF" > output/${basename}.genomic.gff
85-
echo "Genomic GTF" > output/${basename}.genomic.gtf
86-
echo "Genomic FASTA" > output/${basename}.genomic.fna
87-
echo "Transcript FASTA" > output/${basename}.transcripts.fna
88-
echo "CDS FASTA" > output/${basename}.cds.fna
89-
echo "Protein FASTA" > output/${basename}.proteins.faa
71+
echo "Genomic GFF" > output/complete.genomic.gff
72+
echo "Genomic GTF" > output/complete.genomic.gtf
73+
echo "Genomic FASTA" > output/complete.genomic.fna
74+
echo "Transcript FASTA" > output/complete.transcripts.fna
75+
echo "CDS FASTA" > output/complete.cds.fna
76+
echo "Protein FASTA" > output/complete.proteins.faa
9077
"""
9178
}

nf/subworkflows/ncbi/gnomon-training-iteration/utilities.nf

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ workflow gnomon_training_iteration {
2525

2626
chainer(chainer_alignments, initial_hmm_params, chainer_evidence_denylist, chainer_gap_fill_allowlist, chainer_scaffolds, chainer_trusted_genes, genome_asn, proteins_asn, parameters.get('chainer_wnode', [:]))
2727
gnomon_wnode(gnomon_scaffolds, chainer.out.chains, chainer.out.chains_slices, initial_hmm_params, gnomon_softmask, [], genome_asn, proteins_asn, parameters.get('gnomon_wnode', [:]))
28-
gnomon_training(genome_asn, gnomon_wnode.out.outputs, max_intron, parameters.get('gnomon_training', [:]))
28+
gnomon_training(genome_asn, gnomon_wnode.out.gn_models, max_intron, parameters.get('gnomon_training', [:]))
2929

3030
emit:
3131
hmm_params_file = gnomon_training.out.hmm_params_file

nf/subworkflows/ncbi/gnomon/chainer_wnode/main.nf

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -121,9 +121,9 @@ process run_chainer {
121121
# with the same filename. We need to avoid that to be able to stage
122122
# the output files for gpx_make_outputs. We add the job file numeric
123123
# extension as a prefix to the filename.
124-
mkdir interim
124+
mkdir -p interim
125125
chainer_wnode $params -start-job-id \$start_job_id -workers 32 -input-jobs ${job} -O interim -nogenbank -lds2 LDS2 -evidence-denylist-manifest evidence_denylist.mft -gap-fill-allowlist-manifest gap_fill_allowlist.mft -param ${hmm_params} -scaffolds-manifest scaffolds.mft -trusted-genes-manifest trusted_genes.mft
126-
mkdir output
126+
mkdir -p output
127127
for f in interim/*; do
128128
if [ -f \$f ]; then
129129
mv \$f output/\${extension}_\$(basename \$f)

0 commit comments

Comments
 (0)