Release 0.3.1-alpha

Victor Joukov · Victor Joukov · commit b2e6f1a79ade · 2024-11-18T16:34:39.000-05:00
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 EGAPx is the publicly accessible version of the updated NCBI [Eukaryotic Genome Annotation Pipeline](https://www.ncbi.nlm.nih.gov/refseq/annotation_euk/process/). 
 
-EGAPx takes an assembly fasta file, a taxid of the organism, and RNA-seq data. Based on the taxid, EGAPx will pick protein sets and HMM models. The pipeline runs `miniprot` to align protein sequences, and `STAR` to align RNA-seq to the assembly. Protein alignments and RNA-seq read alignments are then passed to `Gnomon` for gene prediction. In the first step of `Gnomon`, the short alignments are chained together into putative gene models. In the second step, these predictions are further supplemented by _ab-initio_ predictions based on HMM models. Functional annotation is added to the final structural annotation set based on the type and quality of the model and orthology information. The final annotation for the input assembly is produced as a `gff` file. 
+EGAPx takes an assembly FASTA file, a taxid of the organism, and RNA-seq data. Based on the taxid, EGAPx will pick protein sets and HMM models. The pipeline runs `miniprot` to align protein sequences, and `STAR` to align RNA-seq to the assembly. Protein alignments and RNA-seq read alignments are then passed to `Gnomon` for gene prediction. In the first step of `Gnomon`, the short alignments are chained together into putative gene models. In the second step, these predictions are further supplemented by _ab-initio_ predictions based on HMM models. Functional annotation is added to the final structural annotation set based on the type and quality of the model and orthology information. The final annotation for the input assembly is produced as a `gff` file. 
 
 We currently have protein datasets posted that are suitable for most vertebrates, arthropods, and some plants:
   - Chordata - Mammalia, Sauropsida, Actinopterygii (ray-finned fishes), other Vertebrates
@@ -363,8 +363,9 @@ Description of the outputs:
 * `complete.transcripts.fna`: annotated transcripts in FASTA format (includes UTRs).
 * `complete.proteins.faa`: annotated protein products in FASTA format.
 * `annotated_genome.asn`: final annotation set in ASN1 format.
+
 Description of the logs and miscellaneous outputs:
-* `annot_builder_output/accept.ftable_annot`: intermediate file with accepted annotation models called by GNOMON.
+* `annot_builder_output/accept.ftable_annot`: intermediate file with accepted annotation models called by Gnomon.
 * `annotation_data.cmt`: annotation structured comment file. Used for submission to GenBank.
 * `nextflow.log`: main Nextflow log that captures all the process information and their work directories.
 * `resume.sh`: Nextflow command for resuming a run from the last successful task.
@@ -377,13 +378,13 @@ Description of the logs and miscellaneous outputs:
 
 ## Interpreting Output
 
-`stats/feature_counts.xml` contains summary counts of features by model prediction categories determined by GNOMON.
+`stats/feature_counts.xml` contains summary counts of features by model prediction categories determined by Gnomon.
 
 **NOTE** not all categories are expected to have counts data (e.g. model RefSeq, fully supported, ab initio)
 
 Genes with `major correction` are likely protein-coding genes with frameshifts and/or internal stops. These models include "LOW QUALITY PROTEIN" in the protein FASTA title, are marked up with exception=low-quality sequence region on the mRNA and CDS features, and the annotation is adjusted to meet GenBank criteria (frameshifts are compensated for by 1-2 bp microintrons in the mRNA and CDS features, and internal stops have a transl_except to translate the codon as X instead of a stop). For RefSeq, we set a threshold of no more than 10% of protein-coding genes with major corrections to release the annotation. We recommend users polish assembly sequences if the rate is higher than 10%.
 
-Counts of protein-coding genes should be considered versus similar species. Low counts may result from insufficient supporting evidence (e.g. low RNAseq coverage or an unusual organism compared to the available protein data). High counts may indicate genome fragmentation or noise from genes annotated on transposons.
+Counts of protein-coding genes should be considered versus similar species. Low counts may result from insufficient supporting evidence (e.g. low RNAseq coverage or an unusual organism compared to the available protein data). High counts may indicate genome fragmentation, uncollapsed haplotypic duplication, or noise from genes annotated on transposons.
 
 `stats/feature_stats.xml` contains summary statistics on transcript counts per gene, exon counts per transcript, and the counts and length distributions of features by sub-type.
 
@@ -419,7 +420,7 @@ If you do not have internet access from your cluster, you can run EGAPx in offli
 ```
 rm egap*sif
 singularity cache clean
-singularity pull docker://ncbi/egapx:0.3.0-alpha
+singularity pull docker://ncbi/egapx:0.3.1-alpha
 ```
 
 - Clone the repo:
diff --git a/nf/subworkflows/ncbi/gnomon-training-iteration/gnomon_training_iterations/main.nf b/nf/subworkflows/ncbi/gnomon-training-iteration/gnomon_training_iterations/main.nf
@@ -17,24 +17,23 @@ workflow gnomon_training_iterations {
         chainer_gap_fill_allowlist
         chainer_trusted_genes
         chainer_scaffolds
-        gnomon_softmask_lds2
-        gnomon_softmask_lds2_source
+        gnomon_softmask
         gnomon_scaffolds
         max_intron
         parameters
     main:
     gnomon_training_iteration(models_file, genome_asn, proteins_asn ,chainer_alignments,chainer_evidence_denylist,chainer_gap_fill_allowlist,
-               chainer_trusted_genes, chainer_scaffolds, gnomon_softmask_lds2,
-               gnomon_softmask_lds2_source, gnomon_scaffolds, max_intron, parameters)
+               chainer_trusted_genes, chainer_scaffolds, 
+               gnomon_softmask, gnomon_scaffolds, max_intron, parameters)
     gnomon_training_iteration2(gnomon_training_iteration.out.hmm_params_file, genome_asn, proteins_asn ,chainer_alignments,
-               chainer_evidence_denylist,chainer_gap_fill_allowlist, chainer_trusted_genes, chainer_scaffolds, gnomon_softmask_lds2,
-               gnomon_softmask_lds2_source, gnomon_scaffolds, max_intron, parameters)
+               chainer_evidence_denylist,chainer_gap_fill_allowlist, chainer_trusted_genes, chainer_scaffolds, 
+               gnomon_softmask, gnomon_scaffolds, max_intron, parameters)
     gnomon_training_iteration3(gnomon_training_iteration2.out.hmm_params_file, genome_asn, proteins_asn ,chainer_alignments,
-               chainer_evidence_denylist,chainer_gap_fill_allowlist, chainer_trusted_genes, chainer_scaffolds, gnomon_softmask_lds2,
-               gnomon_softmask_lds2_source, gnomon_scaffolds, max_intron, parameters)
+               chainer_evidence_denylist,chainer_gap_fill_allowlist, chainer_trusted_genes, chainer_scaffolds, 
+               gnomon_softmask, gnomon_scaffolds, max_intron, parameters)
     gnomon_training_iteration4(gnomon_training_iteration3.out.hmm_params_file, genome_asn, proteins_asn ,chainer_alignments,
-               chainer_evidence_denylist,chainer_gap_fill_allowlist, chainer_trusted_genes, chainer_scaffolds, gnomon_softmask_lds2,
-               gnomon_softmask_lds2_source, gnomon_scaffolds, max_intron, parameters)
+               chainer_evidence_denylist,chainer_gap_fill_allowlist, chainer_trusted_genes, chainer_scaffolds, 
+               gnomon_softmask, gnomon_scaffolds, max_intron, parameters)
 
     emit:
         hmm_params_file = gnomon_training_iteration4.out.hmm_params_file
diff --git a/nf/subworkflows/ncbi/gnomon-training-iteration/main.nf b/nf/subworkflows/ncbi/gnomon-training-iteration/main.nf
@@ -9,32 +9,31 @@ include { gnomon_training_iteration; gnomon_training_iteration as gnomon_trainin
 
 workflow gnomon_training_iterations {
     take:
-        models_file
+        initial_hmm_params
         genome_asn
         proteins_asn
         chainer_alignments
         chainer_evidence_denylist
         chainer_gap_fill_allowlist
         chainer_trusted_genes
         chainer_scaffolds
-        gnomon_softmask_lds2
-        gnomon_softmask_lds2_source
+        gnomon_softmask
         gnomon_scaffolds
         max_intron
         parameters
     main:
-    gnomon_training_iteration(models_file, genome_asn, proteins_asn ,chainer_alignments,chainer_evidence_denylist,chainer_gap_fill_allowlist,
-               chainer_trusted_genes, chainer_scaffolds, gnomon_softmask_lds2,
-               gnomon_softmask_lds2_source, gnomon_scaffolds, max_intron, parameters)
+    gnomon_training_iteration(initial_hmm_params, genome_asn, proteins_asn ,chainer_alignments,chainer_evidence_denylist,chainer_gap_fill_allowlist,
+               chainer_trusted_genes, chainer_scaffolds, 
+               gnomon_softmask, gnomon_scaffolds, max_intron, parameters)
     gnomon_training_iteration2(gnomon_training_iteration.out.hmm_params_file, genome_asn, proteins_asn ,chainer_alignments,
-               chainer_evidence_denylist,chainer_gap_fill_allowlist, chainer_trusted_genes, chainer_scaffolds, gnomon_softmask_lds2,
-               gnomon_softmask_lds2_source, gnomon_scaffolds, max_intron, parameters)
+               chainer_evidence_denylist,chainer_gap_fill_allowlist, chainer_trusted_genes, chainer_scaffolds, 
+               gnomon_softmask, gnomon_scaffolds, max_intron, parameters)
     gnomon_training_iteration3(gnomon_training_iteration2.out.hmm_params_file, genome_asn, proteins_asn ,chainer_alignments,
-               chainer_evidence_denylist,chainer_gap_fill_allowlist, chainer_trusted_genes, chainer_scaffolds, gnomon_softmask_lds2,
-               gnomon_softmask_lds2_source, gnomon_scaffolds, max_intron, parameters)
+               chainer_evidence_denylist,chainer_gap_fill_allowlist, chainer_trusted_genes, chainer_scaffolds, 
+               gnomon_softmask, gnomon_scaffolds, max_intron, parameters)
     gnomon_training_iteration4(gnomon_training_iteration3.out.hmm_params_file, genome_asn, proteins_asn ,chainer_alignments,
-               chainer_evidence_denylist,chainer_gap_fill_allowlist, chainer_trusted_genes, chainer_scaffolds, gnomon_softmask_lds2,
-               gnomon_softmask_lds2_source, gnomon_scaffolds, max_intron, parameters)
+               chainer_evidence_denylist,chainer_gap_fill_allowlist, chainer_trusted_genes, chainer_scaffolds, 
+               gnomon_softmask, gnomon_scaffolds, max_intron, parameters)
 
     emit:
         hmm_params_file = gnomon_training_iteration4.out.hmm_params_file
@@ -81,7 +80,6 @@ workflow gnomon_training_iterations {
         chainer_trusted_genes
         chainer_scaffolds
         gnomon_softmask_lds2
-        gnomon_softmask_lds2_source
         gnomon_scaffolds
         max_intron
         parameters
diff --git a/nf/subworkflows/ncbi/gnomon-training-iteration/utilities.nf b/nf/subworkflows/ncbi/gnomon-training-iteration/utilities.nf
@@ -9,23 +9,22 @@ include { gnomon_training } from '../gnomon/gnomon_training/main'
 
 workflow gnomon_training_iteration {
     take:
-        models_file
+        initial_hmm_params
         genome_asn
         proteins_asn
         chainer_alignments
         chainer_evidence_denylist
         chainer_gap_fill_allowlist
         chainer_trusted_genes
         chainer_scaffolds
-        gnomon_softmask_lds2
-        gnomon_softmask_lds2_source
+        gnomon_softmask
         gnomon_scaffolds
         max_intron
         parameters
     main:
 
-        chainer(chainer_alignments, models_file, chainer_evidence_denylist, chainer_gap_fill_allowlist, chainer_scaffolds, chainer_trusted_genes, genome_asn, proteins_asn, parameters.get('chainer', [:]))
-        gnomon_wnode(gnomon_scaffolds, chainer.out.chains, chainer.out.chains_slices, models_file, gnomon_softmask_lds2, gnomon_softmask_lds2_source, genome_asn, proteins_asn,  parameters.get('gnomon', [:]))
+        chainer(chainer_alignments, initial_hmm_params, chainer_evidence_denylist, chainer_gap_fill_allowlist, chainer_scaffolds, chainer_trusted_genes, genome_asn, proteins_asn, parameters.get('chainer_wnode', [:]))
+        gnomon_wnode(gnomon_scaffolds, chainer.out.chains, chainer.out.chains_slices, initial_hmm_params, gnomon_softmask, [], genome_asn, proteins_asn,  parameters.get('gnomon_wnode', [:]))
         gnomon_training(genome_asn, gnomon_wnode.out.outputs, max_intron, parameters.get('gnomon_training', [:]))
 
     emit:
@@ -37,8 +36,7 @@ workflow gnomon_training_iteration {
         chainer_gap_fill_allowlist = chainer_gap_fill_allowlist
         chainer_trusted_genes = chainer_trusted_genes
         chainer_scaffolds = chainer_scaffolds
-        gnomon_softmask_lds2 = gnomon_softmask_lds2
-        gnomon_softmask_lds2_source = gnomon_softmask_lds2_source
+        gnomon_softmask = gnomon_softmask
         gnomon_scaffolds = gnomon_scaffolds
         max_intron = max_intron
         parameters = parameters
diff --git a/nf/subworkflows/ncbi/gnomon/gnomon_wnode/main.nf b/nf/subworkflows/ncbi/gnomon/gnomon_wnode/main.nf
@@ -18,7 +18,7 @@ workflow gnomon_wnode {
     main:
         String gpx_qsubmit_params =  merge_params("", parameters, 'gpx_qsubmit')
         String annot_params =  merge_params("-margin 1000 -mincont 1000 -minlen 225 -mpp 10.0 -ncsp 25 -window 200000 -nonconsens -open", parameters, 'annot_wnode')
-        String gpx_qdump_params =  merge_params("-slices-for affinity -sort-by affinity", parameters, 'gpx_qdump')
+        String gpx_qdump_params =  merge_params("-unzip '*' -slices-for affinity -sort-by affinity", parameters, 'gpx_qdump')
 
         def (jobs, lines_per_file) = gpx_qsubmit(scaffolds, chains, chains_slices, gpx_qsubmit_params)
         def annot_files = annot(jobs.flatten(), chains, hmm_params, softmask_lds2, softmask_lds2_source, genome, proteins, lines_per_file, annot_params)
@@ -140,7 +140,7 @@ process gpx_qdump {
         path "*.out", emit: "outputs"
     script:
     """
-    gpx_qdump $params -input-path inputs -output gnomon_wnode.out
+    gpx_qdump $params  -input-path inputs -output gnomon_wnode.out
     """
     stub:
     """
diff --git a/nf/subworkflows/ncbi/gnomon/main.nf b/nf/subworkflows/ncbi/gnomon/main.nf
@@ -42,8 +42,7 @@ workflow gnomon_plane {
             effective_hmm = hmm_params
         } else {
             effective_hmm = gnomon_training_iterations(hmm_params, genome_asn, proteins_asn, alignments, /* evidence_denylist */ [], /* gap_fill_allowlist */ [],
-                [proteins_trusted].flatten(), scaffolds, softmask,
-                softmask, scaffolds,
+                [proteins_trusted].flatten(), scaffolds, softmask, scaffolds,
                 max_intron,
                 task_params)
         }
diff --git a/nf/subworkflows/ncbi/only_gnomon.nf b/nf/subworkflows/ncbi/only_gnomon.nf
@@ -5,6 +5,7 @@
 nextflow.enable.dsl=2
 
 include { setup_genome; setup_proteins } from './setup/main'
+include { get_hmm_params; run_get_hmm } from './default/get_hmm_params/main'
 include { chainer_wnode as chainer } from './gnomon/chainer_wnode/main'
 include { gnomon_wnode } from './gnomon/gnomon_wnode/main'
 include { prot_gnomon_prepare } from './annot_proc/prot_gnomon_prepare/main'
@@ -63,7 +64,17 @@ workflow only_gnomon {
 
         // GNOMON
 
-        chainer(alignments, hmm_params, /* evidence_denylist */ [], /* gap_fill_allowlist */ [], scaffolds, /* trusted_genes */ [], genome_asn, proteins_asn, task_params.get('chainer', [:]))
+        def effective_hmm
+        if (hmm_params) {
+            effective_hmm = hmm_params
+        } else {
+            tmp_hmm = run_get_hmm(tax_id)
+            b = tmp_hmm | splitText( { it.split('\n') } ) | flatten 
+            c = b | last
+            effective_hmm = c
+        }
+
+        chainer(alignments, effective_hmm, /* evidence_denylist */ [], /* gap_fill_allowlist */ [], scaffolds, /* trusted_genes */ [], genome_asn, proteins_asn, task_params.get('chainer', [:]))
 
         gnomon_wnode(scaffolds, chainer.out.chains, chainer.out.chains_slices, effective_hmm, [], softmask, genome_asn, proteins_asn, task_params.get('gnomon', [:]))
         def models = gnomon_wnode.out.outputs
diff --git a/nf/subworkflows/ncbi/rnaseq_short/bam_strandedness/main.nf b/nf/subworkflows/ncbi/rnaseq_short/bam_strandedness/main.nf
@@ -35,9 +35,10 @@ process rnaseq_divide_by_strandedness {
     script:
     """
     mkdir -p output
+    mkdir -p tmp
     samtools=\$(which samtools)
     echo "${bam_list.join('\n')}" > bam_list.mft
-    rnaseq_divide_by_strandedness -align-manifest bam_list.mft -metadata $metadata_file  $parameters  -samtools-executable \$samtools -stranded-output output/stranded.list -strandedness-output output/run.strandedness -unstranded-output output/unstranded.list
+    TMPDIR=tmp rnaseq_divide_by_strandedness -align-manifest bam_list.mft -metadata $metadata_file  $parameters  -samtools-executable \$samtools -stranded-output output/stranded.list -strandedness-output output/run.strandedness -unstranded-output output/unstranded.list
     """
     stub:
     """
diff --git a/ui/assets/config/docker_image.config b/ui/assets/config/docker_image.config
@@ -1 +1 @@
-process.container = 'ncbi/egapx:0.3.0-alpha'
+process.container = 'ncbi/egapx:0.3.1-alpha'
diff --git a/ui/egapx.py b/ui/egapx.py
@@ -25,7 +25,7 @@
 
 import yaml
 
-software_version = "0.3.0-alpha"
+software_version = "0.3.1-alpha"
 
 VERBOSITY_DEFAULT=0
 VERBOSITY_QUIET=-1
@@ -496,7 +496,7 @@ def expand_and_validate_params(run_inputs):
     else:
         # Given max_intron is a hard limit, no further calculation is necessary
         inputs['genome_size_threshold'] = 0
-    
+
     if 'ortho' not in inputs or inputs['ortho'] is None or len(inputs['ortho']) < 4:
         ortho_files = dict()
         if 'ortho' in inputs and isinstance(inputs['ortho'], dict):
@@ -508,24 +508,14 @@ def expand_and_validate_params(run_inputs):
         if chosen_taxid == 0: 
             chosen_taxid = get_closest_ortho_ref_taxid(taxid)
         ortho_files['taxid'] = chosen_taxid
-        
+
         file_id = ['genomic.fna', 'genomic.gff', 'protein.faa']
-        
-        possible_files = []
-        try:
-            possible_files = get_files_under_path('ortholog_references', f'{chosen_taxid}/current')
-        except: 
-            print(f'Could not find path for ortho taxid {chosen_taxid}')
-            return False
-        for pf in possible_files:
-            for fi in file_id:
-                if fi in ortho_files:
-                    continue
-                if pf.find(fi) > -1:
-                    ortho_files[fi] = pf
-        
+        for fi in file_id:
+            ortho_files[fi] = get_file_path('ortholog_references', f'{chosen_taxid}/current/{fi}.gz')
+
         ortho_files['name_from.rpt'] = get_file_path('ortholog_references',f'{chosen_taxid}/name_from_ortholog.rpt')
         inputs['ortho'] = ortho_files
+
     if 'reference_sets' not in inputs or inputs['reference_sets'] is None:
         inputs['reference_sets'] = get_file_path('reference_sets', 'swissprot.asnb.gz')
 
@@ -613,31 +603,6 @@ def get_file_path(subsystem, filename):
         return file_path
     return file_url
 
-def get_files_under_path(subsystem, part_path):
-    cache_dir = get_cache_dir()
-    vfn = get_versioned_path(subsystem, part_path)
-    file_path = os.path.join(cache_dir, vfn)
-    file_url = f"{FTP_EGAP_ROOT}/{vfn}"
-    ## look under file_path
-    files_below = list()
-    try:
-        for i in Path(file_path).iterdir():
-            files_below.append(str(i))
-        if files_below:
-            return files_below
-    except:
-        None
-    ## if nothing, look under file_url
-    if not files_below:
-        ftpd = FtpDownloader()
-        ftpd.connect(FTP_EGAP_SERVER)
-        ftp_dir = f'{FTP_EGAP_ROOT_PATH}/{vfn}'
-        files_found = ftpd.list_ftp_dir(ftp_dir)
-        files_online = list()
-        for i in files_found:
-            files_online.append(  f"{FTP_EGAP_ROOT}/{vfn}/{i}")  ### .replace('//','/') ) 
-        return files_online
-    return list()
 
 def get_config(script_directory, args):
     config_file = ""
@@ -1059,7 +1024,7 @@ def main(argv):
     else:
         minlen = 165
         minscor = 25.0
-    task_params = merge_params(task_params, {'tasks': { 'chainer': {'chainer_wnode': f"-minlen {minlen} -minscor {minscor}"}}})
+    task_params = merge_params(task_params, {'tasks': { 'chainer_wnode': {'chainer_wnode': f"-minlen {minlen} -minscor {minscor}"}}})
 
     # Add some parameters to specific tasks
     inputs = run_inputs['input']

Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-process.container = 'ncbi/egapx:0.3.0-alpha'`
	`1`	`+process.container = 'ncbi/egapx:0.3.1-alpha'`