ncbi
diff --git a/‎LICENSE
Lines changed: 5 additions & 2 deletions b/‎LICENSE
Lines changed: 5 additions & 2 deletions
diff --git a/‎README.md
Lines changed: 14 additions & 8 deletions b/‎README.md
Lines changed: 14 additions & 8 deletions
diff --git a/‎examples/example_source_table.src
Lines changed: 7 additions & 0 deletions b/‎examples/example_source_table.src
Lines changed: 7 additions & 0 deletions
diff --git a/‎examples/input_Brevipalpus_obovatus.yaml
Lines changed: 6 additions & 0 deletions b/‎examples/input_Brevipalpus_obovatus.yaml
Lines changed: 6 additions & 0 deletions
diff --git a/‎examples/input_C_longicornis.yaml
Lines changed: 1 addition & 1 deletion b/‎examples/input_C_longicornis.yaml
Lines changed: 1 addition & 1 deletion
diff --git a/‎nf/bin/run_wnode_batch.py
Lines changed: 3 additions & 2 deletions b/‎nf/bin/run_wnode_batch.py
Lines changed: 3 additions & 2 deletions
@@ -58,18 +58,21 @@ Authors:  Sean R. Eddy
 License:  BSD License
           [https://github.com/EddyRivasLab/infernal/blob/master/LICENSE]
 
+Location: /img/gp/third-party/tRNAscan-SE
+Authors:  Patricia P. Chan, Brian Lin, and Todd M. Lowe
+License:  GPL-3.0
+          [https://github.com/EddyRivasLab/infernal/blob/master/LICENSE]
+
 Location: /img/gp/third-party/hmmer
 Authors:  Sean R. Eddy
 License:  BSD License
           [https://github.com/EddyRivasLab/hmmer/blob/master/LICENSE]
 
-
 Location: /usr/local/bin/busco
 Authors:  Evgeny Zdobnov
 License:  MIT License
           [https://gitlab.com/ezlab/busco/-/blob/master/LICENSE]
 
-
 Location: /img/gp/third-party/minimap2
 Authors:  Heng Li
 License:  MIT License
 
@@ -156,7 +156,7 @@ Input to EGAPx is in the form of a YAML file.
 
 
 ### Running EGAPx with short and long RNA-seq reads
-- Optionally, you can also include long reads RNA-seq data from SRA or local files using the same formatting structure for short reads, using the label `long_reads:`
+- Optionally, you can also include long reads RNA-seq data from SRA or local files (FASTA or FASTQ, not BAM) using the same formatting structure for short reads, using the label `long_reads:`
 
   ```
   genome: path to assembled genome in FASTA format
@@ -171,6 +171,7 @@ Input to EGAPx is in the form of a YAML file.
     short_reads: txid43150[Organism] AND 75:350[ReadLength] AND illumina[Platform] AND biomol_rna[Properties]
     long_reads: txid43150[Organism] AND (oxford_nanopore[Platform] OR pacbio_smrt[Platform]) AND biomol_rna[Properties]
     ```
+- We have not rigorously tested EGAPx performance using clustered vs. non-clustered IsoSeq reads. EGAPx uses read depth for filtering and removing rare isoforms with limited support, but clustered reads will reduce compute cost.
 
 ## Input example
 [Back to Top](#Contents)
@@ -310,7 +311,7 @@ If you do not have internet access from your cluster, you can run EGAPx in offli
   ```
   rm egap*sif
   singularity cache clean
-  singularity pull docker://ncbi/egapx:0.4.0-alpha
+  singularity pull docker://ncbi/egapx:0.4.1-alpha
   ```
 
 - Clone the repo:
@@ -343,7 +344,7 @@ If you do not have internet access from your cluster, you can run EGAPx in offli
 - Run `egapx.py` first to edit the `biowulf_cluster.config`:
   ```
   ui/egapx.py examples/input_D_farinae_small.yaml -e biowulf_cluster -w dfs_work -o dfs_out -lc ../local_cache
-  echo "process.container = '/path_to_/egapx_0.4.0-alpha.sif'"  >> egapx_config/biowulf_cluster.config
+  echo "process.container = '/path_to_/egapx_0.4.1-alpha.sif'"  >> egapx_config/biowulf_cluster.config
   ```
 
 - Run `egapx.py`:
@@ -570,8 +571,6 @@ max_intron: 700000
 ## Submitting EGAPx annotation to NCBI
 [Back to Top](#Contents)
 
-:warning: The current EGAPx release (0.4.0) will produce submission files, however they are not yet GenBank submission-compliant. A subsequent patch release (likely 0.4.1) will produce GenBank-compliant files. We welcome users to try the process below to produce submission files and create a GitHub issue with errors or questions.
-
 After annotating your genome with EGAPx, you can prepare your annotation for submission to NCBI.
 
 ### Prepare required files and metadata
@@ -585,7 +584,13 @@ You will need:
   - To submit annotation for existing GenBank assemblies, you can access the BioProject information on Datasets Genome pages by searching the assembly accession at https://www.ncbi.nlm.nih.gov/datasets/genome/. locus_tag prefix is not needed in your `prepare_submission` command 
 
 - To submit annotation with new assemblies, you will need additional inputs:
-  - Source modifiers table file prepared from https://www.ncbi.nlm.nih.gov/WebSub/html/help/genbank-source-table.html
+  - Source modifiers table file (see `examples/example_source_table.src`)
+    - Tab-delimited file containing sequence identifiers, chromosome names, location, topology
+    - Chromosome names follow these [rules](https://www.ncbi.nlm.nih.gov/genbank/genomesubmit/#chr_names)
+    - Default topology is `linear`, only specify `circular` for organelles
+    - Unplaced sequences can be completely omitted from the file
+    - Rare cases of unlocalized sequences (not "the" chromosome, but with a chromosome assignment) should be included with the chromosome name in the chromosome column and blank in the location column
+
   - Assembly data structured comment file prepared from https://submit.ncbi.nlm.nih.gov/structcomment/genomes/
   - linkage evidence argument from options at https://www.ncbi.nlm.nih.gov/genbank/wgs_gapped/, e.g. `proximity-ligation` from Hi-C, `paired-ends` from Illumina
 
@@ -610,15 +615,16 @@ You are ready to run `prepare_submission`. See below for full list of required/o
 | `--submission-comment`                   | table2asn `-y` arg https://www.ncbi.nlm.nih.gov/genbank/table2asn/ |
 | `--name-cleanup-rules-file`                   | Two-column TSV of search/replace regexes to be applied to product and gene names |
 | `--source-quals`                   | table2asn `-j` arg. https://www.ncbi.nlm.nih.gov/genbank/mods_fastadefline/ |
+| `--unknown-gap-len`                   | table2asn `-gaps-unknown` arg. The exact number of consecutive Ns recognized as a gap with unknown length. (default: 100) |
 
 Command:
 
 ```
 # Using Docker:
-alias prepare_submission='docker run --rm -i --volume="$PWD:$PWD" --workdir="$PWD" ncbi/egapx:0.4.0-alpha prepare_submission'
+alias prepare_submission='docker run --rm -i --volume="$PWD:$PWD" --workdir="$PWD" ncbi/egapx:0.4.1-alpha prepare_submission'
 
 # Using Singularity or Apptainer:
-alias prepare_submission='singularity exec --cleanenv --bind "$PWD:$PWD" --pwd "$PWD" docker://ncbi/egapx:0.4.0-alpha prepare_submission'
+alias prepare_submission='singularity exec --cleanenv --bind "$PWD:$PWD" --pwd "$PWD" docker://ncbi/egapx:0.4.1-alpha prepare_submission'
 
 # Invoke the app:
 prepare_submission --egapx-annotated-genome-asn annotated_genome.asn --submission-template-file template.sbt --bioproject-id PRJNA# --src-file source-table.txt --assembly-data-structured-comment-file genome.asm --linkage-evidence paired-ends --out-dir out
 
@@ -0,0 +1,7 @@
+SeqID	chromosome	location 	topology
+contig_Chr1	1	chromosome	
+contig_Chr2	2	chromosome	
+contig_Chr3	3	chromosome	
+contig_997		mitochondrion	circular
+contig_998		chloroplast	circular
+contig_999		plastid	circular
@@ -0,0 +1,6 @@
+genome: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/050/580/445/GCA_050580445.1_ASM5058044v1/GCA_050580445.1_ASM5058044v1_genomic.fna.gz
+taxid: 246614
+short_reads:
+ - SRR33694212
+long_reads:
+ - SRR33704642
@@ -1,4 +1,4 @@
 genome: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/029/603/195/GCA_029603195.2_ASM2960319v2/GCA_029603195.2_ASM2960319v2_genomic.fna.gz
-short_reads: txid2530218[Organism] AND biomol_transcript[properties] NOT SRS024887[Accession]
+short_reads: txid2530218[Organism] AND biomol_transcript[properties] AND 75:350[ReadLength] AND illumina[Platform] NOT SRS024887[Accession]
 taxid: 2530218
 locus_tag_prefix: egapxtmp
@@ -45,11 +45,12 @@
 # and must ensure that all job-ids are unique (among all invocations of a wnode for a task).
 #
 # NB: an alternative to this acrobatics is to allow multiple job-ids in gpx_qdump and gpx_make_outputs.
-batch_size = -(len(jobs) // -args.num_batches)  # ceildiv
+batch_size      = -(len(jobs) // -args.num_batches)  # ceildiv
 starting_job_id = batch_size * (args.batch_num - 1) + 1
+cmd_name        = args.command[0].replace("/", "_")
 
 subprocess.run(
-    (["flock", "-x", f"/tmp/egapx.{args.command[0]}.lock" ] if args.exclusive else [])
+    (["flock", "-x", f"/tmp/egapx.{cmd_name}.lock" ] if args.exclusive else [])
     + args.command
     + [
         "-input-jobs"   , args.work_dir + "/inp/jobs_batch.xml",