Skip to content

Commit 6d5d399

Browse files
authored
Merge pull request #147 from ncbi/release-0.4.1-alpha
Release 0.4.1 alpha
2 parents 12b1f3d + 26af27e commit 6d5d399

File tree

47 files changed

+2031
-511
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

47 files changed

+2031
-511
lines changed

LICENSE

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,18 +58,21 @@ Authors: Sean R. Eddy
5858
License: BSD License
5959
[https://github.com/EddyRivasLab/infernal/blob/master/LICENSE]
6060

61+
Location: /img/gp/third-party/tRNAscan-SE
62+
Authors: Patricia P. Chan, Brian Lin, and Todd M. Lowe
63+
License: GPL-3.0
64+
[https://github.com/EddyRivasLab/infernal/blob/master/LICENSE]
65+
6166
Location: /img/gp/third-party/hmmer
6267
Authors: Sean R. Eddy
6368
License: BSD License
6469
[https://github.com/EddyRivasLab/hmmer/blob/master/LICENSE]
6570

66-
6771
Location: /usr/local/bin/busco
6872
Authors: Evgeny Zdobnov
6973
License: MIT License
7074
[https://gitlab.com/ezlab/busco/-/blob/master/LICENSE]
7175

72-
7376
Location: /img/gp/third-party/minimap2
7477
Authors: Heng Li
7578
License: MIT License

README.md

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -156,7 +156,7 @@ Input to EGAPx is in the form of a YAML file.
156156
157157
158158
### Running EGAPx with short and long RNA-seq reads
159-
- Optionally, you can also include long reads RNA-seq data from SRA or local files using the same formatting structure for short reads, using the label `long_reads:`
159+
- Optionally, you can also include long reads RNA-seq data from SRA or local files (FASTA or FASTQ, not BAM) using the same formatting structure for short reads, using the label `long_reads:`
160160
161161
```
162162
genome: path to assembled genome in FASTA format
@@ -171,6 +171,7 @@ Input to EGAPx is in the form of a YAML file.
171171
short_reads: txid43150[Organism] AND 75:350[ReadLength] AND illumina[Platform] AND biomol_rna[Properties]
172172
long_reads: txid43150[Organism] AND (oxford_nanopore[Platform] OR pacbio_smrt[Platform]) AND biomol_rna[Properties]
173173
```
174+
- We have not rigorously tested EGAPx performance using clustered vs. non-clustered IsoSeq reads. EGAPx uses read depth for filtering and removing rare isoforms with limited support, but clustered reads will reduce compute cost.
174175
175176
## Input example
176177
[Back to Top](#Contents)
@@ -310,7 +311,7 @@ If you do not have internet access from your cluster, you can run EGAPx in offli
310311
```
311312
rm egap*sif
312313
singularity cache clean
313-
singularity pull docker://ncbi/egapx:0.4.0-alpha
314+
singularity pull docker://ncbi/egapx:0.4.1-alpha
314315
```
315316
316317
- Clone the repo:
@@ -343,7 +344,7 @@ If you do not have internet access from your cluster, you can run EGAPx in offli
343344
- Run `egapx.py` first to edit the `biowulf_cluster.config`:
344345
```
345346
ui/egapx.py examples/input_D_farinae_small.yaml -e biowulf_cluster -w dfs_work -o dfs_out -lc ../local_cache
346-
echo "process.container = '/path_to_/egapx_0.4.0-alpha.sif'" >> egapx_config/biowulf_cluster.config
347+
echo "process.container = '/path_to_/egapx_0.4.1-alpha.sif'" >> egapx_config/biowulf_cluster.config
347348
```
348349
349350
- Run `egapx.py`:
@@ -570,8 +571,6 @@ max_intron: 700000
570571
## Submitting EGAPx annotation to NCBI
571572
[Back to Top](#Contents)
572573
573-
:warning: The current EGAPx release (0.4.0) will produce submission files, however they are not yet GenBank submission-compliant. A subsequent patch release (likely 0.4.1) will produce GenBank-compliant files. We welcome users to try the process below to produce submission files and create a GitHub issue with errors or questions.
574-
575574
After annotating your genome with EGAPx, you can prepare your annotation for submission to NCBI.
576575
577576
### Prepare required files and metadata
@@ -585,7 +584,13 @@ You will need:
585584
- To submit annotation for existing GenBank assemblies, you can access the BioProject information on Datasets Genome pages by searching the assembly accession at https://www.ncbi.nlm.nih.gov/datasets/genome/. locus_tag prefix is not needed in your `prepare_submission` command
586585
587586
- To submit annotation with new assemblies, you will need additional inputs:
588-
- Source modifiers table file prepared from https://www.ncbi.nlm.nih.gov/WebSub/html/help/genbank-source-table.html
587+
- Source modifiers table file (see `examples/example_source_table.src`)
588+
- Tab-delimited file containing sequence identifiers, chromosome names, location, topology
589+
- Chromosome names follow these [rules](https://www.ncbi.nlm.nih.gov/genbank/genomesubmit/#chr_names)
590+
- Default topology is `linear`, only specify `circular` for organelles
591+
- Unplaced sequences can be completely omitted from the file
592+
- Rare cases of unlocalized sequences (not "the" chromosome, but with a chromosome assignment) should be included with the chromosome name in the chromosome column and blank in the location column
593+
589594
- Assembly data structured comment file prepared from https://submit.ncbi.nlm.nih.gov/structcomment/genomes/
590595
- linkage evidence argument from options at https://www.ncbi.nlm.nih.gov/genbank/wgs_gapped/, e.g. `proximity-ligation` from Hi-C, `paired-ends` from Illumina
591596
@@ -610,15 +615,16 @@ You are ready to run `prepare_submission`. See below for full list of required/o
610615
| `--submission-comment` | table2asn `-y` arg https://www.ncbi.nlm.nih.gov/genbank/table2asn/ |
611616
| `--name-cleanup-rules-file` | Two-column TSV of search/replace regexes to be applied to product and gene names |
612617
| `--source-quals` | table2asn `-j` arg. https://www.ncbi.nlm.nih.gov/genbank/mods_fastadefline/ |
618+
| `--unknown-gap-len` | table2asn `-gaps-unknown` arg. The exact number of consecutive Ns recognized as a gap with unknown length. (default: 100) |
613619
614620
Command:
615621
616622
```
617623
# Using Docker:
618-
alias prepare_submission='docker run --rm -i --volume="$PWD:$PWD" --workdir="$PWD" ncbi/egapx:0.4.0-alpha prepare_submission'
624+
alias prepare_submission='docker run --rm -i --volume="$PWD:$PWD" --workdir="$PWD" ncbi/egapx:0.4.1-alpha prepare_submission'
619625

620626
# Using Singularity or Apptainer:
621-
alias prepare_submission='singularity exec --cleanenv --bind "$PWD:$PWD" --pwd "$PWD" docker://ncbi/egapx:0.4.0-alpha prepare_submission'
627+
alias prepare_submission='singularity exec --cleanenv --bind "$PWD:$PWD" --pwd "$PWD" docker://ncbi/egapx:0.4.1-alpha prepare_submission'
622628

623629
# Invoke the app:
624630
prepare_submission --egapx-annotated-genome-asn annotated_genome.asn --submission-template-file template.sbt --bioproject-id PRJNA# --src-file source-table.txt --assembly-data-structured-comment-file genome.asm --linkage-evidence paired-ends --out-dir out

examples/example_source_table.src

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
SeqID chromosome location topology
2+
contig_Chr1 1 chromosome
3+
contig_Chr2 2 chromosome
4+
contig_Chr3 3 chromosome
5+
contig_997 mitochondrion circular
6+
contig_998 chloroplast circular
7+
contig_999 plastid circular
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
genome: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/050/580/445/GCA_050580445.1_ASM5058044v1/GCA_050580445.1_ASM5058044v1_genomic.fna.gz
2+
taxid: 246614
3+
short_reads:
4+
- SRR33694212
5+
long_reads:
6+
- SRR33704642

examples/input_C_longicornis.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
genome: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/029/603/195/GCA_029603195.2_ASM2960319v2/GCA_029603195.2_ASM2960319v2_genomic.fna.gz
2-
short_reads: txid2530218[Organism] AND biomol_transcript[properties] NOT SRS024887[Accession]
2+
short_reads: txid2530218[Organism] AND biomol_transcript[properties] AND 75:350[ReadLength] AND illumina[Platform] NOT SRS024887[Accession]
33
taxid: 2530218
44
locus_tag_prefix: egapxtmp

nf/bin/run_wnode_batch.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,11 +45,12 @@
4545
# and must ensure that all job-ids are unique (among all invocations of a wnode for a task).
4646
#
4747
# NB: an alternative to this acrobatics is to allow multiple job-ids in gpx_qdump and gpx_make_outputs.
48-
batch_size = -(len(jobs) // -args.num_batches) # ceildiv
48+
batch_size = -(len(jobs) // -args.num_batches) # ceildiv
4949
starting_job_id = batch_size * (args.batch_num - 1) + 1
50+
cmd_name = args.command[0].replace("/", "_")
5051

5152
subprocess.run(
52-
(["flock", "-x", f"/tmp/egapx.{args.command[0]}.lock" ] if args.exclusive else [])
53+
(["flock", "-x", f"/tmp/egapx.{cmd_name}.lock" ] if args.exclusive else [])
5354
+ args.command
5455
+ [
5556
"-input-jobs" , args.work_dir + "/inp/jobs_batch.xml",

0 commit comments

Comments
 (0)