Skip to content

Commit 314cbbc

Browse files
authored
Merge pull request #21 from ncbi/dev-v0.2
v0.2
2 parents 5569933 + a886af6 commit 314cbbc

File tree

47 files changed

+2447
-491
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

47 files changed

+2447
-491
lines changed

LICENSE

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -43,10 +43,15 @@ Authors: Dana-Farber Cancer Institute
4343
License: MIT License
4444
[https://github.com/lh3/miniprot/blob/master/LICENSE.txt]
4545

46-
Location: img/gp/third-party/bamtools
47-
Authors: Derek Barnett, Erik Garrison, Gabor Marth, Michael Stromberg
46+
Location: img/gp/third-party/diamond
47+
Authors: Benjamin Buchfink
48+
License: GNU GENERAL PUBLIC LICENSE
49+
[https://github.com/bbuchfink/diamond/blob/master/LICENSE]
50+
51+
Location: img/gp/third-party/seqkit
52+
Authors: Wei Shen
4853
License: MIT License
49-
[https://github.com/pezmaster31/bamtools/blob/master/LICENSE]
54+
[https://github.com/shenwei356/seqkit/blob/master/LICENSE]
5055

5156
================================================================================
5257

README.md

Lines changed: 126 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,14 @@ EGAPx is the publicly accessible version of the updated NCBI [Eukaryotic Genome
44

55
EGAPx takes an assembly fasta file, a taxid of the organism, and RNA-seq data. Based on the taxid, EGAPx will pick protein sets and HMM models. The pipeline runs `miniprot` to align protein sequences, and `STAR` to align RNA-seq to the assembly. Protein alignments and RNA-seq read alignments are then passed to `Gnomon` for gene prediction. In the first step of `Gnomon`, the short alignments are chained together into putative gene models. In the second step, these predictions are further supplemented by _ab-initio_ predictions based on HMM models. The final annotation for the input assembly is produced as a `gff` file.
66

7-
We currently have protein datasets posted for most vertebrates (mammals, sauropsids, ray-finned fishes), hymenoptera, diptera, lepidoptera and choleoptera. We will be adding datasets for more arthropods, vertebrates and plants in the next couple of months. Fungi, protists and nematodes are currently out-of-scope for EGAPx pending additional refinements.
7+
We currently have protein datasets posted that are suitable for most vertebrates and arthropods:
8+
- Chordata - Mammalia, Sauropsida, Actinopterygii (ray-finned fishes)
9+
- Insecta - Hymenoptera, Diptera, Lepidoptera, Coleoptera, Hemiptera
10+
- Arthropoda - Arachnida, other Arthropoda
11+
12+
We will be adding datasets for plants and other invertebrates in the next couple of months. Fungi, protists and nematodes are currently out-of-scope for EGAPx pending additional refinements.
13+
14+
We currently have protein datasets posted for most vertebrates (mammals, sauropsids, ray-finned fishes) and arthropods. We will be adding datasets for more arthropods, vertebrates and plants in the next couple of months. Fungi, protists and nematodes are currently out-of-scope for EGAPx pending additional refinements.
815

916
**Warning:**
1017
The current version is an alpha release with limited features and organism scope to collect initial feedback on execution. Outputs are not yet complete and not intended for production use. Please open a GitHub [Issue](https://github.com/ncbi/egapx/issues) if you encounter any problems with EGAPx. You can also write to [email protected] to give us your feedback or if you have any questions.
@@ -41,23 +48,46 @@ Notes:
4148

4249
Input to EGAPx is in the form of a YAML file.
4350

44-
- The following two are the _required_ key-value pairs for the input file:
51+
- The following are the _required_ key-value pairs for the input file:
4552

4653
```
4754
genome: path to assembled genome in FASTA format
4855
taxid: NCBI Taxonomy identifier of the target organism
56+
reads: RNA-seq data
4957
```
5058
You can obtain taxid from the [NCBI Taxonomy page](https://www.ncbi.nlm.nih.gov/taxonomy).
5159

5260

53-
- The following are the _optional_ key-value pairs for the input file:
61+
- RNA-seq data can be supplied in any one of the following ways:
5462

55-
- RNA-seq data. Use one of the following options:
5663
```
57-
reads: [ array of paths to reads FASTA files]
58-
reads_ids: [ array of SRA run ids ]
59-
reads_query: query for reads SRA
64+
reads: [ array of paths to reads FASTA or FASTQ files]
65+
reads: [ array of SRA run IDs ]
66+
reads: [SRA Study ID]
67+
reads: SRA query for reads
68+
```
69+
- If you are using your local reads, then the FASTA/FASTQ headers need to be in the following format:
6070
```
71+
head SRR8506572_1.fasta| grep ">"
72+
>SRR8506572.1.1
73+
>SRR8506572.2.1
74+
75+
head SRR8506572_2.fasta| grep ">"
76+
>SRR8506572.1.2
77+
>SRR8506572.2.2
78+
79+
head SRR8506572_2.fastq| grep "@"
80+
@SRR8506572.1.2
81+
@SRR8506572.2.2
82+
83+
head SRR8506572_1.fastq| grep "@"
84+
@SRR8506572.1.1
85+
@SRR8506572.2.1
86+
```
87+
88+
- If you provide an SRA Study ID, all the SRA run ID's belonging to that Study ID will be included in the EGAPx run.
89+
90+
- The following are the _optional_ key-value pairs for the input file:
6191
6292
- A protein set. A taxid-based protein set will be chosen if no protein set is provided.
6393
```
@@ -86,19 +116,19 @@ Input to EGAPx is in the form of a YAML file.
86116
- https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/data/Dermatophagoides_farinae_small/SRR9005248.2
87117
```
88118
89-
- To specify an array of NCBI SRA datasets using `reads_ids:`
119+
- To specify an array of NCBI SRA datasets:
90120
```
91-
reads_ids:
121+
reads:
92122
- SRR8506572
93123
- SRR9005248
94124
```
95125
96-
- To specify an SRA entrez query using `reads_query:`
126+
- To specify an SRA entrez query:
97127
```
98-
reads_query: 'txid6954[Organism] AND biomol_transcript[properties] NOT SRS024887[Accession] AND (SRR8506572[Accession] OR SRR9005248[Accession] )'
128+
reads: 'txid6954[Organism] AND biomol_transcript[properties] NOT SRS024887[Accession] AND (SRR8506572[Accession] OR SRR9005248[Accession] )'
99129
```
100130
101-
**Note:** Both the above examples `reads_ids` and `reads_query` will have more RNA-seq data than the `input_D_farinae_small.yaml` example. To make sure the `reads_query` does not produce a large number of SRA runs, please run it first at the [NCBI SRA page](https://www.ncbi.nlm.nih.gov/sra). If there are too many SRA runs, then select a few of them and use the `reads_ids` option.
131+
**Note:** Both the above examples will have more RNA-seq data than the `input_D_farinae_small.yaml` example. To make sure the entrez query does not produce a large number of SRA runs, please run it first at the [NCBI SRA page](https://www.ncbi.nlm.nih.gov/sra). If there are too many SRA runs, then select a few of them and list it in the input yaml.
102132
103133
- First, test EGAPx on the example provided (`input_D_farinae_small.yaml`, a dust mite) to make sure everything works. This example usually runs under 30 minutes depending upon resource availability. There are other examples you can try: `input_C_longicornis.yaml`, a green fly, and `input_Gavia_tellata.yaml`, a bird. These will take close to two hours. You can prepare your input YAML file following these examples.
104134
@@ -144,40 +174,57 @@ Input to EGAPx is in the form of a YAML file.
144174
- use `-e aws` for AWS batch using Docker image
145175
- use `-e docker` for using Docker image
146176
- use `-e singularity` for using the Singularity image
147-
- use `-e slurm` for using SLURM in your HPC.
177+
- use `-e biowulf_cluster` for Biowulf cluster using Singularity image
178+
- use '-e slurm` for using SLURM in your HPC.
148179
- Note that for this option, you have to edit `./egapx_config/slurm.config` according to your cluster specifications.
149180
- type `python3 ui/egapx.py  -h ` for the help menu
150181
151182
```
152-
$ ./egapx.py -h
153-
183+
$ ui/egapx.py -h
184+
185+
154186
!!WARNING!!
155187
This is an alpha release with limited features and organism scope to collect initial feedback on execution. Outputs are not yet complete and not intended for production use.
156188
157-
usage: egapx.py [-h] [-e EXECUTOR] [-c CONFIG_DIR] [-o OUTPUT] [-w WORKDIR] [-r REPORT] [-n] [-q] [-v] [-fn FUNC_NAME] filename
189+
usage: egapx.py [-h] [-o OUTPUT] [-e EXECUTOR] [-c CONFIG_DIR] [-w WORKDIR] [-r REPORT] [-n] [-st]
190+
[-so] [-dl] [-lc LOCAL_CACHE] [-q] [-v] [-fn FUNC_NAME]
191+
[filename]
158192
159193
Main script for EGAPx
160194
161-
positional arguments:
162-
filename YAML file with input: section with at least genome: and reads: parameters
163-
164195
optional arguments:
165196
-h, --help show this help message and exit
166197
-e EXECUTOR, --executor EXECUTOR
167-
Nextflow executor, one of local, docker, aws. Uses corresponding Nextflow config file
198+
Nextflow executor, one of docker, singularity, aws, or local (for NCBI
199+
internal use only). Uses corresponding Nextflow config file
168200
-c CONFIG_DIR, --config-dir CONFIG_DIR
169-
Directory for executor config files, default is ./egapx_config. Can be also set as env EGAPX_CONFIG_DIR
170-
-o OUTPUT, --output OUTPUT
171-
Output path
201+
Directory for executor config files, default is ./egapx_config. Can be also
202+
set as env EGAPX_CONFIG_DIR
172203
-w WORKDIR, --workdir WORKDIR
173-
Working directory for cloud executor
204+
Working directory for cloud executor
174205
-r REPORT, --report REPORT
175-
Report file prefix for report (.report.html) and timeline (.timeline.html) files, default is in output directory
206+
Report file prefix for report (.report.html) and timeline (.timeline.html)
207+
files, default is in output directory
176208
-n, --dry-run
209+
-st, --stub-run
210+
-so, --summary-only Print result statistics only if available, do not compute result
211+
-lc LOCAL_CACHE, --local-cache LOCAL_CACHE
212+
Where to store the downloaded files
177213
-q, --quiet
178214
-v, --verbose
179215
-fn FUNC_NAME, --func_name FUNC_NAME
180216
func_name
217+
218+
run:
219+
filename YAML file with input: section with at least genome: and reads: parameters
220+
-o OUTPUT, --output OUTPUT
221+
Output path
222+
223+
download:
224+
-dl, --download-only Download external files to local storage, so that future runs can be
225+
isolated
226+
227+
181228
```
182229
183230
@@ -270,16 +317,69 @@ $ aws s3 ls s3://temp_datapath/D_farinae/96/621c4ba4e6e87a4d869c696fe50034/outpu
270317
2024-03-27 11:20:24 17127134 aligns.paf
271318
```
272319
320+
## Offline mode
321+
322+
If you do not have internet access from your cluster, you can run EGAPx in offline mode. To do this, you would first pull the Singularity image, then download the necessary files from NCBI FTP using `egapx.py` script, and then finally use the path of the downloaded folder in the run command. Here is an example of how to download the files and execute EGAPx in the Biowulf cluster.
323+
324+
325+
- Download the Singularity image:
326+
```
327+
rm egap*sif
328+
singularity cache clean
329+
singularity pull docker://ncbi/egapx:0.2-alpha
330+
```
331+
332+
- Clone the repo:
333+
```
334+
git clone https://github.com/ncbi/egapx.git
335+
cd egapx
336+
```
337+
338+
- Download EGAPx related files from NCBI:
339+
```
340+
python3 ui/egapx.py -dl -lc ../local_cache
341+
```
342+
343+
- Download SRA reads:
344+
```
345+
prefetch SRR8506572
346+
prefetch SRR9005248
347+
fasterq-dump --skip-technical --threads 6 --split-files --seq-defline ">\$ac.\$si.\$ri" --fasta -O sradir/ ./SRR8506572
348+
fasterq-dump --skip-technical --threads 6 --split-files --seq-defline ">\$ac.\$si.\$ri" --fasta -O sradir/ ./SRR9005248
349+
350+
```
351+
You should see downloaded files inside the 'sradir' folder":
352+
```
353+
ls sradir/
354+
SRR8506572_1.fasta SRR8506572_2.fasta SRR9005248_1.fasta SRR9005248_2.fasta
355+
```
356+
Now edit the file paths of SRA reads files in `examples/input_D_farinae_small.yaml` to include the above SRA files.
357+
358+
- Run `egapx.py` first to edit the `biowulf_cluster.config`:
359+
```
360+
ui/egapx.py examples/input_D_farinae_small.yaml -e biowulf_cluster -w dfs_work -o dfs_out -lc ../local_cache
361+
echo "process.container = '/path_to_/egapx_0.2-alpha.sif'" >> egapx_config/biowulf_cluster.config
362+
```
363+
364+
- Run `egapx.py`:
365+
```
366+
ui/egapx.py examples/input_D_farinae_small.yaml -e biowulf_cluster -w dfs_work -o dfs_out -lc ../local_cache
367+
368+
```
369+
370+
273371
## References
274372
275-
Barnett DW, Garrison EK, Quinlan AR, Strömberg MP, Marth GT. BamTools: a C++ API and toolkit for analyzing and managing BAM files. Bioinformatics. 2011 Jun 15;27(12):1691-2. doi: 10.1093/bioinformatics/btr174. Epub 2011 Apr 14. PMID: 21493652; PMCID: PMC3106182.
373+
Buchfink B, Reuter K, Drost HG. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods. 2021 Apr;18(4):366-368. doi: 10.1038/s41592-021-01101-x. Epub 2021 Apr 7. PMID: 33828273; PMCID: PMC8026399.
276374
277375
Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H. Twelve years of SAMtools and BCFtools. Gigascience. 2021 Feb 16;10(2):giab008. doi: 10.1093/gigascience/giab008. PMID: 33590861; PMCID: PMC7931819.
278376
279377
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013 Jan 1;29(1):15-21. doi: 10.1093/bioinformatics/bts635. Epub 2012 Oct 25. PMID: 23104886; PMCID: PMC3530905.
280378
281379
Li H. Protein-to-genome alignment with miniprot. Bioinformatics. 2023 Jan 1;39(1):btad014. doi: 10.1093/bioinformatics/btad014. PMID: 36648328; PMCID: PMC9869432.
282380
381+
Shen W, Le S, Li Y, Hu F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS One. 2016 Oct 5;11(10):e0163962. doi: 10.1371/journal.pone.0163962. PMID: 27706213; PMCID: PMC5051824.
382+
283383
284384
285385
## Contact us

examples/input_C_longicornis.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
genome: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/029//603/195/GCF_029603195.1_ASM2960319v2/GCF_029603195.1_ASM2960319v2_genomic.fna.gz
2-
reads_query: 'txid2530218[Organism] AND biomol_transcript[properties] NOT SRS024887[Accession]'
2+
reads: txid2530218[Organism] AND biomol_transcript[properties] NOT SRS024887[Accession]
33
taxid: 2530218

examples/input_Gavia_stellata.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
genome: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/030/936/135/GCF_030936135.1_bGavSte3.hap2/GCF_030936135.1_bGavSte3.hap2_genomic.fna.gz
2-
reads_query: 'txid37040[Organism] AND biomol_transcript[properties] NOT SRS024887[Accession]'
2+
reads: txid37040[Organism] AND biomol_transcript[properties] NOT SRS024887[Accession]
33
taxid: 37040

nf/subworkflows/ncbi/default/annot_builder/main.nf

Lines changed: 21 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -30,11 +30,13 @@ workflow annot_builder {
3030
def m = annot_builder_main('outdir', params).collect()
3131
def i = annot_builder_input('outdir', m, '01', gnomon_file, params)
3232
// FIXME: intended params 4-5 to be lists of all input files and all input manifests, but it complained with only one entry
33-
def (all, accept) = annot_builder_run('outdir', i[0], gencoll_asn, i[1], gnomon_file, genome_asn, params)
33+
def (all, accept, accept_ftable, annot) = annot_builder_run('outdir', i[0], gencoll_asn, i[1], gnomon_file, genome_asn, params)
3434

3535
emit:
3636
outputs = all
3737
accept_asn = accept
38+
accept_ftable_annot = accept_ftable
39+
annot_files = annot
3840
}
3941

4042

@@ -76,6 +78,7 @@ process annot_builder_main {
7678
stub:
7779
"""
7880
touch annot_builder_main.ini
81+
echo 'main' > annot_builder_main.ini
7982
"""
8083
}
8184

@@ -137,6 +140,8 @@ process annot_builder_input {
137140
"""
138141
touch annot_builder_input.ini
139142
touch input_manifest_${provider_number}.mft
143+
cp ${prior_file} annot_builder_input.ini
144+
echo 'input ${provider_number}' >> annot_builder_input.ini
140145
"""
141146
}
142147

@@ -152,8 +157,10 @@ process annot_builder_run {
152157
path genome_asn, stageAs: 'genome/*'
153158
val params
154159
output:
155-
path "${outdir}/*"
156-
path "${outdir}/ACCEPT/accept.asn", optional: true
160+
path "${outdir}/*", emit: "all"
161+
path "${outdir}/ACCEPT/accept.asn", emit: "accept", optional: true
162+
path "${outdir}/ACCEPT/accept.ftable_annot", emit: "accept_ftable_annot", optional: true
163+
path "${outdir}/ACCEPT/*.annot", optional: true
157164
script:
158165
"""
159166
mkdir -p $outdir/ACCEPT
@@ -165,6 +172,7 @@ process annot_builder_run {
165172
lds2_indexer -source genome/ -db LDS2
166173
# EXCEPTION_STACK_TRACE_LEVEL=Warning DEBUG_STACK_TRACE_LEVEL=Warning DIAG_POST_LEVEL=Trace
167174
annot_builder -accept-output both -nogenbank -lds2 LDS2 -conffile $conffile -gc-assembly $gencoll_asn -logfile ${outdir}/annot_builder.log
175+
cat ${outdir}/ACCEPT/*.ftable.annot > ${outdir}/ACCEPT/accept.ftable_annot
168176
"""
169177
stub:
170178
"""
@@ -174,7 +182,15 @@ process annot_builder_run {
174182
mkdir -p $outdir/REPORT
175183
mkdir -p $outdir/TEST
176184
177-
touch ${outdir}/annot_builder.log
178-
touch ${outdir}/accept.asn
185+
echo "1" > ${outdir}/annot_builder.log
186+
echo "2" > ${outdir}/accept.asn
187+
echo "3" > ${outdir}/accept.ftable.annot
188+
189+
190+
echo "4" > ${outdir}/ACCEPT/accept.asn
191+
echo "5" > ${outdir}/ACCEPT/accept.ftable_annot
192+
echo "S1" > ${outdir}/ACCEPT/S1.annot
193+
echo "S2" > ${outdir}/ACCEPT/S2.annot
194+
179195
"""
180196
}

nf/subworkflows/ncbi/default/annotwriter/main.nf

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,17 +17,21 @@ process run_annotwriter {
1717
input:
1818
path accept_asn_file
1919
output:
20-
path ('output/accept.gff') , emit: 'annoted_file'
20+
path ('output/accept.gff'), emit: 'annoted_file'
2121

2222
script:
2323
"""
2424
mkdir -p output
25-
annotwriter -i ${accept_asn_file} -nogenbank -format gff3 -o output/accept.gff
25+
if [ -s ${accept_asn_file} ]; then
26+
annotwriter -i ${accept_asn_file} -nogenbank -format gff3 -o output/accept.gff
27+
else
28+
touch output/accept.gff
29+
fi
2630
"""
2731

2832
stub:
2933
"""
3034
mkdir -p output
31-
touch output/accept.gff
35+
echo "1" > output/accept.gff
3236
"""
3337
}

0 commit comments

Comments
 (0)