- Dataset/Sample Discovery - Reads dataset or sample information from CSV input file
- Public Dataset Detection - Identifies public datasets (GSE*, E-MTAB-, PRJEB patterns)
- Metadata Parsing - Extracts metadata from public repositories for public datasets
- Quality Control - Generates mapping QC statistics from STARsolo output (if not already present)
- File Collection - Gathers all data files and metadata files for upload
- iRODS Upload - Transfers files to iRODS with checksums
- Metadata Attachment - Attaches comprehensive metadata to iRODS collections
The pipeline supports a dry run mode (--dry_run true
) that allows you to test and validate your pipeline configuration without actually uploading data to iRODS. This is particularly useful for:
- Validates input parameters - Checks that all required parameters are provided and correctly formatted
- Processes metadata - Runs all metadata collection and parsing steps
- Generates QC reports - Creates mapping statistics and quality control files
- Creates output files - Generates
sample_metadata.csv
anddataset_metadata.csv
files locally - Shows intended operations - Displays what files would be uploaded and where
- iRODS file uploads - No files are actually transferred to iRODS storage
- iRODS metadata attachment - No metadata is attached to iRODS collections
- MD5 checksum validation - No checksums are computed or compared with iRODS
This pipeline processes datasets containing single-cell RNA sequencing results (typically from STARsolo/CellRanger), extracts metadata, performs quality control analysis, and uploads the data with comprehensive metadata to iRODS for long-term storage and data management.
main.nf
— the main Nextflow pipeline that orchestrates data processing and iRODS uploadnextflow.config
— configuration script for IBM LSF submission on Sanger's HPC with Singularity containers and global parametersmodules/local/reprocess10x/parsemetadata/
— module for parsing metadata from public repositories (GEO, ENA)modules/local/reprocess10x/mappingqc/
— module for extracting mapping quality control statisticsmodules/local/irods/storefile/
— module for uploading files to iRODSmodules/local/irods/attachcollectionmeta/
— module for attaching metadata to iRODS collections
- Dataset Discovery: Reads dataset information from CSV input file
- Public Dataset Detection: Identifies public datasets (GSE*, E-MTAB-, PRJEB patterns)
- Metadata Parsing: Extracts metadata from public repositories for public datasets
- Quality Control: Generates mapping QC statistics from STARsolo output (if not already present)
- File Collection: Gathers all data files and metadata files for upload
- iRODS Upload: Transfers files to iRODS with checksums
- Metadata Attachment: Attaches comprehensive metadata to iRODS collections
Upload processed datasets to iRODS:
nextflow run main.nf \
--datasets datasets.csv \
--irodspath "/archive/cellgeni/sanger/"
Upload individual samples to iRODS:
nextflow run main.nf \
--samples samples.csv \
--irodspath "/archive/cellgeni/sanger/"
Specify which file types to ignore during upload:
nextflow run main.nf \
--datasets datasets.csv \
--irodspath "/archive/cellgeni/sanger/" \
--ignore_pattern ".bam,.fastq.gz"
Skip automatic metadata retrieval for public datasets:
nextflow run main.nf \
--datasets datasets.csv \
--irodspath "/archive/cellgeni/sanger/" \
--collect_public_metadata false
Get detailed logging information:
nextflow run main.nf \
--datasets datasets.csv \
--irodspath "/archive/cellgeni/sanger/" \
--verbose true
Specify a different output directory:
nextflow run main.nf \
--datasets datasets.csv \
--irodspath "/archive/cellgeni/sanger/" \
--output_dir "my_results"
Test the pipeline without actually uploading files to iRODS:
nextflow run main.nf \
--datasets datasets.csv \
--irodspath "/archive/cellgeni/sanger/" \
--dry_run true
This mode will:
- Validate all input parameters and files
- Process metadata and generate QC reports
- Show what would be uploaded without actual iRODS operations
- Create local output files (metadata CSV files) for review
--datasets
— Path to a CSV file containing dataset information with columns:id
(dataset identifier) andpath
(local filesystem path to processed data directory) OR--samples
— Path to a CSV file containing sample information with columns:id
(sample identifier),path
(local filesystem path), and optionallydataset_id
--irodspath
— Base path in iRODS where datasets will be stored (e.g., "/archive/cellgeni/sanger/")
--output_dir
— Output directory for pipeline results (default: "results"
)--publish_mode
— File publishing mode (default: "copy"
)--ignore_pattern
— Comma-separated list of file patterns to ignore during upload (default: ".bam,.bai,.cram,.crai,.fastq.gz,.fq.gz,.fastq,.fq,.mate1.bz2,.mate2.bz2,.sh,.bsub,.pl"
)--collect_public_metadata
— Collect metadata from public repositories for public datasets (default: true
)--parse_mapper_metrics
— Parse mapping QC metrics from STARsolo output (default: true
)--verbose
— Enable verbose output (default: false
)--dry_run
— Perform a dry run without uploading files to iRODS (default: false
)
The pipeline supports two input modes:
CSV file with the following structure:
id,path
GSE123456,/path/to/processed/GSE123456
PRJEB12345,/path/to/processed/PRJEB12345
EGA_DATASET,/path/to/processed/EGA_DATASET
Where:
id
: Unique dataset identifier (can be GEO accession, ENA project, or custom ID)path
: Absolute path to the directory containing processed single-cell data with sample subdirectories
CSV file with the following structure:
id,path,dataset_id
SAMPLE1,/path/to/processed/SAMPLE1,GSE123456
SAMPLE2,/path/to/processed/SAMPLE2,GSE123456
SAMPLE3,/path/to/processed/SAMPLE3,
Where:
id
: Unique sample identifierpath
: Absolute path to the sample directory containing processed single-cell datadataset_id
: (Optional) Dataset identifier to group samples under. If omitted, samples will be uploaded directly to irodspath/id
The pipeline uploads data to different iRODS paths depending on the input type:
-
Datasets (
--datasets
): Uploaded toirodspath/id
- Example:
/archive/cellgeni/sanger/GSE123456/
- Example:
-
Samples with dataset_id (
--samples
): Uploaded toirodspath/dataset_id/id
- Example:
/archive/cellgeni/sanger/GSE123456/SAMPLE1/
- Example:
-
Samples without dataset_id (
--samples
): Uploaded toirodspath/id
- Example:
/archive/cellgeni/sanger/SAMPLE1/
- Example:
Each dataset directory should contain:
- Sample directories: Named with sample identifiers containing STARsolo output files
- QC files:
*solo_qc.tsv
files (generated automatically if missing) - Metadata files: For public datasets, metadata will be automatically retrieved
Example directory structure:
GSE123456/
├── SAMPLE1/
│ ├── Aligned.sortedByCoord.out.bam
│ ├── Log.final.out
│ ├── Solo.out/
│ └── ...
├── SAMPLE2/
│ └── ...
└── GSE123456_solo_qc.tsv
Each sample directory should contain STARsolo output files:
SAMPLE1/
├── Aligned.sortedByCoord.out.bam
├── Log.final.out
├── Solo.out/
│ ├── Gene/
│ ├── GeneFull/
│ └── ...
└── ...
For datasets matching patterns GSE*
, E-MTAB-*
, or PRJEB*
, the pipeline automatically:
- Retrieves sample metadata from public repositories
- Generates accession mapping files
- Creates sample relationship files
- Downloads study metadata
The pipeline extracts and stores:
- Sample-level metadata: Species, sequencing type (paired/single), strand information, read counts, whitelist version
- Dataset-level metadata: Study accession numbers, dataset identifiers
- Technical metadata: File checksums, upload timestamps
Data is organized in iRODS based on the input type:
/archive/cellgeni/sanger/
├── GSE123456/ # Dataset uploaded to irodspath/id
│ ├── SAMPLE1/ # Individual samples within dataset
│ │ ├── [STARsolo output files]
│ │ └── [metadata attached to collection]
│ ├── SAMPLE2/
│ └── [dataset metadata files]
└── PRJEB12345/ # Another dataset
└── ...
With dataset_id specified:
/archive/cellgeni/sanger/
├── GSE123456/ # Dataset ID from CSV
│ ├── SAMPLE1/ # Sample uploaded to irodspath/dataset_id/id
│ │ ├── [STARsolo output files]
│ │ └── [metadata attached to collection]
│ └── SAMPLE2/
└── ...
Without dataset_id specified:
/archive/cellgeni/sanger/
├── SAMPLE1/ # Sample uploaded directly to irodspath/id
│ ├── [STARsolo output files]
│ └── [metadata attached to collection]
├── SAMPLE2/
└── ...
- Nextflow: Version 25.04.4 or higher
- Singularity: For containerized execution
- LSF: For job scheduling on HPC
- iRODS: Client tools for data upload
- Storage: Sufficient space in
/lustre
and/nfs
mount points
The pipeline is configured for Sanger's HPC environment with:
- LSF executor with per-job memory limits
- Singularity containers from
/nfs/cellgeni/singularity/images/
- Bind mounts for
/lustre
and/nfs
filesystems - Retry strategy with up to 5 attempts per process
The pipeline generates comprehensive reports:
- Timeline report:
reports/YYYYMMDD-HH-mm-ss_timeline.html
- Execution report:
reports/YYYYMMDD-HH-mm-ss_report.html
- Trace file:
reports/YYYYMMDD-HH-mm-ss_trace.tsv
- iRODS Collections: Organized dataset collections with attached metadata
- QC Reports: Mapping statistics and quality metrics
- Metadata Files: Comprehensive sample and dataset annotations
- Upload Logs: Records of transferred files with checksums
- Missing QC files: The pipeline will generate them automatically
- iRODS connection: Ensure iRODS client is properly configured
- File permissions: Check read access to input directories
- Storage space: Ensure sufficient space for temporary files
Check Nextflow work directory (nf-work/
) for detailed process logs and error messages.
Current version: 0.0.1
See repository for license information.