Releases: pachterlab/kb_python
Releases · pachterlab/kb_python
v0.27.1
General
- [DEPRECATION] Support for split indices (with the
-n
option) will be deprecated in the next major release. It is now recommended to use--include-attribute
and--exclude-attribute
options, similar to Cellranger'smkref
options (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references), tokb ref
to reduce index size and memory usage.
ref
- A remote URL may be provided as the
fasta
(genomic FASTA) and/orgtf
(gene annotation GTF) arguments. Support fromngs_tools 1.5.13
. - GTF is now allowed to have 0-length segments (pachterlab/kallisto#340).
count
- [DEPRECATION] Technology
SMARTSEQ
is now deprecated. All future uses should useBULK
,SMARTSEQ2
orSMARTSEQ3
. - Genes that do not have a gene name will now have their gene IDs in the
gene_name
column (or theadata.var_names
if--gene-names
is used). - Support for
--workflow lamanno
for-x BULK
and-x SMARTSEQ2
technologies.
v0.27.0
General
- Added the
compile
command. See below for more information. (#139) - Fixed an issue where a call to kallisto would hang indefinitely due to a full stderr buffer.
- Changed docstring style to Google-style. Added typings to all functions.
- Updated kallisto binaries to
v0.48.0
. - Updated bustools binaries to
v0.41.0
. - Added binary compatibility checks. If a binary is incompatible,
kb compile
is suggested.
compile
- This command can be used to compile the
kallisto
and/orbustools
binary from source. At the most basic level, it downloads the latest release source distributions from the respective GitHub repositories, compiles them, and places them wherekb
can automatically detect them. - The
target
positional argument specifies which binary (or both) to compile. Possible values arekallisto
,bustools
andall
. - The
--url
optional argument may be provided with a URL to a remote archive that will be used instead of the latest GitHub release. When this option is used,target
may not beall
. -
- The
--ref
optional argument may be provided with a commit hash or git tag. When this option is used,target
may not beall
.
- The
- The
-o
optional argument may be used to place the compiled binaries in a different directory. Note that if this option is used,--kallisto
and--bustools
options will have to be set appropriately when runningref
orcount
. - The
--view
option may be used to simply view what binaries (their locations and versions) will be used bykb
. - The
--remove
option may be used to remove existing compiled binaries. - The
--overwrite
option may be used to overwrite existing compiled binaries. - The
kallisto
compilation follows https://pachterlab.github.io/kallisto/source and has the same dependencies. - The
bustools
compilation follows https://bustools.github.io/source and has the same dependencies. - The
--cmake-arguments
argument may be used to pass in a string of additional arguments to pass directly to thecmake
command. For instance, to manually specify additional include directories,--cmake-arguments "-DCMAKE_CXX_FLAGS='-I /usr/include'"
- Note that the compilation is performed in shared mode, which means the binary will contain links to shared libraries (i.e. not statically linked).
ref
- Added
--include-attribute
and--exclude-attribute
options which can be used to include/exclude specific GTF entries based on their attributes. The argument to these options must be in the form of akey:value
pair, wherekey
is a GTF attribute name andvalue
is the value of the aforementioned attribute to include/exclude. Only one of these two options may be specified, and each option may be specified more than once. When multiple--include-attribute
are provided, GTF entries that have any one of the attributes will be processed. When multiple--exclude-attribute
are provided, GTF entries that have any one of the attributes will not be processed.
count
- Added
--filter-threshold
option to specify the barcode filter threshold. This option may only be used when also providing--filter bustools
and indicates the minimum number of times a barcode must appear to be retained from filtering. (#142) - Added
--strand
option to override automatic strandedness setting bykallisto bus
. Available options areunstranded
,forward
, andreverse
. - Changed the
transcript_ids
column to be a semicolon-delimited string instead of a list (only applicable when--tcc
is provided) as a workaround for an issue with writing lists to h5ad withh5py>=3
. #141 - Added
BULK
andSMARTSEQ2
technologies. The two technologies behave identically. The FASTQs may be provided either directly via command-line (only for multiplexed samples), in which casekb
will perform demultiplexing, or as a single batch definition text file (only for demultiplexed samples). See https://pachterlab.github.io/kallisto/manual section aboutbatch.txt
for formatting. This batch textfile may also contain remote urls to FASTQ files, which will be streamed for supported operating systems. Additionally, added--parity
,--fragment-l
and--fragment-s
options, which may only be provided for these technologies. The first must always be provided, indicating the parity of the reads (single
,paired
), and the latter two may only be provided when--parity single
is also provided, specifying the mean length of the fragments and standard deviation of the fragment lengths. - DEPRECATION The
SMARTSEQ
technology has been deprecated and will be removed in the next release. Instead,SMARTSEQ2
should be used. See previous point for more information. - Added
SMARTSEQ3
technology. - The full binary path is used for
--dry-run
instead of an alias. - Added
--umi-gene
option, which deduplicates UMIs by gene. Can not be used with smartseq or bulk technologies. - Added
--em
option, which estimated gene abundances using the EM algorithm. Can not be used with smartseq or bulk technologies, or with--tcc
. - Fixed an issue that occurs when the
-o
option tobustools count
already exists, but as a directory. For instance,counts_unfiltered/cells_x_genes
. Such folders are removed before running the command. - Improved output file validation so that all expected files must exist.
- Added
--gene-names
option, which may only be used with--h5ad
or-loom
and not--tcc
. By specifying this option, the output h5ad or loom matrix will be aggregated by gene names instead of IDs. - Added support for the following technologies:
BDWTA
(BD Rhapsody),SPLIT-SEQ
,Visium
(10x).
v0.26.4
v0.26.3
v0.26.2
[YANKED] v0.26.1
This version has been yanked due to an issue with installation. Do not try to install this version!
General
- Added a check for whether the temporary directory exists. If it does, now prints out an error and exits. (#119)
- Logging is now handled by a specialized logger implemented in the
ngs-tools
library, which provides logger namespacing. - Updated supported technologies text and syntax for
kb --list
so that they are more compact. Added link to the kallisto manual for custom technology definitions. - Updated citation in
info
.
ref
- Fixed
--tmp
option to set the temporary directory properly (#122) - Major refactor of FASTA and GTF parsing. All relevant functions were replaced with appropriate ones from the
ngs-tools
library. The ones provided in this library are far more robust in dealing with GTF entries (especially missing attributes). FASTA and GTF files no longer have to be sorted nor decompressed. These all result in an approximately order-of-magnitude speedup in splitting the genomic FASTA. Additionally, more helpful error messages are printed, which should help user debuggability. - Fixed an issue where no logging messages were displayed when downloading a reference with
-d
.
count
- Whitelists are now provided by the
ngs-tools
library.
v0.26.0
General
- Added the optional arguments
--kallisto
and--bustools
, which may be used to override the packaged kallisto and bustools binaries. The argument may be a command in the user's PATH, which will be expanded to the full absolute path, or an absolute/relative path to the binary (#109, thanks @apeltzer, @dpryan79, @Maarten-vd-Sande).
ref
- Any spaces in GTF groups are now removed. For instance, if a transcript has ID
TRANSCRIPT ID
then the resulting transcript sequence will be namedTRANSCRIPTID
. (#97, thanks @axelalmet)
count
- Fixed an issue where converting the count matrix using
--loom
and--workflow lamanno
would cause an error (#91) - Fixed an issue with parsing FASTQ paths when using
-x smartseq
, where the second read file would be erroneously used as the first (#114, thanks @jma1991) - Added entries to indicate the current working directory when the
kb
command was called, along with thekallisto
andbustools
binary paths and versions inkb_info.json
.
v0.25.1
count
- Fixed
loompy does not accept empty matrices as data
error when providing--loom
with--workflow lamanno
(#91) - When using
--h5ad
or--loom
with-x smartseq
, the output matrix has genes as columns, instead of transcripts. For genes that have multiple transcripts, the counts are added. (#93) - For
-x smartseq
, it is now possible to provide a batch TSV instead of FASTQs directly. The batch TSV must contain exactly three columns: cell ID, FASTQ 1 (read 1), FASTQ 2 (read 2). - Added an error when an uneven number of FASTQs are provided for
-x smartseq
(only paired-end reads are currently supported) - Turned off all logging and warning messages from
h5py
andanndata
.
v0.25.0
ref
- Progress bar is now displayed when downloading pre-packaged reference files.
- Added checks to provide more useful outputs for common errors, including: 1) when FASTA and GTF chromosomes do not match, 2) when a GTF entry is not parsable, and 3) when either
transcript
orexon
entry for a transcript is missing in the GTF (both are required). - Added
-k
option to override default (or calculated optimal) kmer length for the Kallisto index. - Added functionality to generate a feature barcode reference for use with the KITE feature-barcoding workflow. To use this option, supply
--workflow kite
and a feature-barcode to cell-barcode mapping. - Added
-n
option to be able to split indices inton
parts. This reduces the maximum memory used at any given time. Useful for running in memory-limited environments. When the-n
option is used, the-i
argument is used as the prefix to then
indices generated. Each of these indices are appended with a.i
wherei
is the index number, starting fromi=0
. When-n
is used the built indices must be passed in as a comma-delimited list tokb count
(NOTE: this feature is EXPERIMENTAL Seecount
for more details). When-n
is used with--workflow lamanno
or--workflow nucleus
, only the intron FASTA is split inton-1
parts, which are then each indexed separately. The cDNA FASTA is indexed in its entirety and is never split. - Added functionality to build a single index using multiple references. Useful for mixed species experiments. The
fasta
argument should be a comma-delimited list of genome FASTAs, and thegtf
argument should be a comma-delimited list of GTFs, corresponding in position to each genome FASTA. - Added
--tmp
option to manually specify temporary directory. Otherwise, behavior is identical to previous version (tmp
directory at the locationkb
is executed). - Added support for IUPAC nucleotide code. Note that
kallisto
replaces non-ACGUT nucleotides to pseudorandom ones. Thanks @Maarten-vd-Sande
count
- Added support for KITE feature-barcoding workflow. The
bustools
binary was updated to support this feature. - DEPRECATION: The
--lamanno
and--nucleus
flags will be deprecated in the next release. These have been replaced with--workflow lamanno
and--workflow nucleus
. - All BUS files that are input/outputs are validated before/after running
kallisto
orbustools
. A BUS file is considered valid if it is read withbustools
without error and it has positive number of BUS records. This should preventbustools
from trying to sort empty BUS files and crashing (#31). - Added functionality to generate TCC matrices with the
--tcc
flag. - Added
--tcc
flag to include reads that pseudoalign to multiple genes. - When running in verbose mode (
--verbose
), commands are no longer printed with the full path to thebustools
andkallisto
binaries. These paths are printed once at the start of the program. - Added
--dry-run
flag, which prints the entire workflow to standard output as shell commands, without actually running them. - EXPERIMENTAL: Added support for multiple indices by passing a comma-delimited list of indices to
-i
.kb
will align the reads to each of these indices and merge the BUS files withbustools mash
andbustools merge
. This feature is currently EXPERIMENTAL, and there are known issues that cause the loss of reads. This feature will be fully supported in a future release. In the meantime, use at your own risk! - Added
--tmp
option to manually specify temporary directory. The default behavior has also changed: the defaulttmp
directory is created IN THE OUTPUT FOLDER (specified by-o
). Previously, thetmp
directory was created wherekb
was run, which was causing issues when running multiple instances ofkb
from the same location. Thanks to @Munfred and @kokitsuyuzaki for the suggestion. kb
now outputs akb_info.json
which includes useful run information, such as the commands run and their runtimes.- Added functionality to generate a brief standalone HTML report that includes basic statistics (run_info.json, inspect.json) and quality-control plots (knee plot, elbow plot, pca, genes detected). This feature is available with the
--report
flag. Using this flag on velocity matrices may causekb
to crash due to high memory usage, and a corresponding warning is printed at the start. Plots for TCC matrices are not supported. - When the matrix is converted to H5AD or Loom format (using the
--h5ad
or--loom
options), the gene/feature names are included as a column in thevar
of the anndata. Related to #52 - Added a
--cellranger
option, which converts the raw gene matrices to cellranger-compatible format in a separate,cellranger
directory forstandard
workflow (andcellranger_spliced
andcellranger_unspliced
forvelocity
andnucleus
workflows). Note that cellranger outputs matrices with genes as rows and cells (barcodes) as columns. - Added
--mm
flag to include bus records that pseudoalign to multiple genes, via the--multimapping
flag inbustools count
(#57). None
can be provided as the whitelist, which will forcekb
to use thebustools whitelist
command, even if there exists a pre-packaged whitelist.- Added support for Smart-seq reads with
-x smartseq
. FASTQs are paired by first sorting the list of FASTQ paths in lexicographical order, and taking every two to be a pair. For instance, if1.fastq 3.fastq 2.fastq 4.fastq
is provided,1.fastq
and2.fastq
will be a pair, and3.fastq and 4.fastq
will be another pair. The FASTQ argument now supports glob expressions to make it easier to provide a long list of FASTQs.
v0.24.4
--info
- Fix typo with
indropsv3
ref
- If any input (FASTA or GTF) files are provided as gzip files, they are uncompressed to the temporary directory, instead of being streamed directly. This is because
ref
relies on being able to access arbitrary locations of the files quickly. Working with decompressed files results in a considerable speedup.
count
- For
--lamanno
: spliced and unspliced busfiles no longer contain the.s
suffix. This was done to make the output consistent with the normal (non--lamanno
) command - Implemented
--filter
with--lamanno
- Support for single nuclei RNA-seq with
--nuclei
. The only difference between--nuclei
and--lamanno
is how the spliced and unspliced matrices are combined. Specifically,--nuclei
sums the matrices. Using--nuclei
with neither--loom
nor--h5ad
results in behavior identical with--lamanno
.