Contributions welcome :)
The BPCells 0.3.0 release covers 6 months of changes and 45 commits from 5 contributors. Notable improvements this release include support for peak calling with MACS and the addition of pseudobulk matrix and stats calculations. We also released an initial prototype of a BPCells Python library (more details here). Full details of changes below.
Thanks to @ycli1995, @Yunuuuu, and @douglasgscofield for pull requests that contributed to this release, as well as to users who sumitted github issues to help identify and fix bugs. We also added @immanuelazn to the team as a new hire! He is responsible for many of the new features this release and will continue to help with maintenance and new development moving forwards.
apply_by_col()
and apply_by_row()
allow providing custom R functions to compute per row/col summaries.
In initial tests calculating row/col means using R functions is ~2x slower than the C++-based implementation but memory
usage remains low.rowMaxs()
and colMaxs()
functions, which return the maximum value in each row or column of a matrix.
If matrixStats
or MatrixGenerics
packages are installed, BPCells::rowMaxs()
will fall back to their implementations for non-BPCells objects.
Thanks to @immanuelazn for their first contribution as a new lab hire!regress_out()
to allow removing unwanted sources of variation via least squares linear regression models.
Thanks to @ycli1995 for pull request #110trackplot_genome_annotation()
for plotting peaks, with options for directional arrows, colors, labels, and peak widths. (pull request #113)call_peaks_macs()
(pull request #118). Note, renamed from call_macs_peaks()
in pull request #143rowQuantiles()
and colQuantiles()
functions, which return the quantiles of each row/column of a matrix. Currently rowQuantiles()
only works on row-major matrices and colQuantiles()
only works on col-major matrices.
If matrixStats
or MatrixGenerics
packages are installed, BPCells::colQuantiles()
will fall back to their implementations for non-BPCells objects. (pull request #128)pseudobulk_matrix()
which allows pseudobulk aggregation by sum
or mean
and calculation of per-pseudobulk variance
and nonzero
statistics for each gene (pull request #128)trackplot_loop()
now accepts discrete color scalestrackplot_combine()
now has smarter layout logic for margins, as well as detecting when plots are being combined that cover different genomic regions. (pull request #116)select_cells()
and select_chromosomes()
now also allow using a logical mask for selection. (pull request #117)LDFLAGS
or CFLAGS
as environment variables in addition to setting them in ~/.R/Makevars
(pull request #124)open_matrix_anndata_hdf5()
now supports reading AnnData matrices in the dense format. (pull request #146)cluster_graph_leiden()
now has better defaults that produce reasonable cluster counts regardless of dataset size. (pull request #147)trackplot_coverage()
with fragments from a single cluster. (Thanks to @sjessa for directly reporting this bug and coming up with a fix)trackplot_coverage()
when called with ranges less than 500 bp in length (Thanks to @bettybliu for directly reporting this bug.)tile_matrix()
with fragment mode (pull request #141)sctransform_pearson()
on ARM architecture (pull request #141)pseudobulk_matrix()
gets an integer matrix (pull request #174)trackplot_coverage()
legend_label
argument is now ignored, as the color legend is no longer shown by default for coverage plots.We are finally declaring a new release version, covering a large amount of changes and improvements
over the past year. Among the major features here are parallelization options for svds()
and
matrix_stats()
, improved genomic track plots, and runtime CPU feature detection for SIMD code (enables
higher performance, more portable builds). Full details of changes below.
This version also comes with a new installation path, which is done in preparation for a future Python package release. (So we can have one folder for R and one for Python, rather than having all the R files sit in the root folder). This is a breaking change and requires a slightly modified installation command.
Thanks to @brgew, @ycli1995, and @Yunuuuu for pull requests that contributed to this release, as well as all users who submitted github issues to help identify and fix bugs.
remotes::install_github("bnprks/BPCells/r")
(note the additional /r
)
"subdir": "r"
to their packages.json
config.all_matrix_inputs()
. Outside of
loading old RDS files no changes should be needed.trackplot_gene()
now returns a plot with a facet label to match the new trackplot system.
This label can be removed by by calling trackplot_gene(...) + ggplot2::facet_null()
to be
equivalent to the old function's output.draw_trackplot_grid()
deprecated, replaced by trackplot_combine()
with simplified argumentstrackplot_bulk()
has been deprecated, replaced by trackplot_coverage()
with equivalent functionalitysvds()
function, based on the excellent Spectra C++ library (used in RSpectra) by Yixuan Qiu.
This should ensure lower memory usage compared to irlba
, while achieving similar speed + accuracy.threads
argument to
matrix_stats()
and svds()
.
marker_features()
and writing a
matrix to disk remain single-threaded.svds()
with many threads on gene-major matrices can result in high memory usage for now.
This problem is not present for cell-major matrices.import_matrix_market()
and the convenience function import_matrix_market_10x()
. Our
implementation uses disk-backed sorting to allow importing large files with low memory usage.binarize()
function and associated generics <
, <=
, >
, and >=
.
This only supports comparison with non-negative numbers currently. (Thanks to
contribution from @brgew)round()
matrix transformation (Thanks to contributions from @brgew)all_matrix_inputs()
to help enable relocating
the underlying storage for BPCells matrix transform objects.gzip_level
parameter, which will enable a shuffle + gzip filter for
compression. This is generally much slower than bitpacking compression, but it adds improved storage options for
files that must be read by outside programs. Thanks to @ycli1995 for submitting this improvement in pull #42.write_matrix_anndata_hdf5()
(issue #49)m1[i,j] <- m2
). Note that this does not modify data on disk. Instead,
it uses a series of subsetting and concatenation operations to provide the appearance of overwriting the appropriate
entries.knn_to_geodesic_graph()
, which matches the Scanpy default construction for
graph-based clusteringchecksum()
, which allows for calculating an MD5 checksum of a matrix contents. Thanks to @brgrew for submitting this improvement in pull request #83write_insertion_bedgraph()
allows exporting pseudobulk insertion data to bedgraph formatc()
now handles inputs with mismatched chromosome names.knn_to_snn_graph()
should work more smoothly on large datasets due to C++ implementationmarker_features()
for samples with millions of cells and a large number
of clusters to compare.[
now propagates through so we always avoid computing parts of
the peak/tile matrix that have been discarded by our subset. Subsetting a tile matrix will automatically
convert into a peak matrix when possible for improved efficiency.as.matrix()
will produce integer matrices when appropriate (Thanks to @Yunuuuu in pull #77)trackplot_combine()
trackplot_gene()
now draws arrows for the direction of transcriptiontrackplot_loop()
is a new track type allows plotting interactions between genomic regions, for instance peak-gene correlations
or loop calls from Hi-Ctrackplot_scalebar()
is added to show genomic scaleset_trackplot_label()
and set_trackplot_height()
rowVars()
and colVars()
functions, as convenience wrappers around matrix_stats()
.
If matrixStats
or MatrixGenerics
packages are installed, BPCells::rowVars()
will fall back to
their implementations for non-BPCells objects. Unfortunately, matrixStats::rowVars()
is not generic, so either BPCells::rowVars()
or
BPCells::colVars()
highway
.
Pow
SIMD implementation is removed, but Square
gets a new SIMD implementationlog1p()
, and sctransform_pearson()
chrNames(frags) <- val
or cellNames(frags) <- val
could cause
downstream errors.transpose_storage_order()
for matrices with >4 billion non-zero entries.transpose_storage_order()
for matrices with no non-zero entries.rownames()
or colnames()
is now propagated when saving matrices (Issue #29 reported thanks to @realzehuali, with an additional fix after report thanks to @Dario-Rocha)marker_features()
for features with
more than 2.6 million zeros.convert_matrix_type()
twice in a row such that it cancels out (e.g. double -> uint32_t -> double). Thanks to @brgrew reporting issue #43svds()
not handling row-major matrices correctly. Thanks to @ycli1995 for reporting this in issue #55[<-
. Thanks to @Yunuuuu for identifying the issue #67transpose_storage_order()
on a densely-transformed matrix. Thanks to @Yunuuuu for reporting this in issue #71readRDS()
can be used from different working directories.footprints()
now respects user interrupts via Ctrl-C+
, -
, *
, /
, and log1p
for streaming
normalization, along with other less common operations. This allows implementation of ATAC-seq LSI and Seurat default
normalization, along with most published log-based normalizations.Note: All operations interoperate with all storage formats. For example, all matrix operations can be applied directly to an AnnData or 10x matrix file. In many cases the bitpacking-compressed formats will provide performance/space advantages, but are not required to use the computations.