Title: | Single Cell Counts Matrices to PCA |
---|---|
Description: | > Efficient operations for single cell ATAC-seq fragments and RNA counts matrices. Interoperable with standard file formats, and introduces efficient bit-packed formats that allow large storage savings and increased read speeds. |
Authors: | Benjamin Parks [aut, cre, cph] , Stanford University [cph, fnd], Genentech, Inc. [cph, fnd] |
Maintainer: | Benjamin Parks <[email protected]> |
License: | Apache-2.0 or MIT |
Version: | 0.2.0 |
Built: | 2024-11-16 08:28:20 UTC |
Source: | https://github.com/bnprks/BPCells |
Convenience functions for adding or multiplying each row / column of a matrix by a number.
add_rows(mat, vec) add_cols(mat, vec) multiply_rows(mat, vec) multiply_cols(mat, vec)
add_rows(mat, vec) add_cols(mat, vec) multiply_rows(mat, vec) multiply_cols(mat, vec)
mat |
Matrix-like object |
vec |
Numeric vector |
Matrix-like object
A matrix object can either be an input (i.e. a file on disk or a raw matrix in memory),
or it can represent a delayed operation on one or more matrices. The all_matrix_inputs()
getter and setter functions allow accessing the base-level input matrices as a list, and
changing them. This is useful if you want to re-locate data on disk without losing your
transformed BPCells matrix. (Note: experimental API; potentially subject to revisions).
all_matrix_inputs(x) all_matrix_inputs(x) <- value
all_matrix_inputs(x) all_matrix_inputs(x) <- value
x |
IterableMatrix |
value |
List of IterableMatrix objects |
List of IterableMatrix objects. If a matrix m
is itself an input object, then
all_matrix_inputs(m)
will return list(m)
.
Apply a custom R function to each row/col of a BPCells matrix. This will run slower than the builtin C++-backed functions, but will keep most of the memory benefits from disk-backed operations.
apply_by_row(mat, fun, ...) apply_by_col(mat, fun, ...)
apply_by_row(mat, fun, ...) apply_by_col(mat, fun, ...)
mat |
IterableMatrix object |
fun |
|
... |
Optional additional arguments passed to |
These functions require row-major matrix storage for apply_by_row and col-major storage for apply_by_col,
so matrices stored in the wrong order may neeed a re-ordered copy created using transpose_storage_order()
first.
This is required to be able to keep memory-usage low and allow calculating the result with a single streaming pass of the
input matrix.
If vector/matrix outputs are desired instead of lists, calling unlist(x)
or do.call(cbind, x)
or do.call(rbind, x)
can convert the list output.
apply_by_row - A list of length nrow(matrix)
with the results returned by fun()
on each row
apply_by_col - A list of length ncol(matrix)
with the results returned by fun()
on each row
For an interface more similar to base::apply
, see the BPCellsArray
project. For calculating colMeans on a sparse single cell RNA matrix it is about 8x slower than apply_by_col
, due to the
base::apply
interface not being sparsity-aware. (See pull request #104 for benchmarking.)
Binarize compares the matrix element values to the
threshold value and sets the output elements to either zero
or one. By default, element values greater than the threshold
are set to one; otherwise, set to zero. When strict_inequality
is set to FALSE, element values greater than or equal to the
threshold are set to one. As an alternative, the <
, <=
, >
,
and >=
operators are also supported.
binarize(mat, threshold = 0, strict_inequality = TRUE)
binarize(mat, threshold = 0, strict_inequality = TRUE)
mat |
IterableMatrix |
threshold |
A numeric value that determines whether the elements of x are set to zero or one. |
strict_inequality |
A logical value determining whether the comparison to the threshold is >= (strict_inequality=FALSE) or > (strict_inequality=TRUE). |
binarized IterableMatrix object
Export pseudobulk bed files as input for MACS, then run MACS and read the output peaks as a tibble. Each step can can be run independently, allowing for quickly re-loading the results of an already completed call, or running MACS externally (e.g. via cluster job submisison) for increased parallelization. See details for more information.
call_peaks_macs( fragments, path, cell_groups = rlang::rep_along(cellNames(fragments), "all"), effective_genome_size = 2.9e+09, insertion_mode = c("both", "start_only", "end_only"), step = c("all", "prep-inputs", "run-macs", "read-outputs"), macs_executable = NULL, additional_params = "--call-summits --keep-dup all --shift -75 --extsize 150 --nomodel --nolambda", verbose = FALSE, threads = 1 )
call_peaks_macs( fragments, path, cell_groups = rlang::rep_along(cellNames(fragments), "all"), effective_genome_size = 2.9e+09, insertion_mode = c("both", "start_only", "end_only"), step = c("all", "prep-inputs", "run-macs", "read-outputs"), macs_executable = NULL, additional_params = "--call-summits --keep-dup all --shift -75 --extsize 150 --nomodel --nolambda", verbose = FALSE, threads = 1 )
fragments |
IterableFragments object |
path |
(string) Parent directory to store MACS inputs and outputs.
Inputs are stored in |
cell_groups |
Grouping vector with one entry per cell in fragments, e.g. cluster IDs |
effective_genome_size |
(numeric) Effective genome size for MACS. Default is |
insertion_mode |
(string) Which fragment ends to use for coverage calculation. One of |
step |
(string) Which step to run. One of |
macs_executable |
(string) Path to either MACS2/3 executable. Default ( |
additional_params |
(string) Additional parameters to pass to MACS2/3. |
verbose |
(bool) Whether to provide verbose output from MACS. Only used if step is |
threads |
(int) Number of threads to use. |
File format:
Inputs are written such that a bed file used as input into MACS, as well as a shell file containing a call to MACS are written for each cell group.
Bed files containing chr
, start
, and end
coordinates of insertions are written at <path>/input/<group>.bed.gz
.
Shell commands to run MACS manually are written at <path>/input/<group>.sh
.
Outputs are written to an output directory with a subdirectory for each cell group. Each cell group's output directory contains a file for narrowPeaks, peaks, and summits.
NarrowPeaks are written at <path>/output/<group>/<group>_peaks.narrowPeak
.
Peaks are written at <path>/output/<group>/<group>_peaks.xls
.
Summits are written at <path>/output/<group>/<group>_summits.bed
.
Only the narrowPeaks file is read into a tibble and returned. For more information on outputs from MACS, visit the MACS docs
Performance:
Running on a 2600 cell dataset and taking both start and end insertions into account, written input bedfiles and MACS outputs used 364 MB and 158 MB of space respectively. With 4 threads, running this function end to end took 74 seconds, with 61 of those seconds spent on running MACS.
Running MACS manually:
To run MACS manually, you will first run call_peaks_macs()
with step="prep-inputs
. Then, manually run all of the
shell scripts generated at <path>/input/<group>.sh
. Finally, run call_peaks_macs()
again with the same original arguments, but
setting step="read-outputs"
.
If step is prep-inputs
, return script paths for each cell group given as a character vector.
If step is run-macs
, return NULL
.
If step is read-outputs
or all
, returns a tibble with all the peaks from each cell group concatenated.
Columnns are chr
, start
, end
, group
, name
, score
, strand
, fold_enrichment
, log10_pvalue
, log10_qvalue
, summit_offset
Calling peaks from a pre-set list of tiles can be much faster than using
dedicated peak-calling software like macs3
. The resulting peaks are less
precise in terms of exact coordinates, but should be sufficient for most
analyses.
call_peaks_tile( fragments, chromosome_sizes, cell_groups = rep.int("all", length(cellNames(fragments))), effective_genome_size = NULL, peak_width = 200, peak_tiling = 3, fdr_cutoff = 0.01, merge_peaks = c("all", "group", "none") )
call_peaks_tile( fragments, chromosome_sizes, cell_groups = rep.int("all", length(cellNames(fragments))), effective_genome_size = NULL, peak_width = 200, peak_tiling = 3, fdr_cutoff = 0.01, merge_peaks = c("all", "group", "none") )
fragments |
IterableFragments object |
chromosome_sizes |
Chromosome start and end coordinates given as GRanges, data.frame, or list. See
See |
cell_groups |
Grouping vector with one entry per cell in fragments, e.g. cluster IDs |
effective_genome_size |
(Optional) effective genome size for poisson background rate estimation. See deeptools for values for common genomes. Defaults to sum of chromosome sizes, which overestimates peak significance |
peak_width |
Width of candidate peaks |
peak_tiling |
Number of candidate peaks overlapping each base of genome. E.g. peak_width = 300 and peak_tiling = 3 results in candidate peaks of 300bp spaced 100bp apart |
fdr_cutoff |
Adjusted p-value significance cutoff |
merge_peaks |
How to merge significant peaks with
|
Peak calling steps:
Estimate the genome-wide expected insertions per tile based on
peak_width
, effective_genome_size
, and per-group read counts
Tile the genome with nonoverlapping tiles of size peak_width
For each tile and group, calculate p_value based on a Poisson model
Compute adjusted p-values using BH method and using the total number of tiles as the number of hypotheses tested.
Repeat steps 2-4 peak_tiling
times, with evenly spaced offsets
If merge_peaks
is "all" or "group": use merge_peaks_iterative()
within each group to keep only the most
significant of the overlapping candidate peaks
If merge_peaks
is "all", perform a final round of merge_peaks_iterative()
,
prioritizing each peak by its within-group significance rank
tibble with peak calls and the following columns:
chr
, start
, end
: genome coordinates
group
: group ID that this peak was identified in
p_val
, q_val
: Poission p-value and BH-corrected p-value
enrichment
: Enrichment of counts in this peak compared to a genome-wide
background
Calculate the MD5 checksum of an IterableMatrix and return the checksum in hexidecimal format.
checksum(matrix)
checksum(matrix)
matrix |
IterableMatrix object |
checksum()
converts the non-zero elements of the sparse input matrix to double
precision, concatenates each element value with the element row and column index words,
and uses these 16-byte blocks along with the matrix dimensions and row and column
names to calculate the checksum. The checksum value depends on the storage order so
column- and row-order matrices with the same element values give different checksum
values. checksum()
uses element and index values in little-endian CPU storage order.
It converts to little-endian order on big-endian architecture although this has not
been tested.
MD5 checksum string in hexidecimal format.
library(Matrix) library(BPCells) m1 <- matrix(seq(1,12), nrow=3) m2 <- as(m1, 'dgCMatrix') m3 <- as(m2, 'IterableMatrix') checksum(m3)
library(Matrix) library(BPCells) m1 <- matrix(seq(1,12), nrow=3) m2 <- as(m1, 'dgCMatrix') m3 <- as(m2, 'IterableMatrix') checksum(m3)
Cluster an adjacency matrix
cluster_graph_leiden( snn, resolution = 1, objective_function = c("modularity", "CPM"), seed = 12531, ... ) cluster_graph_louvain(snn, resolution = 1, seed = 12531) cluster_graph_seurat(snn, resolution = 0.8, ...)
cluster_graph_leiden( snn, resolution = 1, objective_function = c("modularity", "CPM"), seed = 12531, ... ) cluster_graph_louvain(snn, resolution = 1, seed = 12531) cluster_graph_seurat(snn, resolution = 0.8, ...)
snn |
Symmetric adjacency matrix (dgCMatrix) output from e.g. |
resolution |
Resolution parameter. Higher values result in more clusters |
objective_function |
Graph statistic to optimize during clustering. Modularity is the default as it keeps resolution independent of dataset size (see details below).
For the meaning of each option, see |
seed |
Random seed for clustering initialization |
... |
Additional arguments to underlying clustering function |
cluster_graph_leiden: Leiden clustering algorithm igraph::cluster_leiden()
.
Note that when using objective_function = "CPM"
the number of clusters empirically scales with cells * resolution
,
so 1e-3 is a good resolution for 10k cells, but 1M cells is better with a 1e-5 resolution. A resolution of 1 is a
good default when objective_function = "modularity"
per the default.
cluster_graph_louvain: Louvain graph clustering algorithm igraph::cluster_louvain()
cluster_graph_seurat: Seurat's clustering algorithm Seurat::FindClusters()
Factor vector containing the cluster assignment for each cell.
Converts a vector of membership IDs into a sparse matrix
cluster_membership_matrix(groups, group_order = NULL)
cluster_membership_matrix(groups, group_order = NULL)
groups |
Vector with one entry per cell, specifying the cell's group |
group_order |
Optional vector listing ordering of groups |
cell x group matrix where an entry is 1 when a cell is in a given group
Helper function for data on features to plot from a diverse set of data sources.
collect_features( source, features = NULL, gene_mapping = human_gene_mapping, n = 1 )
collect_features( source, features = NULL, gene_mapping = human_gene_mapping, n = 1 )
source |
Matrix or data frame to pull features from, or a vector of feature values for a single feature. For a matrix, the features must be rows. |
features |
Character vector of features names to plot if |
gene_mapping |
An optional vector for gene name matching with match_gene_symbol(). Ignored if source is a data frame. |
n |
Internal-use parameter marking the number of nested calls. This is used for finding the name of the "source" input variable from the caller's perspective |
If source
is a data.frame, features will be drawn from the columns. If
source
is a matrix object (IterableMatrix
, dgCMatrix
, or matrix
), features
will be drawn from rows.
Data frame with one column for each feature requested
Convert the type of a matrix
convert_matrix_type(matrix, type = c("uint32_t", "double", "float"))
convert_matrix_type(matrix, type = c("uint32_t", "double", "float"))
matrix |
IterableMatrix object input |
type |
One of uint32_t (unsigned 32-bit integer), float (32-bit real number), or double (64-bit real number) |
IterableMatrix object
BPCells fragments can be interconverted with GRanges and data.frame R objects.
The main conversion method is R's builtin as()
function, though the
convert_to_fragments()
helper is also available. For all R objects except
GRanges, BPCells assumes a 0-based, end-exclusive coordinate system. (See
genomic-ranges-like reference for details)
# Convert from R to BPCells convert_to_fragments(x, zero_based_coords = !is(x, "GRanges")) as(x, "IterableFragments") # Convert from BPCells to R as.data.frame(bpcells_fragments) as(bpcells_fragments, "data.frame") as(bpcells_fragments, "GRanges")
# Convert from R to BPCells convert_to_fragments(x, zero_based_coords = !is(x, "GRanges")) as(x, "IterableFragments") # Convert from BPCells to R as.data.frame(bpcells_fragments) as(bpcells_fragments, "data.frame") as(bpcells_fragments, "GRanges")
x |
Fragment coordinates given as GRanges, data.frame, or list. See
|
zero_based_coords |
Whether to convert the ranges from a 1-based end-inclusive coordinate system to a 0-based end-exclusive coordinate system. Defaults to true for GRanges and false for other formats (see this archived UCSC blogpost) |
convert_to_fragments(): IterableFragments object
These color palettes are derived from the ArchR color palettes, and provide large sets of distinguishable colors
discrete_palette(name, n = 1) continuous_palette(name)
discrete_palette(name, n = 1) continuous_palette(name)
name |
Name of the color palette. Valid discrete palettes are: |
n |
Minimum number of colors needed |
If the requested number of colors is too large, a new palette will be constructed via interpolation from the requested palette
Character vector of hex color codes
Extend genome ranges in a strand-aware fashion.
extend_ranges( ranges, upstream = 0, downstream = 0, metadata_cols = c("strand"), chromosome_sizes = NULL, zero_based_coords = !is(ranges, "GRanges") )
extend_ranges( ranges, upstream = 0, downstream = 0, metadata_cols = c("strand"), chromosome_sizes = NULL, zero_based_coords = !is(ranges, "GRanges") )
ranges |
Genomic regions given as GRanges, data.frame, or list. See
|
upstream |
Number of bases to extend each range upstream (negative to shrink width) |
downstream |
Number of bases to extend each range downstream (negative to shrink width) |
metadata_cols |
Optional list of metadata columns to require & extract |
chromosome_sizes |
(optional) Size of chromosomes as a genomic-ranges object |
zero_based_coords |
If true, coordinates start and 0 and the end coordinate is not included in the range. If false, coordinates start at 1 and the end coordinate is included in the range |
Note that ranges will be blocked from extending past the beginning of the chromosome (base 0),
and if chromosome_sizes
is given then they will also be blocked from extending past the end of the chromosome
Get footprints around a set of genomic coordinates
footprint( fragments, ranges, zero_based_coords = !is(ranges, "GRanges"), cell_groups = rlang::rep_along(cellNames(fragments), "all"), cell_weights = rlang::rep_along(cell_groups, 1), flank = 125L, normalization_width = flank%/%10L )
footprint( fragments, ranges, zero_based_coords = !is(ranges, "GRanges"), cell_groups = rlang::rep_along(cellNames(fragments), "all"), cell_weights = rlang::rep_along(cell_groups, 1), flank = 125L, normalization_width = flank%/%10L )
fragments |
IterableFragments object |
ranges |
Footprint centers given as GRanges, data.frame, or list. See
"+" strand ranges will footprint around the start coordinate, and "-" strand ranges around the end coordinate. |
zero_based_coords |
If true, coordinates start and 0 and the end coordinate is not included in the range. If false, coordinates start at 1 and the end coordinate is included in the range |
cell_groups |
Character or factor assigning a group to each cell, in order of
|
cell_weights |
Numeric vector assigning weight factors
(e.g. inverse of total reads) to each cell, in order of |
flank |
Number of flanking basepairs to include on either side of the motif |
normalization_width |
Number of basepairs at the upstream + downstream extremes to use for calculating enrichment |
tibble::tibble()
with columns group
, position
, and count
, enrichment
Check if two fragments objects are identical
fragments_identical(fragments1, fragments2)
fragments_identical(fragments1, fragments2)
fragments1 |
First IterableFragments to compare |
fragments2 |
Second IterableFragments to compare |
boolean for whether the fragments objects are identical
Conveniently look up the region of a gene by gene symbol. The value returned by this function
can be used as the region
argument for trackplot functions such as
trackplot_coverage()
or trackplot_gene()
gene_region( genes, gene_symbol, extend_bp = c(10000, 10000), gene_mapping = human_gene_mapping )
gene_region( genes, gene_symbol, extend_bp = c(10000, 10000), gene_mapping = human_gene_mapping )
genes |
Transcipt features given as GRanges, data.frame, or list. See
|
gene_symbol |
Name of gene symbol or ID |
extend_bp |
Bases to extend region upstream and downstream of gene. If length 1, extension is symmetric. If length 2, provide upstream extension then downstream extension as positive distances. |
gene_mapping |
Named vector where names are gene symbols or IDs and values are canonical gene symbols |
List of chr, start, end positions for use with trackplot functions.
ArchR-style gene activity scores are based on a weighted sum of each tile according to the signed distance from the tile to a gene body. This function calculates the signed distances according to ArchR's default parameters.
gene_score_tiles_archr( genes, chromosome_sizes = NULL, tile_width = 500, addArchRBug = FALSE )
gene_score_tiles_archr( genes, chromosome_sizes = NULL, tile_width = 500, addArchRBug = FALSE )
genes |
Gene coordinates given as GRanges, data.frame, or list. See
|
chromosome_sizes |
(optional) Size of chromosomes as a genomic-ranges object |
tile_width |
Size of tiles to consider |
addArchRBug |
Replicate ArchR bug in handling nested genes |
ArchR's tile distance algorithm works as follows
Genes are extended 5kb upstream
Genes are linked to any tiles 1kb-100kb upstream + downstream, but tiles beyond a neighboring gene are not considered
Tibble with one range per tile, with additional metadata columns gene_idx (row index of the gene this tile corresponds to) and distance.
Distance is a signed distance calculated such that if the tile has a smaller start coordinate than the gene and the gene is on the + strand, distance will be negative. The distance of adjacent but non-overlapping regions is 1bp, counting up from there.
Gene activity scores can be calculated as a distance-weighted sum of per-tile accessibility.
The tile weights for each gene can be represented as a sparse matrix of dimension genes x tiles.
If we multiply this weight matrix by a corresponding tile matrix (tiles x cells), then we can
get a gene activity score matrix of genes x cells. gene_score_weights_archr()
calculates the
weight matrix (best if you have a pre-computed tile matrix), while gene_score_archr()
provides
a easy-to-use wrapper.
gene_score_weights_archr( genes, chromosome_sizes, blacklist = NULL, tile_width = 500, gene_name_column = "gene_id", addArchRBug = FALSE ) gene_score_archr( fragments, genes, chromosome_sizes, blacklist = NULL, tile_width = 500, gene_name_column = "gene_id", addArchRBug = FALSE, tile_max_count = 4, scale_factor = 10000, tile_matrix_path = tempfile(pattern = "gene_score_tile_mat") )
gene_score_weights_archr( genes, chromosome_sizes, blacklist = NULL, tile_width = 500, gene_name_column = "gene_id", addArchRBug = FALSE ) gene_score_archr( fragments, genes, chromosome_sizes, blacklist = NULL, tile_width = 500, gene_name_column = "gene_id", addArchRBug = FALSE, tile_max_count = 4, scale_factor = 10000, tile_matrix_path = tempfile(pattern = "gene_score_tile_mat") )
genes |
Gene coordinates given as GRanges, data.frame, or list. See
|
chromosome_sizes |
Chromosome start and end coordinates given as GRanges, data.frame, or list. See
See |
blacklist |
Regions to exclude from calculations, given as GRanges, data.frame, or list. See
|
tile_width |
Size of tiles to consider |
gene_name_column |
If not NULL, a column name of |
addArchRBug |
Replicate ArchR bug in handling nested genes |
fragments |
Input fragments object |
tile_max_count |
Maximum value in the tile counts matrix. If not null, tile counts higher than this will be clipped to |
scale_factor |
If not null, counts for each cell will be scaled to sum to |
tile_matrix_path |
Path of a directory where the intermediate tile matrix will be saved |
gene_score_weights_archr:
Given a set of tile coordinates and distances returned by gene_score_tiles_archr()
,
calculate a weight matrix of dimensions genes x tiles. This matrix can be
multiplied with a tile matrix to obtain ArchR-compatible gene activity scores.
gene_score_weights_archr
Weight matrix of dimension genes x tiles
gene_score_archr
Gene score matrix of dimension genes x cells.
BPCells accepts a flexible set of genomic ranges-like objects as input, either
GRanges, data.frame, lists, or character vectors. These objects must specify chromosome, start,
and end coordinates along with optional metadata about each range.
With the exception of GenomicRanges::GRanges
objects, BPCells assumes all
objects use a zero-based, end-exclusive coordinate system (see below for details).
BPCells can interpret the following types as ranges:
list()
, data.frame()
, with columns:
chr
: Character or factor of chromosome names
start
: Start coordinates (0-based)
end
: End coordinates (exclusive)
(optional) strand
: "+"
/"-"
or TRUE
/FALSE
for pos/neg strand
(optional) Additional metadata as named list
entries or data.frame
columns
GenomicRanges::GRanges
start(x)
is interpreted as a 1-based start coordinate
end(x)
is interpreted as an inclusive end coordinate
strand(x)
: "*"
entries are interpeted as postive strand
(optional) mcols(x)
holds additional metadata
character
Given in format "chr1:1000-2000"
or "chr1:1,000-2,000"
Uses 0-based, end-exclusive coordinate system
Cannot be used for ranges where additional metadata is required
There are two main conventions for the coordinate systems:
One-based, end-inclusive ranges
The first base of a chromosome is numbered 1
The last base in a range is equal to the end coordinate
e.g. 1-5 describes the first 5 bases of the chromosome
Used in formats such as SAM, GTF
In BPCells, used when reading or writing GenomicRanges::GRanges
objects
Zero-based, end-exclusive ranges
The first base of a chromosome is numbered 0
The last base in a range is one less than the end coordinate
e.g. 0-5 describes the first 5 bases of the chromosome
Used in formats such as BAM, BED
In BPCells, used for all other range objects
Mapping of the canonical gene symbols corresponding to each unambiguous alias, previous symbol, ensembl ID, or entrez ID.
human_gene_mapping mouse_gene_mapping
human_gene_mapping mouse_gene_mapping
human_gene_mapping
A named character vector. Names are aliases or IDs and values are the corresponding canonical gene symbol
mouse_gene_mapping
A named character vector. Names are aliases or IDs and values are the corresponding canonical gene symbol
See the source code in data-raw/human_gene_mapping.R
and
data-raw/mouse_gene_mapping.R
for exactly how these mappings were made.
human_gene_mapping
http://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/tsv/non_alt_loci_set.txt
mouse_gene_mapping
http://www.informatics.jax.org/downloads/reports/MGI_EntrezGene.rpt http://www.informatics.jax.org/downloads/reports/MRK_ENSEMBL.rpt
Read a sparse matrix from a MatrixMarket file. This is a text-based format used by 10x, Parse, and others to store sparse matrices. Format details on the NIST website.
import_matrix_market( mtx_path, outdir = tempfile("matrix_market"), row_names = NULL, col_names = NULL, row_major = FALSE, tmpdir = tempdir(), load_bytes = 4194304L, sort_bytes = 1073741824L ) import_matrix_market_10x( mtx_dir, outdir = tempfile("matrix_market"), feature_type = NULL, row_major = FALSE, tmpdir = tempdir(), load_bytes = 4194304L, sort_bytes = 1073741824L )
import_matrix_market( mtx_path, outdir = tempfile("matrix_market"), row_names = NULL, col_names = NULL, row_major = FALSE, tmpdir = tempdir(), load_bytes = 4194304L, sort_bytes = 1073741824L ) import_matrix_market_10x( mtx_dir, outdir = tempfile("matrix_market"), feature_type = NULL, row_major = FALSE, tmpdir = tempdir(), load_bytes = 4194304L, sort_bytes = 1073741824L )
mtx_path |
Path of mtx or mtx.gz file |
outdir |
Directory to store the output |
row_names |
Character vector of row names |
col_names |
Character vector of col names |
row_major |
If true, store the matrix in row-major orientation |
tmpdir |
Temporary directory to use for intermediate storage |
load_bytes |
The minimum contiguous load size during the merge sort passes |
sort_bytes |
The amount of memory to allocate for re-sorting chunks of entries |
mtx_dir |
Directory holding matrix.mtx.gz, barcodes.tsv.gz, and features.tsv.gz |
feature_type |
String or vector of feature types to include. (cellranger 3.0 and newer) |
Import MatrixMarket mtx files to the BPCells format. This implementation ensures fixed memory usage even for very large inputs by doing on-disk sorts. It will be much slower than hdf5 inputs, so only use MatrixMarket format when absolutely necessary.
As a rough speed estimate, importing the 17GB Parse
1M PBMC
DGE_1M_PBMC.mtx
file takes about 4 minutes and 1.3GB of RAM, producing a compressed output matrix of 1.5GB. mtx.gz
files will be slower to import due to gzip decompression.
When importing from 10x mtx files, the row and column names can be read automatically
using the import_matrix_market_10x()
convenience function.
MatrixDir object with the imported matrix
Methods for IterableFragments objects
## S4 method for signature 'IterableFragments' show(object) cellNames(x) cellNames(x, ...) <- value chrNames(x) chrNames(x, ...) <- value
## S4 method for signature 'IterableFragments' show(object) cellNames(x) cellNames(x, ...) <- value chrNames(x) chrNames(x, ...) <- value
object |
IterableFragments object |
x |
an IterableFragments object |
value |
Character vector of new names |
cellNames<-
It is only possible to replace names, not add new names.
chrNames<-
It is only possible to replace names, not add new names.
cellNames()
Character vector of cell names, or NULL if none are known
chrNames()
: Character vector of chromosome names, or NULL if none are known
show(IterableFragments)
: Print IterableFragments
cellNames()
: Get cell names
cellNames(x, ...) <- value
: Set cell names
chrNames()
: Set chromosome names
chrNames(x, ...) <- value
: Set chromosome names
Generic methods and built-in functions for IterableMatrix objects
matrix_type(x) storage_order(x) ## S4 method for signature 'IterableMatrix' show(object) ## S4 method for signature 'IterableMatrix' t(x) ## S4 method for signature 'IterableMatrix,matrix' x %*% y ## S4 method for signature 'IterableMatrix' rowSums(x) ## S4 method for signature 'IterableMatrix' colSums(x) ## S4 method for signature 'IterableMatrix' rowMeans(x) ## S4 method for signature 'IterableMatrix' colMeans(x) colVars( x, rows = NULL, cols = NULL, na.rm = FALSE, center = NULL, ..., useNames = TRUE ) rowVars( x, rows = NULL, cols = NULL, na.rm = FALSE, center = NULL, ..., useNames = TRUE ) rowMaxs(x, rows = NULL, cols = NULL, na.rm = FALSE, ..., useNames = TRUE) colMaxs(x, rows = NULL, cols = NULL, na.rm = FALSE, ..., useNames = TRUE) rowQuantiles( x, rows = NULL, cols = NULL, probs = seq(from = 0, to = 1, by = 0.25), na.rm = FALSE, type = 7L, digits = 7L, ..., useNames = TRUE, drop = TRUE ) colQuantiles( x, rows = NULL, cols = NULL, probs = seq(from = 0, to = 1, by = 0.25), na.rm = FALSE, type = 7L, digits = 7L, ..., useNames = TRUE, drop = TRUE ) ## S4 method for signature 'IterableMatrix' log1p(x) log1p_slow(x) ## S4 method for signature 'IterableMatrix' expm1(x) expm1_slow(x) ## S4 method for signature 'IterableMatrix,numeric' e1 ^ e2 ## S4 method for signature 'numeric,IterableMatrix' e1 < e2 ## S4 method for signature 'IterableMatrix,numeric' e1 > e2 ## S4 method for signature 'numeric,IterableMatrix' e1 <= e2 ## S4 method for signature 'IterableMatrix,numeric' e1 >= e2 ## S4 method for signature 'IterableMatrix' round(x, digits = 0) ## S4 method for signature 'IterableMatrix,numeric' e1 * e2 ## S4 method for signature 'IterableMatrix,numeric' e1 + e2 ## S4 method for signature 'IterableMatrix,numeric' e1 / e2 ## S4 method for signature 'IterableMatrix,numeric' e1 - e2
matrix_type(x) storage_order(x) ## S4 method for signature 'IterableMatrix' show(object) ## S4 method for signature 'IterableMatrix' t(x) ## S4 method for signature 'IterableMatrix,matrix' x %*% y ## S4 method for signature 'IterableMatrix' rowSums(x) ## S4 method for signature 'IterableMatrix' colSums(x) ## S4 method for signature 'IterableMatrix' rowMeans(x) ## S4 method for signature 'IterableMatrix' colMeans(x) colVars( x, rows = NULL, cols = NULL, na.rm = FALSE, center = NULL, ..., useNames = TRUE ) rowVars( x, rows = NULL, cols = NULL, na.rm = FALSE, center = NULL, ..., useNames = TRUE ) rowMaxs(x, rows = NULL, cols = NULL, na.rm = FALSE, ..., useNames = TRUE) colMaxs(x, rows = NULL, cols = NULL, na.rm = FALSE, ..., useNames = TRUE) rowQuantiles( x, rows = NULL, cols = NULL, probs = seq(from = 0, to = 1, by = 0.25), na.rm = FALSE, type = 7L, digits = 7L, ..., useNames = TRUE, drop = TRUE ) colQuantiles( x, rows = NULL, cols = NULL, probs = seq(from = 0, to = 1, by = 0.25), na.rm = FALSE, type = 7L, digits = 7L, ..., useNames = TRUE, drop = TRUE ) ## S4 method for signature 'IterableMatrix' log1p(x) log1p_slow(x) ## S4 method for signature 'IterableMatrix' expm1(x) expm1_slow(x) ## S4 method for signature 'IterableMatrix,numeric' e1 ^ e2 ## S4 method for signature 'numeric,IterableMatrix' e1 < e2 ## S4 method for signature 'IterableMatrix,numeric' e1 > e2 ## S4 method for signature 'numeric,IterableMatrix' e1 <= e2 ## S4 method for signature 'IterableMatrix,numeric' e1 >= e2 ## S4 method for signature 'IterableMatrix' round(x, digits = 0) ## S4 method for signature 'IterableMatrix,numeric' e1 * e2 ## S4 method for signature 'IterableMatrix,numeric' e1 + e2 ## S4 method for signature 'IterableMatrix,numeric' e1 / e2 ## S4 method for signature 'IterableMatrix,numeric' e1 - e2
x |
IterableMatrix object or a matrix-like object. |
object |
IterableMatrix object |
y |
matrix |
probs |
(Numeric) Quantile value(s) to be computed, between 0 and 1. |
type |
(Integer) between 4 and 9 selecting which quantile algorithm to use, detailed in |
t()
Transposed object
x %*% y
: dense matrix result
rowSums()
: vector of row sums
colSums()
: vector of col sums
rowMeans()
: vector of row means
colMeans()
: vector of col means
colVars()
: vector of col variance
rowVars()
: vector of row variance
rowMaxs()
: vector of maxes for every row
colMaxs()
: vector of column maxes
rowQuantiles():
If length(probs) == 1
, return a numeric with number of entries equal to the number of rows in the matrix.
Else, return a Matrix of quantile values, with cols representing each quantile, and each row representing a row in the input matrix.
colQuantiles():
If length(probs) == 1
, return a numeric with number of entries equal to the number of columns in the matrix.
Else, return a Matrix of quantile values, with cols representing each quantile, and each row representing a col in the input matrix.
matrix_type()
: Get the matrix data type (mat_uint32_t, mat_float, or mat_double for now)
storage_order()
: Get the matrix storage order ("row" or "col")
show(IterableMatrix)
: Display an IterableMatrix
t(IterableMatrix)
: Transpose an IterableMatrix
x %*% y
: Multiply by a dense matrix
rowSums(IterableMatrix)
: Calculate rowSums
colSums(IterableMatrix)
: Calculate colSums
rowMeans(IterableMatrix)
: Calculate rowMeans
colMeans(IterableMatrix)
: Calculate colMeans
colVars()
: Calculate colVars (replacement for matrixStats::colVars()
)
rowVars()
: Calculate rowVars (replacement for matrixStats::rowVars()
)
rowMaxs()
: Calculate rowMaxs (replacement for matrixStats::rowMaxs()
)
colMaxs()
: Calculate colMax (replacement for matrixStats::colMax()
)
rowQuantiles()
: Calculate rowQuantiles (replacement for matrixStats::rowQuantiles
)
colQuantiles()
: Calculate colQuantiles (replacement for matrixStats::colQuantiles
)
log1p(IterableMatrix)
: Calculate log(x + 1)
log1p_slow()
: Calculate log(x + 1) (non-SIMD version)
expm1(IterableMatrix)
: Calculate exp(x) - 1
expm1_slow()
: Calculate exp(x) - 1 (non-SIMD version)
e1^e2
: Calculate x^y (elementwise)
e1 < e2
: Binarize matrix according to numeric < matrix comparison
e1 > e2
: Binarize matrix according to matrix > numeric comparison
e1 <= e2
: Binarize matrix according to numeric <= matrix comparison
e1 >= e2
: Binarize matrix according to matrix >= numeric comparison
round(IterableMatrix)
: round to nearest integer (digits must be 0)
e1 * e2
: Multiply by a constant, or multiply rows by a vector length nrow(mat)
e1 + e2
: Add a constant, or row-wise addition with a vector length nrow(mat)
e1 / e2
: Divide by a constant, or divide rows by a vector length nrow(mat)
e1 - e2
: Subtract a constant, or row-wise subtraction with a vector length nrow(mat)
Search for approximate nearest neighbors between cells in the reduced dimensions (e.g. PCA), and return the k nearest neighbors (knn) for each cell. Optionally, we can find neighbors between two separate sets of cells by utilizing both data and query.
knn_hnsw( data, query = NULL, k = 10, metric = c("euclidean", "cosine"), verbose = TRUE, threads = 1, ef = 100 ) knn_annoy( data, query = data, k = 10, metric = c("euclidean", "cosine", "manhattan", "hamming"), n_trees = 50, search_k = -1 )
knn_hnsw( data, query = NULL, k = 10, metric = c("euclidean", "cosine"), verbose = TRUE, threads = 1, ef = 100 ) knn_annoy( data, query = data, k = 10, metric = c("euclidean", "cosine", "manhattan", "hamming"), n_trees = 50, search_k = -1 )
data |
cell x dims matrix for reference dataset |
query |
cell x dims matrix for query dataset (optional) |
k |
number of neighbors to calculate |
metric |
distance metric to use |
verbose |
whether to print progress information during search |
threads |
Number of threads to use. Note that result is non-deterministic if threads > 1 |
ef |
ef parameter for RccppHNSW::hnsw_search. Increase for slower search but improved accuracy |
n_trees |
Number of trees during index build time. More trees gives higher accuracy |
search_k |
Number of nodes to inspect during the query, or -1 for default value. Higher number gives higher accuracy |
knn_hnsw: Use RcppHNSW as knn engine
knn_annoy: Use RcppAnnoy as knn engine
List of 2 matrices – idx for cell x K neighbor indices, dist for cell x K neighbor distances. If no query is given, nearest neighbors are found mapping the data matrix to itself, prohibiting self-neighbors
Convert a KNN object (e.g. returned by knn_hnsw()
or knn_annoy()
) into
a graph. The graph is represented as a sparse adjacency matrix.
knn_to_graph(knn, use_weights = FALSE, self_loops = TRUE) knn_to_snn_graph( knn, min_val = 1/15, self_loops = FALSE, return_type = c("matrix", "list") ) knn_to_geodesic_graph(knn, return_type = c("matrix", "list"), threads = 0L)
knn_to_graph(knn, use_weights = FALSE, self_loops = TRUE) knn_to_snn_graph( knn, min_val = 1/15, self_loops = FALSE, return_type = c("matrix", "list") ) knn_to_geodesic_graph(knn, return_type = c("matrix", "list"), threads = 0L)
knn |
List of 2 matrices – idx for cell x K neighbor indices, dist for cell x K neighbor distances |
use_weights |
boolean for whether to replace all distance weights with 1 |
self_loops |
Whether to allow self-loops in the output graph |
min_val |
minimum jaccard index between neighbors. Values below this will round to 0 |
return_type |
Whether to return a sparse adjacency matrix or an edge list |
threads |
Number of threads to use during calculations |
knn_to_graph Create a knn graph
knn_to_snn_graph Convert a knn object into a shared nearest neighbors adjacency matrix. This follows the algorithm that Seurat uses to compute SNN graphs
knn_to_geodesic_graph
Convert a knn object into an undirected weighted graph, using the same
geodesic distance estimation method as the UMAP package.
This matches the output of umap._umap.fuzzy_simplicial_set
from the umap-learn
python package, used by default in scanpy.pp.neighbors
.
Because this only re-weights and symmetrizes the KNN graph, it will usually use
less memory and return a sparser graph than knn_to_snn_graph
which computes
2nd-order neighbors. Note: when cells don't have themselves listed as the nearest
neighbor, results may differ slightly from umap._umap.fuzzy_simplicial_set
, which
assumes self is always successfully found in the approximate nearest neighbor search.
knn_to_graph
Sparse matrix (dgCMatrix) where mat[i,j]
= distance from cell i
to
cell j
, or 0 if cell j
is not in the K nearest neighbors of i
knn_to_snn_graph
return_type == "matrix"
:
Sparse matrix (dgCMatrix) where mat[i,j]
= jaccard index of the overlap
in nearest neigbors between cell i
and cell j
, or 0 if the jaccard index
is < min_val
. Only the lower triangle is filled in, which is compatible with
the BPCells clustering methods
return_type == "list"
:
List of 3 equal-length vectors i
, j
, and weight
, along with an integer dim
.
These correspond to the rows, cols, and values of non-zero entries in the lower triangle
adjacency matrix. dim
is the total number of vertices (cells) in the graph
knn_to_geodesic_graph
return_type == "matrix"
:
Sparse matrix (dgCMatrix) where mat[i,j]
= normalized similarity between cell i
and cell j
.
Only the lower triangle is filled in, which is compatible with the BPCells clustering methods
return_type == "list"
:
List of 3 equal-length vectors i
, j
, and weight
, along with an integer dim
.
These correspond to the rows, cols, and values of non-zero entries in the lower triangle
adjacency matrix. dim
is the total number of vertices (cells) in the graph
Given a features x cells matrix, perform one-vs-all differential tests to find markers.
marker_features(mat, groups, method = "wilcoxon")
marker_features(mat, groups, method = "wilcoxon")
mat |
IterableMatrix object of dimensions features x cells |
groups |
Character/factor vector of cell groups/clusters. Length #cells |
method |
Test method to use. Current options are:
|
Tips for using the values from this function:
Use dplyr::mutate()
to add columns for e.g. adjusted p-value and log fold change.
Use dplyr::filter()
to get only differential genes above some given threshold
To get adjusted p-values, use R p.adjust()
, recommended method is "BH"
To get log2 fold change: if your input matrix was already log-transformed,
calculate (foreground_mean - background_mean)/log(2)
. If your input
matrix was not log-transformed, calculate log2(forground_mean/background_mean)
tibble with the following columns:
foreground: Group ID used for the foreground
background: Group ID used for the background (or NA if comparing to rest of cells)
feature: ID of the feature
p_val_raw: Unadjusted p-value for differential test
foreground_mean: Average value in the foreground group
background_mean: Average value in the background group
Correct alias gene symbols, Ensembl IDs, and Entrez IDs to canonical gene symbols. This is useful for matching gene names between different datasets which might not always use the same gene naming conventions.
match_gene_symbol(query, subject, gene_mapping = human_gene_mapping) canonical_gene_symbol(query, gene_mapping = human_gene_mapping)
match_gene_symbol(query, subject, gene_mapping = human_gene_mapping) canonical_gene_symbol(query, gene_mapping = human_gene_mapping)
query |
Character vector of gene symbols or IDs |
subject |
Vector of gene symbols or IDs to index into |
gene_mapping |
Named vector where names are gene symbols or IDs and values are canonical gene symbols |
match_gene_symbol
Integer vector of indices v
such that subject[v]
corresponds to the gene
symbols in query
canonical_gene_symbol
Character vector of canonical gene symbols for each symbol in query
BPCells matrices can be interconverted with Matrix package dgCMatrix sparse matrices, as well as base R dense matrices (though this may result in high memory usage for large matrices)
# Convert to R from BPCells as(bpcells_mat, "dgCMatrix") # Sparse matrix conversion as.matrix(bpcells_mat) # Dense matrix conversion # Convert to BPCells from R as(dgc_mat, "IterableMatrix")
# Convert to R from BPCells as(bpcells_mat, "dgCMatrix") # Sparse matrix conversion as.matrix(bpcells_mat) # Dense matrix conversion # Convert to BPCells from R as(dgc_mat, "IterableMatrix")
Calculate matrix stats
matrix_stats( matrix, row_stats = c("none", "nonzero", "mean", "variance"), col_stats = c("none", "nonzero", "mean", "variance"), threads = 0L )
matrix_stats( matrix, row_stats = c("none", "nonzero", "mean", "variance"), col_stats = c("none", "nonzero", "mean", "variance"), threads = 0L )
matrix |
Input matrix object |
row_stats |
Which row statistics to compute |
col_stats |
Which col statistics to compute |
threads |
Number of threads to use during execution |
The statistics will be calculated in a single pass over the matrix, so this method is desirable to use for efficiency purposes compared to the more standard rowMeans or colMeans if multiple statistics are needed. The stats are ordered by complexity: nonzero, mean, then variance. All less complex stats are calculated in the process of calculating a more complicated stat. So to calculate mean and variance simultaneously, just ask for variance, which will compute mean and nonzero counts as a side-effect
List of row_stats: matrix of n_stats x n_rows, col_stats: matrix of n_stats x n_cols
Peak and tile matrix calculations can be sped up by reducing the number of cells. For cases where the outputs are going to be added together afterwards, this can provide a performance improvement
merge_cells(fragments, cell_groups)
merge_cells(fragments, cell_groups)
fragments |
Input fragments object |
cell_groups |
Character or factor vector providing a group for each cell.
Ordering is the same as |
Merge peaks according to ArchR's iterative merging algorithm. More details on the ArchR website
merge_peaks_iterative(peaks)
merge_peaks_iterative(peaks)
peaks |
Peaks given as GRanges, data.frame, or list. See
Must be ordered by priority and have columns chr, start, end. |
Properties of merged peaks:
No peaks in the merged set overlap
Peaks are prioritized according to their order in the original input
The output peaks are a subset of the input peaks, with no peak boundaries changed
tibble::tibble()
with a nonoverlapping subset of the rows in peaks. All metadata
columns are preserved
min_scalar: Take minumum with a global constant
min_by_row: Take the minimum with a per-row constant
min_by_col: Take the minimum with a per-col constant
min_scalar(mat, val) min_by_row(mat, vals) min_by_col(mat, vals)
min_scalar(mat, val) min_by_row(mat, vals) min_by_col(mat, vals)
mat |
IterableMatrix |
val |
Single positive numeric value |
Take the minimum value of a matrix with a per-row, per-col, or global constant. This constant must be >0 to preserve sparsity of the matrix. This has the effect of capping the maximum value in the matrix.
IterableMatrix
Normalize an object representing genomic ranges
normalize_ranges( ranges, metadata_cols = character(0), zero_based_coords = !is(ranges, "GRanges"), n = 1 )
normalize_ranges( ranges, metadata_cols = character(0), zero_based_coords = !is(ranges, "GRanges"), n = 1 )
ranges |
Genomic regions given as GRanges, data.frame, or list. See
|
metadata_cols |
Optional list of metadata columns to require & extract |
zero_based_coords |
If true, coordinates start and 0 and the end coordinate is not included in the range. If false, coordinates start at 1 and the end coordinate is included in the range |
data frame with zero-based coordinates, and elements chr (factor), start (int), and end (int).
If ranges
does not have chr level information, chr levels are the sorted unique values of chr.
If strand is in metadata_cols, then the output strand element will be TRUE for positive strand, and FALSE for negative strand. (Converted from a character vector of "+"/"-" if necessary)
Count fragments by nucleosomal size
nucleosome_counts(fragments, nucleosome_width = 147)
nucleosome_counts(fragments, nucleosome_width = 147)
fragments |
Fragments object |
nucleosome_width |
Integer cutoff to use as nucleosome width |
Shorter than nucleosome_width
is subNucleosomal
,
nucleosome_width
to 2*nucleosome_width-1
is monoNucleosomal
, and anything longer is multiNucleosomal
.
The sum of all fragments is given as nFrags
List with names subNucleosomal
, monoNucleosomal
, multiNucleosomal
, and nFrags
, containing the
count vectors of fragments in each class per cell.
10x fragment files come in a bed-like format, with columns chr, start, end, cell_id, and pcr_duplicates. Unlike a standard bed format, the format from cellranger has an inclusive end-coordinate, meaning the end coordinate itself is what should be counted as the tagmentation site, rather than offset by 1.
open_fragments_10x(path, comment = "#", end_inclusive = TRUE) write_fragments_10x( fragments, path, end_inclusive = TRUE, append_5th_column = FALSE )
open_fragments_10x(path, comment = "#", end_inclusive = TRUE) write_fragments_10x( fragments, path, end_inclusive = TRUE, append_5th_column = FALSE )
path |
File path (e.g. fragments.tsv or fragments.tsv.gz) |
comment |
Skip lines at beginning of file which start with comment string |
end_inclusive |
Whether the end coordinate of the bed is inclusive – i.e. there was an insertion at the end coordinate rather than the base before the end coordinate. This is the 10x default, though it's not quite standard for the bed file format. |
fragments |
Input fragments object |
append_5th_column |
Whether to include 5th column of all 0 for compatibility with 10x fragment file outputs (defaults to 4 columns chr,start,end,cell) |
open_fragments_10x
No disk operations will take place until the fragments are used in a function
write_fragments_10x
Fragments will be written to disk immediately, then returned in a readable object.
10x fragments file object
Read/write a 10x feature matrix
open_matrix_10x_hdf5(path, feature_type = NULL, buffer_size = 16384L) write_matrix_10x_hdf5( mat, path, barcodes = colnames(mat), feature_ids = rownames(mat), feature_names = rownames(mat), feature_types = "Gene Expression", feature_metadata = list(), buffer_size = 16384L, chunk_size = 1024L, gzip_level = 0L, type = c("uint32_t", "double", "float", "auto") )
open_matrix_10x_hdf5(path, feature_type = NULL, buffer_size = 16384L) write_matrix_10x_hdf5( mat, path, barcodes = colnames(mat), feature_ids = rownames(mat), feature_names = rownames(mat), feature_types = "Gene Expression", feature_metadata = list(), buffer_size = 16384L, chunk_size = 1024L, gzip_level = 0L, type = c("uint32_t", "double", "float", "auto") )
path |
Path to the hdf5 file on disk |
feature_type |
Optional selection of feature types to include in output matrix. For multiome data, the options are "Gene Expression" and "Peaks". This option is only compatible with files from cellranger 3.0 and newer. |
buffer_size |
For performance tuning only. The number of items to be buffered in memory before calling writes to disk. |
mat |
IterableMatrix |
barcodes |
Vector of names for the cells |
feature_ids |
Vector of IDs for the features |
feature_names |
Vector of names for the features |
feature_types |
String or vector of feature types |
feature_metadata |
Named list of additional metadata vectors to store for each feature |
chunk_size |
For performance tuning only. The chunk size used for the HDF5 array storage. |
gzip_level |
Gzip compression level. Default is 0 (no compression) |
type |
Data type of the output matrix. Default is |
The 10x format makes use of gzip compression for the matrix data, which can slow down read performance. Consider writing into another format if the read performance is important to you.
Input matrices must be in column-major storage order, and if the rownames and colnames are not set, names must be provided for the relevant metadata parameters. Some of the metadata parameters are not read by default in BPCells, but it is possible to export them for use with other tools.
BPCells matrix object
Read or write a sparse matrix from an anndata hdf5 file. These functions will
automatically transpose matrices when converting to/from the AnnData
format. This is because the AnnData convention stores cells as rows, whereas the R
convention stores cells as columns. If this behavior is undesired, call t()
manually on the matrix inputs and outputs of these functions.
open_matrix_anndata_hdf5(path, group = "X", buffer_size = 16384L) write_matrix_anndata_hdf5( mat, path, group = "X", buffer_size = 16384L, chunk_size = 1024L, gzip_level = 0L )
open_matrix_anndata_hdf5(path, group = "X", buffer_size = 16384L) write_matrix_anndata_hdf5( mat, path, group = "X", buffer_size = 16384L, chunk_size = 1024L, gzip_level = 0L )
path |
Path to the hdf5 file on disk |
group |
The group within the hdf5 file to write the data to. If writing to an existing hdf5 file this group must not already be in use |
buffer_size |
For performance tuning only. The number of items to be buffered in memory before calling writes to disk. |
chunk_size |
For performance tuning only. The chunk size used for the HDF5 array storage. |
gzip_level |
Gzip compression level. Default is 0 (no compression) |
AnnDataMatrixH5 object, with cells as the columns.
Use this function to order regioins prior to calling peak_matrix()
or tile_matrix()
.
order_ranges(ranges, chr_levels, sort_by_end = TRUE)
order_ranges(ranges, chr_levels, sort_by_end = TRUE)
ranges |
Genomic regions given as GRanges, data.frame, or list. See
|
chr_levels |
Ordering of chromosome names |
sort_by_end |
If TRUE (defualt), sort by (chr, end, start). Else sort by (chr, start, end) |
Numeric vector analagous to the order
function. Provides an index
selection that will reorder the input ranges to be sorted by chr, end, start
Calculate ranges x cells overlap matrix
peak_matrix( fragments, ranges, mode = c("insertions", "fragments", "overlaps"), zero_based_coords = !is(ranges, "GRanges"), explicit_peak_names = TRUE )
peak_matrix( fragments, ranges, mode = c("insertions", "fragments", "overlaps"), zero_based_coords = !is(ranges, "GRanges"), explicit_peak_names = TRUE )
fragments |
Input fragments object. Must have cell names and chromosome names defined |
ranges |
Peaks/ranges to overlap, given as GRanges, data.frame, or list. See
|
mode |
Mode for counting peak overlaps. (See "value" section for more details) |
zero_based_coords |
Whether to convert the ranges from a 1-based end-inclusive coordinate system to a 0-based end-exclusive coordinate system. Defaults to true for GRanges and false for other formats (see this archived UCSC blogpost) |
explicit_peak_names |
Boolean for whether to add rownames to the output matrix in format e.g chr1:500-1000, where start and end coords are given in a 0-based coordinate system. Note that either way, peak names will be written when the matrix is saved. |
Iterable matrix object with dimension ranges x cells. When saved, the column names of the output matrix will be in the format chr1:500-1000, where start and end coords are given in a 0-based coordinate system.
mode
options
"insertions"
: Start and end coordinates are separately overlapped with each peak
"fragments"
: Like "insertions"
, but each fragment can contribute at most 1 count
to each peak, even if both the start and end coordinates overlap
"overlaps"
: Like "fragments"
, but an overlap is also counted if the fragment fully
spans the peak even if neither the start or end falls within the peak
When calculating the matrix directly from a fragments tsv, it's necessary to first call select_chromosomes()
in order to
provide the ordering of chromosomes to expect while reading the tsv.
Plot feature levels per group or cluster as a grid of dots. Dots are colored by z-score normalized average expression, and sized by percent non-zero.
plot_dot( source, features, groups, group_order = NULL, gene_mapping = human_gene_mapping, colors = c("lightgrey", "#4682B4"), return_data = FALSE, apply_styling = TRUE )
plot_dot( source, features, groups, group_order = NULL, gene_mapping = human_gene_mapping, colors = c("lightgrey", "#4682B4"), return_data = FALSE, apply_styling = TRUE )
source |
Feature x cell matrix or data.frame with features. For best results,
features should be sparse and log-normalized (e.g. run |
features |
Character vector of features to plot |
groups |
Vector with one entry per cell, specifying the cell's group |
group_order |
Optional vector listing ordering of groups |
gene_mapping |
An optional vector for gene name matching with match_gene_symbol(). |
colors |
Color scale for plot |
return_data |
If true, return data from just before plotting rather than a plot. |
apply_styling |
If false, return a plot without pretty styling applied |
Plot one or more features by coloring cells in a UMAP plot.
plot_embedding( source, embedding, features = NULL, quantile_range = c(0.01, 0.99), randomize_order = TRUE, smooth = NULL, smooth_rounds = 3, gene_mapping = human_gene_mapping, size = NULL, rasterize = FALSE, raster_pixels = 512, legend_continuous = c("auto", "quantile", "value"), labels_quantile_range = TRUE, colors_continuous = c("lightgrey", "#4682B4"), legend_discrete = TRUE, labels_discrete = TRUE, colors_discrete = discrete_palette("stallion"), return_data = FALSE, return_plot_list = FALSE, apply_styling = TRUE )
plot_embedding( source, embedding, features = NULL, quantile_range = c(0.01, 0.99), randomize_order = TRUE, smooth = NULL, smooth_rounds = 3, gene_mapping = human_gene_mapping, size = NULL, rasterize = FALSE, raster_pixels = 512, legend_continuous = c("auto", "quantile", "value"), labels_quantile_range = TRUE, colors_continuous = c("lightgrey", "#4682B4"), legend_discrete = TRUE, labels_discrete = TRUE, colors_discrete = discrete_palette("stallion"), return_data = FALSE, return_plot_list = FALSE, apply_styling = TRUE )
source |
Matrix, or data frame to pull features from, or a vector of feature values for a single feature. For a matrix, the features must be rows. |
embedding |
A matrix of dimensions cells x 2 with embedding coordinates |
features |
Character vector of features to plot if |
quantile_range |
(optional) Length 2 vector giving the quantiles to clip the minimum and maximum color scale values, as fractions between 0 and 1. NULL or NA values to skip clipping |
randomize_order |
If TRUE, shuffle cells to prevent overplotting biases. Can pass an integer instead to specify a random seed to use. |
smooth |
(optional) Sparse matrix of dimensions cells x cells with cell-cell distance weights for smoothing. |
smooth_rounds |
Number of multiplication rounds to apply when smoothing. |
gene_mapping |
An optional vector for gene name matching with match_gene_symbol().
Ignored if |
size |
Point size for plotting |
rasterize |
Whether to rasterize the point drawing to speed up display in graphics programs. |
raster_pixels |
Number of pixels to use when rasterizing. Can provide one number for square dimensions, or two numbers for width x height. |
legend_continuous |
Whether to label continuous features by quantile or value. "auto"
labels by quantile only when all features are continuous and |
labels_quantile_range |
Whether to add a text label with the value range of each feature when the legend is set to quantile |
colors_continuous |
Vector of colors to use for continuous color palette |
legend_discrete |
Whether to show the legend for discrete (categorical) features. |
labels_discrete |
Whether to add text labels at the center of each group for discrete (categorical) features. |
colors_discrete |
Vector of colors to use for discrete (categorical) features. |
return_data |
If true, return data from just before plotting rather than a plot. |
return_plot_list |
If |
apply_styling |
If false, return a plot without pretty styling applied |
Smoothing is performed as follows: first, the smoothing matrix is normalized so the sum of incoming weights to every cell is 1. Then, the raw data values are repeatedly multiplied by the smoothing matrix and re-scaled so the average value stays the same.
By default, returns a ggplot2 object with all the requested features plotted
in a grid. If return_data
or return_plot_list
is called, the return value will
match that argument.
Plot the distribution of fragment lengths, with length in basepairs on the x-axis, and proportion of fragments on the y-axis. Typical plots will show 10-basepair periodicity, as well as humps spaced at multiples of a nucleosome width (about 150bp).
plot_fragment_length( fragments, max_length = 500, return_data = FALSE, apply_styling = TRUE )
plot_fragment_length( fragments, max_length = 500, return_data = FALSE, apply_styling = TRUE )
fragments |
Fragments object |
max_length |
Maximum length to show on the plot |
return_data |
If true, return data from just before plotting rather than a plot. |
apply_styling |
If false, return a plot without pretty styling applied |
Numeric vector where index i contans the number of length-i fragments
Plots read count rank vs. number of reads on a log-log scale.
plot_read_count_knee( read_counts, cutoff = NULL, return_data = FALSE, apply_styling = TRUE )
plot_read_count_knee( read_counts, cutoff = NULL, return_data = FALSE, apply_styling = TRUE )
read_counts |
Vector of read counts per cell |
cutoff |
(optional) Read cutoff to mark on the plot |
return_data |
If true, return data from just before plotting rather than a plot. |
apply_styling |
If false, return a plot without pretty styling applied |
Performs logarithmic downsampling to reduce the number of points plotted
ggplot2 plot object
Plot the footprinting around TF motif sites
plot_tf_footprint( fragments, motif_positions, cell_groups = rlang::rep_along(cellNames(fragments), "all"), flank = 250L, smooth = 0L, zero_based_coords = !is(genes, "GRanges"), colors = discrete_palette("stallion"), return_data = FALSE, apply_styling = TRUE )
plot_tf_footprint( fragments, motif_positions, cell_groups = rlang::rep_along(cellNames(fragments), "all"), flank = 250L, smooth = 0L, zero_based_coords = !is(genes, "GRanges"), colors = discrete_palette("stallion"), return_data = FALSE, apply_styling = TRUE )
fragments |
IterableFragments object |
motif_positions |
Coordinate ranges for motifs (must include strand) and have constant width |
cell_groups |
Character or factor assigning a group to each cell, in order of
|
flank |
Number of flanking basepairs to include on either side of the motif |
smooth |
(optional) Sparse matrix of dimensions cells x cells with cell-cell distance weights for smoothing. |
zero_based_coords |
If true, coordinates start and 0 and the end coordinate is not included in the range. If false, coordinates start at 1 and the end coordinate is included in the range |
return_data |
If true, return data from just before plotting rather than a plot. |
apply_styling |
If false, return a plot without pretty styling applied |
footprint()
, plot_tss_profile()
Plot the enrichmment of insertions relative to transcription start sites (TSS). Typically, this plot shows strong enrichment of insertions near a TSS, and a small bump downstream around 220bp downstream of the TSS for the +1 nucleosome.
plot_tss_profile( fragments, genes, cell_groups = rlang::rep_along(cellNames(fragments), "all"), flank = 2000L, smooth = 0L, zero_based_coords = !is(genes, "GRanges"), colors = discrete_palette("stallion"), return_data = FALSE, apply_styling = TRUE )
plot_tss_profile( fragments, genes, cell_groups = rlang::rep_along(cellNames(fragments), "all"), flank = 2000L, smooth = 0L, zero_based_coords = !is(genes, "GRanges"), colors = discrete_palette("stallion"), return_data = FALSE, apply_styling = TRUE )
fragments |
IterableFragments object |
genes |
Coordinate ranges for genes (must include strand) |
cell_groups |
Character or factor assigning a group to each cell, in order of
|
flank |
Number of flanking basepairs to include on either side of the motif |
smooth |
Number of bases to smooth over (rolling average) |
zero_based_coords |
If true, coordinates start and 0 and the end coordinate is not included in the range. If false, coordinates start at 1 and the end coordinate is included in the range |
return_data |
If true, return data from just before plotting rather than a plot. |
apply_styling |
If false, return a plot without pretty styling applied |
footprint()
, plot_tf_footprint()
Density scatter plot with log10(fragment_count) on the x-axis and TSS enrichment on the y-axis. This plot is most useful to select which cell barcodes in an experiment correspond to high-quality cells
plot_tss_scatter( atac_qc, min_frags = NULL, min_tss = NULL, bins = 100, apply_styling = TRUE )
plot_tss_scatter( atac_qc, min_frags = NULL, min_tss = NULL, bins = 100, apply_styling = TRUE )
atac_qc |
Tibble as returned by |
min_frags |
Minimum fragment count cutoff |
min_tss |
Minimum TSS Enrichment cutoff |
bins |
Number of bins for density calculation |
apply_styling |
If false, return a plot without pretty styling applied |
Rename cells by adding a prefix to the names. Most commonly this will be a sample name.
All cells will recieve the exact text of prefix
added to the beginning, so any separator
characters like "_" must be included in the given prefix
. Use this prior to merging
fragments from different experiments with c()
in order to help prevent cell name
clashes.
prefix_cell_names(fragments, prefix)
prefix_cell_names(fragments, prefix)
fragments |
Input fragments object. |
prefix |
String to add as the prefix |
Fragments object with prefixed names
Given a (features x cells)
matrix, group cells by cell_groups
and aggregate counts by method
for each
feature.
pseudobulk_matrix(mat, cell_groups, method = "sum", threads = 1L)
pseudobulk_matrix(mat, cell_groups, method = "sum", threads = 1L)
mat |
IterableMatrix object of dimensions features x cells |
cell_groups |
(Character/factor) Vector of group/cluster assignments for each cell. Length must be |
method |
(Character vector) Method(s) to aggregate counts. If one method is provided, the output will be a matrix. If multiple methods are provided, the output will be a named list of matrices. Current options are: |
threads |
(integer) Number of threads to use. |
Some simpler stats are calculated in the process of calculating more complex
statistics. So when calculating variance
, nonzeros
and mean
can be included with no
extra calculation time, and when calculating mean
, adding nonzeros
will take no extra time.
If method
is length 1
, returns a matrix of shape (features x groups)
.
If method
is greater than length 1
, returns a list of matrices with each matrix representing a pseudobulk matrix with a different aggregation method.
Each matrix is of shape (features x groups)
, and names are one of nonzeros
, sum
, mean
, variance
.
Calculate ArchR-compatible per-cell QC statistics
qc_scATAC(fragments, genes, blacklist)
qc_scATAC(fragments, genes, blacklist)
fragments |
IterableFragments object |
genes |
Gene coordinates given as GRanges, data.frame, or list. See
|
blacklist |
Blacklisted regions given as GRanges, data.frame, or list. See
|
This implementation mimics ArchR's default parameters. For uses requiring more flexibility to tweak default parameters, the best option is to re-implement this function with required changes. Output columns of data.frame:
cellName
: cell name for each cell
nFrags
: number of fragments per cell
subNucleosomal
, monoNucleosomal
, multiNucleosomal
: number of fragments of size 1-146bp, 147-254bp, and 255bp + respectively.
equivalent to ArchR's nMonoFrags, nDiFrags, nMultiFrags respectively
TSSEnrichment
: AvgInsertInTSS / max(AvgInsertFlankingTSS, 0.1)
, where AvgInsertInTSS
is ReadsInTSS / 101
(window size),
and AvgInsertFlankingTSS
is ReadsFlankingTSS / (100*2)
(window size). The max(0.1)
ensures that very low-read cells
do not get assigned spuriously high TSSEnrichment.
ReadsInPromoter
: Number of reads from 2000bp upstream of TSS to 101bp downstream of TSS
ReadsInBlacklist
: Number of reads in the provided blacklist region
ReadsInTSS
: Number of reads overlapping the 101bp centered around each TSS
ReadsFlankingTSS
: Number of reads overlapping 1901-2000bp +/- each TSS
Differences from ArchR: Note that ArchR by default uses a different set of annotations to derive TSS sites and promoter sites. This function uses just one annotation for gene start+end sites, so must be called twice to exactly re-calculate the ArchR QC stats.
ArchR's PromoterRatio
and BlacklistRatio
are not included in the output, as they can be easily calculated
from ReadsInPromoter / nFrags
and ReadsInBlacklist / nFrags
. Similarly, ArchR's NucleosomeRatio
can be calculated
as (monoNucleosomal + multiNucleosomal) / subNucleosomal
.
data.frame with QC data
Given a set of genomic ranges, find the distance to the nearest neighbors both upstream and downstream.
range_distance_to_nearest( ranges, addArchRBug = FALSE, zero_based_coords = !is(ranges, "GRanges") )
range_distance_to_nearest( ranges, addArchRBug = FALSE, zero_based_coords = !is(ranges, "GRanges") )
ranges |
Genomic regions given as GRanges, data.frame, or list. See
|
addArchRBug |
boolean to reproduce ArchR's bug that incorrectly handles nested genes |
zero_based_coords |
If true, coordinates start and 0 and the end coordinate is not included in the range. If false, coordinates start at 1 and the end coordinate is included in the range |
A 2-column data.frame with columns upstream and downstream, containing
the distances to the nearest neighbor in the respective directions.
For ranges on +
or *
strand, distance is calculated as:
upstream = max(start(range) - end(upstreamNeighbor), 0)
downstream = max(start(downstreamNeighbor) - end(range), 0)
For ranges on -
strand, the definition of upstream and downstream is flipped.
Note that this definition of distance is one off from
GenomicRanges::distance()
, as ranges that neighbor but don't overlap are given
a distance of 1 rather than 0.
Bed files can contain peak or blacklist annotations. These utilities help read thos annotations
read_bed( path, additional_columns = character(0), backup_url = NULL, timeout = 300 ) read_encode_blacklist( dir, genome = c("hg38", "mm10", "hg19", "dm6", "dm3", "ce11", "ce10"), timeout = 300 )
read_bed( path, additional_columns = character(0), backup_url = NULL, timeout = 300 ) read_encode_blacklist( dir, genome = c("hg38", "mm10", "hg19", "dm6", "dm3", "ce11", "ce10"), timeout = 300 )
path |
Path to file (or desired save location if backup_url is used) |
additional_columns |
Names for additional columns in the bed file |
backup_url |
If path does not exist, provides a URL to download the gtf from |
timeout |
Maximum time in seconds to wait for download from backup_url |
dir |
Output directory to cache the downloaded gtf file |
genome |
genome name |
read_bed
Read a bed file from disk or a url.
read_encode_blacklist
Downloads the Boyle Lab blacklist, as described in https://doi.org/10.1038/s41598-019-45839-z
Data frame with coordinates using the 0-based convention.
read_gtf()
, read_gencode_genes()
Read gene annotations from gtf format into a data frame. The source can be a URL, a gtf file on disk, or a gencode release version.
read_gtf( path, attributes = c("gene_id"), tags = character(0), features = c("gene"), keep_attribute_column = FALSE, backup_url = NULL, timeout = 300 ) read_gencode_genes( dir, release = "latest", annotation_set = c("basic", "comprehensive"), gene_type = "lncRNA|protein_coding|IG_.*_gene|TR_.*_gene", attributes = c("gene_id", "gene_type", "gene_name"), tags = character(0), features = c("gene"), timeout = 300 ) read_gencode_transcripts( dir, release = "latest", transcript_choice = c("MANE_Select", "Ensembl_Canonical", "all"), annotation_set = c("basic", "comprehensive"), gene_type = "lncRNA|protein_coding|IG_.*_gene|TR_.*_gene", attributes = c("gene_id", "gene_type", "gene_name", "transcript_id"), features = c("transcript", "exon"), timeout = 300 )
read_gtf( path, attributes = c("gene_id"), tags = character(0), features = c("gene"), keep_attribute_column = FALSE, backup_url = NULL, timeout = 300 ) read_gencode_genes( dir, release = "latest", annotation_set = c("basic", "comprehensive"), gene_type = "lncRNA|protein_coding|IG_.*_gene|TR_.*_gene", attributes = c("gene_id", "gene_type", "gene_name"), tags = character(0), features = c("gene"), timeout = 300 ) read_gencode_transcripts( dir, release = "latest", transcript_choice = c("MANE_Select", "Ensembl_Canonical", "all"), annotation_set = c("basic", "comprehensive"), gene_type = "lncRNA|protein_coding|IG_.*_gene|TR_.*_gene", attributes = c("gene_id", "gene_type", "gene_name", "transcript_id"), features = c("transcript", "exon"), timeout = 300 )
path |
Path to file (or desired save location if backup_url is used) |
attributes |
Vector of GTF attribute names to parse out as columns |
tags |
Vector of tags to parse out as boolean presence/absence |
features |
List of features types to keep from the GTF (e.g. gene, transcript, exon, intron) |
keep_attribute_column |
Boolean for whether to preserve the raw attribute text column |
backup_url |
If path does not exist, provides a URL to download the gtf from |
timeout |
Maximum time in seconds to wait for download from backup_url |
dir |
Output directory to cache the downloaded gtf file |
release |
release version (prefix with M for mouse versions). For most recent version, use "latest" or "latest_mouse" |
annotation_set |
Either "basic" or "comprehensive" annotation sets (see details section). |
gene_type |
Regular expression with which gene types to keep. Defaults to protein_coding, lncRNA, and IG/TR genes |
transcript_choice |
Method for selecting representative transcripts. Choices are:
|
read_gtf
Read gtf from a file or URL
read_gencode_genes
Read gene annotations directly from GENCODE. The file name will vary depending
on the release and annotation set requested, but will be of the format
gencode.v42.annotation.gtf.gz
. GENCODE currently recommends the basic set:
https://www.gencodegenes.org/human/. In release 42, both the comprehensive and
basic sets had identical gene-level annotations, but the comprehensive set had
additional transcript variants annotated.
read_gencode_transcripts
Read transcript models from GENCODE, for use with trackplot_gene()
Data frame with coordinates using the 0-based convention. Columns are:
chr
source
feature
start
end
score
strand
frame
attributes (optional; named according to listed attributes)
tags (named according to listed tags)
read_bed()
, read_encode_blacklist()
Read chromosome sizes from UCSC and return as a tibble with one row per chromosome. The underlying data is pulled from here: https://hgdownload.soe.ucsc.edu/downloads.html
read_ucsc_chrom_sizes( dir, genome = c("hg38", "mm39", "mm10", "mm9", "hg19"), keep_chromosomes = "chr[0-9]+|chrX|chrY", timeout = 300 )
read_ucsc_chrom_sizes( dir, genome = c("hg38", "mm39", "mm10", "mm9", "hg19"), keep_chromosomes = "chr[0-9]+|chrX|chrY", timeout = 300 )
Regress out the effects of confounding variables using a linear least squares regression model.
regress_out(mat, latent_data, prediction_axis = c("row", "col"))
regress_out(mat, latent_data, prediction_axis = c("row", "col"))
mat |
Input IterableMatrix |
latent_data |
Data to regress out, as a |
prediction_axis |
Which axis corresponds to prediction outputs from the linear models (e.g. the gene axis in typical single cell analysis). Options include "row" (default) and "col". |
Conceptually, regress_out
calculates a linear least squares best fit model for each row of the matrix.
(Or column if prediction_axis
is "col"
).
The input data for each regression model are the columns of latent_data
, and each model tries to
predict the values in the corresponding row (or column) of mat
. After fitting each model, regress_out
will subtract the model predictions from the input values, aiming to only retain effects that are
not explained by the variables in latent_data
.
These models can be fit efficiently since they all share the same input data and so most of the calculations for the closed-form best fit solution are shared. A QR factorization of the model matrix and a dense matrix-vector multiply are sufficient to fully calculate the residual values.
Efficiency considerations: As the output matrix is dense rather than sparse, mean and variance calculations may
run comparatively slowly. However, PCA and matrix/vector multiply operations can be performed at nearly the same
cost as the input matrix due to mathematical simplifications. Memory usage scales with n_features * ((nrow(mat) + ncol(mat))
.
Generally, n_features == ncol(latent_data)
, but for categorical variables in latent_data
, each
category will be expanded into its own indicator variable. Memory usage will therefore be higher when
using categorical input variables with many (i.e. >100) distinct values.
IterableMatrix
Rotate ggplot x axis labels
rotate_x_labels(degrees = 45)
rotate_x_labels(degrees = 45)
degrees |
Number of degrees to rotate by |
Calculate pearson residuals of a negative binomial sctransform model.
Normalized values are calculated as (X - mu) / sqrt(mu + mu^2/theta)
.
mu is calculated as cell_read_counts * gene_beta
.
sctransform_pearson( mat, gene_theta, gene_beta, cell_read_counts, min_var = -Inf, clip_range = c(-10, 10), columns_are_cells = TRUE, slow = FALSE )
sctransform_pearson( mat, gene_theta, gene_beta, cell_read_counts, min_var = -Inf, clip_range = c(-10, 10), columns_are_cells = TRUE, slow = FALSE )
mat |
IterableMatrix (raw counts) |
gene_theta |
Vector of per-gene thetas (overdispersion values) |
gene_beta |
Vector of per-gene betas (expression level values) |
cell_read_counts |
Vector of total reads per (umi count for RNA) |
min_var |
Minimum value for clipping variance |
clip_range |
Length 2 vector of min and max clipping range |
columns_are_cells |
Whether the columns of the matrix correspond to cells (default) or genes |
slow |
If TRUE, use a 10x slower but more precise implementation (default FALSE) |
The parameterization used is somewhat simplified compared to the original SCTransform paper, in particular it uses a linear-scale rather than log-scale to represent the cell_read_counts and gene_beta variables. It also does not support the addition of arbitrary cell metadata (e.g. batch) to add to the negative binomial regression.
IterableMatrix
Subset, translate, or reorder cell IDs
select_cells(fragments, cell_selection)
select_cells(fragments, cell_selection)
fragments |
Input fragments object |
cell_selection |
List of cell IDs (numeric), names (character), or logical mask. |
Numeric cell IDs will be re-assigned in the order of cell_selection
.
The output cell ID n
will be taken from the input cell with ID/name cell_selection[n]
.
Subset, translate, or reorder chromosome IDs
select_chromosomes(fragments, chromosome_selection)
select_chromosomes(fragments, chromosome_selection)
fragments |
Input fragments object |
chromosome_selection |
List of chromosme IDs (numeric), or names (character), or logical mask. |
Numeric chromosome IDs will be re-assigned in the order of chromosome_selection
.
The output chromosome ID n
will be taken from the input chromosome with ID/name chromosome_selection[n]
.
Fragments can be subset based on overlapping (or not overlapping) a set of regions
select_regions( fragments, ranges, invert_selection = FALSE, zero_based_coords = !is(ranges, "GRanges") )
select_regions( fragments, ranges, invert_selection = FALSE, zero_based_coords = !is(ranges, "GRanges") )
fragments |
Input fragments object. |
ranges |
Peaks/ranges to overlap, given as GRanges, data.frame, or list. See
|
invert_selection |
If TRUE, select fragments not overlapping selected regions instead of only fragments overlapping the selected regions. |
zero_based_coords |
Whether to convert the ranges from a 1-based end-inclusive coordinate system to a 0-based end-exclusive coordinate system. Defaults to true for GRanges and false for other formats (see this archived UCSC blogpost) |
Fragments object filtered according to the selected regions
Adjust labels and heights on trackplots. Labels are set as facet labels in ggplot2, and heights
are additional properties read by trackplot_combine()
to determine relative height of input plots.
set_trackplot_label(plot, labels) set_trackplot_height(plot, height) get_trackplot_height(plot)
set_trackplot_label(plot, labels) set_trackplot_height(plot, height) get_trackplot_height(plot)
plot |
ggplot object |
labels |
character vector of labels – must match existing number of facets in plot |
height |
New height. If numeric, adjusts relative height. If |
set_trackplot_label: ggplot object with adjusted facet labels
set_trackplot_height: ggplot object with adjusted trackplot height
get_trackplot_height: ggplot2::unit
object with height setting
Shifts start or end of fragments by a fixed amount, which can be useful to correct the Tn5 offset.
shift_fragments(fragments, shift_start = 0L, shift_end = 0L)
shift_fragments(fragments, shift_start = 0L, shift_end = 0L)
fragments |
Input fragments object |
shift_start |
How many basepairs to shift the start coords |
shift_end |
How many basepairs to shift the end coords |
The correct Tn5 offset is +/- 4bp since the Tn5 cut sites on opposite strands are offset by 9bp. However, +4/-5 bp is often applied to bed-format files, since the end coordinate in bed files is 1 past the last basepair of the sequenced DNA fragment. This results in a bed-like format except with inclusive end coordinates.
Shifted fragments object
Subset fragments by length
subset_lengths(fragments, min_len = 0L, max_len = NA_integer_)
subset_lengths(fragments, min_len = 0L, max_len = NA_integer_)
fragments |
Input fragments object |
min_len |
Minimum bases in fragment (inclusive) |
max_len |
Maximum bases in fragment (inclusive) |
Fragment length is calculated as end-start
Fragments object
Use the C++ Spectra solver (same as RSpectra package), in order to
compute the largest k values and corresponding singular vectors.
Empirically, memory usage is much lower than using irlba::irlba()
, likely
due to avoiding R garbage creation while solving due to the pure-C++ solver.
This documentation is a slightly-edited version of the RSpectra::svds()
documentation.
svds(A, k, nu = k, nv = k, opts = list(), threads=0L, ...)
svds(A, k, nu = k, nv = k, opts = list(), threads=0L, ...)
A |
The matrix whose truncated SVD is to be computed. |
k |
Number of singular values requested. |
nu |
Number of right singular vectors to be computed. This must be between 0 and 'k'. (Must be equal to 'k' for BPCells IterableMatrix) |
opts |
Control parameters related to computing algorithm. See Details below |
threads |
Control threads to use calculating mat-vec producs (BPCells specific) |
When RSpectra is installed, this function will just add a method to
RSpectra::svds()
for the IterableMatrix
class.
The opts
argument is a list that can supply any of the
following parameters:
ncv
Number of Lanzcos basis vectors to use. More vectors
will result in faster convergence, but with greater
memory use. ncv
must be satisfy
where
p = min(m, n)
.
Default is min(p, max(2*k+1, 20))
.
tol
Precision parameter. Default is 1e-10.
maxitr
Maximum number of iterations. Default is 1000.
center
Either a logical value (TRUE
/FALSE
), or a numeric
vector of length . If a vector
is supplied, then
SVD is computed on the matrix
,
in an implicit way without actually forming this matrix.
center = TRUE
has the same effect as
center = colMeans(A)
. Default is FALSE
. Ignored in BPCells
scale
Either a logical value (TRUE
/FALSE
), or a numeric
vector of length . If a vector
is supplied, then
SVD is computed on the matrix
,
where
is the centering vector and
.
If
scale = TRUE
, then the vector is computed as
the column norm of
.
Default is
FALSE
. Ignored in BPCells
A list with the following components:
d |
A vector of the computed singular values. |
u |
An |
v |
An |
nconv |
Number of converged singular values. |
niter |
Number of iterations used. |
nops |
Number of matrix-vector multiplications used. |
Qiu Y, Mei J (2022). RSpectra: Solvers for Large-Scale Eigenvalue and SVD Problems. R package version 0.16-1, https://CRAN.R-project.org/package=RSpectra.
Calculate ranges x cells tile overlap matrix
tile_matrix( fragments, ranges, mode = c("insertions", "fragments"), zero_based_coords = !is(ranges, "GRanges"), explicit_tile_names = FALSE )
tile_matrix( fragments, ranges, mode = c("insertions", "fragments"), zero_based_coords = !is(ranges, "GRanges"), explicit_tile_names = FALSE )
fragments |
Input fragments object |
ranges |
Tiled regions given as GRanges, data.frame, or list. See
Must be non-overlapping and sorted by
(chr, start), with chromosomes ordered according to the chromosome names of |
mode |
Mode for counting tile overlaps. (See "value" section for more detail) |
zero_based_coords |
Whether to convert the ranges from a 1-based end-inclusive coordinate system to a 0-based end-exclusive coordinate system. Defaults to true for GRanges and false for other formats (see this archived UCSC blogpost) |
explicit_tile_names |
Boolean for whether to add rownames to the output matrix in format e.g chr1:500-1000, where start and end coords are given in a 0-based coordinate system. For whole-genome Tile matrices the names will take ~5 seconds to generate and take up 400MB of memory. Note that either way, tile names will be written when the matrix is saved. |
Iterable matrix object with dimension ranges x cells. When saved, the column names will be in the format chr1:500-1000, where start and end coords are given in a 0-based coordinate system.
mode
options
"insertions"
: Start and end coordinates are separately overlapped with each tile
"fragments"
: Like "insertions"
, but each fragment can contribute at most 1 count
to each tile, even if both the start and end coordinates overlap
When calculating the matrix directly from a fragments tsv, it's necessary to first call select_chromosomes()
in order to
provide the ordering of chromosomes to expect while reading the tsv.
Combines multiple track plots of the same region into a single grid.
Uses the patchwork
package to perform the alignment.
trackplot_combine( tracks, side_plot = NULL, title = NULL, side_plot_width = 0.3 )
trackplot_combine( tracks, side_plot = NULL, title = NULL, side_plot_width = 0.3 )
tracks |
List of tracks in order from top to bottom, generally ggplots as output from
the other |
side_plot |
Optional plot to align to the right (e.g. RNA expression per cluster). Will be aligned to the first
|
title |
Text for overarching title of the plot |
side_plot_width |
Fraction of width that should be used for the side plot relative to the main track area |
A plot object with aligned genome plots. Each aligned row has the text label, y-axis, and plot body. The relative height of each row is given by heights. A shared title and x-axis are put at the top.
trackplot_coverage()
, trackplot_gene()
, trackplot_loop()
, trackplot_scalebar()
Plot a pseudobulk genome track, showing the number of fragment insertions across a region for each cell type or group.
trackplot_coverage( fragments, region, groups, cell_read_counts, group_order = NULL, bins = 500, clip_quantile = 0.999, colors = discrete_palette("stallion"), legend_label = NULL, zero_based_coords = !is(region, "GRanges"), return_data = FALSE )
trackplot_coverage( fragments, region, groups, cell_read_counts, group_order = NULL, bins = 500, clip_quantile = 0.999, colors = discrete_palette("stallion"), legend_label = NULL, zero_based_coords = !is(region, "GRanges"), return_data = FALSE )
fragments |
Fragments object |
region |
Region to plot, e.g. output from |
groups |
Vector with one entry per cell, specifying the cell's group |
cell_read_counts |
Numeric vector of read counts for each cell (used for normalization) |
group_order |
Optional vector listing ordering of groups |
bins |
Number of bins to plot across the region |
clip_quantile |
(optional) Quantile of values for clipping y-axis limits. Default of 0.999 will crop out just the most extreme outliers across the region. NULL to disable clipping |
colors |
Character vector of color values (optionally named by group) |
legend_label |
Custom label to put on the legend (no longer used as color legend is not shown anymore) |
zero_based_coords |
Whether to convert the ranges from a 1-based end-inclusive coordinate system to a 0-based end-exclusive coordinate system. Defaults to true for GRanges and false for other formats (see this archived UCSC blogpost) |
return_data |
If true, return data from just before plotting rather than a plot. |
scale_bar |
Whether to include a scale bar in the top track ( |
Returns a combined plot of pseudobulk genome tracks. For compatability with
draw_trackplot_grid()
, the extra attribute $patches$labels
will be added to
specify the labels for each track. If return_data
or return_plot_list
is
TRUE
, the return value will be modified accordingly.
trackplot_combine()
, trackplot_gene()
, trackplot_loop()
, trackplot_scalebar()
Plot transcript models
trackplot_gene( transcripts, region, exon_size = 2.5, gene_size = 0.5, label_size = 11 * 0.8/ggplot2::.pt, track_label = "Genes", return_data = FALSE )
trackplot_gene( transcripts, region, exon_size = 2.5, gene_size = 0.5, label_size = 11 * 0.8/ggplot2::.pt, track_label = "Genes", return_data = FALSE )
transcripts |
Transcipt features given as GRanges, data.frame, or list. See
Usually given as the output from |
region |
Region to plot, e.g. output from |
exon_size |
size for exon lines in units of mm |
gene_size |
size for intron/gene lines in units of mm |
label_size |
size for transcript labels in units of mm |
return_data |
If true, return data from just before plotting rather than a plot. |
labels |
Character vector with labels for each item in transcripts. NA for items that should not be labeled |
transcript_size |
size for transcript lines in units of mm |
Plot of gene locations
trackplot_combine()
, trackplot_coverage()
, trackplot_loop()
, trackplot_scalebar()
Plot range-based annotation tracks (e.g. peaks)
trackplot_genome_annotation( loci, region, color_by = NULL, colors = NULL, label_by = NULL, label_size = 11 * 0.8/ggplot2::.pt, show_strand = FALSE, annotation_size = 2.5, track_label = "Peaks", return_data = FALSE )
trackplot_genome_annotation( loci, region, color_by = NULL, colors = NULL, label_by = NULL, label_size = 11 * 0.8/ggplot2::.pt, show_strand = FALSE, annotation_size = 2.5, track_label = "Peaks", return_data = FALSE )
loci |
Genomic loci given as GRanges, data.frame, or list. See
|
region |
Region to plot, e.g. output from |
color_by |
Name of a metadata column in |
colors |
Vector of hex color codes to use for the color scale. For numeric |
label_by |
Name of a metadata column in |
label_size |
size for labels in units of mm |
show_strand |
If TRUE, show strand direction as arrows |
annotation_size |
size for annotation lines in mm |
return_data |
If true, return data from just before plotting rather than a plot. |
Plot of genomic loci if return_data is FALSE, otherwise returns the data frame used to generate the plot
trackplot_combine()
, trackplot_coverage()
, trackplot_loop()
, trackplot_scalebar()
, trackplot_gene()
Plot loops
trackplot_loop( loops, region, color_by = NULL, colors = NULL, allow_truncated = TRUE, curvature = 0.75, track_label = "Links", return_data = FALSE )
trackplot_loop( loops, region, color_by = NULL, colors = NULL, allow_truncated = TRUE, curvature = 0.75, track_label = "Links", return_data = FALSE )
loops |
Genomic regions given as GRanges, data.frame, or list. See
|
region |
Region to plot, e.g. output from |
color_by |
Name of a metadata column in |
colors |
Vector of hex color codes to use for the color scale. For numeric |
allow_truncated |
If FALSE, remove any loops that are not fully contained within |
curvature |
Curvature value between 0 and 1. 1 is a 180-degree arc, and 0 is flat lines. |
return_data |
If true, return data from just before plotting rather than a plot. |
Plot of loops connecting genomic coordinates
trackplot_combine()
, trackplot_coverage()
, trackplot_gene()
, trackplot_scalebar()
, trackplot_genome_annotation()
Plots a human-readable scale bar and coordinates of the region being plotted
trackplot_scalebar(region, font_pt = 11)
trackplot_scalebar(region, font_pt = 11)
region |
Region to plot, e.g. output from |
font_pt |
Font size for scale bar labels in units of pt. |
Plot with coordinates and scalebar for plotted genomic region
trackplot_combine()
, trackplot_coverage()
, trackplot_gene()
, trackplot_loop()
Transpose the storage order for a matrix
transpose_storage_order( matrix, outdir = tempfile("transpose"), tmpdir = tempdir(), load_bytes = 4194304L, sort_bytes = 1073741824L )
transpose_storage_order( matrix, outdir = tempfile("transpose"), tmpdir = tempdir(), load_bytes = 4194304L, sort_bytes = 1073741824L )
matrix |
Input matrix |
outdir |
Directory to store the output |
tmpdir |
Temporary directory to use for intermediate storage |
load_bytes |
The minimum contiguous load size during the merge sort passes |
sort_bytes |
The amount of memory to allocate for re-sorting chunks of entries |
This re-sorts the entries of a matrix to change the storage order from row-major to col-major. For large matrices, this can be slow – around 2 minutes to transpose a 500k cell RNA-seq matrix The default load_bytes (4MiB) and sort_bytes (1GiB) parameters allow ~85GB of data to be sorted with two passes through the data, and ~7.3TB of data to be sorted in three passes through the data.
MatrixDir object with a copy of the input matrix, but the storage order flipped
BPCells fragments can be read/written in compressed (bitpacked) or uncompressed form in a variety of storage locations: in memory (as an R object), in an hdf5 file, or in a directory on disk (containing binary files).
write_fragments_memory(fragments, compress = TRUE) write_fragments_dir( fragments, dir, compress = TRUE, buffer_size = 1024L, overwrite = FALSE ) open_fragments_dir(dir, buffer_size = 1024L) write_fragments_hdf5( fragments, path, group = "fragments", compress = TRUE, buffer_size = 8192L, chunk_size = 1024L, overwrite = FALSE, gzip_level = 0L ) open_fragments_hdf5(path, group = "fragments", buffer_size = 16384L)
write_fragments_memory(fragments, compress = TRUE) write_fragments_dir( fragments, dir, compress = TRUE, buffer_size = 1024L, overwrite = FALSE ) open_fragments_dir(dir, buffer_size = 1024L) write_fragments_hdf5( fragments, path, group = "fragments", compress = TRUE, buffer_size = 8192L, chunk_size = 1024L, overwrite = FALSE, gzip_level = 0L ) open_fragments_hdf5(path, group = "fragments", buffer_size = 16384L)
fragments |
Input fragments object |
compress |
Whether or not to compress the data. With compression, storage size is be about half the size of a gzip-compressed 10x fragments file. |
dir |
Directory to read/write the data from |
buffer_size |
For performance tuning only. The number of items to be bufferred in memory before calling writes to disk. |
overwrite |
If |
path |
Path to the hdf5 file on disk |
group |
The group within the hdf5 file to write the data to. If writing to an existing hdf5 file this group must not already be in use |
chunk_size |
For performance tuning only. The chunk size used for the HDF5 array storage. |
gzip_level |
Gzip compression level. Default is 0 (no compression). This is recommended when both compression and compatibility with outside programs is required. Otherwise, using compress=TRUE is recommended as it is >10x faster with often similar compression levels. |
Saving in a directory on disk is a good default for local analysis, as it provides the best I/O performance and lowest memory usage. The HDF5 format allows saving within existing hdf5 files to group data together, and the in memory format provides the fastest performance in the event memory usage is unimportant.
Fragment object
Write insertion counts data for one or more pseudobulks to bedgraph format. This reports the total
number insertions at each basepair for each group listed in cell_groups
.
write_insertion_bedgraph( fragments, path, cell_groups = rlang::rep_along(cellNames(fragments), "all"), insertion_mode = c("both", "start_only", "end_only") )
write_insertion_bedgraph( fragments, path, cell_groups = rlang::rep_along(cellNames(fragments), "all"), insertion_mode = c("both", "start_only", "end_only") )
fragments |
IterableFragments object |
path |
(character vector) Path(s) to save bedgraph to, optionally ending in ".gz" to add gzip compression. If |
cell_groups |
Character or factor assigning a group to each cell, in order of
|
insertion_mode |
(string) Which fragment ends to use for coverage calculation. One of "both", "start_only", or "end_only" |
BPCells matrices are stored in sparse format, meaning only the non-zero entries are stored. Matrices can store integer counts data or decimal numbers (float or double). See details for more information.
write_matrix_memory(mat, compress = TRUE) write_matrix_dir( mat, dir, compress = TRUE, buffer_size = 8192L, overwrite = FALSE ) open_matrix_dir(dir, buffer_size = 8192L) write_matrix_hdf5( mat, path, group, compress = TRUE, buffer_size = 8192L, chunk_size = 1024L, overwrite = FALSE, gzip_level = 0L ) open_matrix_hdf5(path, group, buffer_size = 16384L)
write_matrix_memory(mat, compress = TRUE) write_matrix_dir( mat, dir, compress = TRUE, buffer_size = 8192L, overwrite = FALSE ) open_matrix_dir(dir, buffer_size = 8192L) write_matrix_hdf5( mat, path, group, compress = TRUE, buffer_size = 8192L, chunk_size = 1024L, overwrite = FALSE, gzip_level = 0L ) open_matrix_hdf5(path, group, buffer_size = 16384L)
compress |
Whether or not to compress the data. |
dir |
Directory to save the data into |
buffer_size |
For performance tuning only. The number of items to be buffered in memory before calling writes to disk. |
overwrite |
If |
path |
Path to the hdf5 file on disk |
group |
The group within the hdf5 file to write the data to. If writing to an existing hdf5 file this group must not already be in use |
chunk_size |
For performance tuning only. The chunk size used for the HDF5 array storage. |
gzip_level |
Gzip compression level. Default is 0 (no compression). This is recommended when both compression and compatibility with outside programs is required. Otherwise, using compress=TRUE is recommended as it is >10x faster with often similar compression levels. |
matrix |
Input matrix, either IterableMatrix or dgCMatrix |
Matrices can be stored in a directory on disk, in memory, or in an HDF5 file. Saving in a directory on disk is a good default for local analysis, as it provides the best I/O performance and lowest memory usage. The HDF5 format allows saving within existing hdf5 files to group data together, and the in memory format provides the fastest performance in the event memory usage is unimportant.
For typical RNA counts matrices holding integer counts, this bitpacking compression will result in 6-8x less space than an R dgCMatrix, and 4-6x smaller than a scipy csc_matrix. The compression will be more effective when the count values in the matrix are small, and when the rows of the matrix are sorted by rowMeans. In tests on RNA-seq data optimal ordering could save up to 40% of storage space. On non-integer data only the row indices are compressed, not the values themselves so space savings will be smaller.
For non-integer data matrices, bitpacking compression is much less effective, as it can only be applied to the indexes of each entry but not the values. There will still be some space savings, but far less than for counts matrices.
BPCells matrix object