Convenience functions for adding or multiplying each row / column of a matrix by a number.
add_rows(mat, vec)
add_cols(mat, vec)
multiply_rows(mat, vec)
multiply_cols(mat, vec)
mat 
Matrixlike object 
vec 
Numeric vector 
Matrixlike object
A matrix object can either be an input (i.e. a file on disk or a raw matrix in memory),
or it can represent a delayed operation on one or more matrices. The all_matrix_inputs()
getter and setter functions allow accessing the baselevel input matrices as a list, and
changing them. This is useful if you want to relocate data on disk without losing your
transformed BPCells matrix. (Note: experimental API; potentially subject to revisions).
all_matrix_inputs(x)
all_matrix_inputs(x) < value
x 
IterableMatrix 
value 
List of IterableMatrix objects 
List of IterableMatrix objects. If a matrix m
is itself an input object, then
all_matrix_inputs(m)
will return list(m)
.
Binarize compares the matrix element values to the
threshold value and sets the output elements to either zero
or one. By default, element values greater than the threshold
are set to one; otherwise, set to zero. When strict_inequality
is set to FALSE, element values greater than or equal to the
threshold are set to one. As an alternative, the <
, <=
, >
,
and >=
operators are also supported.
binarize(mat, threshold = 0, strict_inequality = TRUE)
mat 
IterableMatrix 
threshold 
A numeric value that determines whether the elements of x are set to zero or one. 
strict_inequality 
A logical value determining whether the comparison to the threshold is >= (strict_inequality=FALSE) or > (strict_inequality=TRUE). 
binarized IterableMatrix object
Calling peaks from a preset list of tiles can be much faster than using
dedicated peakcalling software like macs3
. The resulting peaks are less
precise in terms of exact coordinates, but should be sufficient for most
analyses.
call_peaks_tile(
fragments,
chromosome_sizes,
cell_groups = rep.int("all", length(cellNames(fragments))),
effective_genome_size = NULL,
peak_width = 200,
peak_tiling = 3,
fdr_cutoff = 0.01,
merge_peaks = c("all", "group", "none")
)
fragments 
IterableFragments object 
chromosome_sizes 
Chromosome start and end coordinates given as GRanges, data.frame, or list. See
See 
cell_groups 
Grouping vector with one entry per cell in fragments, e.g. cluster IDs 
effective_genome_size 
(Optional) effective genome size for poisson background rate estimation. See deeptools for values for common genomes. Defaults to sum of chromosome sizes, which overestimates peak significance 
peak_width 
Width of candidate peaks 
peak_tiling 
Number of candidate peaks overlapping each base of genome. E.g. peak_width = 300 and peak_tiling = 3 results in candidate peaks of 300bp spaced 100bp apart 
fdr_cutoff 
Adjusted pvalue significance cutoff 
merge_peaks 
How to merge significant peaks with

Peak calling steps:
Estimate the genomewide expected insertions per tile based on
peak_width
, effective_genome_size
, and pergroup read counts
Tile the genome with nonoverlapping tiles of size peak_width
For each tile and group, calculate p_value based on a Poisson model
Compute adjusted pvalues using BH method and using the total number of tiles as the number of hypotheses tested.
Repeat steps 24 peak_tiling
times, with evenly spaced offsets
If merge_peaks
is "all" or "group": use merge_peaks_iterative()
within each group to keep only the most
significant of the overlapping candidate peaks
If merge_peaks
is "all", perform a final round of merge_peaks_iterative()
,
prioritizing each peak by its withingroup significance rank
tibble with peak calls and the following columns:
chr
, start
, end
: genome coordinates
group
: group ID that this peak was identified in
p_val
, q_val
: Poission pvalue and BHcorrected pvalue
enrichment
: Enrichment of counts in this peak compared to a genomewide
background
Cluster an adjacency matrix
cluster_graph_leiden(snn, resolution = 0.001, seed = 12531, ...)
cluster_graph_louvain(snn, resolution = 1, seed = 12531)
cluster_graph_seurat(snn, resolution = 0.8, ...)
snn 
Symmetric adjacency matrix (dgCMatrix) output from e.g. knn_to_snn_graph. Only the lower triangle is used 
resolution 
Resolution parameter. Higher values result in more clusters 
seed 
Random seed for clustering initialization 
... 
Additional arguments to underlying clustering function 
cluster_graph_leiden: Leiden graph clustering algorithm igraph::cluster_leiden()
cluster_graph_louvain: Louvain graph clustering algorithm igraph::cluster_louvain()
cluster_graph_seurat: Seurat's clustering algorithm Seurat::FindClusters()
Factor vector containing the cluster assignment for each cell.
Converts a vector of membership IDs into a sparse matrix
cluster_membership_matrix(groups, group_order = NULL)
groups 
Vector with one entry per cell, specifying the cell's group 
group_order 
Optional vector listing ordering of groups 
cell x group matrix where an entry is 1 when a cell is in a given group
Helper function for data on features to plot from a diverse set of data sources.
collect_features(
source,
features = NULL,
gene_mapping = human_gene_mapping,
n = 1
)
source 
Matrix or data frame to pull features from, or a vector of feature values for a single feature. For a matrix, the features must be rows. 
features 
Character vector of features names to plot if 
gene_mapping 
An optional vector for gene name matching with match_gene_symbol(). Ignored if source is a data frame. 
n 
Internaluse parameter marking the number of nested calls. This is used for finding the name of the "source" input variable from the caller's perspective 
If source
is a data.frame, features will be drawn from the columns. If
source
is a matrix object (IterableMatrix
, dgCMatrix
, or matrix
), features
will be drawn from rows.
Data frame with one column for each feature requested
Convert the type of a matrix
convert_matrix_type(matrix, type = c("uint32_t", "double", "float"))
matrix 
IterableMatrix object input 
type 
One of uint32_t (unsigned 32bit integer), float (32bit real number), or double (64bit real number) 
IterableMatrix object
BPCells fragments can be interconverted with GRanges and data.frame R objects.
The main conversion method is R's builtin as()
function, though the
convert_to_fragments()
helper is also available. For all R objects except
GRanges, BPCells assumes a 0based, endexclusive coordinate system. (See
genomicrangeslike reference for details)
# Convert from R to BPCells
convert_to_fragments(x, zero_based_coords = !is(x, "GRanges"))
as(x, "IterableFragments")
# Convert from BPCells to R
as.data.frame(bpcells_fragments)
as(bpcells_fragments, "data.frame")
as(bpcells_fragments, "GRanges")
x 
Fragment coordinates given as GRanges, data.frame, or list. See

zero_based_coords 
Whether to convert the ranges from a 1based endinclusive coordinate system to a 0based endexclusive coordinate system. Defaults to true for GRanges and false for other formats (see this archived UCSC blogpost) 
convert_to_fragments(): IterableFragments object
These color palettes are derived from the ArchR color palettes, and provide large sets of distinguishable colors
discrete_palette(name, n = 1)
name 
Name of the color palette 
n 
Minimum number of colors needed 
If the requested number of colors is too large, a new palette will be constructed via interpolation from the requested palette
Combine ggplot track plots into an aligned grid. Uses patchwork to perform the alignment
draw_trackplot_grid(
...,
labels,
title = NULL,
heights = rep(1, length(plots)),
label_width = 0.2,
label_style = list(fontface = "bold", size = 4)
)
... 
Plots in order from top to bottom, generally plain ggplots.
To better accomodate many bulk tracks, patchwork objects which contain multiple
tracks are also accepted. In this case, plot labels will be drawn from the
attribute 
labels 
Text labels to display for each track 
title 
Text for overarching title of the plot 
heights 
Relative heights for each component plot. It is suggested to use 1 as standard height of a pseudobulk track. 
label_width 
Fraction of width that should be used for labels relative to the main track area 
label_style 
Arguments to pass to geom_text to adjust label text style 
A plot object with aligned genome plots. Each aligned row has the text label, yaxis, and plot body. The relative height of each row is given by heights. A shared title and xaxis are put at the top.
Extend genome ranges in a strandaware fashion.
extend_ranges(
ranges,
upstream = 0,
downstream = 0,
metadata_cols = c("strand"),
chromosome_sizes = NULL,
zero_based_coords = !is(ranges, "GRanges")
)
ranges 
Genomic regions given as GRanges, data.frame, or list. See

upstream 
Number of bases to extend each range upstream (negative to shrink width) 
downstream 
Number of bases to extend each range downstream (negative to shrink width) 
metadata_cols 
Optional list of metadata columns to require & extract 
chromosome_sizes 
(optional) Size of chromosomes as a genomicranges object 
zero_based_coords 
If true, coordinates start and 0 and the end coordinate is not included in the range. If false, coordinates start at 1 and the end coordinate is included in the range 
Note that ranges will be blocked from extending past the beginning of the chromosome (base 0),
and if chromosome_sizes
is given then they will also be blocked from extending past the end of the chromosome
Get footprints around a set of genomic coordinates
footprint(
fragments,
ranges,
zero_based_coords = !is(ranges, "GRanges"),
cell_groups = rlang::rep_along(cellNames(fragments), "all"),
cell_weights = rlang::rep_along(cell_groups, 1),
flank = 125L,
normalization_width = flank%/%10L
)
fragments 
IterableFragments object 
ranges 
Footprint centers given as GRanges, data.frame, or list. See
"+" strand ranges will footprint around the start coordinate, and "" strand ranges around the end coordinate. 
zero_based_coords 
If true, coordinates start and 0 and the end coordinate is not included in the range. If false, coordinates start at 1 and the end coordinate is included in the range 
cell_groups 
Character or factor assigning a group to each cell, in order of

cell_weights 
Numeric vector assigning weight factors
(e.g. inverse of total reads) to each cell, in order of 
flank 
Number of flanking basepairs to include on either side of the motif 
normalization_width 
Number of basepairs at the upstream + downstream extremes to use for calculating enrichment 
tibble::tibble()
with columns group
, position
, and count
, enrichment
Check if two fragments objects are identical
fragments_identical(fragments1, fragments2)
fragments1 
First IterableFragments to compare 
fragments2 
Second IterableFragments to compare 
boolean for whether the fragments objects are identical
Conveniently look up the region of a gene by gene symbol. The value returned by this function
can be used as the region
argument for trackplot functions such as
trackplot_bulk()
or trackplot_gene()
gene_region(
genes,
gene_symbol,
extend_bp = 10000,
gene_mapping = human_gene_mapping
)
genes 
GRanges, list, or data.frame of transcript features to plot. Required attributes are:

gene_symbol 
Name of gene symbol or ID 
extend_bp 
Bases to extend region upstream and downstream of gene 
gene_mapping 
Named vector where names are gene symbols or IDs and values are canonical gene symbols 
list of chr, start, end positions for use with , etc.
ArchRstyle gene activity scores are based on a weighted sum of each tile according to the signed distance from the tile to a gene body. This function calculates the signed distances according to ArchR's default parameters.
gene_score_tiles_archr(
genes,
chromosome_sizes = NULL,
tile_width = 500,
addArchRBug = FALSE
)
genes 
Gene coordinates given as GRanges, data.frame, or list. See

chromosome_sizes 
(optional) Size of chromosomes as a genomicranges object 
tile_width 
Size of tiles to consider 
addArchRBug 
Replicate ArchR bug in handling nested genes 
ArchR's tile distance algorithm works as follows
Genes are extended 5kb upstream
Genes are linked to any tiles 1kb100kb upstream + downstream, but tiles beyond a neighboring gene are not considered
Tibble with one range per tile, with additional metadata columns gene_idx (row index of the gene this tile corresponds to) and distance.
Distance is a signed distance calculated such that if the tile has a smaller start coordinate than the gene and the gene is on the + strand, distance will be negative. The distance of adjacent but nonoverlapping regions is 1bp, counting up from there.
Gene activity scores can be calculated as a distanceweighted sum of pertile accessibility.
The tile weights for each gene can be represented as a sparse matrix of dimension genes x tiles.
If we multiply this weight matrix by a corresponding tile matrix (tiles x cells), then we can
get a gene activity score matrix of genes x cells. gene_score_weights_archr()
calculates the
weight matrix (best if you have a precomputed tile matrix), while gene_score_archr()
provides
a easytouse wrapper.
gene_score_weights_archr(
genes,
chromosome_sizes,
blacklist = NULL,
tile_width = 500,
gene_name_column = "gene_id",
addArchRBug = FALSE
)
gene_score_archr(
fragments,
genes,
chromosome_sizes,
blacklist = NULL,
tile_width = 500,
gene_name_column = "gene_id",
addArchRBug = FALSE,
tile_max_count = 4,
scale_factor = 10000,
tile_matrix_path = tempfile(pattern = "gene_score_tile_mat")
)
genes 
Gene coordinates given as GRanges, data.frame, or list. See

chromosome_sizes 
Chromosome start and end coordinates given as GRanges, data.frame, or list. See
See 
blacklist 
Regions to exclude from calculations, given as GRanges, data.frame, or list. See

tile_width 
Size of tiles to consider 
gene_name_column 
If not NULL, a column name of 
addArchRBug 
Replicate ArchR bug in handling nested genes 
fragments 
Input fragments object 
tile_max_count 
Maximum value in the tile counts matrix. If not null, tile counts higher than this will be clipped to 
scale_factor 
If not null, counts for each cell will be scaled to sum to 
tile_matrix_path 
Path of a directory where the intermediate tile matrix will be saved 
gene_score_weights_archr:
Given a set of tile coordinates and distances returned by gene_score_tiles_archr()
,
calculate a weight matrix of dimensions genes x tiles. This matrix can be
multiplied with a tile matrix to obtain ArchRcompatible gene activity scores.
gene_score_weights_archr
Weight matrix of dimension genes x tiles
gene_score_archr
Gene score matrix of dimension genes x cells.
BPCells accepts a flexible set of genomic rangeslike objects as input, either
GRanges, data.frame, lists, or character vectors. These objects must specify chromosome, start,
and end coordinates along with optional metadata about each range.
With the exception of GenomicRanges::GRanges
objects, BPCells assumes all
objects use a zerobased, endexclusive coordinate system (see below for details).
BPCells can interpret the following types as ranges:
list()
, data.frame()
, with columns:
chr
: Character or factor of chromosome names
start
: Start coordinates (0based)
end
: End coordinates (exclusive)
(optional) strand
: "+"
/""
or TRUE
/FALSE
for pos/neg strand
(optional) Additional metadata as named list
entries or data.frame
columns
GenomicRanges::GRanges
start(x)
is interpreted as a 1based start coordinate
end(x)
is interpreted as an inclusive end coordinate
strand(x)
: "*"
entries are interpeted as postive strand
(optional) mcols(x)
holds additional metadata
character
Given in format "chr1:10002000"
or "chr1:1,0002,000"
Uses 0based, endexclusive coordinate system
Cannot be used for ranges where additional metadata is required
There are two main conventions for the coordinate systems:
Onebased, endinclusive ranges
The first base of a chromosome is numbered 1
The last base in a range is equal to the end coordinate
e.g. 15 describes the first 5 bases of the chromosome
Used in formats such as SAM, GTF
In BPCells, used when reading or writing GenomicRanges::GRanges
objects
Zerobased, endexclusive ranges
The first base of a chromosome is numbered 0
The last base in a range is one less than the end coordinate
e.g. 05 describes the first 5 bases of the chromosome
Used in formats such as BAM, BED
In BPCells, used for all other range objects
Mapping of the canonical gene symbols corresponding to each unambiguous alias, previous symbol, ensembl ID, or entrez ID.
human_gene_mapping
mouse_gene_mapping
human_gene_mapping
A named character vector. Names are aliases or IDs and values are the corresponding canonical gene symbol
mouse_gene_mapping
A named character vector. Names are aliases or IDs and values are the corresponding canonical gene symbol
See the source code in dataraw/human_gene_mapping.R
and
dataraw/mouse_gene_mapping.R
for exactly how these mappings were made.
human_gene_mapping
http://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/tsv/non_alt_loci_set.txt
mouse_gene_mapping
http://www.informatics.jax.org/downloads/reports/MGI_EntrezGene.rpt http://www.informatics.jax.org/downloads/reports/MRK_ENSEMBL.rpt
Read a sparse matrix from a MatrixMarket file. This is a textbased format used by 10x, Parse, and others to store sparse matrices. Format details on the NIST website.
import_matrix_market(
mtx_path,
outdir = tempfile("matrix_market"),
row_names = NULL,
col_names = NULL,
row_major = FALSE,
tmpdir = tempdir(),
load_bytes = 4194304L,
sort_bytes = 1073741824L
)
import_matrix_market_10x(
mtx_dir,
outdir = tempfile("matrix_market"),
feature_type = NULL,
row_major = FALSE,
tmpdir = tempdir(),
load_bytes = 4194304L,
sort_bytes = 1073741824L
)
mtx_path 
Path of mtx or mtx.gz file 
outdir 
Directory to store the output 
row_names 
Character vector of row names 
col_names 
Character vector of col names 
row_major 
If true, store the matrix in rowmajor orientation 
tmpdir 
Temporary directory to use for intermediate storage 
load_bytes 
The minimum contiguous load size during the merge sort passes 
sort_bytes 
The amount of memory to allocate for resorting chunks of entries 
mtx_dir 
Directory holding matrix.mtx.gz, barcodes.tsv.gz, and features.tsv.gz 
feature_type 
String or vector of feature types to include. (cellranger 3.0 and newer) 
Import MatrixMarket mtx files to the BPCells format. This implementation ensures fixed memory usage even for very large inputs by doing ondisk sorts. It will be much slower than hdf5 inputs, so only use MatrixMarket format when absolutely necessary.
As a rough speed estimate, importing the 17GB Parse
1M PBMC
DGE_1M_PBMC.mtx
file takes about 4 minutes and 1.3GB of RAM, producing a compressed output matrix of 1.5GB. mtx.gz
files will be slower to import due to gzip decompression.
When importing from 10x mtx files, the row and column names can be read automatically
using the import_matrix_market_10x()
convenience function.
MatrixDir object with the imported matrix
Methods for IterableFragments objects
## S4 method for signature 'IterableFragments'
show(object)
cellNames(x)
cellNames(x, ...) < value
chrNames(x)
chrNames(x, ...) < value
object 
IterableFragments object 
x 
an IterableFragments object 
value 
Character vector of new names 
cellNames<
It is only possible to replace names, not add new names.
chrNames<
It is only possible to replace names, not add new names.
cellNames()
Character vector of cell names, or NULL if none are known
chrNames()
: Character vector of chromosome names, or NULL if none are known
show(IterableFragments)
: Print IterableFragments
cellNames()
: Get cell names
cellNames(x, ...) < value
: Set cell names
chrNames()
: Set chromosome names
chrNames(x, ...) < value
: Set chromosome names
Generic methods and builtin functions for IterableMatrix objects
matrix_type(x)
storage_order(x)
## S4 method for signature 'IterableMatrix'
show(object)
## S4 method for signature 'IterableMatrix'
t(x)
## S4 method for signature 'IterableMatrix,matrix'
x %*% y
## S4 method for signature 'IterableMatrix'
rowSums(x)
## S4 method for signature 'IterableMatrix'
colSums(x)
## S4 method for signature 'IterableMatrix'
rowMeans(x)
## S4 method for signature 'IterableMatrix'
colMeans(x)
## S4 method for signature 'IterableMatrix'
log1p(x)
log1p_slow(x)
## S4 method for signature 'IterableMatrix'
expm1(x)
expm1_slow(x)
## S4 method for signature 'IterableMatrix,numeric'
e1 ^ e2
pow_slow(x, exponent)
## S4 method for signature 'numeric,IterableMatrix'
e1 < e2
## S4 method for signature 'IterableMatrix,numeric'
e1 > e2
## S4 method for signature 'numeric,IterableMatrix'
e1 <= e2
## S4 method for signature 'IterableMatrix,numeric'
e1 >= e2
## S4 method for signature 'IterableMatrix'
round(x, digits = 0)
## S4 method for signature 'IterableMatrix,numeric'
e1 * e2
## S4 method for signature 'IterableMatrix,numeric'
e1 + e2
## S4 method for signature 'IterableMatrix,numeric'
e1 / e2
## S4 method for signature 'IterableMatrix,numeric'
e1  e2
x 
IterableMatrix object 
object 
IterableMatrix object 
y 
matrix 
t()
Transposed object
x %*% y
: dense matrix result
rowSums()
: vector of row sums
colSums()
: vector of col sums
rowMeans()
: vector of row means
colMeans()
: vector of col means
matrix_type()
: Get the matrix data type (mat_uint32_t, mat_float, or mat_double for now)
storage_order()
: Get the matrix storage order ("row" or "col")
show(IterableMatrix)
: Display an IterableMatrix
t(IterableMatrix)
: Transpose an IterableMatrix
x %*% y
: Multiply by a dense matrix
rowSums(IterableMatrix)
: Calculate rowSums
colSums(IterableMatrix)
: Calculate colSums
rowMeans(IterableMatrix)
: Calculate rowMeans
colMeans(IterableMatrix)
: Calculate colMeans
log1p(IterableMatrix)
: Calculate log(x + 1)
log1p_slow()
: Calculate log(x + 1) (nonSIMD version)
expm1(IterableMatrix)
: Calculate exp(x)  1
expm1_slow()
: Calculate exp(x)  1 (nonSIMD version)
e1^e2
: Calculate x^y (elementwise)
pow_slow()
: Calculate x^y (elementwise, nonSIMD version)
e1 < e2
: Binarize matrix according to numeric < matrix comparison
e1 > e2
: Binarize matrix according to matrix > numeric comparison
e1 <= e2
: Binarize matrix according to numeric <= matrix comparison
e1 >= e2
: Binarize matrix according to matrix >= numeric comparison
round(IterableMatrix)
: round to nearest integer (digits must be 0)
e1 * e2
: Multiply by a constant, or multiply rows by a vector length nrow(mat)
e1 + e2
: Add a constant, or rowwise addition with a vector length nrow(mat)
e1 / e2
: Divide by a constant, or divide rows by a vector length nrow(mat)
e1  e2
: Subtract a constant, or rowwise subtraction with a vector length nrow(mat)
Search for approximate nearest neighbors between cells in the reduced dimensions (e.g. PCA), and return the k nearest neighbors (knn) for each cell. Optionally, we can find neighbors between two separate sets of cells by utilizing both data and query.
knn_hnsw(
data,
query = NULL,
k = 10,
metric = c("euclidean", "cosine"),
verbose = TRUE,
threads = 1,
ef = 100
)
knn_annoy(
data,
query = data,
k = 10,
metric = c("euclidean", "cosine", "manhattan", "hamming"),
n_trees = 50,
search_k = 1
)
data 
cell x dims matrix for reference dataset 
query 
cell x dims matrix for query dataset (optional) 
k 
number of neighbors to calculate 
metric 
distance metric to use 
verbose 
whether to print progress information during search 
threads 
Number of threads to use. Note that result is nondeterministic if threads > 1 
ef 
ef parameter for RccppHNSW::hnsw_search. Increase for slower search but improved accuracy 
n_trees 
Number of trees during index build time. More trees gives higher accuracy 
search_k 
Number of nodes to inspect during the query, or 1 for default value. Higher number gives higher accuracy 
knn_hnsw: Use RcppHNSW as knn engine
knn_annoy: Use RcppAnnoy as knn engine
List of 2 matrices – idx for cell x K neighbor indices, dist for cell x K neighbor distances. If no query is given, nearest neighbors are found mapping the data matrix to itself, prohibiting selfneighbors
Convert a KNN object (e.g. returned by knn_hnsw()
or knn_annoy()
) into
a graph. The graph is represented as a sparse adjacency matrix.
knn_to_graph(knn, use_weights = FALSE, self_loops = TRUE)
knn_to_snn_graph(
knn,
min_val = 1/15,
self_loops = FALSE,
return_type = c("matrix", "list")
)
knn_to_geodesic_graph(knn, return_type = c("matrix", "list"), threads = 0L)
knn 
List of 2 matrices – idx for cell x K neighbor indices, dist for cell x K neighbor distances 
use_weights 
boolean for whether to replace all distance weights with 1 
self_loops 
Whether to allow selfloops in the output graph 
min_val 
minimum jaccard index between neighbors. Values below this will round to 0 
return_type 
Whether to return a sparse adjacency matrix or an edge list 
threads 
Number of threads to use during calculations 
knn_to_graph Create a knn graph
knn_to_snn_graph Convert a knn object into a shared nearest neighbors adjacency matrix. This follows the algorithm that Seurat uses to compute SNN graphs
knn_to_geodesic_graph
Convert a knn object into an undirected weighted graph, using the same
geodesic distance estimation method as the UMAP package.
This matches the output of umap._umap.fuzzy_simplicial_set
from the umaplearn
python package, used by default in scanpy.pp.neighbors
.
Because this only reweights and symmetrizes the KNN graph, it will usually use
less memory and return a sparser graph than knn_to_snn_graph
which computes
2ndorder neighbors. Note: when cells don't have themselves listed as the nearest
neighbor, results may differ slightly from umap._umap.fuzzy_simplicial_set
, which
assumes self is always successfully found in the approximate nearest neighbor search.
knn_to_graph
Sparse matrix (dgCMatrix) where mat[i,j]
= distance from cell i
to
cell j
, or 0 if cell j
is not in the K nearest neighbors of i
knn_to_snn_graph
return_type == "matrix"
:
Sparse matrix (dgCMatrix) where mat[i,j]
= jaccard index of the overlap
in nearest neigbors between cell i
and cell j
, or 0 if the jaccard index
is < min_val
. Only the lower triangle is filled in, which is compatible with
the BPCells clustering methods
return_type == "list"
:
List of 3 equallength vectors i
, j
, and weight
, along with an integer dim
.
These correspond to the rows, cols, and values of nonzero entries in the lower triangle
adjacency matrix. dim
is the total number of vertices (cells) in the graph
knn_to_geodesic_graph
return_type == "matrix"
:
Sparse matrix (dgCMatrix) where mat[i,j]
= normalized similarity between cell i
and cell j
.
Only the lower triangle is filled in, which is compatible with the BPCells clustering methods
return_type == "list"
:
List of 3 equallength vectors i
, j
, and weight
, along with an integer dim
.
These correspond to the rows, cols, and values of nonzero entries in the lower triangle
adjacency matrix. dim
is the total number of vertices (cells) in the graph
Given a features x cells matrix, perform onevsall differential tests to find markers.
marker_features(mat, groups, method = "wilcoxon")
mat 
IterableMatrix object of dimensions features x cells 
groups 
Character/factor vector of cell groups/clusters. Length #cells 
method 
Test method to use. Current options are:

Tips for using the values from this function:
Use dplyr::mutate()
to add columns for e.g. adjusted pvalue and log fold change.
Use dplyr::filter()
to get only differential genes above some given threshold
To get adjusted pvalues, use R p.adjust()
, recommended method is "BH"
To get log2 fold change: if your input matrix was already logtransformed,
calculate (foreground_mean  background_mean)/log(2)
. If your input
matrix was not logtransformed, calculate log2(forground_mean/background_mean)
tibble with the following columns:
foreground: Group ID used for the foreground
background: Group ID used for the background (or NA if comparing to rest of cells)
feature: ID of the feature
p_val_raw: Unadjusted pvalue for differential test
foreground_mean: Average value in the foreground group
background_mean: Average value in the background group
Correct alias gene symbols, Ensembl IDs, and Entrez IDs to canonical gene symbols. This is useful for matching gene names between different datasets which might not always use the same gene naming conventions.
match_gene_symbol(query, subject, gene_mapping = human_gene_mapping)
canonical_gene_symbol(query, gene_mapping = human_gene_mapping)
query 
Character vector of gene symbols or IDs 
subject 
Vector of gene symbols or IDs to index into 
gene_mapping 
Named vector where names are gene symbols or IDs and values are canonical gene symbols 
match_gene_symbol
Integer vector of indices v
such that subject[v]
corresponds to the gene
symbols in query
canonical_gene_symbol
Character vector of canonical gene symbols for each symbol in query
BPCells matrices can be interconverted with Matrix package dgCMatrix sparse matrices, as well as base R dense matrices (though this may result in high memory usage for large matrices)
# Convert to R from BPCells
as(bpcells_mat, "dgCMatrix") # Sparse matrix conversion
as.matrix(bpcells_mat) # Dense matrix conversion
# Convert to BPCells from R
as(dgc_mat, "IterableMatrix")
Calculate matrix stats
matrix_stats(
matrix,
row_stats = c("none", "nonzero", "mean", "variance"),
col_stats = c("none", "nonzero", "mean", "variance"),
threads = 0L
)
matrix 
Input matrix object 
row_stats 
Which row statistics to compute 
col_stats 
Which col statistics to compute 
threads 
Number of threads to use during execution 
The statistics will be calculated in a single pass over the matrix, so this method is desirable to use for efficiency purposes compared to the more standard rowMeans or colMeans if multiple statistics are needed. The stats are ordered by complexity: nonzero, mean, then variance. All less complex stats are calculated in the process of calculating a more complicated stat. So to calculate mean and variance simultaneously, just ask for variance, which will compute mean and nonzero counts as a sideeffect
List of row_stats: matrix of n_stats x n_rows, col_stats: matrix of n_stats x n_cols
Peak and tile matrix calculations can be sped up by reducing the number of cells. For cases where the outputs are going to be added together afterwards, this can provide a performance improvement
merge_cells(fragments, cell_groups)
fragments 
Input fragments object 
cell_groups 
Character or factor vector providing a group for each cell.
Ordering is the same as 
Merge peaks according to ArchR's iterative merging algorithm. More details on the ArchR website
merge_peaks_iterative(peaks)
peaks 
Peaks given as GRanges, data.frame, or list. See
Must be ordered by priority and have columns chr, start, end. 
Properties of merged peaks:
No peaks in the merged set overlap
Peaks are prioritized according to their order in the original input
The output peaks are a subset of the input peaks, with no peak boundaries changed
tibble::tibble()
with a nonoverlapping subset of the rows in peaks. All metadata
columns are preserved
min_scalar: Take minumum with a global constant
min_by_row: Take the minimum with a perrow constant
min_by_col: Take the minimum with a percol constant
min_scalar(mat, val)
min_by_row(mat, vals)
min_by_col(mat, vals)
mat 
IterableMatrix 
val 
Single positive numeric value 
Take the minimum value of a matrix with a perrow, percol, or global constant. This constant must be >0 to preserve sparsity of the matrix. This has the effect of capping the maximum value in the matrix.
IterableMatrix
Normalize an object representing genomic ranges
normalize_ranges(
ranges,
metadata_cols = character(0),
zero_based_coords = !is(ranges, "GRanges"),
n = 1
)
ranges 
Genomic regions given as GRanges, data.frame, or list. See

metadata_cols 
Optional list of metadata columns to require & extract 
zero_based_coords 
If true, coordinates start and 0 and the end coordinate is not included in the range. If false, coordinates start at 1 and the end coordinate is included in the range 
data frame with zerobased coordinates, and elements chr (factor), start (int), and end (int).
If ranges
does not have chr level information, chr levels are the sorted unique values of chr.
If strand is in metadata_cols, then the output strand element will be TRUE for positive strand, and FALSE for negative strand. (Converted from a character vector of "+"/"" if necessary)
Count fragments by nucleosomal size
nucleosome_counts(fragments, nucleosome_width = 147)
fragments 
Fragments object 
nucleosome_width 
Integer cutoff to use as nucleosome width 
Shorter than nucleosome_width
is subNucleosomal
,
nucleosome_width
to 2*nucleosome_width1
is monoNucleosomal
, and anything longer is multiNucleosomal
.
The sum of all fragments is given as nFrags
List with names subNucleosomal
, monoNucleosomal
, multiNucleosomal
, and nFrags
, containing the
count vectors of fragments in each class per cell.
10x fragment files come in a bedlike format, with columns chr, start, end, cell_id, and pcr_duplicates. Unlike a standard bed format, the format from cellranger has an inclusive endcoordinate, meaning the end coordinate itself is what should be counted as the tagmentation site, rather than offset by 1.
open_fragments_10x(path, comment = "#", end_inclusive = TRUE)
write_fragments_10x(
fragments,
path,
end_inclusive = TRUE,
append_5th_column = FALSE
)
path 
File path (e.g. fragments.tsv or fragments.tsv.gz) 
comment 
Skip lines at beginning of file which start with comment string 
end_inclusive 
Whether the end coordinate of the bed is inclusive – i.e. there was an insertion at the end coordinate rather than the base before the end coordinate. This is the 10x default, though it's not quite standard for the bed file format. 
fragments 
Input fragments object 
append_5th_column 
Whether to include 5th column of all 0 for compatibility with 10x fragment file outputs (defaults to 4 columns chr,start,end,cell) 
open_fragments_10x
No disk operations will take place until the fragments are used in a function
write_fragments_10x
Fragments will be written to disk immediately, then returned in a readable object.
10x fragments file object
Read/write a 10x feature matrix
open_matrix_10x_hdf5(path, feature_type = NULL, buffer_size = 16384L)
write_matrix_10x_hdf5(
mat,
path,
barcodes = colnames(mat),
feature_ids = rownames(mat),
feature_names = rownames(mat),
feature_types = "Gene Expression",
feature_metadata = list(),
buffer_size = 16384L,
chunk_size = 1024L,
gzip_level = 0L
)
path 
Path to the hdf5 file on disk 
feature_type 
Optional selection of feature types to include in output matrix. For multiome data, the options are "Gene Expression" and "Peaks". This option is only compatible with files from cellranger 3.0 and newer. 
buffer_size 
For performance tuning only. The number of items to be buffered in memory before calling writes to disk. 
mat 
IterableMatrix 
barcodes 
Vector of names for the cells 
feature_ids 
Vector of IDs for the features 
feature_names 
Vector of names for the features 
feature_types 
String or vector of feature types 
feature_metadata 
Named list of additional metadata vectors to store for each feature 
chunk_size 
For performance tuning only. The chunk size used for the HDF5 array storage. 
gzip_level 
Gzip compression level. Default is 0 (no compression) 
The 10x format makes use of gzip compression for the matrix data, which can slow down read performance. Consider writing into another format if the read performance is important to you.
Input matrices must be in columnmajor storage order, and if the rownames and colnames are not set, names must be provided for the relevant metadata parameters. Some of the metadata parameters are not read by default in BPCells, but it is possible to export them for use with other tools.
BPCells matrix object
Read or write a sparse matrix from an anndata hdf5 file. These functions will
automatically transpose matrices when converting to/from the AnnData
format. This is because the AnnData convention stores cells as rows, whereas the R
convention stores cells as columns. If this behavior is undesired, call t()
manually on the matrix inputs and outputs of these functions.
open_matrix_anndata_hdf5(path, group = "X", buffer_size = 16384L)
write_matrix_anndata_hdf5(
mat,
path,
group = "X",
buffer_size = 16384L,
chunk_size = 1024L,
gzip_level = 0L
)
path 
Path to the hdf5 file on disk 
group 
The group within the hdf5 file to write the data to. If writing to an existing hdf5 file this group must not already be in use 
buffer_size 
For performance tuning only. The number of items to be buffered in memory before calling writes to disk. 
chunk_size 
For performance tuning only. The chunk size used for the HDF5 array storage. 
gzip_level 
Gzip compression level. Default is 0 (no compression) 
AnnDataMatrixH5 object, with cells as the columns.
Use this function to order regioins prior to calling peak_matrix()
or tile_matrix()
.
order_ranges(ranges, chr_levels, sort_by_end = TRUE)
ranges 
Genomic regions given as GRanges, data.frame, or list. See

chr_levels 
Ordering of chromosome names 
sort_by_end 
If TRUE (defualt), sort by (chr, end, start). Else sort by (chr, start, end) 
Numeric vector analagous to the order
function. Provides an index
selection that will reorder the input ranges to be sorted by chr, end, start
Calculate ranges x cells overlap matrix
peak_matrix(
fragments,
ranges,
mode = c("insertions", "fragments", "overlaps"),
zero_based_coords = !is(ranges, "GRanges"),
explicit_peak_names = TRUE
)
fragments 
Input fragments object. Must have cell names and chromosome names defined 
ranges 
Peaks/ranges to overlap, given as GRanges, data.frame, or list. See

mode 
Mode for counting peak overlaps. (See "value" section for more details) 
zero_based_coords 
Whether to convert the ranges from a 1based endinclusive coordinate system to a 0based endexclusive coordinate system. Defaults to true for GRanges and false for other formats (see this archived UCSC blogpost) 
explicit_peak_names 
Boolean for whether to add rownames to the output matrix in format e.g chr1:5001000, where start and end coords are given in a 0based coordinate system. Note that either way, peak names will be written when the matrix is saved. 
Iterable matrix object with dimension ranges x cells. When saved, the column names of the output matrix will be in the format chr1:5001000, where start and end coords are given in a 0based coordinate system.
mode
options
"insertions"
: Start and end coordinates are separately overlapped with each peak
"fragments"
: Like "insertions"
, but each fragment can contribute at most 1 count
to each peak, even if both the start and end coordinates overlap
"overlaps"
: Like "fragments"
, but an overlap is also counted if the fragment fully
spans the peak even if neither the start or end falls within the peak
When calculating the matrix directly from a fragments tsv, it's necessary to first call select_chromosomes()
in order to
provide the ordering of chromosomes to expect while reading the tsv.
Plot feature levels per group or cluster as a grid of dots. Dots are colored by zscore normalized average expression, and sized by percent nonzero.
plot_dot(
source,
features,
groups,
group_order = NULL,
gene_mapping = human_gene_mapping,
colors = c("lightgrey", "#4682B4"),
return_data = FALSE,
apply_styling = TRUE
)
source 
Feature x cell matrix or data.frame with features. For best results,
features should be sparse and lognormalized (e.g. run 
features 
Character vector of features to plot 
groups 
Vector with one entry per cell, specifying the cell's group 
group_order 
Optional vector listing ordering of groups 
gene_mapping 
An optional vector for gene name matching with match_gene_symbol(). 
colors 
Color scale for plot 
return_data 
If true, return data from just before plotting rather than a plot. 
apply_styling 
If false, return a plot without pretty styling applied 
Plot one or more features by coloring cells in a UMAP plot.
plot_embedding(
source,
embedding,
features = NULL,
quantile_range = c(0.01, 0.99),
randomize_order = TRUE,
smooth = NULL,
smooth_rounds = 3,
gene_mapping = human_gene_mapping,
size = NULL,
rasterize = FALSE,
raster_pixels = 512,
legend_continuous = c("auto", "quantile", "value"),
labels_quantile_range = TRUE,
colors_continuous = c("lightgrey", "#4682B4"),
legend_discrete = TRUE,
labels_discrete = TRUE,
colors_discrete = discrete_palette("stallion"),
return_data = FALSE,
return_plot_list = FALSE,
apply_styling = TRUE
)
source 
Matrix, or data frame to pull features from, or a vector of feature values for a single feature. For a matrix, the features must be rows. 
embedding 
A matrix of dimensions cells x 2 with embedding coordinates 
features 
Character vector of features to plot if 
quantile_range 
(optional) Length 2 vector giving the quantiles to clip the minimum and maximum color scale values, as fractions between 0 and 1. NULL or NA values to skip clipping 
randomize_order 
If TRUE, shuffle cells to prevent overplotting biases. Can pass an integer instead to specify a random seed to use. 
smooth 
(optional) Sparse matrix of dimensions cells x cells with cellcell distance weights for smoothing. 
smooth_rounds 
Number of multiplication rounds to apply when smoothing. 
gene_mapping 
An optional vector for gene name matching with match_gene_symbol().
Ignored if 
size 
Point size for plotting 
rasterize 
Whether to rasterize the point drawing to speed up display in graphics programs. 
raster_pixels 
Number of pixels to use when rasterizing. Can provide one number for square dimensions, or two numbers for width x height. 
legend_continuous 
Whether to label continuous features by quantile or value. "auto"
labels by quantile only when all features are continuous and 
labels_quantile_range 
Whether to add a text label with the value range of each feature when the legend is set to quantile 
colors_continuous 
Vector of colors to use for continuous color palette 
legend_discrete 
Whether to show the legend for discrete (categorical) features. 
labels_discrete 
Whether to add text labels at the center of each group for discrete (categorical) features. 
colors_discrete 
Vector of colors to use for discrete (categorical) features. 
return_data 
If true, return data from just before plotting rather than a plot. 
return_plot_list 
If 
apply_styling 
If false, return a plot without pretty styling applied 
Smoothing is performed as follows: first, the smoothing matrix is normalized so the sum of incoming weights to every cell is 1. Then, the raw data values are repeatedly multiplied by the smoothing matrix and rescaled so the average value stays the same.
By default, returns a ggplot2 object with all the requested features plotted
in a grid. If return_data
or return_plot_list
is called, the return value will
match that argument.
Plot the distribution of fragment lengths, with length in basepairs on the xaxis, and proportion of fragments on the yaxis. Typical plots will show 10basepair periodicity, as well as humps spaced at multiples of a nucleosome width (about 150bp).
plot_fragment_length(
fragments,
max_length = 500,
return_data = FALSE,
apply_styling = TRUE
)
fragments 
Fragments object 
max_length 
Maximum length to show on the plot 
return_data 
If true, return data from just before plotting rather than a plot. 
apply_styling 
If false, return a plot without pretty styling applied 
Numeric vector where index i contans the number of lengthi fragments
Plots read count rank vs. number of reads on a loglog scale.
plot_read_count_knee(
read_counts,
cutoff = NULL,
return_data = FALSE,
apply_styling = TRUE
)
read_counts 
Vector of read counts per cell 
cutoff 
(optional) Read cutoff to mark on the plot 
return_data 
If true, return data from just before plotting rather than a plot. 
apply_styling 
If false, return a plot without pretty styling applied 
Performs logarithmic downsampling to reduce the number of points plotted
ggplot2 plot object
Plot the footprinting around TF motif sites
plot_tf_footprint(
fragments,
motif_positions,
cell_groups = rlang::rep_along(cellNames(fragments), "all"),
flank = 250L,
smooth = 0L,
zero_based_coords = !is(genes, "GRanges"),
colors = discrete_palette("stallion"),
return_data = FALSE,
apply_styling = TRUE
)
fragments 
IterableFragments object 
motif_positions 
Coordinate ranges for motifs (must include strand) and have constant width 
cell_groups 
Character or factor assigning a group to each cell, in order of

flank 
Number of flanking basepairs to include on either side of the motif 
smooth 
(optional) Sparse matrix of dimensions cells x cells with cellcell distance weights for smoothing. 
zero_based_coords 
If true, coordinates start and 0 and the end coordinate is not included in the range. If false, coordinates start at 1 and the end coordinate is included in the range 
return_data 
If true, return data from just before plotting rather than a plot. 
apply_styling 
If false, return a plot without pretty styling applied 
footprint()
, plot_tss_profile()
Plot the enrichmment of insertions relative to transcription start sites (TSS). Typically, this plot shows strong enrichment of insertions near a TSS, and a small bump downstream around 220bp downstream of the TSS for the +1 nucleosome.
plot_tss_profile(
fragments,
genes,
cell_groups = rlang::rep_along(cellNames(fragments), "all"),
flank = 2000L,
smooth = 0L,
zero_based_coords = !is(genes, "GRanges"),
colors = discrete_palette("stallion"),
return_data = FALSE,
apply_styling = TRUE
)
fragments 
IterableFragments object 
genes 
Coordinate ranges for genes (must include strand) 
cell_groups 
Character or factor assigning a group to each cell, in order of

flank 
Number of flanking basepairs to include on either side of the motif 
smooth 
Number of bases to smooth over (rolling average) 
zero_based_coords 
If true, coordinates start and 0 and the end coordinate is not included in the range. If false, coordinates start at 1 and the end coordinate is included in the range 
return_data 
If true, return data from just before plotting rather than a plot. 
apply_styling 
If false, return a plot without pretty styling applied 
footprint()
, plot_tf_footprint()
Density scatter plot with log10(fragment_count) on the xaxis and TSS enrichment on the yaxis. This plot is most useful to select which cell barcodes in an experiment correspond to highquality cells
plot_tss_scatter(
atac_qc,
min_frags = NULL,
min_tss = NULL,
bins = 100,
apply_styling = TRUE
)
atac_qc 
Tibble as returned by 
min_frags 
Minimum fragment count cutoff 
min_tss 
Minimum TSS Enrichment cutoff 
bins 
Number of bins for density calculation 
apply_styling 
If false, return a plot without pretty styling applied 
Rename cells by adding a prefix to the names. Most commonly this will be a sample name.
All cells will recieve the exact text of prefix
added to the beginning, so any separator
characters like "_" must be included in the given prefix
. Use this prior to merging
fragments from different experiments with c()
in order to help prevent cell name
clashes.
prefix_cell_names(fragments, prefix)
fragments 
Input fragments object. 
prefix 
String to add as the prefix 
Fragments object with prefixed names
Calculate ArchRcompatible percell QC statistics
qc_scATAC(fragments, genes, blacklist)
fragments 
IterableFragments object 
genes 
Gene coordinates given as GRanges, data.frame, or list. See

blacklist 
Blacklisted regions given as GRanges, data.frame, or list. See

This implementation mimics ArchR's default parameters. For uses requiring more flexibility to tweak default parameters, the best option is to reimplement this function with required changes. Output columns of data.frame:
cellName
: cell name for each cell
nFrags
: number of fragments per cell
subNucleosomal
, monoNucleosomal
, multiNucleosomal
: number of fragments of size 1146bp, 147254bp, and 255bp + respectively.
equivalent to ArchR's nMonoFrags, nDiFrags, nMultiFrags respectively
TSSEnrichment
: AvgInsertInTSS / max(AvgInsertFlankingTSS, 0.1)
, where AvgInsertInTSS
is ReadsInTSS / 101
(window size),
and AvgInsertFlankingTSS
is ReadsFlankingTSS / (100*2)
(window size). The max(0.1)
ensures that very lowread cells
do not get assigned spuriously high TSSEnrichment.
ReadsInPromoter
: Number of reads from 2000bp upstream of TSS to 101bp downstream of TSS
ReadsInBlacklist
: Number of reads in the provided blacklist region
ReadsInTSS
: Number of reads overlapping the 101bp centered around each TSS
ReadsFlankingTSS
: Number of reads overlapping 19012000bp +/ each TSS
Differences from ArchR: Note that ArchR by default uses a different set of annotations to derive TSS sites and promoter sites. This function uses just one annotation for gene start+end sites, so must be called twice to exactly recalculate the ArchR QC stats.
ArchR's PromoterRatio
and BlacklistRatio
are not included in the output, as they can be easily calculated
from ReadsInPromoter / nFrags
and ReadsInBlacklist / nFrags
. Similarly, ArchR's NucleosomeRatio
can be calculated
as (monoNucleosomal + multiNucleosomal) / subNucleosomal
.
data.frame with QC data
Given a set of genomic ranges, find the distance to the nearest neighbors both upstream and downstream.
range_distance_to_nearest(
ranges,
addArchRBug = FALSE,
zero_based_coords = !is(ranges, "GRanges")
)
ranges 
Genomic regions given as GRanges, data.frame, or list. See

addArchRBug 
boolean to reproduce ArchR's bug that incorrectly handles nested genes 
zero_based_coords 
If true, coordinates start and 0 and the end coordinate is not included in the range. If false, coordinates start at 1 and the end coordinate is included in the range 
A 2column data.frame with columns upstream and downstream, containing
the distances to the nearest neighbor in the respective directions.
For ranges on +
or *
strand, distance is calculated as:
upstream = max(start(range)  end(upstreamNeighbor), 0)
downstream = max(start(downstreamNeighbor)  end(range), 0)
For ranges on 
strand, the definition of upstream and downstream is flipped.
Note that this definition of distance is one off from
GenomicRanges::distance()
, as ranges that neighbor but don't overlap are given
a distance of 1 rather than 0.
Bed files can contain peak or blacklist annotations. These utilities help read thos annotations
read_bed(
path,
additional_columns = character(0),
backup_url = NULL,
timeout = 300
)
read_encode_blacklist(
dir,
genome = c("hg38", "mm10", "hg19", "dm6", "dm3", "ce11", "ce10"),
timeout = 300
)
path 
Path to file (or desired save location if backup_url is used) 
additional_columns 
Names for additional columns in the bed file 
backup_url 
If path does not exist, provides a URL to download the gtf from 
timeout 
Maximum time in seconds to wait for download from backup_url 
dir 
Output directory to cache the downloaded gtf file 
genome 
genome name 
read_bed
Read a bed file from disk or a url.
read_encode_blacklist
Downloads the Boyle Lab blacklist, as described in https://doi.org/10.1038/s4159801945839z
Data frame with coordinates using the 0based convention.
read_gtf()
, read_gencode_genes()
Read gene annotations from gtf format into a data frame. The source can be a URL, a gtf file on disk, or a gencode release version.
read_gtf(
path,
attributes = c("gene_id"),
tags = character(0),
features = c("gene"),
keep_attribute_column = FALSE,
backup_url = NULL,
timeout = 300
)
read_gencode_genes(
dir,
release = "latest",
annotation_set = c("basic", "comprehensive"),
gene_type = "lncRNAprotein_codingIG_.*_geneTR_.*_gene",
attributes = c("gene_id", "gene_type", "gene_name"),
tags = character(0),
features = c("gene"),
timeout = 300
)
read_gencode_transcripts(
dir,
release = "latest",
transcript_choice = c("MANE_Select", "Ensembl_Canonical", "all"),
annotation_set = c("basic", "comprehensive"),
gene_type = "lncRNAprotein_codingIG_.*_geneTR_.*_gene",
attributes = c("gene_id", "gene_type", "gene_name", "transcript_id"),
features = c("transcript", "exon"),
timeout = 300
)
path 
Path to file (or desired save location if backup_url is used) 
attributes 
Vector of GTF attribute names to parse out as columns 
tags 
Vector of tags to parse out as boolean presence/absence 
features 
List of features types to keep from the GTF (e.g. gene, transcript, exon, intron) 
keep_attribute_column 
Boolean for whether to preserve the raw attribute text column 
backup_url 
If path does not exist, provides a URL to download the gtf from 
timeout 
Maximum time in seconds to wait for download from backup_url 
dir 
Output directory to cache the downloaded gtf file 
release 
release version (prefix with M for mouse versions). For most recent version, use "latest" or "latest_mouse" 
annotation_set 
Either "basic" or "comprehensive" annotation sets (see details section). 
gene_type 
Regular expression with which gene types to keep. Defaults to protein_coding, lncRNA, and IG/TR genes 
transcript_choice 
Method for selecting representative transcripts. Choices are:

read_gtf
Read gtf from a file or URL
read_gencode_genes
Read gene annotations directly from GENCODE. The file name will vary depending
on the release and annotation set requested, but will be of the format
gencode.v42.annotation.gtf.gz
. GENCODE currently recommends the basic set:
https://www.gencodegenes.org/human/. In release 42, both the comprehensive and
basic sets had identical genelevel annotations, but the comprehensive set had
additional transcript variants annotated.
read_gencode_transcripts
Read transcript models from GENCODE, for use with trackplot_gene()
Data frame with coordinates using the 0based convention. Columns are:
chr
source
feature
start
end
score
strand
frame
attributes (optional; named according to listed attributes)
tags (named according to listed tags)
read_bed()
, read_encode_blacklist()
Read chromosome sizes from UCSC and return as a tibble with one row per chromosome. The underlying data is pulled from here: https://hgdownload.soe.ucsc.edu/downloads.html
read_ucsc_chrom_sizes(
dir,
genome = c("hg38", "mm39", "mm10", "mm9", "hg19"),
keep_chromosomes = "chr[09]+chrXchrY",
timeout = 300
)
Rotate ggplot x axis labels
rotate_x_labels(degrees = 45)
degrees 
Number of degrees to rotate by 
Calculate pearson residuals of a negative binomial sctransform model.
Normalized values are calculated as (X  mu) / sqrt(mu + mu^2/theta)
.
mu is calculated as cell_read_counts * gene_beta
.
sctransform_pearson(
mat,
gene_theta,
gene_beta,
cell_read_counts,
min_var = Inf,
clip_range = c(10, 10),
columns_are_cells = TRUE,
slow = FALSE
)
mat 
IterableMatrix (raw counts) 
gene_theta 
Vector of pergene thetas (overdispersion values) 
gene_beta 
Vector of pergene betas (expression level values) 
cell_read_counts 
Vector of total reads per (umi count for RNA) 
min_var 
Minimum value for clipping variance 
clip_range 
Length 2 vector of min and max clipping range 
columns_are_cells 
Whether the columns of the matrix correspond to cells (default) or genes 
slow 
If TRUE, use a 10x slower but more precise implementation (default FALSE) 
The parameterization used is somewhat simplified compared to the original SCTransform paper, in particular it uses a linearscale rather than logscale to represent the cell_read_counts and gene_beta variables. It also does not support the addition of arbitrary cell metadata (e.g. batch) to add to the negative binomial regression.
IterableMatrix
Subset, translate, or reorder cell IDs
select_cells(fragments, cell_selection)
fragments 
Input fragments object 
cell_selection 
List of chromosme IDs (numeric), or names (character).
The output cell ID 
Subset, translate, or reorder chromosome IDs
select_chromosomes(fragments, chromosome_selection)
fragments 
Input fragments object 
chromosome_selection 
List of chromosme IDs (numeric), or names (character).
The output chromosome ID 
Fragments can be subset based on overlapping (or not overlapping) a set of regions
select_regions(
fragments,
ranges,
invert_selection = FALSE,
zero_based_coords = !is(ranges, "GRanges")
)
fragments 
Input fragments object. 
ranges 
Peaks/ranges to overlap, given as GRanges, data.frame, or list. See

invert_selection 
If TRUE, select fragments not overlapping selected regions instead of only fragments overlapping the selected regions. 
zero_based_coords 
Whether to convert the ranges from a 1based endinclusive coordinate system to a 0based endexclusive coordinate system. Defaults to true for GRanges and false for other formats (see this archived UCSC blogpost) 
Fragments object filtered according to the selected regions
Shifts start or end of fragments by a fixed amount, which can be useful to correct the Tn5 offset.
shift_fragments(fragments, shift_start = 0L, shift_end = 0L)
fragments 
Input fragments object 
shift_start 
How many basepairs to shift the start coords 
shift_end 
How many basepairs to shift the end coords 
The correct Tn5 offset is +/ 4bp since the Tn5 cut sites on opposite strands are offset by 9bp. However, +4/5 bp is often applied to bedformat files, since the end coordinate in bed files is 1 past the last basepair of the sequenced DNA fragment. This results in a bedlike format except with inclusive end coordinates.
Shifted fragments object
Subset fragments by length
subset_lengths(fragments, min_len = 0L, max_len = NA_integer_)
fragments 
Input fragments object 
min_len 
Minimum bases in fragment (inclusive) 
max_len 
Maximum bases in fragment (inclusive) 
Fragment length is calculated as endstart
Fragments object
Use the C++ Spectra solver (same as RSpectra package), in order to compute the largest k values and corresponding singular vectors. Empirically, memory usage is much lower than using irlba::irlba, likely due to avoiding R garbage creation while solving due to the pureC++ solver.
svds(A, k, nu = k, nv = k, opts = list(), threads=0L, ...)
A 
The matrix whose truncated SVD is to be computed. 
k 
Number of singular values requested. 
nu 
Number of right singular vectors to be computed. This must be between 0 and 'k'. (Must be equal to 'k' for BPCells IterableMatrix) 
opts 
Control parameters related to computing algorithm. See Details below 
threads 
Control threads to use calculating matvec producs (BPCells specific) 
When RSpectra is installed, this function will just add a method to
RSpectra::svds()
for the IterableMatrix
class. This
documentation is a slightlyedited version of the RSpectra::svds()
documentation.
The opts
argument is a list that can supply any of the
following parameters:
ncv
Number of Lanzcos basis vectors to use. More vectors
will result in faster convergence, but with greater
memory use. ncv
must be satisfy
$k < ncv \le p$
where
p = min(m, n)
.
Default is min(p, max(2*k+1, 20))
.
tol
Precision parameter. Default is 1e10.
maxitr
Maximum number of iterations. Default is 1000.
center
Either a logical value (TRUE
/FALSE
), or a numeric
vector of length $n$
. If a vector $c$
is supplied, then
SVD is computed on the matrix $A  1c'$
,
in an implicit way without actually forming this matrix.
center = TRUE
has the same effect as
center = colMeans(A)
. Default is FALSE
. Ignored in BPCells
scale
Either a logical value (TRUE
/FALSE
), or a numeric
vector of length $n$
. If a vector $s$
is supplied, then
SVD is computed on the matrix $(A  1c')S$
,
where $c$
is the centering vector and $S = diag(1/s)$
.
If scale = TRUE
, then the vector $s$
is computed as
the column norm of $A  1c'$
.
Default is FALSE
. Ignored in BPCells
A list with the following components:
d 
A vector of the computed singular values. 
u 
An 
v 
An 
nconv 
Number of converged singular values. 
niter 
Number of iterations used. 
nops 
Number of matrixvector multiplications used. 
Qiu Y, Mei J (2022). RSpectra: Solvers for LargeScale Eigenvalue and SVD Problems. R package version 0.161, https://CRAN.Rproject.org/package=RSpectra.
Calculate ranges x cells tile overlap matrix
tile_matrix(
fragments,
ranges,
zero_based_coords = !is(ranges, "GRanges"),
explicit_tile_names = FALSE
)
fragments 
Input fragments object 
ranges 
Tiled regions given as GRanges, data.frame, or list. See
Must be nonoverlapping and sorted by
(chr, start), with chromosomes ordered according to the chromosome names of 
zero_based_coords 
Whether to convert the ranges from a 1based endinclusive coordinate system to a 0based endexclusive coordinate system. Defaults to true for GRanges and false for other formats (see this archived UCSC blogpost) 
explicit_tile_names 
Boolean for whether to add rownames to the output matrix in format e.g chr1:5001000, where start and end coords are given in a 0based coordinate system. For wholegenome Tile matrices the names will take ~5 seconds to generate and take up 400MB of memory. Note that either way, tile names will be written when the matrix is saved. 
Iterable matrix object with dimension ranges x cells. When saved, the column names will be in the format chr1:5001000, where start and end coords are given in a 0based coordinate system.
When calculating the matrix directly from a fragments tsv, it's necessary to first call select_chromosomes()
in order to
provide the ordering of chromosomes to expect while reading the tsv.
Plot a pseudobulk genome track, showing the number of fragment insertions across a region.
trackplot_bulk(
fragments,
region,
groups,
cell_read_counts,
group_order = NULL,
bins = 200,
clip_quantile = 0.999,
colors = discrete_palette("stallion"),
legend_label = "group",
zero_based_coords = !is(region, "GRanges"),
return_data = FALSE,
return_plot_list = FALSE,
apply_styling = TRUE
)
fragments 
Fragments object 
region 
GRanges of length 1 with region to plot, or list/data.frame with
one entry each for chr, start, end. See 
groups 
Vector with one entry per cell, specifying the cell's group 
cell_read_counts 
Numeric vector of read counts for each cell (used for normalization) 
group_order 
Optional vector listing ordering of groups 
bins 
Number of bins to plot across the region 
clip_quantile 
(optional) Quantile of values for clipping yaxis limits. Default of 0.999 will crop out just the most extreme outliers across the region. NULL to disable clipping 
colors 
Character vector of color values (optionally named by group) 
legend_label 
Custom label to put on the legend 
zero_based_coords 
Whether to convert the ranges from a 1based endinclusive coordinate system to a 0based endexclusive coordinate system. Defaults to true for GRanges and false for other formats (see this archived UCSC blogpost) 
return_data 
If true, return data from just before plotting rather than a plot. 
return_plot_list 
If 
apply_styling 
If false, return a plot without pretty styling applied 
Returns a combined plot of pseudobulk genome tracks. For compatability with
draw_trackplot_grid()
, the extra attribute $patches$labels
will be added to
specify the labels for each track. If return_data
or return_plot_list
is
TRUE
, the return value will be modified accordingly.
Plot transcript models
trackplot_gene(
transcripts,
region,
exon_size = 3,
gene_size = 1,
label_size = 3,
return_data = FALSE
)
transcripts 
GRanges, list, or data.frame of transcript features to plot. Required attributes are:

region 
GRanges of length 1 with region to plot, or list/data.frame with
one entry each for chr, start, end. See 
exon_size 
size for exon lines 
label_size 
size for transcript labels 
return_data 
If true, return data from just before plotting rather than a plot. 
labels 
Character vector with labels for each item in transcripts. NA for items that should not be labeled 
transcript_size 
size for transcript lines 
Plot of gene locations
Transpose the storage order for a matrix
transpose_storage_order(
matrix,
outdir = tempfile("transpose"),
tmpdir = tempdir(),
load_bytes = 4194304L,
sort_bytes = 1073741824L
)
matrix 
Input matrix 
outdir 
Directory to store the output 
tmpdir 
Temporary directory to use for intermediate storage 
load_bytes 
The minimum contiguous load size during the merge sort passes 
sort_bytes 
The amount of memory to allocate for resorting chunks of entries 
This resorts the entries of a matrix to change the storage order from rowmajor to colmajor. For large matrices, this can be slow – around 2 minutes to transpose a 500k cell RNAseq matrix The default load_bytes (4MiB) and sort_bytes (1GiB) parameters allow ~85GB of data to be sorted with two passes through the data, and ~7.3TB of data to be sorted in three passes through the data.
MatrixDir object with a copy of the input matrix, but the storage order flipped
BPCells fragments can be read/written in compressed (bitpacked) or uncompressed form in a variety of storage locations: in memory (as an R object), in an hdf5 file, or in a directory on disk (containing binary files).
write_fragments_memory(fragments, compress = TRUE)
write_fragments_dir(
fragments,
dir,
compress = TRUE,
buffer_size = 1024L,
overwrite = FALSE
)
open_fragments_dir(dir, buffer_size = 1024L)
write_fragments_hdf5(
fragments,
path,
group = "fragments",
compress = TRUE,
buffer_size = 8192L,
chunk_size = 1024L,
overwrite = FALSE,
gzip_level = 0L
)
open_fragments_hdf5(path, group = "fragments", buffer_size = 16384L)
fragments 
Input fragments object 
compress 
Whether or not to compress the data. With compression, storage size is be about half the size of a gzipcompressed 10x fragments file. 
dir 
Directory to read/write the data from 
buffer_size 
For performance tuning only. The number of items to be bufferred in memory before calling writes to disk. 
overwrite 
If 
path 
Path to the hdf5 file on disk 
group 
The group within the hdf5 file to write the data to. If writing to an existing hdf5 file this group must not already be in use 
chunk_size 
For performance tuning only. The chunk size used for the HDF5 array storage. 
gzip_level 
Gzip compression level. Default is 0 (no compression). This is recommended when both compression and compatibility with outside programs is required. Otherwise, using compress=TRUE is recommended as it is >10x faster with often similar compression levels. 
Saving in a directory on disk is a good default for local analysis, as it provides the best I/O performance and lowest memory usage. The HDF5 format allows saving within existing hdf5 files to group data together, and the in memory format provides the fastest performance in the event memory usage is unimportant.
Fragment object
BPCells matrices are stored in sparse format, meaning only the nonzero entries are stored. Matrices can store integer counts data or decimal numbers (float or double). See details for more information.
write_matrix_memory(mat, compress = TRUE)
write_matrix_dir(
mat,
dir,
compress = TRUE,
buffer_size = 8192L,
overwrite = FALSE
)
open_matrix_dir(dir, buffer_size = 8192L)
write_matrix_hdf5(
mat,
path,
group,
compress = TRUE,
buffer_size = 8192L,
chunk_size = 1024L,
overwrite = FALSE,
gzip_level = 0L
)
open_matrix_hdf5(path, group, buffer_size = 16384L)
compress 
Whether or not to compress the data. 
dir 
Directory to save the data into 
buffer_size 
For performance tuning only. The number of items to be buffered in memory before calling writes to disk. 
overwrite 
If 
path 
Path to the hdf5 file on disk 
group 
The group within the hdf5 file to write the data to. If writing to an existing hdf5 file this group must not already be in use 
chunk_size 
For performance tuning only. The chunk size used for the HDF5 array storage. 
gzip_level 
Gzip compression level. Default is 0 (no compression). This is recommended when both compression and compatibility with outside programs is required. Otherwise, using compress=TRUE is recommended as it is >10x faster with often similar compression levels. 
matrix 
Input matrix, either IterableMatrix or dgCMatrix 
Matrices can be stored in a directory on disk, in memory, or in an HDF5 file. Saving in a directory on disk is a good default for local analysis, as it provides the best I/O performance and lowest memory usage. The HDF5 format allows saving within existing hdf5 files to group data together, and the in memory format provides the fastest performance in the event memory usage is unimportant.
For typical RNA counts matrices holding integer counts, this bitpacking compression will result in 68x less space than an R dgCMatrix, and 46x smaller than a scipy csc_matrix. The compression will be more effective when the count values in the matrix are small, and when the rows of the matrix are sorted by rowMeans. In tests on RNAseq data optimal ordering could save up to 40% of storage space. On noninteger data only the row indices are compressed, not the values themselves so space savings will be smaller.
For noninteger data matrices, bitpacking compression is much less effective, as it can only be applied to the indexes of each entry but not the values. There will still be some space savings, but far less than for counts matrices.
BPCells matrix object