CellAnova (Zhaojun Zhang et al, 2024 Nat Biotech) is a new method that can remove/mitigate batch effect in single-cell data and return users with a "corrected" data matrix (instead of just a low-dim embedding). Here we implement the 2nd step of the CellAnova. Specifically, this function is an R-implemention of the CellAnova python package's calc_BE() function (https://github.com/Janezjz/cellanova).
Usage
cellanova_calc_BE(
object = NULL,
assay = NULL,
layer = "scale.data",
integrate_key = NULL,
features = NULL,
control_dict = NULL,
reduction = NULL,
var_cutoff = 0.9,
k_max = 1500,
k_select = NULL,
new.assay.name = "CORRECTED",
verbose = TRUE
)Arguments
- object
A Seurat object
- assay
the name of the assay to perform the CellAnova correction.
- layer
the name of the layer to be used to correct for the batch effect. Should be scale.data.
- integrate_key
A string indicating the smallest batch unit in the meta-data (e.g., library, donor, etc.), which will be used for integration later.
- features
features to compute corrected expression for. Defaults to the variable features set in the assay specified.
- control_dict
A list indicating the control-group assignment of the controls. The name of each element in the list should correspond to the batch name in the 'integrate_key' column.
- reduction
the name of the DimReduc object we use as the integrated embeddings. Should be from methods like Harmony or Seurat-integration methods (e.g., CCA).
- var_cutoff
the fraction of explained variance to determine the optimal value of k in truncated SVD when calculating the basis of the batch effect. Default is 0.9.
- k_max
the maximum of singular values and vectors to compute.
- k_select
the user-defined number of singular values and vectors to compute (override var_cutoff and k_max). Default is NULL.
- new.assay.name
the name for the new assay to store the corrected expression matrix
- verbose
display progress + messages
Value
Returns a Seurat object with a new assay added containing the batch-corrected expression matrix
Details
This function takes a Seurat object and its pre-computed integrated embedding from methods like Harmony or Seurat-CCA, a batch-effect index, and a case-control index as input, to estimate the batch effect from the control samples, and correct for it from the full original expression data. Most of the procedures are kept the same, with the following modifications:
currently we only support one control group.
we have additionally implemented a future_lapply() and a more efficient regression framework to enhance the efficiency.
the procedure can be done to a "sketched" data and later project to the whole data (for the purposes of efficiency and data balance).