In this tutorial, we will describe an R package “Mixscale” for
analyzing Perturb-seq data. Mixscale contains functions designed to
tackle the following tasks:
1. calculate ‘Mixscale scores’ for cells that receives the same
perturbation to quantify the heterogeneity in perturbation
strength
2. perform a scoring-based weighted differential expression (DE) tests
to identify DE genes for each perturbation
3. perform different levels of decomposition analysis to identify
correlated perturbations and group them into a program
4. perform a PCA-based permutation test to extract shared genes for the
perturbation programs (program gene signature)
5. identify shared and unique signature between two relevant
programs
6. perform gene set enrichment tests using the program signature for new
datasets
7. perform module score analyses using the program signature to quantify
the program activity in new datasets
The tutorial is divided into two sections. The first section will
describe task 1 to 5 using a public Perturb-seq dataset from the
Weissman Lab (Jost et al
2020), which can be downloaded from GSE132080.
The second section will describe task 6 and 7 using the pathway gene
lists generated from our study (available at Zenodo) and an interferon-beta stimulated
human PBMCs dataset (ifnb) from Kang et al 2017.
This ifnb dataset is available via the SeuratData package (see section 2
below).
The package can be easily installed using the following command. Please make sure the dependencies like Seurat are installed prior to this.
# dependencies
install.packages("Seurat")
install.packages("PMA")
install.packages("protoclust")
BiocManager::install("glmGamPoi")
# Mixscale package
devtools::install_github("longmanz/Mixscale")
options(Seurat.object.assay.version = 'v3')
library(Seurat)
library(ggridges)
library(ggplot2)
library(Mixscale)
In this section we will focus on how to use Mixscale to analyze Perturb-seq data.
The demo dataset from Jost et al
2020 contains CRISPRi Perturb-seq data targeting 25 key genes
involved in essential cell biological processes. We will load in the
count matrix to create a Seurat object and append the provided meta data
to it.
One special feature of this dataset is that, for each perturbation
target gene, there are five different gRNAs designed to target it. One
of the gRNA has the perfectly matched sequence for the target region
(labelled with “_00”), while the others contain 1~3 nucleotide
mismatches so that their perturbation stength is “titrated”. We will
treat the cells that have the same target gene as the same group in our
downstream analyses.
library(stringr)
# load in the count matrix downloaded from GSE132080
ct_mat = ReadMtx(mtx = "GSE132080/GSE132080_10X_matrix.mtx",
cells = "GSE132080/GSE132080_10X_barcodes.tsv",
features = "GSE132080/GSE132080_10X_genes.tsv")
# load the meta_data
meta_data = read.csv("GSE132080/GSE132080_cell_identities.csv")
rownames(meta_data) = meta_data$cell_barcode
# create a seurat object
seurat_obj = CreateSeuratObject(counts = ct_mat, meta.data = meta_data)
rm(ct_mat, meta_data)
# retrieve the guide information for each cell
txt = seurat_obj$guide_identity
txt2 = str_extract(txt, "^[^_]+")
txt3 = gsub(pattern = "^[^_]+_", replacement = "", txt)
seurat_obj[['gene']] = txt2
seurat_obj[['gRNA_name']] = txt3
seurat_obj[['cell_type']] = "K562"
rm(txt, txt2, txt3)
# remove ambiguous cells
seurat_obj = subset(seurat_obj, subset = number_of_cells == 1) # 19594 cells remain
seurat_obj = subset(seurat_obj, subset = guide_identity != '*') # 19587 cells remain
We will first run standard pre-processing (normalization, find variable features, etc) for the dataset. Then, we will follow the standard Mixscape analysis to calculate local perturbation signatures that mitigate confounding effects. Briefly speaking, for each cell we will search for its 20 nearest neighbors from the non-targeted (NT) cells, and then remove all technical variation so that perturbation-specific effect can be revealed.
# quick check of the data
seurat_obj
## An object of class Seurat
## 33694 features across 19587 samples within 1 assay
## Active assay: RNA (33694 features, 0 variable features)
## 2 layers present: counts, data
# standard pre-processing
seurat_obj = NormalizeData(seurat_obj)
seurat_obj = FindVariableFeatures(seurat_obj)
seurat_obj = ScaleData(seurat_obj)
seurat_obj = RunPCA(seurat_obj)
# calculate Perturbation signatures
seurat_obj <- CalcPerturbSig(
object = seurat_obj,
assay = "RNA",
slot = "data",
gd.class ="gene",
nt.cell.class = "neg",
reduction = "pca",
ndims = 40,
num.neighbors = 20,
new.assay.name = "PRTB",
split.by = NULL) # to specify the metadata column if multiple biological contexts (e.g., cell lines) exist
We will now compute the Mixscale scores for each cell in every perturbation group. The standardized Mixscale scores will be saved in the metadata under the column named “mixscale_score”. Note that if multiple biological contexts (e.g., cell lines) exist, we can use the ‘split.by’ to specify the corresponding metadata column (in this example we only have one cell line).
# Mixscale
seurat_obj = RunMixscale(
object = seurat_obj,
assay = "PRTB",
slot = "scale.data",
labels = "gene",
nt.class.name = "neg",
min.de.genes = 5,
logfc.threshold = 0.2,
de.assay = "RNA",
max.de.genes = 100,
new.class.name = "mixscale_score",
fine.mode = F,
verbose = F,
split.by = NULL)
We will now use some plotting functions to explore the perturbation scores that we just calculated.
# a. Check the distribution of the scores for the perturbations
RidgePlot(
seurat_obj,
features = "mixscale_score",
group.by = "gene") + NoLegend()
# b. Check if the scores correlate with the expression level of the target gene itself
Mixscale_ScatterPlot(object = seurat_obj,
nt.class.name = "neg",
slct.ident = unique(seurat_obj$gene)[unique(seurat_obj$gene) != "neg"][1:10],
nbin = 10,
facet_wrap = "gene") + NoLegend()
After calculating the scores, we can use the scores to enhance the statistical power of DE analysis by using them as a “weights” in the regression model. Briefly speaking, instead of coding the NT cells as 0 and the targeted cells as 1, we used the standardized scores to code the targeted cells in the regression, so that cells with stronger perturbation strength will have higher “weights” and vice versa. Similar as above, if multiple biological contexts (e.g., cell lines) exist, we can use the ‘split.by’ to specify the corresponding metadata column, and the DE tests will be performed in a multi-variate manner. Cell line specific P-values will be reported separately in the final output.
# run score-based weighted DE test for 12 selected perturbations. It will return a list of data frames (one for each perturbation)
de_res = Run_wmvRegDE(object = seurat_obj, assay = "RNA", slot = "counts",
labels = "gene", nt.class.name = "neg",
PRTB_list = c("GATA1", "GINS1", "MTOR"),
logfc.threshold = 0.2,
split.by = NULL)
# have a quick look at the DE results
head(de_res[[1]])
## gene_ID log2FC beta_weight p_weight DE_method
## CYBA CYBA -1.4268525 0.08554419 0 weighted
## RPL18A RPL18A 0.4359628 -0.03871204 0 weighted
## RPS12 RPS12 0.4241743 -0.03550264 0 weighted
## RPS2 RPS2 0.4265143 -0.03764625 0 weighted
## TMSB4X TMSB4X -1.6341147 0.08718808 0 weighted
## YBX1 YBX1 0.4969422 -0.05072183 0 weighted
We can now explore the top DE genes for each perturbations using the customized DoHeatmap function, where cells are ordered by Mixscale scores.
# select the top 20 DE genes from the perturbation
top_res = de_res[["GATA1"]][order(de_res[["GATA1"]]$p_weight)[1:20], ]
# order the DE genes based on its log-fold-change
top_DEG = rownames(top_res[order(top_res$log2FC, decreasing = T), ])
# heatmap for the top DE genes. cells ordered by Mixscale scores. The expression level of the target gene will be plotted in the first row.
Mixscale_DoHeatmap(object = seurat_obj,
labels = "gene",
nt.class.name = "neg",
slct.ident = "GATA1",
mixscale.score.name = "mixscale_score",
features = c("GATA1", top_DEG), angle = 0, hjust = 0.5)
We can examine the DE results for additional perturbations using similar codes as above. For the “GINS1” perturbation, the target gene’s expression appears poorly correlated with Mixscale scores, as shown in the Mixscale_ScatterPlot. However, the Mixscale scores correlate well with the expression levels of other DE genes, suggesting that Mixscale scores might more accurately indicate the perturbation strength in cells compared to target gene’s expression (especially when it is a lowly-expressed gene as “GINS1”).
In this section we will focus on how to use the pathway signatures from our study to run gene set enrichment test in external datasets.
We will use the interferon-beta (IFNB) stimulated human PBMCs dataset (ifnb) from Kang et al 2017 (available via SeuratData package) to demonstrate how to perform gene set enrichment analyses using the pathway gene sets from our study. We aim to show that by using our pathway gene lists, we can correctly infer the pathway activation of IFNB across different cell types in the human PBMCs.
# install the ifnb dataset
SeuratData::InstallData("ifnb")
# load dataset
ifnb <- SeuratData::LoadData("ifnb")
We can then load the pathway gene sets we generated (can be downloaded from Zenodo). There are two versions of pathway gene lists provided. One is the standard pathway gene list for different pathway programs we compiled, and the other one is the pathway exclusive gene list that filtered out the shared genes shared with other relevant pathways in the experiment.
plist = readRDS("Pathway_genelist.rds")
exclusive_plist = readRDS("Pathway_Exclusive_genelist.rds")
We will first conduct Wilcox DE tests between the control and the IFNB-stimulated cells in each cell types in the ifnb dataset. Then, we will perform Fisher enrichment tests for the DE genes from each of the cell types, testing them against the pathway gene lists we just load. These two steps are merged by a wrapper function. During the DEenrich() process, if users wish to conduct the DE tests for each cell type individually, they can use the ‘split.by’ to specify the corresponding metadata column.
# Normalize the counts
ifnb = NormalizeData(ifnb)
# A wrapper function to perform both DE and enrichment test
Idents(ifnb) = "stim"
res = DEenrich(object = ifnb,
plist = plist,
ident = "stim",
ident.1 = "STIM",
ident.2 = "CTRL",
split.by = "seurat_annotations",
direction = "up",
logfc.threshold = 0.2,
p.val.cutoff = 0.05,
min.pct = 0.1)
# check the enrichment results for CD14 Monocytes
head(res$`CD14 Mono`)
## GO_term OR Pval combined_score num_LH num_PH
## 1 IFNB_program1_down 30.551460 1.566042e-69 4840.26233710267 148 164
## 3 IFNG_program1_down 13.764122 6.311636e-47 1464.21496809152 125 154
## 5 TNFA_program1_down 7.315289 8.145677e-26 422.602219842384 91 128
## 4 IFNG_program2_down 14.092937 5.223443e-21 658.156092839788 54 64
## 2 IFNB_program2_down 1.353308 1.792922e-01 2.3259826904708 20 57
## 6 TNFA_program2_down 1.537152 3.469431e-01 1.62722098095663 5 11
## n_GO_term num_DEG direction_DEG slct.ct
## 1 300 590 upDEG CD14 Mono
## 3 300 590 upDEG CD14 Mono
## 5 300 590 upDEG CD14 Mono
## 4 200 590 upDEG CD14 Mono
## 2 223 590 upDEG CD14 Mono
## 6 98 590 upDEG CD14 Mono
Gene lists from related pathways, such as IFNG, IFNB, and TNFA which are all linked to immune responses, frequently share many genes. This overlap makes it challenging to differentiate the activation of these pathways. For example, as the result above shows, DE genes due to IFNB stimulation are enriched in not just the IFNB pathway, but also in IFNG and TNFA pathways. To overcome this challenge, we have introduced a concept of pathway-exclusive gene lists. Essentially, for any two related pathways, we define the exclusive genes of one pathway as those that are absent from the gene list of the other. To refine this further, we employed a more stringent criterion to exclude genes that, while potentially related, are not explicitly listed in the gene list of the other pathway (For a detailed explanation, please refer to our paper). Performing enrichment tests using the exclusive gene lists enhances our ability to accurately distinguish activations among closely associated pathways.
# only extract the exclusive gene lists that are relevant to IFNB pathway
exclusive_plist = exclusive_plist[grep(pattern = "IFNB", names(exclusive_plist))]
# A wrapper function to perform both DE and enrichment test
res_exclusive = DEenrich(object = ifnb,
plist = exclusive_plist,
ident = "stim",
ident.1 = "STIM",
ident.2 = "CTRL",
split.by = "seurat_annotations",
direction = "up",
logfc.threshold = 0.2,
p.val.cutoff = 0.05,
min.pct = 0.1)
# check the enrichment results for CD14 Monocytes
head(res_exclusive$`CD14 Mono`)
## GO_term OR Pval combined_score num_LH num_PH
## 3 IFNB_REMOVE_TNFA 9.8573087 8.350578e-30 659.998290889661 90 117
## 2 IFNB_REMOVE_IFNG 9.8435531 1.200209e-19 428.850346678085 58 74
## 1 IFNG_REMOVE_IFNB 1.7696262 6.530225e-02 4.82882984756938 16 38
## 4 TNFA_REMOVE_IFNB 0.1313289 9.999754e-01 3.22910851679664e-06 3 42
## n_GO_term num_DEG direction_DEG slct.ct
## 3 433 590 upDEG CD14 Mono
## 2 289 590 upDEG CD14 Mono
## 1 198 590 upDEG CD14 Mono
## 4 188 590 upDEG CD14 Mono
We can see that the exclusive gene lists for IFNB (removing TNFA) and IFNB (removing IFNG) are still enriched for IFNB-stimulated DE genes. But we do not observe signals from IFNG (removing IFNB) or TNFA (removing IFNB), indicating that the underlying activated pathway during IFNB stimulation is indeed IFNB, while IFNG and TNFA are showing enrichment just because of their substantial overlap with IFNB.
We can now visualize the enrichment results across all the cell types in the ifnb dataset. First we will check the results for the standard enrichment test
DEenrich_DotPlot(res,
direction = "up",
plot_title = "Standard pathway gene lists")
DEenrich_DotPlot(res_exclusive,
direction = "up",
plot_title = "Pathway exclusive gene lists",
OR_cutoff = 10)
Apart from performing enrishment test, we can also evaluate the pathway activity by calculating the over all expression level of all the genes within a gene list (the so-called module score analysis). We will use package “UCell” for module score analysis. Alternatively, we can use the built-in function AddModuleScore() from Seurat as well.
ifnb = UCell::AddModuleScore_UCell(ifnb,
features = plist[c("IFNB_program1_down", "IFNG_program1_down",
"TNFA_program1_down", "TGFB1_program1_down")] )
# using VlnPlot to visualize the score of each cell
VlnPlot(ifnb,
features = grep("_UCell", names(ifnb@meta.data), value = T),
pt.size = 0,
group.by = "seurat_annotations",
split.by = "stim",
ncol = 2) &
theme(legend.position = "NA",
axis.title = element_text(size = 15),
axis.text = element_text(size = 12),
plot.title = element_text(size = 15))
We can observe very similar results as in our enrichment tests, where all IFNB, IFNG, and TNF pathways show a high activity (and not for TGFB pathway) in the IFNB-stimulated cells compared to the non-stimulated cells. And if we repeat the module score analysis using the pathway exclusive gene lists, we should be able to determine the pathway actually being activated (i.e., IFNB pathway).
ifnb = UCell::AddModuleScore_UCell(ifnb,
features = exclusive_plist)
# using VlnPlot to visualize the score of each cell
VlnPlot(ifnb,
features = grep("_REMOVE_.*UCell", names(ifnb@meta.data), value = T),
pt.size = 0,
group.by = "seurat_annotations",
split.by = "stim",
ncol = 2) &
theme(legend.position = "NA",
axis.title = element_text(size = 15),
axis.text = element_text(size = 12),
plot.title = element_text(size = 15))