Introduction

In this tutorial, we will describe an R package “Mixscale” for analyzing Perturb-seq data. Mixscale contains functions designed to tackle the following tasks:
1. calculate ‘Mixscale scores’ for cells that receives the same perturbation to quantify the heterogeneity in perturbation strength
2. perform a scoring-based weighted differential expression (DE) tests to identify DE genes for each perturbation
3. perform different levels of decomposition analysis to identify correlated perturbations and group them into a program
4. perform a PCA-based permutation test to extract shared genes for the perturbation programs (program gene signature)
5. identify shared and unique signature between two relevant programs
6. perform gene set enrichment tests using the program signature for new datasets
7. perform module score analyses using the program signature to quantify the program activity in new datasets

The tutorial is divided into two sections. The first section will describe task 1 to 5 using a public Perturb-seq dataset from the Weissman Lab (Jost et al 2020), which can be downloaded from GSE132080. The second section will describe task 6 and 7 using the pathway gene lists generated from our study (available at Zenodo) and an interferon-beta stimulated human PBMCs dataset (ifnb) from Kang et al 2017. This ifnb dataset is available via the SeuratData package (see section 2 below).

install the Mixscale package

The package can be easily installed using the following command. Please make sure the dependencies like Seurat are installed prior to this.

# dependencies
install.packages("Seurat")
install.packages("PMA")
install.packages("protoclust")
BiocManager::install("glmGamPoi")

# Mixscale package 
devtools::install_github("longmanz/Mixscale")

load the packages

options(Seurat.object.assay.version = 'v3') 

library(Seurat)
library(ggridges)
library(ggplot2)
library(Mixscale)

Section A.

In this section we will focus on how to use Mixscale to analyze Perturb-seq data.

0. Description of the demo data

The demo dataset from Jost et al 2020 contains CRISPRi Perturb-seq data targeting 25 key genes involved in essential cell biological processes. We will load in the count matrix to create a Seurat object and append the provided meta data to it.
One special feature of this dataset is that, for each perturbation target gene, there are five different gRNAs designed to target it. One of the gRNA has the perfectly matched sequence for the target region (labelled with “_00”), while the others contain 1~3 nucleotide mismatches so that their perturbation stength is “titrated”. We will treat the cells that have the same target gene as the same group in our downstream analyses.

Click here to see how to generate the Seurat object
library(stringr)

# load in the count matrix downloaded from GSE132080
ct_mat = ReadMtx(mtx = "GSE132080/GSE132080_10X_matrix.mtx", 
                 cells = "GSE132080/GSE132080_10X_barcodes.tsv", 
                 features = "GSE132080/GSE132080_10X_genes.tsv")
# load the meta_data
meta_data = read.csv("GSE132080/GSE132080_cell_identities.csv")
rownames(meta_data) = meta_data$cell_barcode

# create a seurat object 
seurat_obj = CreateSeuratObject(counts = ct_mat, meta.data = meta_data)
rm(ct_mat, meta_data)

# retrieve the guide information for each cell
txt = seurat_obj$guide_identity
txt2 = str_extract(txt, "^[^_]+")
txt3 = gsub(pattern = "^[^_]+_", replacement = "", txt)
seurat_obj[['gene']] = txt2
seurat_obj[['gRNA_name']] = txt3
seurat_obj[['cell_type']] = "K562"
rm(txt, txt2, txt3)

# remove ambiguous cells 
seurat_obj = subset(seurat_obj, subset = number_of_cells == 1)  # 19594 cells remain
seurat_obj = subset(seurat_obj, subset = guide_identity != '*')  # 19587 cells remain


1. Pre-processing and calculating the Mixscale score

We will first run standard pre-processing (normalization, find variable features, etc) for the dataset. Then, we will follow the standard Mixscape analysis to calculate local perturbation signatures that mitigate confounding effects. Briefly speaking, for each cell we will search for its 20 nearest neighbors from the non-targeted (NT) cells, and then remove all technical variation so that perturbation-specific effect can be revealed.

# quick check of the data
seurat_obj
## An object of class Seurat 
## 33694 features across 19587 samples within 1 assay 
## Active assay: RNA (33694 features, 0 variable features)
##  2 layers present: counts, data
# standard pre-processing
seurat_obj = NormalizeData(seurat_obj)
seurat_obj = FindVariableFeatures(seurat_obj)
seurat_obj = ScaleData(seurat_obj)
seurat_obj = RunPCA(seurat_obj)

# calculate Perturbation signatures 
seurat_obj <- CalcPerturbSig(
    object = seurat_obj, 
    assay = "RNA", 
    slot = "data", 
    gd.class ="gene", 
    nt.cell.class = "neg", 
    reduction = "pca", 
    ndims = 40, 
    num.neighbors = 20, 
    new.assay.name = "PRTB", 
    split.by = NULL)  # to specify the metadata column if multiple biological contexts (e.g., cell lines) exist

We will now compute the Mixscale scores for each cell in every perturbation group. The standardized Mixscale scores will be saved in the metadata under the column named “mixscale_score”. Note that if multiple biological contexts (e.g., cell lines) exist, we can use the ‘split.by’ to specify the corresponding metadata column (in this example we only have one cell line).

# Mixscale
seurat_obj = RunMixscale(
    object = seurat_obj, 
    assay = "PRTB", 
    slot = "scale.data", 
    labels = "gene", 
    nt.class.name = "neg", 
    min.de.genes = 5, 
    logfc.threshold = 0.2,
    de.assay = "RNA",
    max.de.genes = 100, 
    new.class.name = "mixscale_score", 
    fine.mode = F, 
    verbose = F, 
    split.by = NULL)

2. Visualizations for the scores

We will now use some plotting functions to explore the perturbation scores that we just calculated.

# a. Check the distribution of the scores for the perturbations
RidgePlot(
    seurat_obj,
    features = "mixscale_score",
    group.by = "gene") + NoLegend()

# b. Check if the scores correlate with the expression level of the target gene itself
Mixscale_ScatterPlot(object = seurat_obj, 
                     nt.class.name = "neg", 
                     slct.ident = unique(seurat_obj$gene)[unique(seurat_obj$gene) != "neg"][1:10], 
                     nbin = 10, 
                     facet_wrap = "gene") + NoLegend()

3. Differential expression (DE) analysis

After calculating the scores, we can use the scores to enhance the statistical power of DE analysis by using them as a “weights” in the regression model. Briefly speaking, instead of coding the NT cells as 0 and the targeted cells as 1, we used the standardized scores to code the targeted cells in the regression, so that cells with stronger perturbation strength will have higher “weights” and vice versa. Similar as above, if multiple biological contexts (e.g., cell lines) exist, we can use the ‘split.by’ to specify the corresponding metadata column, and the DE tests will be performed in a multi-variate manner. Cell line specific P-values will be reported separately in the final output.

# run score-based weighted DE test for 12 selected perturbations. It will return a list of data frames (one for each perturbation)
de_res = Run_wmvRegDE(object = seurat_obj, assay = "RNA", slot = "counts",
                      labels = "gene", nt.class.name = "neg", 
                      PRTB_list = c("GATA1", "GINS1", "MTOR"),
                      logfc.threshold = 0.2, 
                      split.by = NULL)

# have a quick look at the DE results
head(de_res[[1]])
##        gene_ID     log2FC beta_weight p_weight DE_method
## CYBA      CYBA -1.4268525  0.08554419        0  weighted
## RPL18A  RPL18A  0.4359628 -0.03871204        0  weighted
## RPS12    RPS12  0.4241743 -0.03550264        0  weighted
## RPS2      RPS2  0.4265143 -0.03764625        0  weighted
## TMSB4X  TMSB4X -1.6341147  0.08718808        0  weighted
## YBX1      YBX1  0.4969422 -0.05072183        0  weighted

We can now explore the top DE genes for each perturbations using the customized DoHeatmap function, where cells are ordered by Mixscale scores.

# select the top 20 DE genes from the perturbation
top_res = de_res[["GATA1"]][order(de_res[["GATA1"]]$p_weight)[1:20], ]
# order the DE genes based on its log-fold-change
top_DEG = rownames(top_res[order(top_res$log2FC, decreasing = T), ])

# heatmap for the top DE genes. cells ordered by Mixscale scores. The expression level of the target gene will be plotted in the first row.
Mixscale_DoHeatmap(object = seurat_obj, 
                   labels = "gene", 
                   nt.class.name = "neg", 
                   slct.ident = "GATA1", 
                   mixscale.score.name = "mixscale_score",
                   features = c("GATA1", top_DEG), angle = 0, hjust = 0.5) 

We can examine the DE results for additional perturbations using similar codes as above. For the “GINS1” perturbation, the target gene’s expression appears poorly correlated with Mixscale scores, as shown in the Mixscale_ScatterPlot. However, the Mixscale scores correlate well with the expression levels of other DE genes, suggesting that Mixscale scores might more accurately indicate the perturbation strength in cells compared to target gene’s expression (especially when it is a lowly-expressed gene as “GINS1”).

Section B.

In this section we will focus on how to use the pathway signatures from our study to run gene set enrichment test in external datasets.

0. Introduction

We will use the interferon-beta (IFNB) stimulated human PBMCs dataset (ifnb) from Kang et al 2017 (available via SeuratData package) to demonstrate how to perform gene set enrichment analyses using the pathway gene sets from our study. We aim to show that by using our pathway gene lists, we can correctly infer the pathway activation of IFNB across different cell types in the human PBMCs.

# install the ifnb dataset 
SeuratData::InstallData("ifnb")
# load dataset
ifnb <- SeuratData::LoadData("ifnb")

We can then load the pathway gene sets we generated (can be downloaded from Zenodo). There are two versions of pathway gene lists provided. One is the standard pathway gene list for different pathway programs we compiled, and the other one is the pathway exclusive gene list that filtered out the shared genes shared with other relevant pathways in the experiment.

plist = readRDS("Pathway_genelist.rds")
exclusive_plist = readRDS("Pathway_Exclusive_genelist.rds")

1. DE tests and Fisher enrichment tests for ifnb dataset

We will first conduct Wilcox DE tests between the control and the IFNB-stimulated cells in each cell types in the ifnb dataset. Then, we will perform Fisher enrichment tests for the DE genes from each of the cell types, testing them against the pathway gene lists we just load. These two steps are merged by a wrapper function. During the DEenrich() process, if users wish to conduct the DE tests for each cell type individually, they can use the ‘split.by’ to specify the corresponding metadata column.

# Normalize the counts 
ifnb = NormalizeData(ifnb)

# A wrapper function to perform both DE and enrichment test
Idents(ifnb) = "stim"
res = DEenrich(object = ifnb, 
               plist = plist, 
               ident = "stim",
               ident.1 = "STIM",
               ident.2 = "CTRL", 
               split.by = "seurat_annotations", 
               direction = "up", 
               logfc.threshold = 0.2,
               p.val.cutoff = 0.05, 
               min.pct = 0.1)

# check the enrichment results for CD14 Monocytes
head(res$`CD14 Mono`)
##              GO_term        OR         Pval   combined_score num_LH num_PH
## 1 IFNB_program1_down 30.551460 1.566042e-69 4840.26233710267    148    164
## 3 IFNG_program1_down 13.764122 6.311636e-47 1464.21496809152    125    154
## 5 TNFA_program1_down  7.315289 8.145677e-26 422.602219842384     91    128
## 4 IFNG_program2_down 14.092937 5.223443e-21 658.156092839788     54     64
## 2 IFNB_program2_down  1.353308 1.792922e-01  2.3259826904708     20     57
## 6 TNFA_program2_down  1.537152 3.469431e-01 1.62722098095663      5     11
##   n_GO_term num_DEG direction_DEG   slct.ct
## 1       300     590         upDEG CD14 Mono
## 3       300     590         upDEG CD14 Mono
## 5       300     590         upDEG CD14 Mono
## 4       200     590         upDEG CD14 Mono
## 2       223     590         upDEG CD14 Mono
## 6        98     590         upDEG CD14 Mono

2. Enrichment tests using pathway exclusive gene lists

Gene lists from related pathways, such as IFNG, IFNB, and TNFA which are all linked to immune responses, frequently share many genes. This overlap makes it challenging to differentiate the activation of these pathways. For example, as the result above shows, DE genes due to IFNB stimulation are enriched in not just the IFNB pathway, but also in IFNG and TNFA pathways. To overcome this challenge, we have introduced a concept of pathway-exclusive gene lists. Essentially, for any two related pathways, we define the exclusive genes of one pathway as those that are absent from the gene list of the other. To refine this further, we employed a more stringent criterion to exclude genes that, while potentially related, are not explicitly listed in the gene list of the other pathway (For a detailed explanation, please refer to our paper). Performing enrichment tests using the exclusive gene lists enhances our ability to accurately distinguish activations among closely associated pathways.

# only extract the exclusive gene lists that are relevant to IFNB pathway
exclusive_plist = exclusive_plist[grep(pattern = "IFNB", names(exclusive_plist))]

# A wrapper function to perform both DE and enrichment test
res_exclusive = DEenrich(object = ifnb, 
                         plist = exclusive_plist, 
                         ident = "stim",
                         ident.1 = "STIM",
                         ident.2 = "CTRL", 
                         split.by = "seurat_annotations", 
                         direction = "up", 
                         logfc.threshold = 0.2,
                         p.val.cutoff = 0.05, 
                         min.pct = 0.1)

# check the enrichment results for CD14 Monocytes
head(res_exclusive$`CD14 Mono`)
##            GO_term        OR         Pval       combined_score num_LH num_PH
## 3 IFNB_REMOVE_TNFA 9.8573087 8.350578e-30     659.998290889661     90    117
## 2 IFNB_REMOVE_IFNG 9.8435531 1.200209e-19     428.850346678085     58     74
## 1 IFNG_REMOVE_IFNB 1.7696262 6.530225e-02     4.82882984756938     16     38
## 4 TNFA_REMOVE_IFNB 0.1313289 9.999754e-01 3.22910851679664e-06      3     42
##   n_GO_term num_DEG direction_DEG   slct.ct
## 3       433     590         upDEG CD14 Mono
## 2       289     590         upDEG CD14 Mono
## 1       198     590         upDEG CD14 Mono
## 4       188     590         upDEG CD14 Mono

We can see that the exclusive gene lists for IFNB (removing TNFA) and IFNB (removing IFNG) are still enriched for IFNB-stimulated DE genes. But we do not observe signals from IFNG (removing IFNB) or TNFA (removing IFNB), indicating that the underlying activated pathway during IFNB stimulation is indeed IFNB, while IFNG and TNFA are showing enrichment just because of their substantial overlap with IFNB.

3. Visualization

We can now visualize the enrichment results across all the cell types in the ifnb dataset. First we will check the results for the standard enrichment test

DEenrich_DotPlot(res, 
                 direction = "up", 
                 plot_title = "Standard pathway gene lists")

DEenrich_DotPlot(res_exclusive, 
                 direction = "up", 
                 plot_title = "Pathway exclusive gene lists", 
                 OR_cutoff = 10)

4. Module score analysis

Apart from performing enrishment test, we can also evaluate the pathway activity by calculating the over all expression level of all the genes within a gene list (the so-called module score analysis). We will use package “UCell” for module score analysis. Alternatively, we can use the built-in function AddModuleScore() from Seurat as well.

ifnb = UCell::AddModuleScore_UCell(ifnb, 
                                   features = plist[c("IFNB_program1_down", "IFNG_program1_down", 
                                                      "TNFA_program1_down", "TGFB1_program1_down")] )

# using VlnPlot to visualize the score of each cell 
VlnPlot(ifnb, 
             features = grep("_UCell", names(ifnb@meta.data), value = T), 
             pt.size = 0, 
             group.by = "seurat_annotations", 
             split.by = "stim", 
             ncol = 2) &
    theme(legend.position = "NA", 
          axis.title = element_text(size = 15), 
          axis.text = element_text(size = 12), 
          plot.title = element_text(size = 15)) 

We can observe very similar results as in our enrichment tests, where all IFNB, IFNG, and TNF pathways show a high activity (and not for TGFB pathway) in the IFNB-stimulated cells compared to the non-stimulated cells. And if we repeat the module score analysis using the pathway exclusive gene lists, we should be able to determine the pathway actually being activated (i.e., IFNB pathway).

ifnb = UCell::AddModuleScore_UCell(ifnb, 
                                   features = exclusive_plist)

# using VlnPlot to visualize the score of each cell 
VlnPlot(ifnb, 
             features = grep("_REMOVE_.*UCell", names(ifnb@meta.data), value = T), 
             pt.size = 0, 
             group.by = "seurat_annotations", 
             split.by = "stim", 
             ncol = 2) &
    theme(legend.position = "NA", 
          axis.title = element_text(size = 15), 
          axis.text = element_text(size = 12), 
          plot.title = element_text(size = 15))