In this vignette, we demonstrate RNA fingerprinting on a Perturb-seq dataset that can be assessed on GEO under accession number GSE132080. We can begin by loading in the appropriate packages and setting up the data.
library(RNAFingerprinting)
library(Seurat)
library(pheatmap)
library(stringr)
options(future.globals.maxSize = 10 * 1024^3)
counts <- ReadMtx(mtx = "/brahms/grabskii/jost_data/GSE132080_10X_matrix.mtx.gz",
cells = "/brahms/grabskii/jost_data/GSE132080_10X_barcodes.tsv.gz",
features = "/brahms/grabskii/jost_data/GSE132080_10X_genes.tsv.gz")
meta_data <- read.csv("/brahms/grabskii/jost_data/GSE132080_cell_identities.csv.gz")
rownames(meta_data) <- meta_data$cell_barcode
jost <- CreateSeuratObject(counts = counts, meta.data = meta_data)
jost <- subset(jost, subset = number_of_cells == 1)
jost <- subset(jost, subset = guide_identity != '*')
jost <- jost[,!is.na(jost$guide_identity)]
jost$gene <- str_extract(jost$guide_identity, "^[^_]+")
jost$gRNA_id <- sapply(jost$guide_identity, function(x) str_sub(x, -2))Here, every gene was targeted by 5 different guide RNAs, where the one ending with “00” indicates a perfectly matched sequence, and the rest have titrated strength due to nucleotide mismatches. We split this object into two: one containing only the perfectly matched sequences, and one containing the rest. We also arbitrarily distribute the non-targeting guide RNAs among them.
jost_match <- jost[,jost$gRNA_id == '00' |
(jost$gene == 'neg' &
jost$gRNA_id %in% c('01','02','06','17','27'))]
jost_mismatch <- jost[,(jost$gRNA_id != '00' & jost$gene != 'neg') |
(jost$gene == 'neg' &
jost$gRNA_id %in% c('28','40','54','83','89'))]Finally, we normalize each resulting object with
SCTransform; importantly, we specify
return.only.var.genes = FALSE.
We will learn fingerprints corresponding to each perturbation in
jost_match. To start, we need to specify the features that
should be used for these fingerprints. We recommend using the
ChooseFeatures function to automatically select features,
but a custom feature set can also be passed in. We provide two options
for ChooseFeatures:
method = de: This option performs differential
expression for each unique perturbation against the control cells, then
combines those gene sets together. We typically recommend this
option.method = expression: This option selects all genes
whose average expression across all cells is above some threshold. This
should be used if there are too many unique perturbations to feasibly
run differential expression for each one.Here, we use the de option. Note that we specify the
metadata column where the perturbation IDs are stored
(gene), and we provide the name of the control class
(neg). We can further subset the features to ensure that
they are all present in our query data jost_mismatch; while
a degree of feature mismatch between the fingerprints and the query data
is allowable, we encourage using consistent feature sets where
possible.
features <- ChooseFeatures(jost_match, method = 'de',
perturbation_meta = 'gene', control_class = 'neg')
features <- intersect(features,
rownames(GetAssayData(jost_mismatch, assay = 'SCT',
layer = 'scale.data')))Now we can learn our dictionary of fingerprints. Note that this step
can be parallelized by setting mc.cores, which is just 1 by
default.
dictionary <- LearnDictionary(jost_match, perturbation_meta = 'gene',
control_class = 'neg', features = features)
#> Processing data...
#> Learning fingerprints...
#> Learned 25 of 25 possible perturbations; to learn all possible perturbations, even if low-signal, set override_filter=TRUE
#> Perturbations kept: ALDOA, ATP5E, BCR, CAD, CDC23, COX11, DBR1, DUT, EIF2S1, GATA1, GINS1, GNB2L1, HSPA5, HSPA9, HSPE1, MTOR, POLR1D, POLR2H, RAN, RPL9, RPS14, RPS15, RPS18, SEC61A1, TUBB
#> Perturbations filtered out:We use built-in filtering steps to exclude any low-signal
fingerprints that we are unlikely to be able to classify, which can be
overriden by setting override_filter = TRUE.
We can visualize relationships among perturbations by looking at the correlation of their fingerprints.
Next, we can map cells from jost_mismatch to this
dictionary of fingerprints. We first demonstrate this at the
group-level, where cells of the same condition can be assigned together.
Here, we map cells according to their perturbation labels, and obtain a
dataframe summarizing these assignments.
jost_mismatch <- FingerprintCells(jost_mismatch, condition_meta = 'gene',
control_class = 'neg', dictionary = dictionary,
group_level = TRUE)
#> Processing data...
#> Preparing...
#> Classifying...
group_level_summary <- SummarizeResults(jost_mismatch, group_level = TRUE,
condition_meta = 'gene', control_class = 'neg')
print(group_level_summary[1:5,])
#> top_credible_set.gene top_lbf.gene all_credible_sets.gene all_lbfs.gene
#> CAD CAD 4210.96 CAD 4210.96
#> SEC61A1 SEC61A1 3077.94 SEC61A1 3077.94
#> RPL9 RPL9 3004.63 RPL9 3004.63
#> HSPA5 HSPA5 2872.76 HSPA5 2872.76
#> DBR1 DBR1 2833.08 DBR1 2833.08
#> all_signs.gene
#> CAD +
#> SEC61A1 +
#> RPL9 +
#> HSPA5 +
#> DBR1 +The columns in this dataframe can be interpreted as follows:
top_credible_set: The top identified effect. If there
is uncertainty among multiple perturbations, they are separated by
commas. If no assignment is made, then unassigned is
returned.top_lbf: The log Bayes factor associated with the top
credible set. Higher values indicate greater certainty. If no assignment
is made, then this is set to NA.all_credible_sets: If more than one credible set is
found, all credible sets separated by semicolons.all_lbfs: All log Bayes factors separated by
semicolons.all_signs: The signs of the matches made for each
perturbation in each credible set. Positive signs indicate a match in
the same directionality as the perturbation. Negative signs should
generally be interpreted with care.These values are also added as metadata to the Seurat object, with
each cell containing the information associated with its condition
label. Note that these metadata columns are appended with the condition
metadata suffix by default, if classification is done at the
group-level. A user-specified suffix can also be passed in using the
suffix parameter.
By default, these summaries are ranked by the Bayes factor of the top credible set, with higher Bayes factor assignments indicating greater confidence. Here, we find a high degree of accuracy in the assignments, meaning that the right perturbation is in the top credible set.
Optionally, this summary can also be put into tidy
format, where each individual credible set is described in its own row.
Here, the resulting dataframe is not much different, because only one
credible set happened to be returned for each group.
tidy_summary <- SummarizeResults(jost_mismatch, group_level = TRUE,
condition_meta = 'gene', control_class = 'neg',
tidy = TRUE)
print(tidy_summary[1:10,])
#> query cs lbf sign
#> 1 CAD CAD 4210.96 +
#> 2 SEC61A1 SEC61A1 3077.94 +
#> 3 RPL9 RPL9 3004.63 +
#> 4 HSPA5 HSPA5 2872.76 +
#> 5 DBR1 DBR1 2833.08 +
#> 6 EIF2S1 EIF2S1,HSPA9 2517.88 +,+
#> 7 POLR2H POLR2H 2162.38 +
#> 8 HSPE1 HSPE1 2127.7 +
#> 9 DUT DUT 2122.6 +
#> 10 POLR1D POLR1D 1950.08 +We can visualize individual assignments for any group using what we
call Long Island City plots, which are inspired by Manhattan plots in
genome-wide association-studies. This displays the largest coefficients,
in magnitude, for each perturbation conditional on inclusion in the
model, and sorts the perturbations via hierarchical clustering (default)
or alphabetically. Each identified credible set is indicated with a
different color. Optionally, a faster version can be created by setting
approx = TRUE to show marginal correlations on the y-axis.
Here, we visualize two examples of our matches.
LongIslandCityPlot(jost_mismatch, query = 'CAD', condition_meta = 'gene',
control_class = 'neg', dictionary = dictionary,
group_level = T, cs = 10)
#> This can be slow in large settings; you can speed up this function with approx = TRUE.LongIslandCityPlot(jost_mismatch, query = 'EIF2S1', condition_meta = 'gene',
control_class = 'neg', dictionary = dictionary,
group_level = T, cs = 10)For any given match, we can identify what genes were most responsible
for making that assignment using the ExplainMatch function.
This is done by testing each gene to see which genes, when removed, have
the biggest impact on the Bayes factor. Here, we can identify and print
out the top 25 genes that drove the assignment of CAD-perturbed cells to
CAD by setting plot = FALSE:
ExplainMatch(jost_mismatch, query = 'CAD', match = 'CAD', condition_meta = 'gene',
control_class = 'neg', dictionary = dictionary,
group_level = T, num_genes = 25, plot = FALSE)
#> genes delta_BF
#> AC016629.3 AC016629.3 -84.67232
#> MT-CYB MT-CYB -78.95598
#> RPLP1 RPLP1 -65.54225
#> RPS4X RPS4X -63.03516
#> RPS12 RPS12 -62.33166
#> MT-CO1 MT-CO1 -60.31671
#> BCYRN1 BCYRN1 -57.69514
#> PVT1 PVT1 -53.88305
#> FTL FTL -51.96460
#> UBB UBB -49.56154
#> TXNIP TXNIP -49.06386
#> APOC1 APOC1 -49.04864
#> RPL41 RPL41 -46.24781
#> MYL6 MYL6 -43.58336
#> SOD1 SOD1 -42.09382
#> UROD UROD -41.72311
#> RPLP0 RPLP0 -35.76947
#> RPL18A RPL18A -35.30757
#> MT-CO3 MT-CO3 -33.91766
#> DYNLL1 DYNLL1 -33.07042
#> SNAPC5 SNAPC5 -30.83733
#> RPL28 RPL28 -29.46071
#> NEAT1 NEAT1 -29.32222
#> GSTP1 GSTP1 -29.26513
#> RPL26 RPL26 -27.74860We can also visualize these top genes in both the query data
(jost_mismatch) and the reference data
(jost_match) by using plot = TRUE (which is
the default setting) and specifying details about the reference
object:
ExplainMatch(jost_mismatch, query = 'CAD', match = 'CAD', condition_meta = 'gene',
control_class = 'neg', dictionary = dictionary,
group_level = T, ref = jost_match, perturbation_meta = 'gene',
control_class_ref = 'neg', num_genes = 25)By specifying the include_de parameter, we can
additionally show the top differentially expressed genes in both the
query and the reference data below these top driver genes:
ExplainMatch(jost_mismatch, query = 'CAD', match = 'CAD', condition_meta = 'gene',
control_class = 'neg', dictionary = dictionary,
group_level = T, ref = jost_match, perturbation_meta = 'gene',
control_class_ref = 'neg', num_genes = 25, include_de = 10)As an additional example, here we again map cells at the group-level, but this time using their guide identities. Because we want to use all the non-targeting cells as the “control” cells, regardless of the specific non-targeting guide, we create a new metadata column that contains the guide IDs for all the perturbed cells, but just “neg” for each non-targeting cell.
jost_mismatch$guide_identity2 <- jost_mismatch$guide_identity
jost_mismatch$guide_identity2[jost_mismatch$gene == 'neg'] <- 'neg'Now we can map, specifying the metadata values appropriately:
jost_mismatch <- FingerprintCells(jost_mismatch, condition_meta = 'guide_identity2',
control_class = 'neg', dictionary = dictionary,
group_level = TRUE)
#> Processing data...
#> Preparing...
#> Classifying...
group_level_summary <- SummarizeResults(jost_mismatch, group_level = TRUE,
condition_meta = 'guide_identity2',
control_class = 'neg')
print(group_level_summary[1:5,])
#> top_credible_set.guide_identity2
#> HSPA5_HSPA5_+_128003624.23-P1P2_08 HSPA5
#> HSPA5_HSPA5_+_128003624.23-P1P2_01 HSPA5
#> HSPA5_HSPA5_+_128003624.23-P1P2_04 HSPA5
#> HSPA5_HSPA5_+_128003624.23-P1P2_06 HSPA5
#> CAD_CAD_+_27440280.23-P1P2_03 CAD
#> top_lbf.guide_identity2
#> HSPA5_HSPA5_+_128003624.23-P1P2_08 11389.69
#> HSPA5_HSPA5_+_128003624.23-P1P2_01 10790.87
#> HSPA5_HSPA5_+_128003624.23-P1P2_04 10681.25
#> HSPA5_HSPA5_+_128003624.23-P1P2_06 4996.05
#> CAD_CAD_+_27440280.23-P1P2_03 3972.76
#> all_credible_sets.guide_identity2
#> HSPA5_HSPA5_+_128003624.23-P1P2_08 HSPA5
#> HSPA5_HSPA5_+_128003624.23-P1P2_01 HSPA5
#> HSPA5_HSPA5_+_128003624.23-P1P2_04 HSPA5
#> HSPA5_HSPA5_+_128003624.23-P1P2_06 HSPA5
#> CAD_CAD_+_27440280.23-P1P2_03 CAD
#> all_lbfs.guide_identity2
#> HSPA5_HSPA5_+_128003624.23-P1P2_08 11389.69
#> HSPA5_HSPA5_+_128003624.23-P1P2_01 10790.87
#> HSPA5_HSPA5_+_128003624.23-P1P2_04 10681.25
#> HSPA5_HSPA5_+_128003624.23-P1P2_06 4996.05
#> CAD_CAD_+_27440280.23-P1P2_03 3972.76
#> all_signs.guide_identity2
#> HSPA5_HSPA5_+_128003624.23-P1P2_08 +
#> HSPA5_HSPA5_+_128003624.23-P1P2_01 +
#> HSPA5_HSPA5_+_128003624.23-P1P2_04 +
#> HSPA5_HSPA5_+_128003624.23-P1P2_06 +
#> CAD_CAD_+_27440280.23-P1P2_03 +In most cases, we still find the expected match for each guide, albeit sometimes with more uncertainty. Among our lowest certainty calls, we do see some mistakes, and occasionally more than one credible set is used. There are also some guides which are left unassigned, indicating that the signal differed enough that we were unable to make a match.
For one of our incorrect assignments, the ExplainMatch
function helps us see that the match is imperfect, which demonstrates
how this visualization can be used to assess whether an assignment makes
sense. While the top driver genes look consistent, the top DE genes from
the reference are not replicated in the query.
ExplainMatch(jost_mismatch, query = 'MTOR_MTOR_+_11322547.23-P1P2_10',
match = 'GINS1', condition_meta = 'guide_identity2',
control_class = 'neg', dictionary = dictionary,
group_level = T, ref = jost_match, perturbation_meta = 'gene',
control_class_ref = 'neg', num_genes = 25, include_de = 10)For some of the guides that don’t match, we can look at top DEGs from the reference to understand why:
gata1_degs <- FindMarkers(jost_match, ident.1 = 'GATA1', ident.2 = 'neg',
group.by = 'gene')
gata1_degs <- rownames(gata1_degs)[1:25][order(gata1_degs$avg_log2FC[1:25])]
DoHeatmap(jost_mismatch, gata1_degs, group.by = 'guide_identity2', label = F,
cells = colnames(jost_mismatch)[jost_mismatch$gene %in% c('neg','GATA1')])Here, we see that the guide identities in the query that did match indeed show a strong phenotype, whereas the one that didn’t match (GATA1_GATA1_-_48645022.23-P1P2_12) does not show the same gene expression pattern.
We can also map cells on an individual basis by setting
group_level = FALSE. This will be slower than group-level
classification, and we can speed it up by using parallelization with
mc.cores. For example, here, we use 5 cores.
jost_mismatch <- FingerprintCells(jost_mismatch, condition_meta = 'gene',
control_class = 'neg', dictionary = dictionary,
group_level = FALSE, mc.cores = 5)
#> Processing data...
#> Preparing...
#> Classifying...
cell_level_summary <- SummarizeResults(jost_mismatch, group_level = FALSE,
condition_meta = 'gene', control_class = 'neg')
print(head(cell_level_summary))
#> gene top_credible_set top_lbf all_credible_sets all_lbfs
#> CAGAATCCATTAACCG-2 HSPA5 HSPA5 2955.00 HSPA5 2955
#> CATTATCTCCGCATAA-2 HSPA5 HSPA5 2795.29 HSPA5 2795.29
#> CTCTACGGTCAACATC-2 HSPA5 HSPA5 2760.40 HSPA5 2760.4
#> TCAATCTAGCGTCTAT-2 HSPA5 HSPA5 2611.56 HSPA5 2611.56
#> CACCTTGGTTACGTCA-2 HSPA5 HSPA5 2572.06 HSPA5 2572.06
#> GGAACTTAGGAACTGC-2 HSPA5 HSPA5 2523.96 HSPA5 2523.96
#> all_signs
#> CAGAATCCATTAACCG-2 +
#> CATTATCTCCGCATAA-2 +
#> CTCTACGGTCAACATC-2 +
#> TCAATCTAGCGTCTAT-2 +
#> CACCTTGGTTACGTCA-2 +
#> GGAACTTAGGAACTGC-2 +This time, the metadata added to the Seurat object reflects the assignments made for each individual cell. We can confirm that many of these top assignments match the true perturbation labels.
Group-level and cell-level assignments represent a trade-off, where group-level assignments increase power to detect even very subtle signals, but cell-level assignments allow us to better characterize heterogeneity. For example, here we can visualize HSPA5 markers in all the cells that truly received an HSPA5 guide split by their top assigned label.
hspa5_markers <- FindMarkers(jost_match, ident.1 = 'HSPA5', ident.2 = 'neg',
group.by = 'gene')
hspa5_markers <- rownames(hspa5_markers)[1:50][order(hspa5_markers$avg_log2FC[1:50])]
DoHeatmap(jost_mismatch, hspa5_markers,
cells = colnames(jost_mismatch)[jost_mismatch$gene == 'HSPA5'],
group.by = 'top_credible_set', size = 3, label = F)We can see that the majority of these cells either received the specific label HSPA5, or were left unassigned. When comparing these two groups, we see that the cells labeled as HSPA5 strongly show the associated signal, whereas the unassigned cells appear to have escaped the perturbation.
Finally, we can produce a 2D visualization of all assigned cells by
using the ProjectFingerprints function. This projects each
assigned cell onto the fingerprints, which is stored in a
proj assay, and then runs UMAP on those values, which is
stored as the projUMAP reduction. Here, we can label each
cell by its true perturbation:
jost_mismatch <- ProjectFingerprints(jost_mismatch, 'gene', 'neg', dictionary)
DimPlot(jost_mismatch, group.by = 'gene', reduction = 'projUMAP', label = T)We can see that cells generally separate based on their true perturbation labels, with very similar perturbation classes grouping together.