jost-example-usage

Set-up

In this vignette, we demonstrate RNA fingerprinting on a Perturb-seq dataset that can be assessed on GEO under accession number GSE132080. We can begin by loading in the appropriate packages and setting up the data.

library(RNAFingerprinting)
library(Seurat)
library(pheatmap)
library(stringr)
options(future.globals.maxSize = 10 * 1024^3)

counts <- ReadMtx(mtx = "/brahms/grabskii/jost_data/GSE132080_10X_matrix.mtx.gz",
                  cells = "/brahms/grabskii/jost_data/GSE132080_10X_barcodes.tsv.gz", 
                  features = "/brahms/grabskii/jost_data/GSE132080_10X_genes.tsv.gz")
meta_data <- read.csv("/brahms/grabskii/jost_data/GSE132080_cell_identities.csv.gz")
rownames(meta_data) <- meta_data$cell_barcode
jost <- CreateSeuratObject(counts = counts, meta.data = meta_data)
jost <- subset(jost, subset = number_of_cells == 1) 
jost <- subset(jost, subset = guide_identity != '*')  
jost <- jost[,!is.na(jost$guide_identity)]

jost$gene <- str_extract(jost$guide_identity, "^[^_]+")
jost$gRNA_id <- sapply(jost$guide_identity, function(x) str_sub(x, -2))

Here, every gene was targeted by 5 different guide RNAs, where the one ending with “00” indicates a perfectly matched sequence, and the rest have titrated strength due to nucleotide mismatches. We split this object into two: one containing only the perfectly matched sequences, and one containing the rest. We also arbitrarily distribute the non-targeting guide RNAs among them.

jost_match <- jost[,jost$gRNA_id == '00' |
                     (jost$gene == 'neg' & 
                        jost$gRNA_id %in% c('01','02','06','17','27'))]
jost_mismatch <- jost[,(jost$gRNA_id != '00' & jost$gene != 'neg') |
                     (jost$gene == 'neg' & 
                        jost$gRNA_id %in% c('28','40','54','83','89'))]

Finally, we normalize each resulting object with SCTransform; importantly, we specify return.only.var.genes = FALSE.

jost_match <- SCTransform(jost_match, return.only.var.genes = FALSE)
jost_mismatch <- SCTransform(jost_mismatch, return.only.var.genes = FALSE)

Learning fingerprints

We will learn fingerprints corresponding to each perturbation in jost_match. To start, we need to specify the features that should be used for these fingerprints. We recommend using the ChooseFeatures function to automatically select features, but a custom feature set can also be passed in. We provide two options for ChooseFeatures:

method = de: This option performs differential expression for each unique perturbation against the control cells, then combines those gene sets together. We typically recommend this option.
method = expression: This option selects all genes whose average expression across all cells is above some threshold. This should be used if there are too many unique perturbations to feasibly run differential expression for each one.

Here, we use the de option. Note that we specify the metadata column where the perturbation IDs are stored (gene), and we provide the name of the control class (neg). We can further subset the features to ensure that they are all present in our query data jost_mismatch; while a degree of feature mismatch between the fingerprints and the query data is allowable, we encourage using consistent feature sets where possible.

features <- ChooseFeatures(jost_match, method = 'de', 
                           perturbation_meta = 'gene', control_class = 'neg')
features <- intersect(features, 
                      rownames(GetAssayData(jost_mismatch, assay = 'SCT', 
                                            layer = 'scale.data')))

Now we can learn our dictionary of fingerprints. Note that this step can be parallelized by setting mc.cores, which is just 1 by default.

dictionary <- LearnDictionary(jost_match, perturbation_meta = 'gene', 
                                control_class = 'neg', features = features)
#> Processing data...
#> Learning fingerprints...
#> Learned 25 of 25 possible perturbations; to learn all possible perturbations, even if low-signal, set override_filter=TRUE
#> Perturbations kept:  ALDOA, ATP5E, BCR, CAD, CDC23, COX11, DBR1, DUT, EIF2S1, GATA1, GINS1, GNB2L1, HSPA5, HSPA9, HSPE1, MTOR, POLR1D, POLR2H, RAN, RPL9, RPS14, RPS15, RPS18, SEC61A1, TUBB
#> Perturbations filtered out:

We use built-in filtering steps to exclude any low-signal fingerprints that we are unlikely to be able to classify, which can be overriden by setting override_filter = TRUE.

We can visualize relationships among perturbations by looking at the correlation of their fingerprints.

pheatmap(cor(dictionary$fingerprints$Lambdas), treeheight_row = 0, treeheight_col = 0)

Mapping cells at the group-level (perturbation label)

Next, we can map cells from jost_mismatch to this dictionary of fingerprints. We first demonstrate this at the group-level, where cells of the same condition can be assigned together. Here, we map cells according to their perturbation labels, and obtain a dataframe summarizing these assignments.

jost_mismatch <- FingerprintCells(jost_mismatch, condition_meta = 'gene', 
                                  control_class = 'neg', dictionary = dictionary,
                                  group_level = TRUE)
#> Processing data...
#> Preparing...
#> Classifying...
group_level_summary <- SummarizeResults(jost_mismatch, group_level = TRUE,
                                         condition_meta = 'gene', control_class = 'neg')
print(group_level_summary[1:5,])
#>         top_credible_set.gene top_lbf.gene all_credible_sets.gene all_lbfs.gene
#> CAD                       CAD      4210.96                    CAD       4210.96
#> SEC61A1               SEC61A1      3077.94                SEC61A1       3077.94
#> RPL9                     RPL9      3004.63                   RPL9       3004.63
#> HSPA5                   HSPA5      2872.76                  HSPA5       2872.76
#> DBR1                     DBR1      2833.08                   DBR1       2833.08
#>         all_signs.gene
#> CAD                  +
#> SEC61A1              +
#> RPL9                 +
#> HSPA5                +
#> DBR1                 +

The columns in this dataframe can be interpreted as follows:

top_credible_set: The top identified effect. If there is uncertainty among multiple perturbations, they are separated by commas. If no assignment is made, then unassigned is returned.
top_lbf: The log Bayes factor associated with the top credible set. Higher values indicate greater certainty. If no assignment is made, then this is set to NA.
all_credible_sets: If more than one credible set is found, all credible sets separated by semicolons.
all_lbfs: All log Bayes factors separated by semicolons.
all_signs: The signs of the matches made for each perturbation in each credible set. Positive signs indicate a match in the same directionality as the perturbation. Negative signs should generally be interpreted with care.

These values are also added as metadata to the Seurat object, with each cell containing the information associated with its condition label. Note that these metadata columns are appended with the condition metadata suffix by default, if classification is done at the group-level. A user-specified suffix can also be passed in using the suffix parameter.

By default, these summaries are ranked by the Bayes factor of the top credible set, with higher Bayes factor assignments indicating greater confidence. Here, we find a high degree of accuracy in the assignments, meaning that the right perturbation is in the top credible set.

Optionally, this summary can also be put into tidy format, where each individual credible set is described in its own row. Here, the resulting dataframe is not much different, because only one credible set happened to be returned for each group.

tidy_summary <- SummarizeResults(jost_mismatch, group_level = TRUE, 
                                 condition_meta = 'gene', control_class = 'neg', 
                                 tidy = TRUE)
print(tidy_summary[1:10,])
#>      query           cs     lbf sign
#> 1      CAD          CAD 4210.96    +
#> 2  SEC61A1      SEC61A1 3077.94    +
#> 3     RPL9         RPL9 3004.63    +
#> 4    HSPA5        HSPA5 2872.76    +
#> 5     DBR1         DBR1 2833.08    +
#> 6   EIF2S1 EIF2S1,HSPA9 2517.88  +,+
#> 7   POLR2H       POLR2H 2162.38    +
#> 8    HSPE1        HSPE1  2127.7    +
#> 9      DUT          DUT  2122.6    +
#> 10  POLR1D       POLR1D 1950.08    +

We can visualize individual assignments for any group using what we call Long Island City plots, which are inspired by Manhattan plots in genome-wide association-studies. This displays the largest coefficients, in magnitude, for each perturbation conditional on inclusion in the model, and sorts the perturbations via hierarchical clustering (default) or alphabetically. Each identified credible set is indicated with a different color. Optionally, a faster version can be created by setting approx = TRUE to show marginal correlations on the y-axis. Here, we visualize two examples of our matches.

LongIslandCityPlot(jost_mismatch, query = 'CAD', condition_meta = 'gene',
                   control_class = 'neg', dictionary = dictionary,
                   group_level = T, cs = 10)
#> This can be slow in large settings; you can speed up this function with approx = TRUE.

LongIslandCityPlot(jost_mismatch, query = 'EIF2S1', condition_meta = 'gene',
                   control_class = 'neg', dictionary = dictionary,
                   group_level = T, cs = 10)

For any given match, we can identify what genes were most responsible for making that assignment using the ExplainMatch function. This is done by testing each gene to see which genes, when removed, have the biggest impact on the Bayes factor. Here, we can identify and print out the top 25 genes that drove the assignment of CAD-perturbed cells to CAD by setting plot = FALSE:

ExplainMatch(jost_mismatch, query = 'CAD', match = 'CAD', condition_meta = 'gene',
             control_class = 'neg', dictionary = dictionary,
             group_level = T, num_genes = 25, plot = FALSE)
#>                 genes  delta_BF
#> AC016629.3 AC016629.3 -84.67232
#> MT-CYB         MT-CYB -78.95598
#> RPLP1           RPLP1 -65.54225
#> RPS4X           RPS4X -63.03516
#> RPS12           RPS12 -62.33166
#> MT-CO1         MT-CO1 -60.31671
#> BCYRN1         BCYRN1 -57.69514
#> PVT1             PVT1 -53.88305
#> FTL               FTL -51.96460
#> UBB               UBB -49.56154
#> TXNIP           TXNIP -49.06386
#> APOC1           APOC1 -49.04864
#> RPL41           RPL41 -46.24781
#> MYL6             MYL6 -43.58336
#> SOD1             SOD1 -42.09382
#> UROD             UROD -41.72311
#> RPLP0           RPLP0 -35.76947
#> RPL18A         RPL18A -35.30757
#> MT-CO3         MT-CO3 -33.91766
#> DYNLL1         DYNLL1 -33.07042
#> SNAPC5         SNAPC5 -30.83733
#> RPL28           RPL28 -29.46071
#> NEAT1           NEAT1 -29.32222
#> GSTP1           GSTP1 -29.26513
#> RPL26           RPL26 -27.74860

We can also visualize these top genes in both the query data (jost_mismatch) and the reference data (jost_match) by using plot = TRUE (which is the default setting) and specifying details about the reference object:

ExplainMatch(jost_mismatch, query = 'CAD', match = 'CAD', condition_meta = 'gene',
             control_class = 'neg', dictionary = dictionary, 
             group_level = T, ref = jost_match, perturbation_meta = 'gene', 
             control_class_ref = 'neg', num_genes = 25)

By specifying the include_de parameter, we can additionally show the top differentially expressed genes in both the query and the reference data below these top driver genes:

ExplainMatch(jost_mismatch, query = 'CAD', match = 'CAD', condition_meta = 'gene',
             control_class = 'neg', dictionary = dictionary, 
             group_level = T, ref = jost_match, perturbation_meta = 'gene', 
             control_class_ref = 'neg', num_genes = 25, include_de = 10)

Mapping cells at the group-level (guide identity)

As an additional example, here we again map cells at the group-level, but this time using their guide identities. Because we want to use all the non-targeting cells as the “control” cells, regardless of the specific non-targeting guide, we create a new metadata column that contains the guide IDs for all the perturbed cells, but just “neg” for each non-targeting cell.

jost_mismatch$guide_identity2 <- jost_mismatch$guide_identity
jost_mismatch$guide_identity2[jost_mismatch$gene == 'neg'] <- 'neg'

Now we can map, specifying the metadata values appropriately:

jost_mismatch <- FingerprintCells(jost_mismatch, condition_meta = 'guide_identity2', 
                                  control_class = 'neg', dictionary = dictionary,
                                  group_level = TRUE)
#> Processing data...
#> Preparing...
#> Classifying...
group_level_summary <- SummarizeResults(jost_mismatch, group_level = TRUE,
                                         condition_meta = 'guide_identity2', 
                                         control_class = 'neg')
print(group_level_summary[1:5,])
#>                                    top_credible_set.guide_identity2
#> HSPA5_HSPA5_+_128003624.23-P1P2_08                            HSPA5
#> HSPA5_HSPA5_+_128003624.23-P1P2_01                            HSPA5
#> HSPA5_HSPA5_+_128003624.23-P1P2_04                            HSPA5
#> HSPA5_HSPA5_+_128003624.23-P1P2_06                            HSPA5
#> CAD_CAD_+_27440280.23-P1P2_03                                   CAD
#>                                    top_lbf.guide_identity2
#> HSPA5_HSPA5_+_128003624.23-P1P2_08                11389.69
#> HSPA5_HSPA5_+_128003624.23-P1P2_01                10790.87
#> HSPA5_HSPA5_+_128003624.23-P1P2_04                10681.25
#> HSPA5_HSPA5_+_128003624.23-P1P2_06                 4996.05
#> CAD_CAD_+_27440280.23-P1P2_03                      3972.76
#>                                    all_credible_sets.guide_identity2
#> HSPA5_HSPA5_+_128003624.23-P1P2_08                             HSPA5
#> HSPA5_HSPA5_+_128003624.23-P1P2_01                             HSPA5
#> HSPA5_HSPA5_+_128003624.23-P1P2_04                             HSPA5
#> HSPA5_HSPA5_+_128003624.23-P1P2_06                             HSPA5
#> CAD_CAD_+_27440280.23-P1P2_03                                    CAD
#>                                    all_lbfs.guide_identity2
#> HSPA5_HSPA5_+_128003624.23-P1P2_08                 11389.69
#> HSPA5_HSPA5_+_128003624.23-P1P2_01                 10790.87
#> HSPA5_HSPA5_+_128003624.23-P1P2_04                 10681.25
#> HSPA5_HSPA5_+_128003624.23-P1P2_06                  4996.05
#> CAD_CAD_+_27440280.23-P1P2_03                       3972.76
#>                                    all_signs.guide_identity2
#> HSPA5_HSPA5_+_128003624.23-P1P2_08                         +
#> HSPA5_HSPA5_+_128003624.23-P1P2_01                         +
#> HSPA5_HSPA5_+_128003624.23-P1P2_04                         +
#> HSPA5_HSPA5_+_128003624.23-P1P2_06                         +
#> CAD_CAD_+_27440280.23-P1P2_03                              +

In most cases, we still find the expected match for each guide, albeit sometimes with more uncertainty. Among our lowest certainty calls, we do see some mistakes, and occasionally more than one credible set is used. There are also some guides which are left unassigned, indicating that the signal differed enough that we were unable to make a match.

For one of our incorrect assignments, the ExplainMatch function helps us see that the match is imperfect, which demonstrates how this visualization can be used to assess whether an assignment makes sense. While the top driver genes look consistent, the top DE genes from the reference are not replicated in the query.

ExplainMatch(jost_mismatch, query = 'MTOR_MTOR_+_11322547.23-P1P2_10', 
             match = 'GINS1', condition_meta = 'guide_identity2',
             control_class = 'neg', dictionary = dictionary, 
             group_level = T, ref = jost_match, perturbation_meta = 'gene', 
             control_class_ref = 'neg', num_genes = 25, include_de = 10)

For some of the guides that don’t match, we can look at top DEGs from the reference to understand why:

gata1_degs <- FindMarkers(jost_match, ident.1 = 'GATA1', ident.2 = 'neg',
                          group.by = 'gene')
gata1_degs <- rownames(gata1_degs)[1:25][order(gata1_degs$avg_log2FC[1:25])]
DoHeatmap(jost_mismatch, gata1_degs, group.by = 'guide_identity2', label = F,
          cells = colnames(jost_mismatch)[jost_mismatch$gene %in% c('neg','GATA1')])

Here, we see that the guide identities in the query that did match indeed show a strong phenotype, whereas the one that didn’t match (GATA1_GATA1_-_48645022.23-P1P2_12) does not show the same gene expression pattern.

Mapping cells individually

We can also map cells on an individual basis by setting group_level = FALSE. This will be slower than group-level classification, and we can speed it up by using parallelization with mc.cores. For example, here, we use 5 cores.

jost_mismatch <- FingerprintCells(jost_mismatch, condition_meta = 'gene', 
                                  control_class = 'neg', dictionary = dictionary,
                                  group_level = FALSE, mc.cores = 5)
#> Processing data...
#> Preparing...
#> Classifying...
cell_level_summary <- SummarizeResults(jost_mismatch, group_level = FALSE,
                                       condition_meta = 'gene', control_class = 'neg')
print(head(cell_level_summary))
#>                     gene top_credible_set top_lbf all_credible_sets all_lbfs
#> CAGAATCCATTAACCG-2 HSPA5            HSPA5 2955.00             HSPA5     2955
#> CATTATCTCCGCATAA-2 HSPA5            HSPA5 2795.29             HSPA5  2795.29
#> CTCTACGGTCAACATC-2 HSPA5            HSPA5 2760.40             HSPA5   2760.4
#> TCAATCTAGCGTCTAT-2 HSPA5            HSPA5 2611.56             HSPA5  2611.56
#> CACCTTGGTTACGTCA-2 HSPA5            HSPA5 2572.06             HSPA5  2572.06
#> GGAACTTAGGAACTGC-2 HSPA5            HSPA5 2523.96             HSPA5  2523.96
#>                    all_signs
#> CAGAATCCATTAACCG-2         +
#> CATTATCTCCGCATAA-2         +
#> CTCTACGGTCAACATC-2         +
#> TCAATCTAGCGTCTAT-2         +
#> CACCTTGGTTACGTCA-2         +
#> GGAACTTAGGAACTGC-2         +

This time, the metadata added to the Seurat object reflects the assignments made for each individual cell. We can confirm that many of these top assignments match the true perturbation labels.

Group-level and cell-level assignments represent a trade-off, where group-level assignments increase power to detect even very subtle signals, but cell-level assignments allow us to better characterize heterogeneity. For example, here we can visualize HSPA5 markers in all the cells that truly received an HSPA5 guide split by their top assigned label.

hspa5_markers <- FindMarkers(jost_match, ident.1 = 'HSPA5', ident.2 = 'neg', 
                             group.by = 'gene')
hspa5_markers <- rownames(hspa5_markers)[1:50][order(hspa5_markers$avg_log2FC[1:50])]
DoHeatmap(jost_mismatch, hspa5_markers, 
          cells = colnames(jost_mismatch)[jost_mismatch$gene == 'HSPA5'],
          group.by = 'top_credible_set', size = 3, label = F)

We can see that the majority of these cells either received the specific label HSPA5, or were left unassigned. When comparing these two groups, we see that the cells labeled as HSPA5 strongly show the associated signal, whereas the unassigned cells appear to have escaped the perturbation.

Finally, we can produce a 2D visualization of all assigned cells by using the ProjectFingerprints function. This projects each assigned cell onto the fingerprints, which is stored in a proj assay, and then runs UMAP on those values, which is stored as the projUMAP reduction. Here, we can label each cell by its true perturbation:

jost_mismatch <- ProjectFingerprints(jost_mismatch, 'gene', 'neg', dictionary)
DimPlot(jost_mismatch, group.by = 'gene', reduction = 'projUMAP', label = T)

We can see that cells generally separate based on their true perturbation labels, with very similar perturbation classes grouping together.