jost-gwps-example

Set-up

In this vignette, we demonstrate RNA fingerprinting using the pre-computed fingerprints for Genome-Wide Perturb-Seq (GWPS) on a Perturb-seq dataset that can be assessed on GEO under accession number GSE132080. Note that the RNAFingerprintingData package needs to be installed to access the GWPS data. We can begin by loading in the appropriate packages and setting up the data.

library(RNAFingerprinting)
library(RNAFingerprintingData)
library(Seurat)
library(stringr)
options(future.globals.maxSize = 10 * 1024^3)

counts <- ReadMtx(mtx = "/brahms/grabskii/jost_data/GSE132080_10X_matrix.mtx.gz",
                  cells = "/brahms/grabskii/jost_data/GSE132080_10X_barcodes.tsv.gz", 
                  features = "/brahms/grabskii/jost_data/GSE132080_10X_genes.tsv.gz")
meta_data <- read.csv("/brahms/grabskii/jost_data/GSE132080_cell_identities.csv.gz")
rownames(meta_data) <- meta_data$cell_barcode
jost <- CreateSeuratObject(counts = counts, meta.data = meta_data)
jost <- subset(jost, subset = number_of_cells == 1) 
jost <- subset(jost, subset = guide_identity != '*')  
jost <- jost[,!is.na(jost$guide_identity)]
jost$gene <- str_extract(jost$guide_identity, "^[^_]+")

We normalize the query data with SCTransform; importantly, we specify return.only.var.genes = FALSE.

jost <- SCTransform(jost, return.only.var.genes = FALSE)

Now we can install the precomputed GWPS dictionary. Note that all available dictionaries can be viewed using AvailableDictionaries().

InstallPrecomputedDictionary(name = 'gwps')
#> ── R CMD build ─────────────────────────────────────────────────────────────────
#>   
✔  checking for file ‘/tmp/RtmpGa6BhN/remotes28d6ca92d7cf/gwps/DESCRIPTION’ (1.4s)
#> 
  
─  preparing ‘gwps’: (949ms)
#>    checking DESCRIPTION meta-information ...
  
✔  checking DESCRIPTION meta-information
#> 
  
─  checking for LF line-endings in source and make files and shell scripts
#> 
  
─  checking for empty or unneeded directories
#> ─  looking to see if a ‘data/datalist’ file should be added
#> 
  
─  building ‘gwps_0.0.0.9000.tar.gz’ (708ms)
#> 
  
   Warning: invalid uid value replaced by that for user 'nobody'
#>    Warning: invalid gid value replaced by that for user 'nobody'
#>    
#>

Next, we can load in this dictionary using the convenience loader function LoadPrecomputedDictionary. The GWPS fingerprints were learned with genes in the ENSEMBL ID format, whereas our query data uses gene symbols. To make these compatible, we specify gene_key = 'SYMBOL' when loading it in. More generally, the ConvertGenes function can be used to update any set of fingerprints from one gene key type to another.

gwps_dictionary <- LoadPrecomputedDictionary(name = 'gwps', gene_key = 'SYMBOL')

As a note, there is likely to be at least some degree of feature mismatch between these GWPS fingerprints and a given query dataset. While this is allowable and unlikely to yield substantive differences in results, this represents an approximation of the procedure and is a trade-off for the convenience of using these pre-computed fingerprints.

Mapping cells at the group-level

We can start by trying to map cells from the query data to GWPS at the group-level, where cells of the same condition can be assigned together. Here, we map cells according to their perturbation labels, and obtain a dataframe summarizing these assignments. This can be slow due to the size of GWPS, but faster results can be obtained using parallelization with the mc.cores parameter.

jost <- FingerprintCells(jost,
                         condition_meta = 'gene', 
                         control_class = 'neg',
                         dictionary = gwps_dictionary, 
                         group_level = TRUE)
#> The following 541 features are not present in the SCT assay:LINC01128, INTS11, MRPL20-AS1, FNDC10, CENPS, SPEN-AS1, MICOS10, ELOA-AS1, ELOA, MACO1, ATP5IF1, SNHG3, YARS1, TMEM35B, ADPRS, UTP11, TUT4, CZIB, IFT25, MIGA1, SELENOF, KYAT3, LINC02609, DPH5-DT, EEIG2, SARS1, ATP5PB, LINC02802, LINC02805, H2AC20, PRUNE1, GBA1, ENTREP3, KHDC4, MIR9-1HG, NAXE, METTL25B, FIRRM, GAS5, COP1, ODR4, RO60, UTP25, PACC1, FLVCR1-DT, EPRS1, H3-3A, COQ8A, H2AC25, RAB4A-AS1...
#> Processing data...
#> Preparing...
#> Classifying...
group_level_summary <- SummarizeResults(jost, group_level = TRUE,
                                         condition_meta = 'gene', 
                                         control_class = 'neg')
print(group_level_summary[1:5,])
#>                                                                                                                                                                                                                                  top_credible_set.gene
#> RPL9                                                                                                                    SPATA5L1,URB2,ABCF1,NLE1,SPATA5,RPL3,CENPP,HERC1,EIF6,SURF6,DDX51,TMA16,CINP,RSL1D1,MYBBP1A,CARF,UBE2E1,KRR1,NOC3L,FBXL14,RPL9
#> GATA1                                                                                                                                                                                                                                     HMBOX1,GATA1
#> POLR1D                                                                                                                                                                                   TAF1C,RPL3,TAF1D,GTF3C6,POLR1E,POLR1C,CD3EAP,RPL26,GNL2,RPL37
#> POLR2H                                                                                                                                                                                             UXT,NUMA1,C1QBP,PFDN2,MED26,RPAP2,POLR2B,ERCC2,GPN3
#> RPS14  KRR1,WDR3,FBXL14,AATF,RIOK1,RPS12,RIOK2,DDX52,DIMT1,DDX10,ZNF84,FCF1,RPS3,WDR36,NOP10,RPS14,RPS2,RPS19,DDX47,TSR2,RPP30,POP4,RRP9,POP1,ZNHIT6,USP36,UTP20,RPS3A,DNTTIP2,RPS20,RPS15A,ABT1,IMPA2,RPP14,ZCCHC9,ESF1,BUD23,LAMB1,RPS10-NUDT3,RRP12
#>        top_lbf.gene
#> RPL9        2882.97
#> GATA1       2762.02
#> POLR1D      2093.92
#> POLR2H      2056.41
#> RPS14       1900.83
#>                                                                                                                                                                                                                                 all_credible_sets.gene
#> RPL9                                                                                                                    SPATA5L1,URB2,ABCF1,NLE1,SPATA5,RPL3,CENPP,HERC1,EIF6,SURF6,DDX51,TMA16,CINP,RSL1D1,MYBBP1A,CARF,UBE2E1,KRR1,NOC3L,FBXL14,RPL9
#> GATA1                                                                                                                                                                                                                                     HMBOX1,GATA1
#> POLR1D                                                                                                                                                                                   TAF1C,RPL3,TAF1D,GTF3C6,POLR1E,POLR1C,CD3EAP,RPL26,GNL2,RPL37
#> POLR2H                                                                                                                                                                                             UXT,NUMA1,C1QBP,PFDN2,MED26,RPAP2,POLR2B,ERCC2,GPN3
#> RPS14  KRR1,WDR3,FBXL14,AATF,RIOK1,RPS12,RIOK2,DDX52,DIMT1,DDX10,ZNF84,FCF1,RPS3,WDR36,NOP10,RPS14,RPS2,RPS19,DDX47,TSR2,RPP30,POP4,RRP9,POP1,ZNHIT6,USP36,UTP20,RPS3A,DNTTIP2,RPS20,RPS15A,ABT1,IMPA2,RPP14,ZCCHC9,ESF1,BUD23,LAMB1,RPS10-NUDT3,RRP12
#>        all_lbfs.gene
#> RPL9         2882.97
#> GATA1        2762.02
#> POLR1D       2093.92
#> POLR2H       2056.41
#> RPS14        1900.83
#>                                                                         all_signs.gene
#> RPL9                                         +,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+
#> GATA1                                                                              +,+
#> POLR1D                                                             +,+,+,+,+,+,+,+,+,+
#> POLR2H                                                               +,+,+,+,+,+,+,+,+
#> RPS14  +,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+

Some of these assignments involve very long credible sets, which makes sense given that many of these perturbations are of genes belonging to large complexes. We can visualize these assignments using Long Island City plots. Below, we should an example of an assignment with a long credible set, and one with a short credible set. We could alternatively sort the fingerprints in alphabetical order by setting the order argument appropriately.

LongIslandCityPlot(jost, query = 'RPS14', condition_meta = 'gene',
                   control_class = 'neg', dictionary = gwps_dictionary,
                   group_level = T, cs = 10)
LongIslandCityPlot(jost, query = 'BCR', condition_meta = 'gene',
                   control_class = 'neg', dictionary = gwps_dictionary,
                   group_level = T, cs = 10)

These plots show the conditional posterior mean coefficients for each fingerprint on the y-axis. However, in large-scale settings, this can be slow to compute. For a faster version, we can specify the argument approx = TRUE to instead show marginal correlations on the y-axis, e.g.:

LongIslandCityPlot(jost, query = 'RPS14', condition_meta = 'gene',
                   control_class = 'neg', dictionary = gwps_dictionary,
                   group_level = T, cs = 10, approx = TRUE)

LongIslandCityPlot(jost, query = 'BCR', condition_meta = 'gene',
                   control_class = 'neg', dictionary = gwps_dictionary,
                   group_level = T, cs = 10, approx = TRUE)

To understand what genes were most responsible for driving these matches, we can use the ExplainMatch function. Here, the visualizations are automatically made with the stored pseudobulk GWPS object inside the precomputed dictionary, since we don’t have the full GWPS object loaded in.

ExplainMatch(jost, query = 'BCR', match = 'BCR', 
             condition_meta = 'gene', control_class = 'neg', 
             dictionary = gwps_dictionary, group_level = T, 
             num_genes = 25)

Mapping cells individually

We can also map cells on an individual basis. This will be slower than group-level classification, so here we speed it up by using parallelization with mc.cores.

jost <- FingerprintCells(jost, condition_meta = 'gene', control_class = 'neg', 
                         dictionary = gwps_dictionary, 
                         group_level = FALSE, mc.cores = 10)
#> The following 541 features are not present in the SCT assay:LINC01128, INTS11, MRPL20-AS1, FNDC10, CENPS, SPEN-AS1, MICOS10, ELOA-AS1, ELOA, MACO1, ATP5IF1, SNHG3, YARS1, TMEM35B, ADPRS, UTP11, TUT4, CZIB, IFT25, MIGA1, SELENOF, KYAT3, LINC02609, DPH5-DT, EEIG2, SARS1, ATP5PB, LINC02802, LINC02805, H2AC20, PRUNE1, GBA1, ENTREP3, KHDC4, MIR9-1HG, NAXE, METTL25B, FIRRM, GAS5, COP1, ODR4, RO60, UTP25, PACC1, FLVCR1-DT, EPRS1, H3-3A, COQ8A, H2AC25, RAB4A-AS1...
#> Processing data...
#> Preparing...
#> Classifying...
cell_level_summary <- SummarizeResults(jost, group_level = FALSE,
                                       condition_meta = 'gene', 
                                       control_class = 'neg')
print(head(cell_level_summary))
#>                    gene
#> GCAGTTACAAGCTGGA-2 RPL9
#> CAGTAACAGAGGGCTT-2 RPL9
#> GGGTTGCCACGTCTCT-2 RPL9
#> GACACGCTCAGCTCTC-2 RPL9
#> TGTTCCGTCGTTGCCT-2 RPL9
#> ATCACGACAGGTCGTC-2 RPL9
#>                                                                                                                                                                                                                                                                             top_credible_set
#> GCAGTTACAAGCTGGA-2                                                                                                                                             ABCF1,RPL30,RPL19,RPL10,RPL8,RPL21,RPL14,RPL24,RPL9,URB2,EIF6,DDX51,NLE1,TEX10,SPATA5L1,MYBBP1A,CINP,RPL37,PPAN,RPL10A,HEATR1
#> CAGTAACAGAGGGCTT-2     ABCF1,NLE1,DDX24,EIF6,RPL8,RPL5,RPL36,RPL24,RPL30,SPATA5L1,CINP,URB2,CENPP,NIFK,SDAD1,GNL2,FBL,RPS7,TSR2,RPS3,RPS4X,KRR1,UTP20,RPS15A,EIF3A,RPS6,RPS24,TTF1,GLIS1,PES1,MYBBP1A,SOX3,WDR12,RPL18,STK24,NOC3L,CARF,DDX51,RRS1,DDX54,NOP16,MAK16,RPL34,RPL7,FBXL14,RPL21
#> GGGTTGCCACGTCTCT-2                                                                                                        NLE1,DDX51,ABCF1,MYBBP1A,EIF6,RRS1,RPL37,RPL30,RPL10,RPL24,RPL8,GNL2,RPL17,SOX3,DDX54,CCDC86,SDAD1,NOC3L,RPL9,RPL36,RPL18,NSA2,SPATA5L1,NOP2,RPL19,RPL14,RPL7,CINP
#> GACACGCTCAGCTCTC-2                                                                                    URB2,SPATA5L1,ABCF1,DDX55,LTV1,HERC1,SDAD1,RPS19,RPS10-NUDT3,NOC3L,TEX10,TMA16,NOLC1,HDAC6,RPL3,SURF6,RSL1D1,EIF6,CINP,DDX52,TSR2,KRR1,RCL1,HIGD2A,FCF1,TPP2,EIF3C,DNTTIP2,RPS12,MAK16
#> TGTTCCGTCGTTGCCT-2 URB2,NLE1,GNL2,ABCF1,RPL24,NOL9,SPATA5L1,KIF18B,CINP,ZSCAN2,RPL18,NR2F2,DDX56,RPS20,EIF5B,GLIS1,RPL21,SOX3,CEBPZ,DDX24,NOL8,RBM28,NIFK,RRS1,MDN1,UBE2E1,GTPBP4,RPL27,NOC3L,LAS1L,WDR55,TEX10,RPL18A,LTV1,PES1,RSL24D1,RPL11,RPL13,SURF6,RPL23A,TPP2,SPATA5,DDX54,NVL,NMD3
#> ATCACGACAGGTCGTC-2                                                                                              SDAD1,NLE1,RPL37,EIF6,GNL2,RRS1,RPL24,URB2,RPL30,RPL5,RPL10,RPL17,RPL13,RPL38,RPL31,RPLP0,GLIS1,DDX56,RPL11,DDX24,ABCF1,NOP16,MAK16,SPATA5L1,DDX51,CINP,EBNA1BP2,RPL18,WDR12
#>                    top_lbf
#> GCAGTTACAAGCTGGA-2 1077.60
#> CAGTAACAGAGGGCTT-2  594.60
#> GGGTTGCCACGTCTCT-2  491.14
#> GACACGCTCAGCTCTC-2  491.08
#> TGTTCCGTCGTTGCCT-2  475.88
#> ATCACGACAGGTCGTC-2  474.95
#>                                                                                                                                                                                                                                                                            all_credible_sets
#> GCAGTTACAAGCTGGA-2                                                                                                                                             ABCF1,RPL30,RPL19,RPL10,RPL8,RPL21,RPL14,RPL24,RPL9,URB2,EIF6,DDX51,NLE1,TEX10,SPATA5L1,MYBBP1A,CINP,RPL37,PPAN,RPL10A,HEATR1
#> CAGTAACAGAGGGCTT-2     ABCF1,NLE1,DDX24,EIF6,RPL8,RPL5,RPL36,RPL24,RPL30,SPATA5L1,CINP,URB2,CENPP,NIFK,SDAD1,GNL2,FBL,RPS7,TSR2,RPS3,RPS4X,KRR1,UTP20,RPS15A,EIF3A,RPS6,RPS24,TTF1,GLIS1,PES1,MYBBP1A,SOX3,WDR12,RPL18,STK24,NOC3L,CARF,DDX51,RRS1,DDX54,NOP16,MAK16,RPL34,RPL7,FBXL14,RPL21
#> GGGTTGCCACGTCTCT-2                                                                                                        NLE1,DDX51,ABCF1,MYBBP1A,EIF6,RRS1,RPL37,RPL30,RPL10,RPL24,RPL8,GNL2,RPL17,SOX3,DDX54,CCDC86,SDAD1,NOC3L,RPL9,RPL36,RPL18,NSA2,SPATA5L1,NOP2,RPL19,RPL14,RPL7,CINP
#> GACACGCTCAGCTCTC-2                                                                                    URB2,SPATA5L1,ABCF1,DDX55,LTV1,HERC1,SDAD1,RPS19,RPS10-NUDT3,NOC3L,TEX10,TMA16,NOLC1,HDAC6,RPL3,SURF6,RSL1D1,EIF6,CINP,DDX52,TSR2,KRR1,RCL1,HIGD2A,FCF1,TPP2,EIF3C,DNTTIP2,RPS12,MAK16
#> TGTTCCGTCGTTGCCT-2 URB2,NLE1,GNL2,ABCF1,RPL24,NOL9,SPATA5L1,KIF18B,CINP,ZSCAN2,RPL18,NR2F2,DDX56,RPS20,EIF5B,GLIS1,RPL21,SOX3,CEBPZ,DDX24,NOL8,RBM28,NIFK,RRS1,MDN1,UBE2E1,GTPBP4,RPL27,NOC3L,LAS1L,WDR55,TEX10,RPL18A,LTV1,PES1,RSL24D1,RPL11,RPL13,SURF6,RPL23A,TPP2,SPATA5,DDX54,NVL,NMD3
#> ATCACGACAGGTCGTC-2                                                                                              SDAD1,NLE1,RPL37,EIF6,GNL2,RRS1,RPL24,URB2,RPL30,RPL5,RPL10,RPL17,RPL13,RPL38,RPL31,RPLP0,GLIS1,DDX56,RPL11,DDX24,ABCF1,NOP16,MAK16,SPATA5L1,DDX51,CINP,EBNA1BP2,RPL18,WDR12
#>                    all_lbfs
#> GCAGTTACAAGCTGGA-2   1077.6
#> CAGTAACAGAGGGCTT-2    594.6
#> GGGTTGCCACGTCTCT-2   491.14
#> GACACGCTCAGCTCTC-2   491.08
#> TGTTCCGTCGTTGCCT-2   475.88
#> ATCACGACAGGTCGTC-2   474.95
#>                                                                                                      all_signs
#> GCAGTTACAAGCTGGA-2                                                   +,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+
#> CAGTAACAGAGGGCTT-2 +,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+
#> GGGTTGCCACGTCTCT-2                                     +,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+
#> GACACGCTCAGCTCTC-2                                 +,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+
#> TGTTCCGTCGTTGCCT-2   +,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+
#> ATCACGACAGGTCGTC-2                                   +,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+

We can visualize these assignments by using the ProjectFingerprints function to create a UMAP of the projections of each assigned cell onto the fingerprints. Below, we overlay the plot with each cell’s true perturbation label:

jost <- ProjectFingerprints(jost, 'gene', 
                            'neg', gwps_dictionary)
DimPlot(jost, group.by = 'gene', 
        reduction = 'projUMAP', label = T)