In this vignette, we demonstrate RNA fingerprinting using the
pre-computed fingerprints for Genome-Wide Perturb-Seq (GWPS) on a
Perturb-seq dataset that can be assessed on GEO under accession number
GSE132080. Note that the RNAFingerprintingData package
needs to be installed to access the GWPS data. We can begin by loading
in the appropriate packages and setting up the data.
library(RNAFingerprinting)
library(RNAFingerprintingData)
library(Seurat)
library(stringr)
options(future.globals.maxSize = 10 * 1024^3)
counts <- ReadMtx(mtx = "/brahms/grabskii/jost_data/GSE132080_10X_matrix.mtx.gz",
cells = "/brahms/grabskii/jost_data/GSE132080_10X_barcodes.tsv.gz",
features = "/brahms/grabskii/jost_data/GSE132080_10X_genes.tsv.gz")
meta_data <- read.csv("/brahms/grabskii/jost_data/GSE132080_cell_identities.csv.gz")
rownames(meta_data) <- meta_data$cell_barcode
jost <- CreateSeuratObject(counts = counts, meta.data = meta_data)
jost <- subset(jost, subset = number_of_cells == 1)
jost <- subset(jost, subset = guide_identity != '*')
jost <- jost[,!is.na(jost$guide_identity)]
jost$gene <- str_extract(jost$guide_identity, "^[^_]+")We normalize the query data with SCTransform;
importantly, we specify return.only.var.genes = FALSE.
Now we can install the precomputed GWPS dictionary. Note that all
available dictionaries can be viewed using
AvailableDictionaries().
InstallPrecomputedDictionary(name = 'gwps')
#> ── R CMD build ─────────────────────────────────────────────────────────────────
#>
✔ checking for file ‘/tmp/RtmpGa6BhN/remotes28d6ca92d7cf/gwps/DESCRIPTION’ (1.4s)
#>
─ preparing ‘gwps’: (949ms)
#> checking DESCRIPTION meta-information ...
✔ checking DESCRIPTION meta-information
#>
─ checking for LF line-endings in source and make files and shell scripts
#>
─ checking for empty or unneeded directories
#> ─ looking to see if a ‘data/datalist’ file should be added
#>
─ building ‘gwps_0.0.0.9000.tar.gz’ (708ms)
#>
Warning: invalid uid value replaced by that for user 'nobody'
#> Warning: invalid gid value replaced by that for user 'nobody'
#>
#> Next, we can load in this dictionary using the convenience loader
function LoadPrecomputedDictionary. The GWPS fingerprints
were learned with genes in the ENSEMBL ID format, whereas our query data
uses gene symbols. To make these compatible, we specify
gene_key = 'SYMBOL' when loading it in. More generally, the
ConvertGenes function can be used to update any set of
fingerprints from one gene key type to another.
As a note, there is likely to be at least some degree of feature mismatch between these GWPS fingerprints and a given query dataset. While this is allowable and unlikely to yield substantive differences in results, this represents an approximation of the procedure and is a trade-off for the convenience of using these pre-computed fingerprints.
We can start by trying to map cells from the query data to GWPS at
the group-level, where cells of the same condition can be assigned
together. Here, we map cells according to their perturbation labels, and
obtain a dataframe summarizing these assignments. This can be slow due
to the size of GWPS, but faster results can be obtained using
parallelization with the mc.cores parameter.
jost <- FingerprintCells(jost,
condition_meta = 'gene',
control_class = 'neg',
dictionary = gwps_dictionary,
group_level = TRUE)
#> The following 541 features are not present in the SCT assay:LINC01128, INTS11, MRPL20-AS1, FNDC10, CENPS, SPEN-AS1, MICOS10, ELOA-AS1, ELOA, MACO1, ATP5IF1, SNHG3, YARS1, TMEM35B, ADPRS, UTP11, TUT4, CZIB, IFT25, MIGA1, SELENOF, KYAT3, LINC02609, DPH5-DT, EEIG2, SARS1, ATP5PB, LINC02802, LINC02805, H2AC20, PRUNE1, GBA1, ENTREP3, KHDC4, MIR9-1HG, NAXE, METTL25B, FIRRM, GAS5, COP1, ODR4, RO60, UTP25, PACC1, FLVCR1-DT, EPRS1, H3-3A, COQ8A, H2AC25, RAB4A-AS1...
#> Processing data...
#> Preparing...
#> Classifying...
group_level_summary <- SummarizeResults(jost, group_level = TRUE,
condition_meta = 'gene',
control_class = 'neg')
print(group_level_summary[1:5,])
#> top_credible_set.gene
#> RPL9 SPATA5L1,URB2,ABCF1,NLE1,SPATA5,RPL3,CENPP,HERC1,EIF6,SURF6,DDX51,TMA16,CINP,RSL1D1,MYBBP1A,CARF,UBE2E1,KRR1,NOC3L,FBXL14,RPL9
#> GATA1 HMBOX1,GATA1
#> POLR1D TAF1C,RPL3,TAF1D,GTF3C6,POLR1E,POLR1C,CD3EAP,RPL26,GNL2,RPL37
#> POLR2H UXT,NUMA1,C1QBP,PFDN2,MED26,RPAP2,POLR2B,ERCC2,GPN3
#> RPS14 KRR1,WDR3,FBXL14,AATF,RIOK1,RPS12,RIOK2,DDX52,DIMT1,DDX10,ZNF84,FCF1,RPS3,WDR36,NOP10,RPS14,RPS2,RPS19,DDX47,TSR2,RPP30,POP4,RRP9,POP1,ZNHIT6,USP36,UTP20,RPS3A,DNTTIP2,RPS20,RPS15A,ABT1,IMPA2,RPP14,ZCCHC9,ESF1,BUD23,LAMB1,RPS10-NUDT3,RRP12
#> top_lbf.gene
#> RPL9 2882.97
#> GATA1 2762.02
#> POLR1D 2093.92
#> POLR2H 2056.41
#> RPS14 1900.83
#> all_credible_sets.gene
#> RPL9 SPATA5L1,URB2,ABCF1,NLE1,SPATA5,RPL3,CENPP,HERC1,EIF6,SURF6,DDX51,TMA16,CINP,RSL1D1,MYBBP1A,CARF,UBE2E1,KRR1,NOC3L,FBXL14,RPL9
#> GATA1 HMBOX1,GATA1
#> POLR1D TAF1C,RPL3,TAF1D,GTF3C6,POLR1E,POLR1C,CD3EAP,RPL26,GNL2,RPL37
#> POLR2H UXT,NUMA1,C1QBP,PFDN2,MED26,RPAP2,POLR2B,ERCC2,GPN3
#> RPS14 KRR1,WDR3,FBXL14,AATF,RIOK1,RPS12,RIOK2,DDX52,DIMT1,DDX10,ZNF84,FCF1,RPS3,WDR36,NOP10,RPS14,RPS2,RPS19,DDX47,TSR2,RPP30,POP4,RRP9,POP1,ZNHIT6,USP36,UTP20,RPS3A,DNTTIP2,RPS20,RPS15A,ABT1,IMPA2,RPP14,ZCCHC9,ESF1,BUD23,LAMB1,RPS10-NUDT3,RRP12
#> all_lbfs.gene
#> RPL9 2882.97
#> GATA1 2762.02
#> POLR1D 2093.92
#> POLR2H 2056.41
#> RPS14 1900.83
#> all_signs.gene
#> RPL9 +,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+
#> GATA1 +,+
#> POLR1D +,+,+,+,+,+,+,+,+,+
#> POLR2H +,+,+,+,+,+,+,+,+
#> RPS14 +,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+Some of these assignments involve very long credible sets, which
makes sense given that many of these perturbations are of genes
belonging to large complexes. We can visualize these assignments using
Long Island City plots. Below, we should an example of an assignment
with a long credible set, and one with a short credible set. We could
alternatively sort the fingerprints in alphabetical order by setting the
order argument appropriately.
LongIslandCityPlot(jost, query = 'RPS14', condition_meta = 'gene',
control_class = 'neg', dictionary = gwps_dictionary,
group_level = T, cs = 10)
LongIslandCityPlot(jost, query = 'BCR', condition_meta = 'gene',
control_class = 'neg', dictionary = gwps_dictionary,
group_level = T, cs = 10)These plots show the conditional posterior mean coefficients for each
fingerprint on the y-axis. However, in large-scale settings, this can be
slow to compute. For a faster version, we can specify the argument
approx = TRUE to instead show marginal correlations on the
y-axis, e.g.:
LongIslandCityPlot(jost, query = 'RPS14', condition_meta = 'gene',
control_class = 'neg', dictionary = gwps_dictionary,
group_level = T, cs = 10, approx = TRUE)LongIslandCityPlot(jost, query = 'BCR', condition_meta = 'gene',
control_class = 'neg', dictionary = gwps_dictionary,
group_level = T, cs = 10, approx = TRUE)To understand what genes were most responsible for driving these
matches, we can use the ExplainMatch function. Here, the
visualizations are automatically made with the stored pseudobulk GWPS
object inside the precomputed dictionary, since we don’t have the full
GWPS object loaded in.
ExplainMatch(jost, query = 'BCR', match = 'BCR',
condition_meta = 'gene', control_class = 'neg',
dictionary = gwps_dictionary, group_level = T,
num_genes = 25)We can also map cells on an individual basis. This will be slower
than group-level classification, so here we speed it up by using
parallelization with mc.cores.
jost <- FingerprintCells(jost, condition_meta = 'gene', control_class = 'neg',
dictionary = gwps_dictionary,
group_level = FALSE, mc.cores = 10)
#> The following 541 features are not present in the SCT assay:LINC01128, INTS11, MRPL20-AS1, FNDC10, CENPS, SPEN-AS1, MICOS10, ELOA-AS1, ELOA, MACO1, ATP5IF1, SNHG3, YARS1, TMEM35B, ADPRS, UTP11, TUT4, CZIB, IFT25, MIGA1, SELENOF, KYAT3, LINC02609, DPH5-DT, EEIG2, SARS1, ATP5PB, LINC02802, LINC02805, H2AC20, PRUNE1, GBA1, ENTREP3, KHDC4, MIR9-1HG, NAXE, METTL25B, FIRRM, GAS5, COP1, ODR4, RO60, UTP25, PACC1, FLVCR1-DT, EPRS1, H3-3A, COQ8A, H2AC25, RAB4A-AS1...
#> Processing data...
#> Preparing...
#> Classifying...
cell_level_summary <- SummarizeResults(jost, group_level = FALSE,
condition_meta = 'gene',
control_class = 'neg')
print(head(cell_level_summary))
#> gene
#> GCAGTTACAAGCTGGA-2 RPL9
#> CAGTAACAGAGGGCTT-2 RPL9
#> GGGTTGCCACGTCTCT-2 RPL9
#> GACACGCTCAGCTCTC-2 RPL9
#> TGTTCCGTCGTTGCCT-2 RPL9
#> ATCACGACAGGTCGTC-2 RPL9
#> top_credible_set
#> GCAGTTACAAGCTGGA-2 ABCF1,RPL30,RPL19,RPL10,RPL8,RPL21,RPL14,RPL24,RPL9,URB2,EIF6,DDX51,NLE1,TEX10,SPATA5L1,MYBBP1A,CINP,RPL37,PPAN,RPL10A,HEATR1
#> CAGTAACAGAGGGCTT-2 ABCF1,NLE1,DDX24,EIF6,RPL8,RPL5,RPL36,RPL24,RPL30,SPATA5L1,CINP,URB2,CENPP,NIFK,SDAD1,GNL2,FBL,RPS7,TSR2,RPS3,RPS4X,KRR1,UTP20,RPS15A,EIF3A,RPS6,RPS24,TTF1,GLIS1,PES1,MYBBP1A,SOX3,WDR12,RPL18,STK24,NOC3L,CARF,DDX51,RRS1,DDX54,NOP16,MAK16,RPL34,RPL7,FBXL14,RPL21
#> GGGTTGCCACGTCTCT-2 NLE1,DDX51,ABCF1,MYBBP1A,EIF6,RRS1,RPL37,RPL30,RPL10,RPL24,RPL8,GNL2,RPL17,SOX3,DDX54,CCDC86,SDAD1,NOC3L,RPL9,RPL36,RPL18,NSA2,SPATA5L1,NOP2,RPL19,RPL14,RPL7,CINP
#> GACACGCTCAGCTCTC-2 URB2,SPATA5L1,ABCF1,DDX55,LTV1,HERC1,SDAD1,RPS19,RPS10-NUDT3,NOC3L,TEX10,TMA16,NOLC1,HDAC6,RPL3,SURF6,RSL1D1,EIF6,CINP,DDX52,TSR2,KRR1,RCL1,HIGD2A,FCF1,TPP2,EIF3C,DNTTIP2,RPS12,MAK16
#> TGTTCCGTCGTTGCCT-2 URB2,NLE1,GNL2,ABCF1,RPL24,NOL9,SPATA5L1,KIF18B,CINP,ZSCAN2,RPL18,NR2F2,DDX56,RPS20,EIF5B,GLIS1,RPL21,SOX3,CEBPZ,DDX24,NOL8,RBM28,NIFK,RRS1,MDN1,UBE2E1,GTPBP4,RPL27,NOC3L,LAS1L,WDR55,TEX10,RPL18A,LTV1,PES1,RSL24D1,RPL11,RPL13,SURF6,RPL23A,TPP2,SPATA5,DDX54,NVL,NMD3
#> ATCACGACAGGTCGTC-2 SDAD1,NLE1,RPL37,EIF6,GNL2,RRS1,RPL24,URB2,RPL30,RPL5,RPL10,RPL17,RPL13,RPL38,RPL31,RPLP0,GLIS1,DDX56,RPL11,DDX24,ABCF1,NOP16,MAK16,SPATA5L1,DDX51,CINP,EBNA1BP2,RPL18,WDR12
#> top_lbf
#> GCAGTTACAAGCTGGA-2 1077.60
#> CAGTAACAGAGGGCTT-2 594.60
#> GGGTTGCCACGTCTCT-2 491.14
#> GACACGCTCAGCTCTC-2 491.08
#> TGTTCCGTCGTTGCCT-2 475.88
#> ATCACGACAGGTCGTC-2 474.95
#> all_credible_sets
#> GCAGTTACAAGCTGGA-2 ABCF1,RPL30,RPL19,RPL10,RPL8,RPL21,RPL14,RPL24,RPL9,URB2,EIF6,DDX51,NLE1,TEX10,SPATA5L1,MYBBP1A,CINP,RPL37,PPAN,RPL10A,HEATR1
#> CAGTAACAGAGGGCTT-2 ABCF1,NLE1,DDX24,EIF6,RPL8,RPL5,RPL36,RPL24,RPL30,SPATA5L1,CINP,URB2,CENPP,NIFK,SDAD1,GNL2,FBL,RPS7,TSR2,RPS3,RPS4X,KRR1,UTP20,RPS15A,EIF3A,RPS6,RPS24,TTF1,GLIS1,PES1,MYBBP1A,SOX3,WDR12,RPL18,STK24,NOC3L,CARF,DDX51,RRS1,DDX54,NOP16,MAK16,RPL34,RPL7,FBXL14,RPL21
#> GGGTTGCCACGTCTCT-2 NLE1,DDX51,ABCF1,MYBBP1A,EIF6,RRS1,RPL37,RPL30,RPL10,RPL24,RPL8,GNL2,RPL17,SOX3,DDX54,CCDC86,SDAD1,NOC3L,RPL9,RPL36,RPL18,NSA2,SPATA5L1,NOP2,RPL19,RPL14,RPL7,CINP
#> GACACGCTCAGCTCTC-2 URB2,SPATA5L1,ABCF1,DDX55,LTV1,HERC1,SDAD1,RPS19,RPS10-NUDT3,NOC3L,TEX10,TMA16,NOLC1,HDAC6,RPL3,SURF6,RSL1D1,EIF6,CINP,DDX52,TSR2,KRR1,RCL1,HIGD2A,FCF1,TPP2,EIF3C,DNTTIP2,RPS12,MAK16
#> TGTTCCGTCGTTGCCT-2 URB2,NLE1,GNL2,ABCF1,RPL24,NOL9,SPATA5L1,KIF18B,CINP,ZSCAN2,RPL18,NR2F2,DDX56,RPS20,EIF5B,GLIS1,RPL21,SOX3,CEBPZ,DDX24,NOL8,RBM28,NIFK,RRS1,MDN1,UBE2E1,GTPBP4,RPL27,NOC3L,LAS1L,WDR55,TEX10,RPL18A,LTV1,PES1,RSL24D1,RPL11,RPL13,SURF6,RPL23A,TPP2,SPATA5,DDX54,NVL,NMD3
#> ATCACGACAGGTCGTC-2 SDAD1,NLE1,RPL37,EIF6,GNL2,RRS1,RPL24,URB2,RPL30,RPL5,RPL10,RPL17,RPL13,RPL38,RPL31,RPLP0,GLIS1,DDX56,RPL11,DDX24,ABCF1,NOP16,MAK16,SPATA5L1,DDX51,CINP,EBNA1BP2,RPL18,WDR12
#> all_lbfs
#> GCAGTTACAAGCTGGA-2 1077.6
#> CAGTAACAGAGGGCTT-2 594.6
#> GGGTTGCCACGTCTCT-2 491.14
#> GACACGCTCAGCTCTC-2 491.08
#> TGTTCCGTCGTTGCCT-2 475.88
#> ATCACGACAGGTCGTC-2 474.95
#> all_signs
#> GCAGTTACAAGCTGGA-2 +,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+
#> CAGTAACAGAGGGCTT-2 +,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+
#> GGGTTGCCACGTCTCT-2 +,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+
#> GACACGCTCAGCTCTC-2 +,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+
#> TGTTCCGTCGTTGCCT-2 +,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+
#> ATCACGACAGGTCGTC-2 +,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+We can visualize these assignments by using the
ProjectFingerprints function to create a UMAP of the
projections of each assigned cell onto the fingerprints. Below, we
overlay the plot with each cell’s true perturbation label:
jost <- ProjectFingerprints(jost, 'gene',
'neg', gwps_dictionary)
DimPlot(jost, group.by = 'gene',
reduction = 'projUMAP', label = T)