Overview

This vignette demonstrates how to map scRNA-seq data to Azimuth reference datasets in R.

Reference-based mapping provides an attractive alternative to unsupervised analysis. When well-curated and annotated references are available, reference-mapping can rapdidly, robustly, and sensitively annotate and interpret query datasets. As part of the Human Biomolecular Atlas Project, we have built integrated references for multiple human tissues, available at azimuth.hubmapconsortium.org. Azimuth is a web-tool that maps user-uploaded datasets - starting from an unnormalized expression counts matrix.

We have recently released updated software that enables the Azimuth workflow to be replicated directly in R, without the need to upload data into the web application. Reference datasets are automatically downloaded as part of out previously released SeuratData framework. Datasets can then be passed through Azimuth using a single command.

As with the web application, Azimuth is compatible with a wide range of inputs, including Seurat objects, 10x HDF5 files, and Scanpy/h5ad files. Once Azimuth is run, a Seurat object is returned which contains

  • Cell annotations (at multiple levels of resolution)
  • Prediction scores (i.e. confidence scores) for each annotation
  • Projection onto the reference-derived 2-dimensional UMAP visualization

In this vignette, we demonstrate the use of a function RunAzimuth() which facilitates annotation of single cell datasets.

View data download code

To download all the required files, you can run the following lines in a shell:

wget https://figshare.com/ndownloader/files/27686835 -O human_cd34_bone_marrow.h5ad

Installation

We first install and load Seurat, Azimuth, and Seurat-Data

devtools::install_github("satijalab/seurat-data")
devtools::install_github("satijalab/azimuth")
install.packages("Seurat")

Map PBMC scRNA-seq datasets from 7 different technologies

We first consider a benchmark dataset from Systematic comparison of single-cell and single-nucleus RNA-sequencing methods, Nat. Biotech 2020, which profiled a total of 31,021 human peripheral blood mononuclear cells (PBMCs) using seven different experimental technologies. The raw data is avaialble for download here, but the dataset is also available for easy loading via SeuratData. We load the data in, and map it to our PBMC reference.

# Install the PBMC systematic comparative analyis (pmbcsca) dataset
InstallData("pbmcsca")

# returns a Seurat object named pbmcsca
LoadData("pbmcsca")

# The RunAzimuth function can take a Seurat object as input
pbmcsca <- RunAzimuth(pbmcsca, reference = "pbmcref")

We can now visualize the outputs of Azimuth. Note that all cells are labeled with high-resolution annotations, and are projected into a harmonized space despite being analyzed from a wide variety of technologies.

p1 <- DimPlot(pbmcsca, group.by = "predicted.celltype.l2", label = TRUE, label.size = 3) + NoLegend()
p2 <- DimPlot(pbmcsca, group.by = "Method")
p1 + p2

We can visualize the expression of canonical marker genes to examine the accuracy of predictions. Note that Azimuth normalizes data (internally) before mapping, but does not return the results, so we normalize the data here before visualization.

Here, we specifically visualize:

  • The expression of CCR7 on CD4 and CD8 Naive T cells
  • The expression of FCGR3A on CD16+ monocytes, CD56dim NK cells, and cytotoxic CD8 T cells
  • The expression of AXL on rare populations of AXL+SIGLEC6+ dendritic cells (ASDC)
  • Prediction scores for the annotation CD4+ regulatory T cells (Treg)
pbmcsca <- NormalizeData(pbmcsca)
Idents(pbmcsca) <- "predicted.celltype.l2"

p1 <- FeaturePlot(pbmcsca, features = "CCR7")
p2 <- FeaturePlot(pbmcsca, features = "FCGR3A")
p3 <- VlnPlot(pbmcsca, features = "AXL", group.by = "predicted.celltype.l2", idents = c("ASDC",
    "pDC", "cDC1", "cDC2"))
p4 <- FeaturePlot(pbmcsca, features = "predictionscorecelltypel2_Treg")

p1 + p2 + p3 + p4 + plot_layout(ncol = 2)

See available references

You can search all available datasets in SeuratData (focusing on Azimuth references)

available_data <- AvailableData()
available_data[grep("Azimuth", available_data[, 3]), 1:3]
##                                  Dataset Version                        Summary
## bonemarrowref.SeuratData   bonemarrowref   1.0.0  Azimuth Reference: bonemarrow
## fetusref.SeuratData             fetusref   1.0.0       Azimuth Reference: fetus
## humancortexref.SeuratData humancortexref   1.0.0 Azimuth Reference: humancortex
## kidneyref.SeuratData           kidneyref   1.0.1      Azimuth Reference: kidney
## lungref.SeuratData               lungref   2.0.0        Azimuth Reference: lung
## mousecortexref.SeuratData mousecortexref   1.0.0 Azimuth Reference: mousecortex
## pancreasref.SeuratData       pancreasref   1.0.0    Azimuth Reference: pancreas
## pbmcref.SeuratData               pbmcref   1.0.0        Azimuth Reference: pbmc

Map CD34+ cells from Human Bone Marrow (stored as an h5ad file)

As a final example, we map data from CD34+ human bone marrow cells from the manuscript Characterization of cell fate probabilities in single-cell data with Palantir. We map to our Azimuth Human Bone Marrow reference, which includes both progenitor and differentiated cells.

Azimuth can also take the path to an h5ad object as input. In this case, Azimuth extracts the unnormalized counts from the object, and proceeds with mapping.

bm <- RunAzimuth(query = "human_cd34_bone_marrow.h5ad", reference = "bonemarrowref")

As expected, query cells map to CD34+ celltypes which represent a subset of celltypes present in the reference. Rare cells that map to differentiated cell populations (i.e. CD4 memory), map with low prediction scores. Reference-mapping also harmonizes two separate runs.

p1 <- DimPlot(bm, group.by = "predicted.celltype.l2", label = TRUE, label.size = 3, repel = TRUE) +
    NoLegend()
p2 <- DimPlot(bm, group.by = "orig.ident")
sort(table(bm$predicted.celltype.l2), decreasing = TRUE)
## 
##           LMPP            GMP            HSC    Early Eryth            CLP 
##           1138           1065           1028            849            470 
##          pre B     Late Eryth        pre-pDC            EMP transitional B 
##            422            214            152            141             82 
##            pDC        Prog Mk          pro B         BaEoMa        pre-mDC 
##             68             60             38             18             14 
##     CD4 Memory           cDC2           ASDC           MAIT        Stromal 
##              8              6              5              1              1
p3 <- VlnPlot(bm, features = "predicted.celltype.l2.score") + NoLegend()
(p1 + p2)/p3

Lastly, we can visualize markers associated with lineage differentiation to verify our annotations including AVP (HSC), KLF1 (erythroid), MPO (myeloid), and VWF (platelet).

# normalize before visualization
bm <- NormalizeData(bm)
FeaturePlot(bm, features = c("AVP", "KLF1", "MPO", "VWF"))

Session Info
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] patchwork_1.1.1                pbmcsca.SeuratData_3.0.0      
##  [3] pbmcref.SeuratData_1.0.0       pbmc3k.SeuratData_3.1.4       
##  [5] lungref.SeuratData_1.0.0       kidneyref.SeuratData_1.0.0    
##  [7] bonemarrowref.SeuratData_1.0.0 SeuratData_0.2.2              
##  [9] Azimuth_0.4.5                  shinyBS_0.61.1                
## [11] sp_1.4-7                       SeuratObject_4.1.0            
## [13] Seurat_4.1.1                  
## 
## loaded via a namespace (and not attached):
##   [1] systemfonts_1.0.4     plyr_1.8.7            igraph_1.3.1         
##   [4] lazyeval_0.2.2        shinydashboard_0.7.2  splines_4.1.3        
##   [7] listenv_0.8.0         scattermore_0.8       ggplot2_3.3.6        
##  [10] digest_0.6.29         htmltools_0.5.2       fansi_1.0.3          
##  [13] magrittr_2.0.3        memoise_2.0.1         googlesheets4_1.0.0  
##  [16] tensor_1.5            cluster_2.1.3         ROCR_1.0-11          
##  [19] globals_0.15.0        matrixStats_0.62.0    pkgdown_2.0.3        
##  [22] spatstat.sparse_2.1-1 colorspace_2.0-3      rappdirs_0.3.3       
##  [25] ggrepel_0.9.1         textshaping_0.3.6     xfun_0.30            
##  [28] dplyr_1.0.9           crayon_1.5.1          jsonlite_1.8.0       
##  [31] progressr_0.10.0      spatstat.data_2.2-0   survival_3.3-1       
##  [34] zoo_1.8-10            glue_1.6.2            polyclip_1.10-0      
##  [37] gargle_1.2.0          gtable_0.3.0          leiden_0.4.2         
##  [40] future.apply_1.9.0    abind_1.4-5           scales_1.2.0         
##  [43] DBI_1.1.2             spatstat.random_2.2-0 miniUI_0.1.1.1       
##  [46] Rcpp_1.0.8.3          viridisLite_0.4.0     xtable_1.8-4         
##  [49] reticulate_1.24       spatstat.core_2.4-2   bit_4.0.4            
##  [52] DT_0.23               htmlwidgets_1.5.4     httr_1.4.3           
##  [55] RColorBrewer_1.1-3    ellipsis_0.3.2        ica_1.0-2            
##  [58] farver_2.1.0          pkgconfig_2.0.3       sass_0.4.1           
##  [61] uwot_0.1.11           deldir_1.0-6          utf8_1.2.2           
##  [64] labeling_0.4.2        tidyselect_1.1.2      rlang_1.0.2          
##  [67] reshape2_1.4.4        later_1.3.0           cellranger_1.1.0     
##  [70] munsell_0.5.0         tools_4.1.3           cachem_1.0.6         
##  [73] cli_3.3.0             generics_0.1.2        ggridges_0.5.3       
##  [76] evaluate_0.15         stringr_1.4.0         fastmap_1.1.0        
##  [79] yaml_2.3.5            ragg_1.2.2            goftest_1.2-3        
##  [82] bit64_4.0.5           knitr_1.38            fs_1.5.2             
##  [85] fitdistrplus_1.1-8    purrr_0.3.4           RANN_2.6.1           
##  [88] pbapply_1.5-0         future_1.25.0         nlme_3.1-157         
##  [91] mime_0.12             formatR_1.12          hdf5r_1.3.5          
##  [94] compiler_4.1.3        rstudioapi_0.13       curl_4.3.2           
##  [97] plotly_4.10.0         png_0.1-7             spatstat.utils_2.3-1 
## [100] tibble_3.1.7          bslib_0.3.1           stringi_1.7.6        
## [103] highr_0.9             desc_1.4.1            rgeos_0.5-9          
## [106] lattice_0.20-45       Matrix_1.4-1          SeuratDisk_0.0.0.9020
## [109] shinyjs_2.1.0         vctrs_0.4.1           pillar_1.7.0         
## [112] lifecycle_1.0.1       BiocManager_1.30.16   spatstat.geom_2.4-0  
## [115] lmtest_0.9-40         jquerylib_0.1.4       RcppAnnoy_0.0.19     
## [118] data.table_1.14.2     cowplot_1.1.1         irlba_2.3.5          
## [121] httpuv_1.6.5          R6_2.5.1              promises_1.2.0.1     
## [124] renv_0.15.4           KernSmooth_2.23-20    gridExtra_2.3        
## [127] parallelly_1.31.1     codetools_0.2-18      MASS_7.3-56          
## [130] assertthat_0.2.1      rprojroot_2.0.3       withr_2.5.0          
## [133] presto_1.0.0          sctransform_0.3.3     mgcv_1.8-40          
## [136] parallel_4.1.3        grid_4.1.3            rpart_4.1.16         
## [139] tidyr_1.2.0           rmarkdown_2.13        googledrive_2.0.0    
## [142] Rtsne_0.16            shiny_1.7.1