Seurat is an R package for analysing single-cell RNA-seq data. More
info about the package can be found here.
The public dataset we will be working with today is part of SeuratData,
which is a package that distributes datasets in Seurat-object form.
Acquiring single cells entails dissociating some individual cells from tissue:
Acquired sequencing data, output of raw data processing: genes x single cells count matrix in scRNA-seq:
DoubletFinderSeurat v5: full process automatic data pre-processing: QC - Normalisation - feature selection - dimension red - data correction (batch effect removal with CCA), with some downstream analysis
This public dataset includes single-cell RNA-seq data of human
peripheral blood mononuclear cells (PBMCs) using multiple sequencing
platforms. Only data that passed quality control are
included in the pbmcsca dataset.
InstallData("pbmcsca")
data("pbmcsca")
# needs to update Seurat object to new structure (v5) for storing data/calculations
pbmcsca <- UpdateSeuratObject(pbmcsca)
Let’s view a summary of the data:
table(pbmcsca$Method)
##
## 10x Chromium (v2) 10x Chromium (v2) A 10x Chromium (v2) B 10x Chromium (v3)
## 3362 3222 3222 3222
## CEL-Seq2 Drop-seq inDrops Seq-Well
## 526 6584 6584 3773
## Smart-seq2
## 526
We see the number of cells sequenced per sequencing platform.
We are going to focus on single-cell sequencing of peripheral blood
mononuclear cells (PBMC) from two patients, sequenced on two different
platforms: 10x Chromium (v2) and 10x Chromium (v3)
pbmc_10x_v2 <- pbmcsca[,pbmcsca$Method == "10x Chromium (v2)"]
pbmc_10x_v3 <- pbmcsca[,pbmcsca$Method == "10x Chromium (v3)"]
pbmc_combo <- pbmcsca[,pbmcsca$Method %in% c("10x Chromium (v2)", "10x Chromium (v3)")]
A Seurat object is a representation of single-cell expression data
for R. Each Seurat object consists of one or more Assay objects aka
representations of expression data (eg. RNA-seq, ATAC-seq, etc). Each
assay can have multiple layers. Each layer would be a transformed
version of the raw gene expression data. These assays can be reduced
from their high-dimensional state to a lower-dimension state and stored
as DimReduc objects.
Seurat objects also store additional metadata. The object is designed to be as self-contained as possible and as more information is put within the object, the larger it becomes in size.
pbmc_10x_v3@assays$RNA #raw count matrix (genes x cells) of scRNA-seq
## Assay data with 33694 features for 3222 cells
## First 10 features:
## TSPAN6, TNMD, DPM1, SCYL3, C1orf112, FGR, CFH, FUCA2, GCLC, NFYA
head(pbmc_10x_v3@meta.data) # info and labels of each cell
## orig.ident nCount_RNA nFeature_RNA nGene nUMI
## pbmc1_10x_v3_AAACCCACACTTGGGC pbmc1 6914 1860 1860 6914
## pbmc1_10x_v3_AAACCCATCTTACACT pbmc1 6168 1758 1758 6168
## pbmc1_10x_v3_AAAGAACAGCAGGCAT pbmc1 8265 2425 2425 8265
## pbmc1_10x_v3_AAAGAACTCAAAGACA pbmc1 5789 1553 1553 5789
## pbmc1_10x_v3_AAAGTCCGTAGGCAGT pbmc1 9163 2655 2655 9163
## pbmc1_10x_v3_AAAGTGATCGTGACTA pbmc1 6359 1707 1707 6359
## percent.mito Cluster CellType Experiment
## pbmc1_10x_v3_AAACCCACACTTGGGC 0.0911194677466011 1 CD4+ T cell pbmc1
## pbmc1_10x_v3_AAACCCATCTTACACT 0.0856031128404669 1 CD4+ T cell pbmc1
## pbmc1_10x_v3_AAAGAACAGCAGGCAT 0.0815486993345433 1 CD4+ T cell pbmc1
## pbmc1_10x_v3_AAAGAACTCAAAGACA 0.0730696147866644 1 CD4+ T cell pbmc1
## pbmc1_10x_v3_AAAGTCCGTAGGCAGT 0.0966932227436429 1 CD4+ T cell pbmc1
## pbmc1_10x_v3_AAAGTGATCGTGACTA 0.104576191225035 1 CD4+ T cell pbmc1
## Method
## pbmc1_10x_v3_AAACCCACACTTGGGC 10x Chromium (v3)
## pbmc1_10x_v3_AAACCCATCTTACACT 10x Chromium (v3)
## pbmc1_10x_v3_AAAGAACAGCAGGCAT 10x Chromium (v3)
## pbmc1_10x_v3_AAAGAACTCAAAGACA 10x Chromium (v3)
## pbmc1_10x_v3_AAAGTCCGTAGGCAGT 10x Chromium (v3)
## pbmc1_10x_v3_AAAGTGATCGTGACTA 10x Chromium (v3)
pbmc_10x_v3@reductions #dim red info is here; empty, because we haven't done any dim reductions on the assay, so there is nothing to store!
## list()
For now, let’s focus on just the patient sample sequenced on 10x Chromium (v3).
dim(pbmc_10x_v3)
## [1] 33694 3222
We have 3222 cells and 33694 genes that have passed quality control.
By default, Seurat implements a global-scaling normalization method
called LogNormalize. Global-scaling means it aims to make
your samples (for us, cells) comparable. LogNormalize
normalises the gene expression measurements for each
cell by the total number of counts across all genes per
cell, and multiplies this by a scaling factor (10,000 by
default), then log transforms the result using log1p (i.e
\(log_e(value+1)\)):
pbmc_10x_v3 <- Seurat::NormalizeData(object=pbmc_10x_v3, normalization.method = "LogNormalize",
scale.factor = 10000)
How we could check we understand what LogNormalize
does:
counts_of_first_cell <- pbmc_10x_v3@assays$RNA@counts[,1]
total_num_counts_first_cell <- sum(counts_of_first_cell)
base::all.equal(pbmc_10x_v3@assays$RNA@data[,1],
log1p(counts_of_first_cell*10000/total_num_counts_first_cell)) # testing ‘near equality’ for numeric; Differences smaller than tolerance are not reported. The default value is close to 1.5e-8.
## [1] TRUE
We use the FindVariableFeatures() Seurat function to
select highly variable genes which have most of the useful information
needed for downstream analysis. For now, there is nothing there:
pbmc_10x_v3@assays$RNA@var.features
## character(0)
Here we select the top 3,000 most variable genes to save some computing time. In practice, you can select more genes (5,000 or more) to preserve more information from the scRNA-seq experiment:
pbmc_10x_v3 <- Seurat::FindVariableFeatures(pbmc_10x_v3, selection.method = "vst", nfeatures = 3000)
head(pbmc_10x_v3@assays$RNA@var.features)
## [1] "S100A9" "IGLC2" "CST3" "IGKC" "HLA-DPA1" "S100A8"
After feature selection, the downstream Seurat functions will only use these HVGs, unless otherwise specified.
Variance stabilising transformation, or vst, is a method for fitting
a line to the relationship of log(variance) and log(mean) using local
polynomial regression (loess), then standardising the feature (aka gene
expression) values using the observed mean and expected variance (given
by the fitted line). Feature variance is then calculated on the
standardised values after clipping to a maximum (see clip.max
parameter). - as per Seurat documentation.
We are searching for the highest variable genes. A way to do this would
be to find the variance of each gene (using gene expr values across all
cells), sort from highest to lowest and have a cutoff of the top 3000
genes. However, genes that have a higher expression will have a higher
variance (and lower expression lower variance), meaning some genes will
be in the cutoff even though they are not particulary variable, instead
just due to how large of a number an observation (hence variance) can
get to. So to make genes more comparable across variances with the
diverse range of means, you want to standardise the expression
values.
When standardising the expression values, normally you would centre by
the mean, and scale by the variance of each gene. But the variance could
be noisy due to low number of samples -> a sc count matrix is sparse!
This is where the variance stabilising method comes in. So the idea is
to “borrow” from other genes of a similar mean value by fitting a
(loess) line to a scatter plot of log_mean x log_var pairs, and use the
predicted (expected) value of log_var for each log_mean (aka use the
stabilised variance) in standardising every gene expression value: \[\frac{gene\_expr\_value -
mean}{\sqrt{predicted\_var}},\] where \(predicted\_var = e^{predicted\_log\_var}\).
This stabilised variance talks about the expected variance given a
certain mean, so by standardising with this stabilised variance, we will
achieve in removing the mean-var dependency. For implementation details,
see here,
and note when the counts or data slot is used
for feature value standardisation.
Many downstream statistical analyses requires the data matrix to be
centred and scaled. We use the Seurat function ScaleData()
for this:
pbmc_10x_v3 <- Seurat::ScaleData(pbmc_10x_v3)
## Centering and scaling data matrix
The scaled z-scored residuals of these models are stored in the
scale.data slot, and are used for downstream dimension
reduction and clustering tasks.
We perform PCA on the scaled data. By default, HVGs
in pbmc_10x_v3@var.genes are used as input, but they can be
defined by specifying the argument pc.genes.
We run PCA on top 3,000 most variable genes, and the total number of PCs to compute and store is 50 by default:
pbmc_10x_v3 <- Seurat::RunPCA(pbmc_10x_v3, features = VariableFeatures(object = pbmc_10x_v3))
## PC_ 1
## Positive: LYZ, FCN1, CLEC7A, CPVL, SERPINA1, SPI1, S100A9, AIF1, NAMPT, CSTA
## CTSS, MAFB, MPEG1, NCF2, FGL2, VCAN, S100A8, TYMP, CST3, LST1
## CYBB, CFD, FCER1G, TGFBI, SLC11A1, GRN, CD14, SLC7A7, PSAP, RNF130
## Negative: IL32, CCL5, TRBC2, TRAC, CST7, CD69, RORA, CTSW, CD247, ITM2A
## TRBC1, C12orf75, IL7R, CD8A, GZMA, NKG7, CD7, LDHB, GZMH, CD6
## CD8B, PRF1, BCL11B, LYAR, FGFBP2, HOPX, LTB, TCF7, KLRD1, ZFP36L2
## PC_ 2
## Positive: NRGN, PF4, SDPR, HIST1H2AC, MAP3K7CL, GNG11, PPBP, GPX1, TUBB1, SPARC
## CLU, PGRMC1, RGS18, FTH1, MARCH2, TREML1, HIST1H3H, ACRBP, AP003068.23, NCOA4
## PRKAR2B, CMTM5, CD9, CA2, TAGLN2, TSC22D1, CTTN, HIST1H2BJ, MTURN, TMSB4X
## Negative: EEF1A1, TMSB10, RPS2, RPS12, RPS18, RPLP1, RPS8, IL32, S100A4, RPLP0
## PFN1, NKG7, CYBA, ARL4C, HSPA8, CST7, HSP90AA1, ZFP36L2, S100A6, ANXA1
## CTSW, CORO1A, LDHA, CD247, GZMA, CALR, YBX1, S100A10, CFL1, ARPC3
## PC_ 3
## Positive: CCL5, TMSB4X, SRGN, NKG7, ACTB, CST7, S100A4, CTSW, ANXA1, GZMH
## FGFBP2, PRF1, GZMA, IL32, GZMB, C12orf75, KLRD1, GNLY, GAPDH, CD247
## MYO1F, NRGN, ID2, ACTG1, CD7, PF4, SDPR, PPBP, HIST1H2AC, CD8A
## Negative: CD79A, HLA-DQA1, MS4A1, BANK1, IGHM, LINC00926, IGHD, TNFRSF13C, HLA-DQB1, CD74
## IGKC, HLA-DRA, BLK, CD83, ADAM28, CD22, HLA-DRB1, CD37, NFKBID, CD79B
## P2RX5, VPREB3, IGLC2, JUND, FCER2, TCOF1, GNG7, COBLL1, RALGPS2, CD19
## PC_ 4
## Positive: IL7R, S100A12, VCAN, S100A8, SLC2A3, CD14, CSF3R, S100A9, MS4A6A, CD93
## IER3, FOS, RCAN3, ZFP36L2, EGR1, RP11-1143G9.4, RGCC, MGST1, LEPROTL1, CLEC4E
## SOCS3, VIM, THBS1, SELL, CXCL8, LTB, SGK1, TNFAIP3, MAL, IRF2BP2
## Negative: FCGR3A, CDKN1C, HES4, CSF1R, CKB, RHOC, ZNF703, MTSS1, TCF7L2, SIGLEC10
## MS4A7, FAM110A, HLA-DPA1, IFITM2, CTSL, PLD4, BATF3, SLC2A6, IFITM3, ABI3
## GZMB, NKG7, HLA-DPB1, FGFBP2, CTD-2006K23.1, CD79B, BID, LRRC25, GZMH, CTSC
## PC_ 5
## Positive: GZMB, FGFBP2, GZMH, PRF1, CST7, NKG7, GNLY, KLRD1, GZMA, CTSW
## CCL5, SPON2, ADGRG1, PRSS23, KLRF1, CCL4, ZEB2, CLIC3, S1PR5, ITGAM
## HOPX, CEP78, CYBA, TRDC, C1orf21, MATK, TTC38, C12orf75, HLA-DRB1, HLA-DPB1
## Negative: IL7R, LTB, LEPROTL1, PAG1, MAL, CDKN1C, RCAN3, LEF1, TCF7, NOSIP
## CAMK4, LDHB, TOB1, TRABD2A, HES4, CSF1R, CKB, JUNB, TMEM123, ZFP36L2
## NELL2, RNASET2, CD28, BCL11B, TNFRSF25, CD27, CTSL, CD40LG, ZNF703, AQP3
The PCA result is stored in pbmc_10x_v3@reduction. The
output information prints us 30 (by default) genes that
are positively, or negatively correlated with the top 5 (by default)
principal components.
PCA is a dimension reduction technique. It captures the most
variability in the data, achieved by using an orthogonal linear
transformation (\(\sum_i{ax_i}\)) of
the data. Let’s say each data point is in p dimensions (e.g. p genes
describe a person). The idea is that you have made a new p-dimension
(new p axes), projected the data onto the new dimensions by a linear
transformation of the data and the earliest dimensions 1) hold
the most data variability (info) whilst maintaining 2) orthogonality to
all previously defined axes (all the p axes are orthogonal). These new
axes are called principal components (PCs) that capture the max variance
in data while keeping orthogonality. Each data point (e.g. a person)
then gets new coordinates, as they are projected onto the new coordinate
system.
PCA is defined as the linear transformation shown below, where “x” are
coordinates of original data, “c” are the projected coordinates, “a” are
the loadings and “p” is the number of dimensions describing a
point.
PCA is fully reversible, as it would equate to solving a system of
linear equations seen below where “c” and “a” are the known, and “x” is
the unknown.
\[
x_1, x_2, ..., x_p \rightarrow data \\
c_1, c_2, ..., c_p \rightarrow new~coordinates \\
a_{11}, a_{12} ~ ... ~ a_{1p} ~ ... ~ a_{pp} \rightarrow loadings\\
(x_1, x_2, ..., x_p) \rightarrow (c_1, c_2, ..., c_p) \\
c_1 = a_{11}x_1+a_{12}x_2 +...+ a_{1p}x_p \\
c_2 = a_{21}x_1+a_{22}x_2 +...+ a_{2p}x_p \\
...\\
c_p = a_{p1}x_1+a_{p2}x_2 +...+ a_{pp}x_p
\] Instead of looking at the original genes, we are looking at
new/derived features aka principal components that came from
transforming the original data.
So the first PC is a linear combination of features that describes the
most variability in the data.
The \(k^{th}\) PC is a linear
combination of features that is uncorrelated to all the previous k-1 PCs
and describes the most variability unaccounted to all the previous k-1
PCs.
We’ve reduced the dimensions by retaining the first k (k < p) PCs
instead of all p dimensions as we started with.
If you were to perform PCA mathematically, you would do the following:
S <- cov(x) # x is the matrix of centered data
eigen(S)
## eigen() decomposition
## $values
## [1] 6.858209 1.796837
##
## $vectors
## [,1] [,2]
## [1,] -0.8044078 0.5940776
## [2,] -0.5940776 -0.8044078
The largest eigenvalue is \(\lambda_1\) = 6.858, and the corresponding
eigenvector (= unit vector) is w1 = (-0.804, -0.594)^T
\(\lambda_i\) is the variance of the
\(i^{th}\) PC. The total variance of
the data is the sum of all . The proportion of variance explained by the
i^{th} PC is \(\frac{\lambda_i}{\sum^p_{j=1}{\lambda_j}}\).
The loadings can be found as:
eigenvalues_sqrt <- sqrt(eigen_result$values)
loadings <- eigen_result$vectors %*% diag(eigenvalues_sqrt)
The default parameters instill for PCA to be approximated by using
truncated singular value decomposition (SVD) and fixed with a random
seed preset to 42. Under the hood, it is using the irlba
package for truncated singular value decomposition (SVD) that uses an
iterative algorithm that starts with a random initial vector to
approximate the largest singular vectors.
Important note: How many genes to choose for PCA and how many PCs to use for downstream analysis is a complex and important issue that is out of the scope of today’s workshop.
But we highly recommend you to read this document before analyzing your own scRNA-seq data, where the authors show how to use some visualization methods to guide your choice.
Uniform manifold approximation and project (UMAP) is a tool for visualising higher dimensional data. It takes PCs as input. Here we use the top 30 PCs:
pbmc_10x_v3 <- Seurat::RunUMAP(pbmc_10x_v3, dims=1:30)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
We draw the UMAP plot with the Dimplot() function by
specifying the argument reduction = ‘umap’. We can colour points (cells)
by using the metadata information (pbmc_10x_v3@meta.data) by specifying the argument
group.by:
Seurat::DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'Method')
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the Seurat package.
## Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The Method metadata refers to the sequencing platform
used, and as we are currently analysing the data from only one
sequencing platform, we are expecting to see all the cells labelled with
the same colour.
pbmc_10x_v3 <- Seurat::RunUMAP(pbmc_10x_v3, dims = 1:2)
p1 <- Seurat::DimPlot(pbmc_10x_v3, reduction = "umap", group.by = 'Method') + Seurat::NoLegend()
pbmc_10x_v3 <- Seurat::RunUMAP(pbmc_10x_v3, dims = 1:10)
p2 <- Seurat::DimPlot(pbmc_10x_v3, reduction = "umap", group.by = 'Method') + Seurat::NoLegend()
pbmc_10x_v3 <- Seurat::RunUMAP(pbmc_10x_v3, dims = 1:30)
p3 <- Seurat::DimPlot(pbmc_10x_v3, reduction = "umap", group.by = 'Method') + Seurat::NoLegend()
pbmc_10x_v3 <- Seurat::RunUMAP(pbmc_10x_v3, dims = 1:50)
p4 <- Seurat::DimPlot(pbmc_10x_v3, reduction = "umap", group.by = 'Method') + Seurat::NoLegend()
# combining plots by patchwork system
p1 + p2 + p3 + p4
Two PCs aren’t enough to see any structure and this would be because
the first 2 PCs don’t capture enough variability in the data. The first
30 and 50 seem to give quite similar UMAP visualisations.
To get the legend font to be smaller in size, I searched the
documentation for DimPlot and didn’t find a specific
parameter for it, however, there was ..., which means you
can send through parameters that would be used by the functions
DimPlot employs. Looking at the implementation
of DimPlot, I saw there is NoLegend(), so I
instead decided to remove the legend for each individual plot using
+NoLegend()
By default, Seurat uses the Louvain algorithm. Another popular one is
Leiden algorithm.
The Louvain algorithm requires a neighbor graph as input. Therefore, we
first run the FindNeighbors() function first. It is based
on PCs, so here we use the 30 top PCs (previously, we conducted PCA with
the default number of PCs being 50). We then run the
FindClusters() function for clustering. The argument
Algorithm=1 means we are using the Louvain algorithm for
clustering:
pbmc_10x_v3 <- Seurat::FindNeighbors(pbmc_10x_v3, dims = 1:30)
## Computing nearest neighbor graph
## Computing SNN
pbmc_10x_v3 <- Seurat::FindClusters(object = pbmc_10x_v3, resolution = 0.3, algorithm=1)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 3222
## Number of edges: 134293
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9386
## Number of communities: 10
## Elapsed time: 0 seconds
The key tunning parameter here is resolution, which determines how many clusters we get. The number of clusters to have depends on biology. The higher the number, the more clusters.
We can use UMAP to visualise the clustering results. The argument
group.by='seurat_clusters' is used to colour the cells
according to the clustering results.
Seurat::DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'seurat_clusters')
Above we see that cluster 3 is visually its own cluster. To see what characterises that cluster, we can look at its marker genes. Marker genes would be the genes that are differentially expressed in ONLY that cluster, aka we compare that cluster to all the other clusters together.
pbmc_10x_v3.markers <- FindAllMarkers(pbmc_10x_v3, min.pct = .25, logfc.threshold = .25)
## Calculating cluster 0
## Warning: `PackageCheck()` was deprecated in SeuratObject 5.0.0.
## ℹ Please use `rlang::check_installed()` instead.
## ℹ The deprecated feature was likely used in the Seurat package.
## Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the presto package
## --------------------------------------------
## install.packages('devtools')
## devtools::install_github('immunogenomics/presto')
## --------------------------------------------
## After installation of presto, Seurat will automatically use the more
## efficient implementation (no further action necessary).
## This message will be shown once per session
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
## Calculating cluster 9
cluster3.markers <- pbmc_10x_v3.markers[which(pbmc_10x_v3.markers$cluster==3),]
cluster3.markers[1:5,]
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## CD79A 0 9.438875 0.980 0.012 0 3 CD79A
## HLA-DQA1 0 5.120361 0.966 0.024 0 3 HLA-DQA1
## MS4A1 0 6.123866 0.858 0.010 0 3 MS4A1
## HLA-DQB1 0 3.781385 0.909 0.103 0 3 HLA-DQB1
## BANK1 0 7.059084 0.805 0.007 0 3 BANK1
We then search these marker genes in this database website. If there are genes with a dash in their name, you need to remove the dash before searching.
Based on the result provided by the database, it looks like cluster 3
might be a cluster of B cells. This is a rare situation in which we can
check whether this is the case by checking the cell-type ground truth
information pre-saved in the pbmc_10x_v3 object, and
plotting this as UMAP:
DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'CellType')
And so we see that cluster 3 (in our previous UMAP) is labelled as B cells, as we got from the database using our marker genes.
Very often we might have a few genes of interest, say CD14, a marker of CD14+ monocytes. Then we can look for clusters where CD14 is up-regulated. This can be done by:
Idents(object = pbmc_10x_v3) <- "seurat_clusters"
VlnPlot(pbmc_10x_v3, features = 'CD14')
## Warning: The `slot` argument of `FetchData()` is deprecated as of SeuratObject 5.0.0.
## ℹ Please use the `layer` argument instead.
## ℹ The deprecated feature was likely used in the Seurat package.
## Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
So, cluster 4 is high likely to be annotated as CD14+ monocytes.
Typically, cell-type information is unknown in single cell data, as this is what we are trying to find out! But publicly available datasets may have been previously annotated using the technique highlighted above, or using other reference datasets.
pbmc_10x_v2Suppose we have annotated pbmc_10x_v3. We can then use
it as a reference to annotate a new PBMC dataset (like
pbmc_10x_v2). This can be done computationally, so that we
do not need to go through the process of clustering, finding
differentially expressed genes and looking up gene names.
Remember: pbmc_10x_v2 and
pbmc_10x_v3 include different cells from different patients
and sequenced using different platforms.
pbmc_10x_v2# Normalise it
pbmc_10x_v2 <- NormalizeData(pbmc_10x_v2)
# Feature Selection
pbmc_10x_v2 <- FindVariableFeatures(pbmc_10x_v2, selection.method = "vst", nfeatures = 3000)
# Scale it
pbmc_10x_v2.all.genes <- rownames(pbmc_10x_v2)
pbmc_10x_v2 <- ScaleData(pbmc_10x_v2, features = pbmc_10x_v2.all.genes)
# Do PCA
pbmc_10x_v2 <- RunPCA(pbmc_10x_v2, features = VariableFeatures(object = pbmc_10x_v2))
# Draw UMAP
pbmc_10x_v2 <- FindNeighbors(pbmc_10x_v2, dims = 1:30)
pbmc_10x_v2 <- RunUMAP(pbmc_10x_v2, dims=1:30)
DimPlot(pbmc_10x_v2, reduction = "umap", label = TRUE, group.by = 'Method')
pbmc_10x_v3 as
ReferenceIn Seurat we can learn cell type annotation results from one scRNA-seq data to provide cell type annotation for another scRNA-seq dataset.
We use the FindTransferAnchors() function to predict
which cells in two datasets are of the same cell type. Here we use
pbmc_10x_v3 as the reference data set, and
pbmc_10x_v2 is the query data.
We specify that we use the top 30 PCs, then assign the cell-type of
the cells in pbmc_10x_v2 using the
TransferData() function:
anchors <- FindTransferAnchors(reference = pbmc_10x_v3, query = pbmc_10x_v2, dims = 1:30)
## Performing PCA on the provided reference using 3000 features as input.
## Projecting cell embeddings
## Finding neighborhoods
## Finding anchors
## Found 2995 anchors
predictions <- TransferData(anchorset = anchors, refdata = pbmc_10x_v3$CellType, dims = 1:30)
## Finding integration vectors
## Finding integration vector weights
## Predicting cell labels
Seurat will provide a table with the most likely cell type and the
probability of each cell type based on the existing cell types in the
reference. This means if there is a cell type in
pbmc_10x_v2 that doesn’t exist in the reference, we won’t
be able to correctly assign those cells, resulting in a misleading
classification. We assign the most likely cell type to the
pbmc_10x_v2 object, and then use UMAP to visualise this
annotation:
pbmc_10x_v2@meta.data$CellType_Prediction <- predictions$predicted.id
cell_type_pred_umap <- DimPlot(pbmc_10x_v2, reduction = "umap", label = FALSE, group.by = 'CellType_Prediction')
cell_type_pred_umap
Azimuth is a web application that uses an annotated reference dataset to automate the processing, analysis, and interpretation of a new single-cell RNA-seq or ATAC-seq experiment.
The input of Azimuth can be a Seurat object. In order to reduce the size of the uploaded file, we retain only the useful information for cell type annotation with the following command lines:
DefaultAssay(pbmc_10x_v2) <- "RNA"
pbmc_10x_v2_simple <- DietSeurat(object = pbmc_10x_v2, assays = "RNA")
#saveRDS(pbmc_10x_v2_simple, 'output/pbmc_10x_v2.Rds')
An Rds file called pbmc_10x_v2.Rds is saved in your
working directory. You can check where is your working directory by
using the getwd() function.
How to use Azimuth for cell type annotation:
The tsv file has the same data structure of Seurat annotation result
(predictions). We read the tsv file, then add the annotation result to
the pbmc_10x_v2 object with the AddMetaData() function and
we use UMAP to visualise the cell type annotation result from
Azimuth:
azimuth_predictions <- read.delim('azimuth_pred.tsv', row.names = 1)
pbmc_10x_v2 <- AddMetaData(object = pbmc_10x_v2, metadata = azimuth_predictions)
cell_type_azimuth_umap <- DimPlot(pbmc_10x_v2, reduction = "umap", label = FALSE, group.by = 'predicted.celltype.l2')
cell_type_azimuth_umap
Here is the cell type annotation results provided by the data provider:
cell_type_truth_umap <- DimPlot(pbmc_10x_v2, reduction = "umap", label = FALSE, group.by = 'CellType')
cell_type_truth_umap
Assuming the ground truth of the cell type annotation is from the provider, we can see that the cell type annotation provided from Azimuth is better than from Seurat, which would be due to Azimuth using a larger reference dataset, enabling Azimuth to learn the characteristics of more cell types.
We have combined the two data sets without any processing in the
Seurat object pbmc_combo In this section we will focus on
combining these data sets (note: the process would be similar if you are
combining different patients).
Let’s use our previous workflow to analyse this data that came from combining two independent datasets.
# Normalise it
pbmc_combo <- NormalizeData(pbmc_combo)
# Feature Selection
pbmc_combo <- FindVariableFeatures(pbmc_combo,
selection.method = "vst", nfeatures = 3000)
# Scale it
pbmc_combo.all.genes <- rownames(pbmc_combo)
pbmc_combo <- ScaleData(pbmc_combo, features = pbmc_combo.all.genes)
# Do PCA
pbmc_combo <- RunPCA(pbmc_combo, features = VariableFeatures(object = pbmc_combo))
# Draw UMAP
pbmc_combo <- FindNeighbors(pbmc_combo, dims = 1:30)
pbmc_combo <- RunUMAP(pbmc_combo, dims=1:30)
DimPlot(pbmc_combo, reduction = "umap", label = TRUE, group.by = 'Method')
This is where we can see that we have clusters form with solely cells from a single patient. Although we can have cell types in one patient that doesn’t appear in the other patient, we have too many clusters here that follow that pattern of clustering based on sequencing platform. This hinders our ability to detect valuable biological signal. We call differences caused by non-biological factors such as sequencing platforms or data sources batch effects. We need to first use some statistical methods to remove batch effects prior to downstream analysis.
If we weren’t sure whether this is batch effect, we could continue a bit with our analysis and do find clusters and see whether e.g. all cells expressing CD14 are in one cluster, as CD14 is a marker of CD14+ monocytes. If they are not in one cluster, they are not clustering based on cell types.
pbmc_combo <- FindClusters(object = pbmc_combo, resolution = 0.3, algorithm=1)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 6584
## Number of edges: 281263
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9558
## Number of communities: 15
## Elapsed time: 0 seconds
DimPlot(pbmc_combo, reduction = "umap", label = FALSE)
VlnPlot(pbmc_combo, features = 'CD14')
In Seurat, we use the FindIntegrationAnchors() function
to identify cells with similar biological information between two data
sets. The difference between cells in two data sets with similar
biological information is considered as batch effect. We then use the
IntegrateData() function to remove batch effects and
integrate the two data sets:
anchor_combo <- FindIntegrationAnchors(object.list = list(pbmc_10x_v2, pbmc_10x_v3), dims = 1:30)
pbmc_combo <- IntegrateData(anchorset = anchor_combo, dims = 1:30)
## Warning: Layer counts isn't present in the assay object; returning NULL
# Scaling
pbmc_combo.all.genes <- rownames(pbmc_combo)
pbmc_combo <- ScaleData(pbmc_combo, features = pbmc_combo.all.genes)
# PCA
pbmc_combo <- RunPCA(pbmc_combo, features = VariableFeatures(object = pbmc_combo))
# UMAP
pbmc_combo <- FindNeighbors(pbmc_combo, dims = 1:30)
pbmc_combo <- RunUMAP(pbmc_combo, dims=1:30)
DimPlot(pbmc_combo, reduction = "umap", label = FALSE, group.by = 'Method')
These clusters look much better in terms of not having clear distinct separation based on the sequencing platform.
DimPlot(pbmc_combo, reduction = "umap", label = FALSE, group.by = 'CellType')