Introduction to scRNA-seq analysis

Seurat

Seurat is an R package for analysing single-cell RNA-seq data. More info about the package can be found here.
The public dataset we will be working with today is part of SeuratData, which is a package that distributes datasets in Seurat-object form.

Single-cell Analysis Overview

Overview:

1. Acquiring single cells and sequencing them

Acquiring single cells entails dissociating some individual cells from tissue:

droplet-based: higher throughput (aka higher # of individual cells you are sequencing), lower depth -> good for main transcriptional patterns
plate-based: higher depth for fewer cells, useful for lowly expressed transcripts

2. Raw data processing: Processing reads to counts per cell (count matrix)

Acquired sequencing data, output of raw data processing: genes x single cells count matrix in scRNA-seq:

high-dimensional data, a lot of rows (whole transcriptome) and columns (lots of cells)
very sparse, signal spread out, characteristic of droplet-based approach

3. Data pre-processing:

3.1. Quality Control (filtering out poor quality cells)

As a result of the wet-lab technique, our droplet could be:
- singlet (what we want): one cell in a droplet
- empty: no cell in a droplet; since ambient RNA exists, it won’t be an empty column
- doublet: has one (or more) cells in a droplet -> DoubletFinder
- dead cell: can recognise by higher % of reads mapped to genes are mapped to the mitochondrial genome; because we have so many cells, we can remove all cells with a more stringent cutoff % e.g. >5% of all genes mapping to that mitochondrial genome (look at the data to pick a cutoff, vioolin plots quite useful)
Cell Ranger has proprietary software package (that you will have run as their customer), and they provide you with the count matrix and do the pre-processing part for you (up until, and including, QC)

3.2. Normalization

Some cells will be sequenced to a bigger depth than other cells, (cell of type A got 1000 reads, cell of type B got 500 reads e.g.) -> apply size factor for each cell to make it comparable

3.3. Feature Selection

genes with low or no expression - exclude, reasoning: won’t be crucial in cell type identity
housekeeping genes - exclude, they are expressed at a stable level in every cell type all the time, doesn’t help in determining cell types -> you care about 3000~5000 most variable usually

3.4. Dimension Reduction

PCA: 30 to 50 PCs instead of 20k you had before for genes; preserves major sources of variability, as it is expected some genes will have correlated behaviours; downstream steps will take the PCs as input -> not best way for visualisation, but is a sanity check that something isn’t horribly messed up

3.5. Data Correction

batch effect correction: these cells can be from e.g. different people, or sequenced on different days, need to differentiate between batch effect and biological effect
tools: Matching mutual nearest neighbours (MNN), Canonical correlation analysis (CCA), Harmony

+ Visualisations along the way

UMAP (non-linear dimensional reduction, non-deterministic) gives local preservation, meaning similar cells are grouped cells together (each dot is a cell), but not good at global structure, meaning just because cluster1 is further away from cluster3 than cluster2 is, doesn’t mean cluster2 and cluster 3 are more similar in cell type than cluster1 and cluster3; not good for downstream analysis, just good for visualisation; PCA was input, otherwise the UMAP is really slow

4. Downstream analysis:

4-a) Clustering

idea: similar cells (aka cells of same cell type) will cluster together
Two most popular clustering algorithms, both have a resolution parameter, the higher, more clusters:
- Leiden clustering (faster and usually leads to better clusters)
- Louvain clustering (Seurat default)

4-b) Cell type annotation

label the clusters based on different biological functions (aka seen through highly expressed genes)
can be manual or automatic
can be of a main or fine labelling (more specific biological calls e.g. CD4 CTL and CD4 Naive instead of CD4 T)
Seurat does cluster resolution (higher acc), SingleR does cell per cell annonation, (usually lower acc)

4-c) Proportion test

once you have cell types annotated, are there difference in the proportions of cell types in our control vs case

4-d) Differential expression

determining which genes are differentially expressed when comparing e.g. two clusters, or a cluster with the rest of the clusters

4-e) Gene Ontology

with a gene list of e.g. differentially expressed genes, whether they are related (and enriching) a pathway/process
R package: clusterProfiler

4-f) Trajectory inference cell states

inferring pseudo time: it looks like cell state A and cell state B are two cell states of the same cell, and they are going through time
tool: Monocle

4-g) Cell-cell interaction

to understand interactions between cell types
tool: CellPhoneDB

Seurat v5: full process automatic data pre-processing: QC - Normalisation - feature selection - dimension red - data correction (batch effect removal with CCA), with some downstream analysis

Case study

Acquiring the data

This public dataset includes single-cell RNA-seq data of human peripheral blood mononuclear cells (PBMCs) using multiple sequencing platforms. Only data that passed quality control are included in the pbmcsca dataset.

InstallData("pbmcsca")
data("pbmcsca")
# needs to update Seurat object to new structure (v5) for storing data/calculations
pbmcsca <- UpdateSeuratObject(pbmcsca)

Let’s view a summary of the data:

table(pbmcsca$Method)

## 
##   10x Chromium (v2) 10x Chromium (v2) A 10x Chromium (v2) B   10x Chromium (v3) 
##                3362                3222                3222                3222 
##            CEL-Seq2            Drop-seq             inDrops            Seq-Well 
##                 526                6584                6584                3773 
##          Smart-seq2 
##                 526

We see the number of cells sequenced per sequencing platform.
We are going to focus on single-cell sequencing of peripheral blood mononuclear cells (PBMC) from two patients, sequenced on two different platforms: 10x Chromium (v2) and 10x Chromium (v3)

pbmc_10x_v2 <- pbmcsca[,pbmcsca$Method == "10x Chromium (v2)"]
pbmc_10x_v3 <- pbmcsca[,pbmcsca$Method == "10x Chromium (v3)"]
pbmc_combo <- pbmcsca[,pbmcsca$Method %in% c("10x Chromium (v2)", "10x Chromium (v3)")]

A Seurat object is a representation of single-cell expression data for R. Each Seurat object consists of one or more Assay objects aka representations of expression data (eg. RNA-seq, ATAC-seq, etc). Each assay can have multiple layers. Each layer would be a transformed version of the raw gene expression data. These assays can be reduced from their high-dimensional state to a lower-dimension state and stored as DimReduc objects.

Seurat objects also store additional metadata. The object is designed to be as self-contained as possible and as more information is put within the object, the larger it becomes in size.

pbmc_10x_v3@assays$RNA #raw count matrix (genes x cells) of scRNA-seq

## Assay data with 33694 features for 3222 cells
## First 10 features:
##  TSPAN6, TNMD, DPM1, SCYL3, C1orf112, FGR, CFH, FUCA2, GCLC, NFYA

head(pbmc_10x_v3@meta.data) # info and labels of each cell

##                               orig.ident nCount_RNA nFeature_RNA nGene nUMI
## pbmc1_10x_v3_AAACCCACACTTGGGC      pbmc1       6914         1860  1860 6914
## pbmc1_10x_v3_AAACCCATCTTACACT      pbmc1       6168         1758  1758 6168
## pbmc1_10x_v3_AAAGAACAGCAGGCAT      pbmc1       8265         2425  2425 8265
## pbmc1_10x_v3_AAAGAACTCAAAGACA      pbmc1       5789         1553  1553 5789
## pbmc1_10x_v3_AAAGTCCGTAGGCAGT      pbmc1       9163         2655  2655 9163
## pbmc1_10x_v3_AAAGTGATCGTGACTA      pbmc1       6359         1707  1707 6359
##                                     percent.mito Cluster    CellType Experiment
## pbmc1_10x_v3_AAACCCACACTTGGGC 0.0911194677466011       1 CD4+ T cell      pbmc1
## pbmc1_10x_v3_AAACCCATCTTACACT 0.0856031128404669       1 CD4+ T cell      pbmc1
## pbmc1_10x_v3_AAAGAACAGCAGGCAT 0.0815486993345433       1 CD4+ T cell      pbmc1
## pbmc1_10x_v3_AAAGAACTCAAAGACA 0.0730696147866644       1 CD4+ T cell      pbmc1
## pbmc1_10x_v3_AAAGTCCGTAGGCAGT 0.0966932227436429       1 CD4+ T cell      pbmc1
## pbmc1_10x_v3_AAAGTGATCGTGACTA  0.104576191225035       1 CD4+ T cell      pbmc1
##                                          Method
## pbmc1_10x_v3_AAACCCACACTTGGGC 10x Chromium (v3)
## pbmc1_10x_v3_AAACCCATCTTACACT 10x Chromium (v3)
## pbmc1_10x_v3_AAAGAACAGCAGGCAT 10x Chromium (v3)
## pbmc1_10x_v3_AAAGAACTCAAAGACA 10x Chromium (v3)
## pbmc1_10x_v3_AAAGTCCGTAGGCAGT 10x Chromium (v3)
## pbmc1_10x_v3_AAAGTGATCGTGACTA 10x Chromium (v3)

pbmc_10x_v3@reductions #dim red info is here; empty, because we haven't done any dim reductions on the assay, so there is nothing to store!

## list()

Analysis of a single scRNA-seq experiment

For now, let’s focus on just the patient sample sequenced on 10x Chromium (v3).

dim(pbmc_10x_v3)

## [1] 33694  3222

We have 3222 cells and 33694 genes that have passed quality control.

Normalisation

By default, Seurat implements a global-scaling normalization method called LogNormalize. Global-scaling means it aims to make your samples (for us, cells) comparable. LogNormalize normalises the gene expression measurements for each cell by the total number of counts across all genes per cell, and multiplies this by a scaling factor (10,000 by default), then log transforms the result using log1p (i.e \(log_e(value+1)\)):

pbmc_10x_v3 <- Seurat::NormalizeData(object=pbmc_10x_v3, normalization.method = "LogNormalize", 
    scale.factor = 10000)

How we could check we understand what LogNormalize does:

counts_of_first_cell <- pbmc_10x_v3@assays$RNA@counts[,1]
total_num_counts_first_cell <- sum(counts_of_first_cell)
base::all.equal(pbmc_10x_v3@assays$RNA@data[,1],
                log1p(counts_of_first_cell*10000/total_num_counts_first_cell)) # testing ‘near equality’ for numeric; Differences smaller than tolerance are not reported. The default value is close to 1.5e-8.

## [1] TRUE

Feature Selection and Scaling

We use the FindVariableFeatures() Seurat function to select highly variable genes which have most of the useful information needed for downstream analysis. For now, there is nothing there:

pbmc_10x_v3@assays$RNA@var.features

## character(0)

Here we select the top 3,000 most variable genes to save some computing time. In practice, you can select more genes (5,000 or more) to preserve more information from the scRNA-seq experiment:

pbmc_10x_v3 <- Seurat::FindVariableFeatures(pbmc_10x_v3, selection.method = "vst", nfeatures = 3000)
head(pbmc_10x_v3@assays$RNA@var.features)

## [1] "S100A9"   "IGLC2"    "CST3"     "IGKC"     "HLA-DPA1" "S100A8"

After feature selection, the downstream Seurat functions will only use these HVGs, unless otherwise specified.

Variance stabilising transformation, or vst, is a method for fitting a line to the relationship of log(variance) and log(mean) using local polynomial regression (loess), then standardising the feature (aka gene expression) values using the observed mean and expected variance (given by the fitted line). Feature variance is then calculated on the standardised values after clipping to a maximum (see clip.max parameter). - as per Seurat documentation.
We are searching for the highest variable genes. A way to do this would be to find the variance of each gene (using gene expr values across all cells), sort from highest to lowest and have a cutoff of the top 3000 genes. However, genes that have a higher expression will have a higher variance (and lower expression lower variance), meaning some genes will be in the cutoff even though they are not particulary variable, instead just due to how large of a number an observation (hence variance) can get to. So to make genes more comparable across variances with the diverse range of means, you want to standardise the expression values.
When standardising the expression values, normally you would centre by the mean, and scale by the variance of each gene. But the variance could be noisy due to low number of samples -> a sc count matrix is sparse! This is where the variance stabilising method comes in. So the idea is to “borrow” from other genes of a similar mean value by fitting a (loess) line to a scatter plot of log_mean x log_var pairs, and use the predicted (expected) value of log_var for each log_mean (aka use the stabilised variance) in standardising every gene expression value: \[\frac{gene\_expr\_value - mean}{\sqrt{predicted\_var}},\] where \(predicted\_var = e^{predicted\_log\_var}\). This stabilised variance talks about the expected variance given a certain mean, so by standardising with this stabilised variance, we will achieve in removing the mean-var dependency. For implementation details, see here, and note when the counts or data slot is used for feature value standardisation.

Many downstream statistical analyses requires the data matrix to be centred and scaled. We use the Seurat function ScaleData() for this:

pbmc_10x_v3 <- Seurat::ScaleData(pbmc_10x_v3)

## Centering and scaling data matrix

The scaled z-scored residuals of these models are stored in the scale.data slot, and are used for downstream dimension reduction and clustering tasks.

Dimension Reduction - Principal component analysis (PCA)

We perform PCA on the scaled data. By default, HVGs in pbmc_10x_v3@var.genes are used as input, but they can be defined by specifying the argument pc.genes.

We run PCA on top 3,000 most variable genes, and the total number of PCs to compute and store is 50 by default:

pbmc_10x_v3 <- Seurat::RunPCA(pbmc_10x_v3, features = VariableFeatures(object = pbmc_10x_v3))

## PC_ 1 
## Positive:  LYZ, FCN1, CLEC7A, CPVL, SERPINA1, SPI1, S100A9, AIF1, NAMPT, CSTA 
##     CTSS, MAFB, MPEG1, NCF2, FGL2, VCAN, S100A8, TYMP, CST3, LST1 
##     CYBB, CFD, FCER1G, TGFBI, SLC11A1, GRN, CD14, SLC7A7, PSAP, RNF130 
## Negative:  IL32, CCL5, TRBC2, TRAC, CST7, CD69, RORA, CTSW, CD247, ITM2A 
##     TRBC1, C12orf75, IL7R, CD8A, GZMA, NKG7, CD7, LDHB, GZMH, CD6 
##     CD8B, PRF1, BCL11B, LYAR, FGFBP2, HOPX, LTB, TCF7, KLRD1, ZFP36L2 
## PC_ 2 
## Positive:  NRGN, PF4, SDPR, HIST1H2AC, MAP3K7CL, GNG11, PPBP, GPX1, TUBB1, SPARC 
##     CLU, PGRMC1, RGS18, FTH1, MARCH2, TREML1, HIST1H3H, ACRBP, AP003068.23, NCOA4 
##     PRKAR2B, CMTM5, CD9, CA2, TAGLN2, TSC22D1, CTTN, HIST1H2BJ, MTURN, TMSB4X 
## Negative:  EEF1A1, TMSB10, RPS2, RPS12, RPS18, RPLP1, RPS8, IL32, S100A4, RPLP0 
##     PFN1, NKG7, CYBA, ARL4C, HSPA8, CST7, HSP90AA1, ZFP36L2, S100A6, ANXA1 
##     CTSW, CORO1A, LDHA, CD247, GZMA, CALR, YBX1, S100A10, CFL1, ARPC3 
## PC_ 3 
## Positive:  CCL5, TMSB4X, SRGN, NKG7, ACTB, CST7, S100A4, CTSW, ANXA1, GZMH 
##     FGFBP2, PRF1, GZMA, IL32, GZMB, C12orf75, KLRD1, GNLY, GAPDH, CD247 
##     MYO1F, NRGN, ID2, ACTG1, CD7, PF4, SDPR, PPBP, HIST1H2AC, CD8A 
## Negative:  CD79A, HLA-DQA1, MS4A1, BANK1, IGHM, LINC00926, IGHD, TNFRSF13C, HLA-DQB1, CD74 
##     IGKC, HLA-DRA, BLK, CD83, ADAM28, CD22, HLA-DRB1, CD37, NFKBID, CD79B 
##     P2RX5, VPREB3, IGLC2, JUND, FCER2, TCOF1, GNG7, COBLL1, RALGPS2, CD19 
## PC_ 4 
## Positive:  IL7R, S100A12, VCAN, S100A8, SLC2A3, CD14, CSF3R, S100A9, MS4A6A, CD93 
##     IER3, FOS, RCAN3, ZFP36L2, EGR1, RP11-1143G9.4, RGCC, MGST1, LEPROTL1, CLEC4E 
##     SOCS3, VIM, THBS1, SELL, CXCL8, LTB, SGK1, TNFAIP3, MAL, IRF2BP2 
## Negative:  FCGR3A, CDKN1C, HES4, CSF1R, CKB, RHOC, ZNF703, MTSS1, TCF7L2, SIGLEC10 
##     MS4A7, FAM110A, HLA-DPA1, IFITM2, CTSL, PLD4, BATF3, SLC2A6, IFITM3, ABI3 
##     GZMB, NKG7, HLA-DPB1, FGFBP2, CTD-2006K23.1, CD79B, BID, LRRC25, GZMH, CTSC 
## PC_ 5 
## Positive:  GZMB, FGFBP2, GZMH, PRF1, CST7, NKG7, GNLY, KLRD1, GZMA, CTSW 
##     CCL5, SPON2, ADGRG1, PRSS23, KLRF1, CCL4, ZEB2, CLIC3, S1PR5, ITGAM 
##     HOPX, CEP78, CYBA, TRDC, C1orf21, MATK, TTC38, C12orf75, HLA-DRB1, HLA-DPB1 
## Negative:  IL7R, LTB, LEPROTL1, PAG1, MAL, CDKN1C, RCAN3, LEF1, TCF7, NOSIP 
##     CAMK4, LDHB, TOB1, TRABD2A, HES4, CSF1R, CKB, JUNB, TMEM123, ZFP36L2 
##     NELL2, RNASET2, CD28, BCL11B, TNFRSF25, CD27, CTSL, CD40LG, ZNF703, AQP3

The PCA result is stored in pbmc_10x_v3@reduction. The output information prints us 30 (by default) genes that are positively, or negatively correlated with the top 5 (by default) principal components.

PCA is a dimension reduction technique. It captures the most variability in the data, achieved by using an orthogonal linear transformation (\(\sum_i{ax_i}\)) of the data. Let’s say each data point is in p dimensions (e.g. p genes describe a person). The idea is that you have made a new p-dimension (new p axes), projected the data onto the new dimensions by a linear transformation of the data and the earliest dimensions 1) hold the most data variability (info) whilst maintaining 2) orthogonality to all previously defined axes (all the p axes are orthogonal). These new axes are called principal components (PCs) that capture the max variance in data while keeping orthogonality. Each data point (e.g. a person) then gets new coordinates, as they are projected onto the new coordinate system.
PCA is defined as the linear transformation shown below, where “x” are coordinates of original data, “c” are the projected coordinates, “a” are the loadings and “p” is the number of dimensions describing a point.
PCA is fully reversible, as it would equate to solving a system of linear equations seen below where “c” and “a” are the known, and “x” is the unknown.

\[ x_1, x_2, ..., x_p \rightarrow data \\ c_1, c_2, ..., c_p \rightarrow new~coordinates \\ a_{11}, a_{12} ~ ... ~ a_{1p} ~ ... ~ a_{pp} \rightarrow loadings\\ (x_1, x_2, ..., x_p) \rightarrow (c_1, c_2, ..., c_p) \\ c_1 = a_{11}x_1+a_{12}x_2 +...+ a_{1p}x_p \\ c_2 = a_{21}x_1+a_{22}x_2 +...+ a_{2p}x_p \\ ...\\ c_p = a_{p1}x_1+a_{p2}x_2 +...+ a_{pp}x_p \] Instead of looking at the original genes, we are looking at new/derived features aka principal components that came from transforming the original data.
So the first PC is a linear combination of features that describes the most variability in the data.
The \(k^{th}\) PC is a linear combination of features that is uncorrelated to all the previous k-1 PCs and describes the most variability unaccounted to all the previous k-1 PCs.
We’ve reduced the dimensions by retaining the first k (k < p) PCs instead of all p dimensions as we started with.

If you were to perform PCA mathematically, you would do the following:

S <- cov(x) # x is the matrix of centered data
eigen(S)
## eigen() decomposition
## $values
## [1] 6.858209 1.796837
##
## $vectors
##      [,1]        [,2]
## [1,] -0.8044078 0.5940776
## [2,] -0.5940776 -0.8044078

The largest eigenvalue is \(\lambda_1\) = 6.858, and the corresponding eigenvector (= unit vector) is w1 = (-0.804, -0.594)^T
\(\lambda_i\) is the variance of the \(i^{th}\) PC. The total variance of the data is the sum of all . The proportion of variance explained by the i^{th} PC is \(\frac{\lambda_i}{\sum^p_{j=1}{\lambda_j}}\).
The loadings can be found as:

eigenvalues_sqrt <- sqrt(eigen_result$values)
loadings <- eigen_result$vectors %*% diag(eigenvalues_sqrt)

The default parameters instill for PCA to be approximated by using truncated singular value decomposition (SVD) and fixed with a random seed preset to 42. Under the hood, it is using the irlba package for truncated singular value decomposition (SVD) that uses an iterative algorithm that starts with a random initial vector to approximate the largest singular vectors.

Important note: How many genes to choose for PCA and how many PCs to use for downstream analysis is a complex and important issue that is out of the scope of today’s workshop.

But we highly recommend you to read this document before analyzing your own scRNA-seq data, where the authors show how to use some visualization methods to guide your choice.

Visualisation - 2D UMAP

Uniform manifold approximation and project (UMAP) is a tool for visualising higher dimensional data. It takes PCs as input. Here we use the top 30 PCs:

pbmc_10x_v3 <- Seurat::RunUMAP(pbmc_10x_v3, dims=1:30)

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

We draw the UMAP plot with the Dimplot() function by specifying the argument reduction = ‘umap’. We can colour points (cells) by using the metadata information (pbmc_10x_v3@meta.data) by specifying the argument group.by:

Seurat::DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'Method')

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the Seurat package.
##   Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The Method metadata refers to the sequencing platform used, and as we are currently analysing the data from only one sequencing platform, we are expecting to see all the cells labelled with the same colour.

pbmc_10x_v3 <- Seurat::RunUMAP(pbmc_10x_v3, dims = 1:2)
p1 <- Seurat::DimPlot(pbmc_10x_v3, reduction = "umap", group.by = 'Method') + Seurat::NoLegend()
pbmc_10x_v3 <- Seurat::RunUMAP(pbmc_10x_v3, dims = 1:10)
p2 <- Seurat::DimPlot(pbmc_10x_v3, reduction = "umap", group.by = 'Method') + Seurat::NoLegend()
pbmc_10x_v3 <- Seurat::RunUMAP(pbmc_10x_v3, dims = 1:30)
p3 <- Seurat::DimPlot(pbmc_10x_v3, reduction = "umap", group.by = 'Method') + Seurat::NoLegend()
pbmc_10x_v3 <- Seurat::RunUMAP(pbmc_10x_v3, dims = 1:50)
p4 <- Seurat::DimPlot(pbmc_10x_v3, reduction = "umap", group.by = 'Method') + Seurat::NoLegend()
# combining plots by patchwork system

p1 + p2 + p3 + p4

Two PCs aren’t enough to see any structure and this would be because the first 2 PCs don’t capture enough variability in the data. The first 30 and 50 seem to give quite similar UMAP visualisations.
To get the legend font to be smaller in size, I searched the documentation for DimPlot and didn’t find a specific parameter for it, however, there was ..., which means you can send through parameters that would be used by the functions DimPlot employs. Looking at the implementation of DimPlot, I saw there is NoLegend(), so I instead decided to remove the legend for each individual plot using +NoLegend()

Clustering

By default, Seurat uses the Louvain algorithm. Another popular one is Leiden algorithm.
The Louvain algorithm requires a neighbor graph as input. Therefore, we first run the FindNeighbors() function first. It is based on PCs, so here we use the 30 top PCs (previously, we conducted PCA with the default number of PCs being 50). We then run the FindClusters() function for clustering. The argument Algorithm=1 means we are using the Louvain algorithm for clustering:

pbmc_10x_v3 <- Seurat::FindNeighbors(pbmc_10x_v3, dims = 1:30)

## Computing nearest neighbor graph

## Computing SNN

pbmc_10x_v3 <- Seurat::FindClusters(object = pbmc_10x_v3, resolution = 0.3, algorithm=1)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 3222
## Number of edges: 134293
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9386
## Number of communities: 10
## Elapsed time: 0 seconds

The key tunning parameter here is resolution, which determines how many clusters we get. The number of clusters to have depends on biology. The higher the number, the more clusters.

We can use UMAP to visualise the clustering results. The argument group.by='seurat_clusters' is used to colour the cells according to the clustering results.

Seurat::DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'seurat_clusters')

Cell Type Annotation

Above we see that cluster 3 is visually its own cluster. To see what characterises that cluster, we can look at its marker genes. Marker genes would be the genes that are differentially expressed in ONLY that cluster, aka we compare that cluster to all the other clusters together.

pbmc_10x_v3.markers <- FindAllMarkers(pbmc_10x_v3, min.pct = .25, logfc.threshold = .25)

## Calculating cluster 0

## Warning: `PackageCheck()` was deprecated in SeuratObject 5.0.0.
## ℹ Please use `rlang::check_installed()` instead.
## ℹ The deprecated feature was likely used in the Seurat package.
##   Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the presto package
## --------------------------------------------
## install.packages('devtools')
## devtools::install_github('immunogenomics/presto')
## --------------------------------------------
## After installation of presto, Seurat will automatically use the more 
## efficient implementation (no further action necessary).
## This message will be shown once per session

## Calculating cluster 1

## Calculating cluster 2

## Calculating cluster 3

## Calculating cluster 4

## Calculating cluster 5

## Calculating cluster 6

## Calculating cluster 7

## Calculating cluster 8

## Calculating cluster 9

cluster3.markers <- pbmc_10x_v3.markers[which(pbmc_10x_v3.markers$cluster==3),]
cluster3.markers[1:5,]

##          p_val avg_log2FC pct.1 pct.2 p_val_adj cluster     gene
## CD79A        0   9.438875 0.980 0.012         0       3    CD79A
## HLA-DQA1     0   5.120361 0.966 0.024         0       3 HLA-DQA1
## MS4A1        0   6.123866 0.858 0.010         0       3    MS4A1
## HLA-DQB1     0   3.781385 0.909 0.103         0       3 HLA-DQB1
## BANK1        0   7.059084 0.805 0.007         0       3    BANK1

We then search these marker genes in this database website. If there are genes with a dash in their name, you need to remove the dash before searching.

Based on the result provided by the database, it looks like cluster 3 might be a cluster of B cells. This is a rare situation in which we can check whether this is the case by checking the cell-type ground truth information pre-saved in the pbmc_10x_v3 object, and plotting this as UMAP:

DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'CellType')

And so we see that cluster 3 (in our previous UMAP) is labelled as B cells, as we got from the database using our marker genes.

Annotating datasets

Very often we might have a few genes of interest, say CD14, a marker of CD14+ monocytes. Then we can look for clusters where CD14 is up-regulated. This can be done by:

Idents(object = pbmc_10x_v3) <- "seurat_clusters"
VlnPlot(pbmc_10x_v3, features = 'CD14')

## Warning: The `slot` argument of `FetchData()` is deprecated as of SeuratObject 5.0.0.
## ℹ Please use the `layer` argument instead.
## ℹ The deprecated feature was likely used in the Seurat package.
##   Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

So, cluster 4 is high likely to be annotated as CD14+ monocytes.

Typically, cell-type information is unknown in single cell data, as this is what we are trying to find out! But publicly available datasets may have been previously annotated using the technique highlighted above, or using other reference datasets.

Automatic Cell Type Annotation for `pbmc_10x_v2`

Suppose we have annotated pbmc_10x_v3. We can then use it as a reference to annotate a new PBMC dataset (like pbmc_10x_v2). This can be done computationally, so that we do not need to go through the process of clustering, finding differentially expressed genes and looking up gene names.

Remember: pbmc_10x_v2 and pbmc_10x_v3 include different cells from different patients and sequenced using different platforms.

Pre-processing and Visualization for `pbmc_10x_v2`

# Normalise it
pbmc_10x_v2 <- NormalizeData(pbmc_10x_v2)

# Feature Selection
pbmc_10x_v2 <- FindVariableFeatures(pbmc_10x_v2, selection.method = "vst", nfeatures = 3000)

# Scale it
pbmc_10x_v2.all.genes <- rownames(pbmc_10x_v2)
pbmc_10x_v2 <- ScaleData(pbmc_10x_v2, features = pbmc_10x_v2.all.genes)

# Do PCA
pbmc_10x_v2 <- RunPCA(pbmc_10x_v2, features = VariableFeatures(object = pbmc_10x_v2))

# Draw UMAP
pbmc_10x_v2 <- FindNeighbors(pbmc_10x_v2, dims = 1:30)
pbmc_10x_v2 <- RunUMAP(pbmc_10x_v2, dims=1:30)
DimPlot(pbmc_10x_v2, reduction = "umap", label = TRUE, group.by = 'Method')

Cell Type Annotation Using `pbmc_10x_v3` as Reference

In Seurat we can learn cell type annotation results from one scRNA-seq data to provide cell type annotation for another scRNA-seq dataset.

We use the FindTransferAnchors() function to predict which cells in two datasets are of the same cell type. Here we use pbmc_10x_v3 as the reference data set, and pbmc_10x_v2 is the query data.

We specify that we use the top 30 PCs, then assign the cell-type of the cells in pbmc_10x_v2 using the TransferData() function:

anchors <- FindTransferAnchors(reference = pbmc_10x_v3, query = pbmc_10x_v2, dims = 1:30)

## Performing PCA on the provided reference using 3000 features as input.

## Projecting cell embeddings

## Finding neighborhoods

## Finding anchors

##  Found 2995 anchors

predictions <- TransferData(anchorset = anchors, refdata = pbmc_10x_v3$CellType, dims = 1:30)

## Finding integration vectors

## Finding integration vector weights

## Predicting cell labels

Seurat will provide a table with the most likely cell type and the probability of each cell type based on the existing cell types in the reference. This means if there is a cell type in pbmc_10x_v2 that doesn’t exist in the reference, we won’t be able to correctly assign those cells, resulting in a misleading classification. We assign the most likely cell type to the pbmc_10x_v2 object, and then use UMAP to visualise this annotation:

pbmc_10x_v2@meta.data$CellType_Prediction <- predictions$predicted.id 

cell_type_pred_umap <- DimPlot(pbmc_10x_v2, reduction = "umap", label = FALSE, group.by = 'CellType_Prediction')
cell_type_pred_umap

Cell Type Annotation Using Azimuth

Azimuth is a web application that uses an annotated reference dataset to automate the processing, analysis, and interpretation of a new single-cell RNA-seq or ATAC-seq experiment.

The input of Azimuth can be a Seurat object. In order to reduce the size of the uploaded file, we retain only the useful information for cell type annotation with the following command lines:

DefaultAssay(pbmc_10x_v2) <- "RNA"
pbmc_10x_v2_simple <- DietSeurat(object = pbmc_10x_v2, assays = "RNA")
#saveRDS(pbmc_10x_v2_simple, 'output/pbmc_10x_v2.Rds')

An Rds file called pbmc_10x_v2.Rds is saved in your working directory. You can check where is your working directory by using the getwd() function.

How to use Azimuth for cell type annotation:

Find ‘References for scRNA-seq Queries’ -> Then find ‘Human - PBMC’ -> click ‘Go to App’
Click ‘Browse’ -> find ‘pbmc_10x_v2.Rds’ at your working directory -> Click ‘Open’
Wait for the Rds file to upload to the website
Click ‘Map cells to reference’
Click ‘Download Results’
Find ‘Predicted cell types and scores (TSV)’
Click ‘Download’ to get the cell type annotation result stored in azimuth_pred.tsv
Copy the tsv file (azimuth_pred.tsv) to your R working directory

The tsv file has the same data structure of Seurat annotation result (predictions). We read the tsv file, then add the annotation result to the pbmc_10x_v2 object with the AddMetaData() function and we use UMAP to visualise the cell type annotation result from Azimuth:

azimuth_predictions <- read.delim('azimuth_pred.tsv', row.names = 1)
pbmc_10x_v2 <- AddMetaData(object = pbmc_10x_v2, metadata = azimuth_predictions)
cell_type_azimuth_umap <- DimPlot(pbmc_10x_v2, reduction = "umap", label = FALSE, group.by = 'predicted.celltype.l2')
cell_type_azimuth_umap

Comparing annotations

Here is the cell type annotation results provided by the data provider:

cell_type_truth_umap <- DimPlot(pbmc_10x_v2, reduction = "umap", label = FALSE, group.by = 'CellType')
cell_type_truth_umap

Assuming the ground truth of the cell type annotation is from the provider, we can see that the cell type annotation provided from Azimuth is better than from Seurat, which would be due to Azimuth using a larger reference dataset, enabling Azimuth to learn the characteristics of more cell types.

Integration of two data sets

We have combined the two data sets without any processing in the Seurat object pbmc_combo In this section we will focus on combining these data sets (note: the process would be similar if you are combining different patients).

Let’s use our previous workflow to analyse this data that came from combining two independent datasets.

# Normalise it
pbmc_combo <- NormalizeData(pbmc_combo)

# Feature Selection
pbmc_combo <- FindVariableFeatures(pbmc_combo,
                                   selection.method = "vst", nfeatures = 3000)

# Scale it
pbmc_combo.all.genes <- rownames(pbmc_combo)
pbmc_combo <- ScaleData(pbmc_combo, features = pbmc_combo.all.genes)

# Do PCA
pbmc_combo <- RunPCA(pbmc_combo, features = VariableFeatures(object = pbmc_combo))

# Draw UMAP
pbmc_combo <- FindNeighbors(pbmc_combo, dims = 1:30)
pbmc_combo <- RunUMAP(pbmc_combo, dims=1:30)
DimPlot(pbmc_combo, reduction = "umap", label = TRUE, group.by = 'Method')

This is where we can see that we have clusters form with solely cells from a single patient. Although we can have cell types in one patient that doesn’t appear in the other patient, we have too many clusters here that follow that pattern of clustering based on sequencing platform. This hinders our ability to detect valuable biological signal. We call differences caused by non-biological factors such as sequencing platforms or data sources batch effects. We need to first use some statistical methods to remove batch effects prior to downstream analysis.

If we weren’t sure whether this is batch effect, we could continue a bit with our analysis and do find clusters and see whether e.g. all cells expressing CD14 are in one cluster, as CD14 is a marker of CD14+ monocytes. If they are not in one cluster, they are not clustering based on cell types.

pbmc_combo <- FindClusters(object = pbmc_combo, resolution = 0.3, algorithm=1)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 6584
## Number of edges: 281263
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9558
## Number of communities: 15
## Elapsed time: 0 seconds

DimPlot(pbmc_combo, reduction = "umap", label = FALSE)

VlnPlot(pbmc_combo, features = 'CD14')

Removing batch effects

In Seurat, we use the FindIntegrationAnchors() function to identify cells with similar biological information between two data sets. The difference between cells in two data sets with similar biological information is considered as batch effect. We then use the IntegrateData() function to remove batch effects and integrate the two data sets:

anchor_combo <- FindIntegrationAnchors(object.list = list(pbmc_10x_v2, pbmc_10x_v3), dims = 1:30)

pbmc_combo <- IntegrateData(anchorset = anchor_combo, dims = 1:30)

## Warning: Layer counts isn't present in the assay object; returning NULL

# Scaling
pbmc_combo.all.genes <- rownames(pbmc_combo)
pbmc_combo <- ScaleData(pbmc_combo, features = pbmc_combo.all.genes)

# PCA
pbmc_combo <- RunPCA(pbmc_combo, features = VariableFeatures(object = pbmc_combo))

# UMAP
pbmc_combo <- FindNeighbors(pbmc_combo, dims = 1:30)
pbmc_combo <- RunUMAP(pbmc_combo, dims=1:30)
DimPlot(pbmc_combo, reduction = "umap", label = FALSE, group.by = 'Method')

These clusters look much better in terms of not having clear distinct separation based on the sequencing platform.

DimPlot(pbmc_combo, reduction = "umap", label = FALSE, group.by = 'CellType')

Introduction to scRNA-seq analysis

Katarina Stankovic

2025-09

Seurat

Single-cell Analysis Overview

Overview:

1. Acquiring single cells and sequencing them

2. Raw data processing: Processing reads to counts per cell (count matrix)

3. Data pre-processing:

3.1. Quality Control (filtering out poor quality cells)

3.2. Normalization

3.3. Feature Selection

3.4. Dimension Reduction

3.5. Data Correction

+ Visualisations along the way

4. Downstream analysis:

4-a) Clustering

4-b) Cell type annotation

4-c) Proportion test

4-d) Differential expression

4-e) Gene Ontology

4-f) Trajectory inference cell states

4-g) Cell-cell interaction

Case study

Acquiring the data

Analysis of a single scRNA-seq experiment

Normalisation

Feature Selection and Scaling

Dimension Reduction - Principal component analysis (PCA)

Visualisation - 2D UMAP

Clustering

Cell Type Annotation

Annotating datasets

Automatic Cell Type Annotation for `pbmc_10x_v2`

Pre-processing and Visualization for `pbmc_10x_v2`

Cell Type Annotation Using `pbmc_10x_v3` as Reference

Cell Type Annotation Using Azimuth

Comparing annotations

Integration of two data sets

Removing batch effects

Introduction to scRNA-seq analysis

Katarina Stankovic

2025-09

Seurat

Single-cell Analysis Overview

Overview:

1. Acquiring single cells and sequencing them More

2. Raw data processing: Processing reads to counts per cell (count matrix) More

3. Data pre-processing:

3.1. Quality Control (filtering out poor quality cells) More

3.2. Normalization More

3.3. Feature Selection More

3.4. Dimension Reduction More

3.5. Data Correction More

+ Visualisations along the way More

4. Downstream analysis:

4-a) Clustering More

4-b) Cell type annotation More

4-c) Proportion test More

4-d) Differential expression More

4-e) Gene Ontology More

4-f) Trajectory inference cell states More

4-g) Cell-cell interaction More

Case study

Acquiring the data

Analysis of a single scRNA-seq experiment

Normalisation

Feature Selection and Scaling

Dimension Reduction - Principal component analysis (PCA)

Visualisation - 2D UMAP

Clustering

Cell Type Annotation

Annotating datasets

Automatic Cell Type Annotation for pbmc_10x_v2

Pre-processing and Visualization for pbmc_10x_v2

Cell Type Annotation Using pbmc_10x_v3 as Reference

Cell Type Annotation Using Azimuth

Comparing annotations

Integration of two data sets

Removing batch effects

1. Acquiring single cells and sequencing them

2. Raw data processing: Processing reads to counts per cell (count matrix)

3.1. Quality Control (filtering out poor quality cells)

3.2. Normalization

3.3. Feature Selection

3.4. Dimension Reduction

3.5. Data Correction

+ Visualisations along the way

4-a) Clustering

4-b) Cell type annotation

4-c) Proportion test

4-d) Differential expression

4-e) Gene Ontology

4-f) Trajectory inference cell states

4-g) Cell-cell interaction

Automatic Cell Type Annotation for `pbmc_10x_v2`

Pre-processing and Visualization for `pbmc_10x_v2`

Cell Type Annotation Using `pbmc_10x_v3` as Reference