- 1 Packages and data import
- 2 Tools list for sc RNA–Seq
- 3 nb fits from DEseq2 to get dropout probs
- 4 Variance stabilization
- 5 Session information

**LAST UPDATE AT**

` [1] "Tue Jun 5 15:47:12 2018"`

We first set global chunk options and load the necessary packages and the data.

```
library("scran")
library("rmarkdown")
library("BiocStyle")
library("magrittr")
library("stringr")
library("ggthemes")
library("scales")
library("ggbeeswarm")
library("broom")
library("tidyverse")
library("readxl")
library("ggrepel")
library("DESeq2")
if (!"patchwork" %in% rownames(installed.packages())){
devtools::install_github("thomasp85/patchwork")
}
library(patchwork)
data_dir <- file.path("../../Teaching/EMBL-Teaching/stat_methods_bioinf/data")
theme_set(theme_solarized(base_size = 18))
glog2 <- function(x) ((asinh(x) - log(2))/log(2))
```

- “Conquer” workflow by Charlotte Soneson / Mark Robinson:

- Aaron Lun & Co preprocesing & QC workflow:

http://bioconductor.org/packages/release/workflows/html/simpleSingleCell.html

- SCRAN size factors

- cool overview / review:

- Cave: uses zero inflated model [Perraudeau, Fanny, Davide Risso, Kelly Street, Elizabeth Purdom, and Sandrine Dudoit. 2017. “Bioconductor Workflow for Single-Cell RNA Sequencing: Normalization, Dimensionality Reduction, Clustering, and Lineage Inference]https://doi.org/10.12688/f1000research.12122.1

- using resampling makes most sense:

Single cell data is often used to infer (developmental) hierarchies of single cells. For this, a three step approach has emerged:

Dimenionality reduction

(optional) Clustering

Graph fitting

*sincell* is a Bioconductor package wrapping a couple of these techniques, typical examples includes *monocle*. Before one uses any of these algorithms, it is always a good idea to try to obtain robust clusterings via packages like *clusterExperiment*, as dimensionality reduction step can be misleading.

In fact, dimensionality reduction methods can result in trajectory–like patterns for random data. 1 see W. Huber’examples. This is often related to a certain covariance structure of the data. For an interesting example of an influential PCA misinterpretation in genetics see Novembre and Stephens (2008) Traditionally, wave-like patterns in PC maps have been interpreted as migration events. However, as they show, these patterns arise naturally as soon as genetic similarity decays with distance.

```
library("Rtsne")
run_tsne <- function(X, perplexity = 20, pca = FALSE, max_iter = 5000,
verbose = FALSE, is_distance = TRUE, seed=123L, ...){
set.seed(seed)
tX <- Rtsne(X, perplexity = perplexity, pca = pca, max_iter = max_iter,
verbose = verbose, is_distance = is_distance)$Y
if(class(X) == "dist"){
labs <- labels(X)
} else {
labs <- rownames(X)
}
# browser()
colnames(tX) <- c("tSNE_dimension_1", "tSNE_dimension_2")
tX <- add_column(as_tibble(tX),
cell_id = labs,
.before = "tSNE_dimension_1")
tX
}
```

```
library("ggthemes")
set.seed(123)
x <- c(rnorm(10, mean = 5, sd = 2), rnorm(100, mean = -5, sd = 2))
y <- c(rnorm(10, mean = 5, sd = 2), rnorm(100, mean = -5, sd = 2))
clusters <- c(rep("cl_1", 10), rep("cl_2", 100))
dat_sim <- tibble(x, y, clusters)
org_plot <- ggplot(aes(x, y, color = clusters), data = dat_sim) +
geom_point() +
ggtitle("simulated data")
org_plot +
theme_solarized() +
scale_color_tableau("tableau10medium")
```

- two normal clusters with different sizes, no within cluster structure

```
X <- as.matrix(dat_sim[, 1:2])
rownames(X) <- dat_sim$clusters
tsne <- run_tsne(X, perplexity = 5,
pca = FALSE, max_iter = 5000,
verbose = FALSE, is_distance = FALSE, seed = 123)
tsne_plot <- ggplot(tsne, aes(x = tSNE_dimension_1,
tSNE_dimension_2,
color = clusters)) +
geom_point(size = 3) +
ggtitle("t-SNE of sim data, perplexity = 5") +
coord_equal()
tsne_plot +
theme_solarized() +
scale_color_tableau("tableau10medium")
```