- 1 Preparations
- 2 Preparing count matrices
- 3 Aligning reads to a reference and feature counting
- 4 The
*DESeqDataSet*class, column and row metadata - 5 Experimental design and the design formula
- 6 Quality control and Normalization of the count data
- 7 Normalization
- 8 PCA and sample heatmaps
- 9 Differential expression analysis
- 10 Statistical testing of Differential expression
- 11 Gene ontology enrichment analysis
- 12 Running topGO
- 13 Session information

Bernd Klaus[2] and Wolfgang Huber [2], based on material by Michael Love [1] and Simon Anders [2],

[1] Department of Biostatistics, Dana-Farber Cancer Institute and Harvard School of Public Health, Boston, US;

[2] European Molecular Biology Laboratory (EMBL), Heidelberg, Germany.

**LAST UPDATE AT**

Wed Oct 19 11:09:15 2016

We first set global chunk options and load the neccessary packages.

```
library(BiocStyle)
library(rmarkdown)
library(geneplotter)
library(ggplot2)
library(plyr)
library(LSD)
library(DESeq2)
library(gplots)
library(RColorBrewer)
library(stringr)
library(topGO)
library(genefilter)
library(biomaRt)
library(dplyr)
library(EDASeq)
library(fdrtool)
library(org.Mm.eg.db)
data.dir <- file.path("/g/huber/users/klaus/Data/NCBI")
```

In this document we introduce a workflow for a typical RNA–Seq data analysis. We start from a “count table”, which summarizes the number of sequence reads mapping to each of the genes and discuss quality control, differential expression and enrichment analysis of the data. Important aspects of the `R`

language and Bioconductor data structures for high–dimensional genomic data are discussed along the way.

We re—analyze RNA–Seq data obtained by comparing the expression profiles of WT mice to mice harboring a deletion that is associated with a MYC enhancer which is required for proper lip/palate formation. The data was published along with the following paper:

Note that in the workflows below, we do not aim at giving a comprehensive treatment of of possible options, but rather try to introduce sensible default approaches for the necessary steps. It is based on a protocol published in Nature Methods, which contains more detailed information on various steps. Additionally, the “RNA–Seq workflow” is well worth reading and contains a lot of additional background information.

This lab will walk you through an end–to-end RNA–Seq differential expression workflow, using *DESeq2* along with other Bioconductor packages.

As input, the *DESeq2* package expects count data as obtained, e.g., from RNA–Seq or another high–throughput sequencing experiment, in the form of a matrix of integer values. The value in the *i*–th row and the *j*–th column of the matrix tells how many reads have been mapped to gene *i* in sample *j*. Analogously, for other types of assays, the rows of the matrix might correspond e.g., to binding regions (with ChIP–Seq) or peptide sequences (with quantitative mass spectrometry).

The count values must be raw counts of sequencing reads. This is important for *DESeq2*’s statistical model to hold, as only the actual counts allow assessing the measurement precision correctly. Hence, please do not supply other quantities, such as (rounded) normalized counts, or counts of covered base pairs — this will only lead to nonsensical results.

The computational analysis of an RNA–Seq experiment begins earlier: what we get from the sequencing machine is a set of FASTQ files that contain the nucleotide sequence of each read and a quality score at each position. These reads must first be aligned to a reference genome or transcriptome. It is important to know if the sequencing experiment was single–end or paired–end, as the alignment software will require the user to specify both FASTQ files for a paired–end experiment. The output of this alignment step is commonly stored in a file format called SAM/BAM.

To download the published data associated with the paper from the gene expression omnibus repository, we need the SRA toolkit, which allows to download the FASTQ files from short read archive. Binaries are available for ubuntu and centOS. The accession numbers (and links to the landing pages) are given in the article, they are –

A number of software programs exist to align reads to the reference genome, and the development is too rapid for this document to provide an up–to–date list. We recommend consulting benchmarking papers that discuss the advantages and disadvantages of each software, which include accuracy, ability to align reads over splice junctions, speed, memory footprint, and many other features.

The reads for this experiment were aligned to the Ensembl release 66 mouse reference genome using TopHat obtained packaged with a GTF (gene annotation) file from the project. The Python library HTSeq was then used to count the aligned reads to features. All of these steps are treated in detail in the references given above.

For a full example of using the *HTSeq* Python package for read counting, please see the *pasilla* vignette. For an example of generating the *DESeqDataSet* from files produced by *htseq–count*, please see the *DESeq2* vignette.

We now obtain the count table of the experiment directly from a pre–saved file.

```
load(url("http://www-huber.embl.de/users/klaus/geneCounts.RData"))
DESeq2Table
```

```
## class: DESeqDataSet
## dim: 37991 8
## metadata(1): version
## assays(1): counts
## rownames(37991): ENSMUSG00000000001 ENSMUSG00000000003 ... ENSMUSG00000093788
## ENSMUSG00000093789
## rowData names(0):
## colnames(8): WT-SRR1042885 WT-SRR1042886 ... Del_8_17_homo-SRR1042891
## Del_8_17_homo-SRR1042892
## colData names(3): condition sampleNO fastq
```

Bioconductor software packages often define and use a custom class for their data object, which makes sure that all the needed data slots are consistently provided and fulfill the requirements. In addition, Bioconductor has general data classes (such as the *SummarizedExperiment*) that can be used to move data between packages.

In *DESeq2*, the custom class is called *DESeqDataSet*. It is built on top of the *SummarizedExperiment* class (in technical term, it is a subclass), and it is easy to convert *SummarizedExperiment* instances into *DESeqDataSet* and vice versa.

One of the main differences is that the `assay`

slot is instead accessed using the *count* accessor, and the class enforces that the values in this matrix are non–negative integers.

Here we show the component parts of a *SummarizedExperiment* object, and also its subclasses, such as the *DESeqDataSet* which we downloaded.

The `assay(s)`

(red block) contains the matrix (or matrices) of summarized values, the `rowData`

(blue block) contains information about the genomic ranges, and the `colData`

(purple block) contains information about the samples or experiments. The highlighted line in each block represents the first row (note that the first row of `colData`

lines up with the first column of the `assay`

.

We can investigate the content of the downloaded *DESeqDataSet* by looking at the counts in the `assay`

slot, the phenotypic data about the samples in `colData`

(sample information), and the data about the genes in the `rowData`

(information on genomic features) slot.

`DESeq2Table`

```
## class: DESeqDataSet
## dim: 37991 8
## metadata(1): version
## assays(1): counts
## rownames(37991): ENSMUSG00000000001 ENSMUSG00000000003 ... ENSMUSG00000093788
## ENSMUSG00000093789
## rowData names(0):
## colnames(8): WT-SRR1042885 WT-SRR1042886 ... Del_8_17_homo-SRR1042891
## Del_8_17_homo-SRR1042892
## colData names(3): condition sampleNO fastq
```

`head(assay(DESeq2Table))`

```
## WT-SRR1042885 WT-SRR1042886 WT-SRR1042887 WT-SRR1042888 Del_8_17_homo-SRR1042889
## ENSMUSG00000000001 2478 2010 4371 4222 3115
## ENSMUSG00000000003 0 0 0 0 0
## ENSMUSG00000000028 492 377 913 909 647
## ENSMUSG00000000031 15493 9791 21816 21350 16111
## ENSMUSG00000000037 108 87 127 140 114
## ENSMUSG00000000049 1 0 2 1 0
## Del_8_17_homo-SRR1042890 Del_8_17_homo-SRR1042891 Del_8_17_homo-SRR1042892
## ENSMUSG00000000001 1346 2599 7895
## ENSMUSG00000000003 0 0 0
## ENSMUSG00000000028 265 542 1674
## ENSMUSG00000000031 7307 14793 40392
## ENSMUSG00000000037 43 95 301
## ENSMUSG00000000049 0 0 4
```

`colSums(assay(DESeq2Table))`

```
## WT-SRR1042885 WT-SRR1042886 WT-SRR1042887 WT-SRR1042888
## 8080886 6138993 13637002 13292699
## Del_8_17_homo-SRR1042889 Del_8_17_homo-SRR1042890 Del_8_17_homo-SRR1042891 Del_8_17_homo-SRR1042892
## 9796176 4272191 8337066 25035865
```

`colData(DESeq2Table)`

```
## DataFrame with 8 rows and 3 columns
## condition sampleNO fastq
## <factor> <factor> <factor>
## WT-SRR1042885 WT SRR1042885 SRR1042885.fastq.gz
## WT-SRR1042886 WT SRR1042886 SRR1042886.fastq.gz
## WT-SRR1042887 WT SRR1042887 SRR1042887.fastq.gz
## WT-SRR1042888 WT SRR1042888 SRR1042888.fastq.gz
## Del_8_17_homo-SRR1042889 Del_8_17_homo SRR1042889 SRR1042889.fastq.gz
## Del_8_17_homo-SRR1042890 Del_8_17_homo SRR1042890 SRR1042890.fastq.gz
## Del_8_17_homo-SRR1042891 Del_8_17_homo SRR1042891 SRR1042891.fastq.gz
## Del_8_17_homo-SRR1042892 Del_8_17_homo SRR1042892 SRR1042892.fastq.gz
```

`rowData(DESeq2Table)`

`## DataFrame with 37991 rows and 0 columns`

Note that the `rowData`

slot is a *GRangesList*, which contains all the information about the exons for each gene, i.e., for each row of the count matrix. It also contains additional annotation for each gene, i.e. a gene description and a gene symbol.

`mcols(rowData(DESeq2Table))`

`## DataFrame with 0 rows and 2 columns`

The `colData`

slot contains all the sample metadata.

`colData(DESeq2Table)`

```
## DataFrame with 8 rows and 3 columns
## condition sampleNO fastq
## <factor> <factor> <factor>
## WT-SRR1042885 WT SRR1042885 SRR1042885.fastq.gz
## WT-SRR1042886 WT SRR1042886 SRR1042886.fastq.gz
## WT-SRR1042887 WT SRR1042887 SRR1042887.fastq.gz
## WT-SRR1042888 WT SRR1042888 SRR1042888.fastq.gz
## Del_8_17_homo-SRR1042889 Del_8_17_homo SRR1042889 SRR1042889.fastq.gz
## Del_8_17_homo-SRR1042890 Del_8_17_homo SRR1042890 SRR1042890.fastq.gz
## Del_8_17_homo-SRR1042891 Del_8_17_homo SRR1042891 SRR1042891.fastq.gz
## Del_8_17_homo-SRR1042892 Del_8_17_homo SRR1042892 SRR1042892.fastq.gz
```

At this point, we have counted the reads which overlap the genes in the gene model we specified. Note that the class `DataFrame`

is a Bioconductor variant of the standard `data.frame`

class in `R`

. which has among other things a more sensible print funtion, only showing the head and the tail of the table.

**Before we proceed, now correct an annotation error: the samples SRR1042891 and SRR1042886 have the opposite genotype**

```
con <- as.character(colData(DESeq2Table)$condition)
con[2] <- "Del_8_17_homo"
con[7] <- "WT"
colData(DESeq2Table)$condition <- factor(con)
```

We can now move on to the actual differential expression analysis steps after specifying an appropriate *design*.

The *DESeqDataSet* has an associated *design formula*. The design is specified at the beginning of the analysis, as it will inform many of the *DESeq2* functions how to treat the samples in the analysis (one exception is the size factor estimation, i.e., the adjustment for differing library sizes, which does not depend on the design formula). The design formula tells which variables in the column metadata table (`colData`

) specify the experimental design and how these factors should be used in the analysis.

The simplest design formula for differential expression would be `~ condition`

, where `condition`

is a column in `colData(DESeq2Table)`

which specifies which of two (or more groups) the samples belong to. This is also what we will be using for the data at hand.

You can use `R`

’s formula notation to express any experimental design that can be described within an ANOVA–like framework. Note that *DESeq2* uses the same formula notation as, for instance, the *lm* function of base `R`

. If the question of interest is whether a fold change due to treatment is different across groups, interaction terms can be included using models such as `~ group + treatment + group:treatment`

. More complex designs such as these are covered in the *DESeq2* vignette.

Having created a count table and fed it into a DESeq2 object, the next step in the workflow is the quality control of the data. Let’s check how many genes we capture by counting the number of genes that have non–zero counts in all samples.

```
GeneCounts <- counts(DESeq2Table)
idx.nz <- apply(GeneCounts, 1, function(x) { all(x > 0)})
sum(idx.nz)
```

`## [1] 16149`

```
### random sample from the count matrix
nz.counts <- subset(GeneCounts, idx.nz)
sam <- sample(dim(nz.counts)[1], 5)
nz.counts[sam, ]
```

```
## WT-SRR1042885 WT-SRR1042886 WT-SRR1042887 WT-SRR1042888 Del_8_17_homo-SRR1042889
## ENSMUSG00000037363 38 15 58 52 47
## ENSMUSG00000029060 333 255 498 554 381
## ENSMUSG00000078247 58 57 71 96 73
## ENSMUSG00000045095 644 531 1105 1117 800
## ENSMUSG00000071506 2 9 11 12 11
## Del_8_17_homo-SRR1042890 Del_8_17_homo-SRR1042891 Del_8_17_homo-SRR1042892
## ENSMUSG00000037363 23 36 100
## ENSMUSG00000029060 175 326 951
## ENSMUSG00000078247 37 57 190
## ENSMUSG00000045095 336 682 2124
## ENSMUSG00000071506 4 8 19
```

In a typical RNA-Seq experiment, there will be at least several thousand genes that are expressed in all samples. If the number of non–zero genes is very low, there is usually something wrong with at least some of the samples (e.g. the sample preparation was not done properly, or there was some problem with the sequencing).

As different libraries will be sequenced to different depths, offsets are built in the statistical model of *DESeq2* to ensure that parameters are comparable.

The term normalization is often used for that, but it should be noted that the raw read counts are not actually altered. *DESeq2* defines a virtual reference sample by taking the median of each gene’s values across samples and then computes size factors as the median of ratios of each sample to the reference sample.

Generally, the ratios of the size factors should roughly match the ratios of the library sizes. Dividing each column of the count table by the corresponding size factor yields normalized count values, which can be scaled to give a counts per million interpretation.

Thus, if all size factors are roughly equal to one, the libraries have been sequenced equally deeply.

Count data sets typically show a (left–facing) trombone shape in average vs mean–difference plots (MA–plots), reflecting the higher variability of log ratios at lower counts. In addition, points will typically be centered around a log ratio of 0 if the normalization factors are calculated appropriately, although this is just a general guide.

In the code below we produce density estimates of the sample counts and pairwise mean–difference plots after the computation of the size factors. The MA–plots are saved to a file in order to avoid cluttering the document.

We first relevel the condition factor to make sure that we have the wild type samples as a base level. This will allow us to obtain the fold changes always as WT–deletion. We also rename the deletion level to “Del”.

```
colData(DESeq2Table)$condition <- factor(colData(DESeq2Table)$condition, levels = c("WT", "Del"))
colData(DESeq2Table)$condition <- factor(ifelse(is.na(colData(DESeq2Table)$condition), "Del", "WT"), levels = c("WT", "Del"))
```

We now estimate estimate the size factors an print them

```
DESeq2Table <- estimateSizeFactors(DESeq2Table)
sizeFactors(DESeq2Table)
```

```
## WT-SRR1042885 WT-SRR1042886 WT-SRR1042887 WT-SRR1042888
## 0.8362 0.6322 1.3975 1.3709
## Del_8_17_homo-SRR1042889 Del_8_17_homo-SRR1042890 Del_8_17_homo-SRR1042891 Del_8_17_homo-SRR1042892
## 1.0086 0.4473 0.8602 2.6086
```

To assess whether the normalization has worked, we plot the densities of counts for the different samples. Since most of the genes are (heavily) affected by the experimental conditions, a succesful normalization will lead to overlapping densities.

```
multidensity( counts(DESeq2Table, normalized = T)[idx.nz ,],
xlab="mean counts", xlim=c(0, 1000))
```

This is indeed the case here. The same holds true for the empircal cummulative distribution functions (ECDFs) which can be thought of as integrals of the densities and give the probability of observing a certain number of counts equal to \(x\) or less given the data.

In an ECDF plot, the estimated probility is plotted on the y–axis and the count values on the x–axis. I.e. you can read of the median and other quantiles from this plot.

As already mentioned, if the normalization has worked, the ECDFs of the different samples should be overlapping.

```
multiecdf( counts(DESeq2Table, normalized = T)[idx.nz ,],
xlab="mean counts", xlim=c(0, 1000))
```

To further assess systematic differences between the samples, we can also plot pairwise mean–average plots: We plot the average of the log–transformed counts vs the fold change per gene for each of the sample pairs.

The function `combn`

helps us to automatically generate all sample pairs and the function `MDPlot`

from the *EDASeq* package then generates the the pairwise MA plots.

```
pdf("pairwiseMAs.pdf")
MA.idx = t(combn(1:8, 2))
for( i in seq_along( MA.idx[,1])){
MDPlot(counts(DESeq2Table, normalized = T)[idx.nz ,],
c(MA.idx[i,1],MA.idx[i,2]),
main = paste( colnames(DESeq2Table)[MA.idx[i,1]], " vs ",
colnames(DESeq2Table)[MA.idx[i,2]] ), ylim = c(-3,3))
}
dev.off()
```

```
## png
## 2
```

The plots look good, if there is no systematic shift between the samples, the log–fold–change should scatter around zero, which is the case here.

A useful first step in an RNA–Seq analysis is often to assess overall similarity between samples: Which samples are similar to each other, which are different? Does this fit to the expectation from the experiment’s design? We use the function to calculate the Euclidean distance between samples. To avoid that the distance measure is dominated by a few highly variable genes, and have a roughly equal contribution from all genes, we use it on the regularized log–transformed data.

The aim of the regularized log–transform is to stabilize the variance of the data and to make its distribution roughly symmetric since many common statistical methods for exploratory analysis of multidimensional data, especially methods for clustering and ordination (e.g., principal–component analysis and the like), work best for (at least approximately) homoskedastic data; this means that the variance of an observable quantity (i.e., here, the expression strength of a gene) does not depend on the mean.

In RNA–Seq data, however, variance grows with the mean. For example, if one performs PCA directly on a matrix of normalized read counts, the result typically depends only on the few most strongly expressed genes because they show the largest absolute differences between samples.

A simple and often used strategy to avoid this is to take the logarithm of the normalized count values plus a small pseudocount; however, now the genes with low counts tend to dominate the results because, due to the strong Poisson noise inherent to small count values, they show the strongest relative differences between samples. Note that this effect can be diminished by adding a relatively high number of pseudocounts, e.g. 32, since this will also substantially reduce the variance of the fold changes.

As a solution, *DESeq2* offers the regularized–logarithm transformation, or rlog for short. For genes with high counts, the rlog transformation differs not much from an ordinary log2 transformation.

For genes with lower counts, however, the values are shrunken towards the genes’ averages across all samples. Using an empirical Bayesian prior in the form of a ridge penalty, this is done such that the rlog–transformed data are approximately homoskedastic. Note that the rlog transformation is provided for applications other than differential testing. For differential testing it is always recommended to apply the DESeq function to raw counts.

Note the use of the function to transpose the data matrix. We need this because dist calculates distances between data rows and our samples constitute the columns. We visualize the distances in a heatmap, using the function `heatmap.2`

from the `r CRANpkg("gplots")`

package.

Note that we have changed the row names of the distance matrix to contain the genotype and the column to contain the sample ID, so that we have all this information in view when looking at the heatmap.

```
rld <- rlogTransformation(DESeq2Table, blind=TRUE)
distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
rownames(mat) <- colData(rld)$condition
colnames(mat) <- colData(rld)$sampleNO
hmcol <- colorRampPalette(brewer.pal(9, "Blues"))(255)
heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13))
```

Another way to visualize sample–to–sample distances is a principal–components analysis (PCA). In this ordination method, the data points (i.e., here, the samples) are projected onto the 2D plane such that they spread out optimally.

```
ntop = 500
Pvars <- rowVars(assay(rld))
select <- order(Pvars, decreasing = TRUE)[seq_len(min(ntop,
length(Pvars)))]
PCA <- prcomp(t(assay(rld)[select, ]), scale = F)
percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1)
```

Since PCA can be slightly problematic with high dimensional data, we first use select only the 500 genes showing the highest variance.

We then use the function `prcomp`

to compute the principal components. The resulting principal component scores, i.e. the coordinates of the samples in the space of the principal components can then be plotted.

For this we use the function `qplot`

from *ggplot2*. *ggplot2* is a comprehensive plotting package for R, a very good tutorial is here. The function `qplot`

is supposed to mimic the standard plotting function `plot`

as closely as possible, but additional changes can be made, e.g. lattice–type graphics (splitting the plot by a factor of interest) can easily be generated.

Addition elements are changed by using `+`

and the corresponding commands. In our PCA plot, we change the plotting colors used and the axis labels.

```
dataGG = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2],
PC3 = PCA$x[,3], PC4 = PCA$x[,4],
sampleNO = colData(rld)$sampleNO,
condition = colData(rld)$condition)
(qplot(PC1, PC2, data = dataGG, color = condition,
main = "PC1 vs PC2, top variable genes", size = I(6))
+ labs(x = paste0("PC1, VarExp:", round(percentVar[1],4)),
y = paste0("PC2, VarExp:", round(percentVar[2],4)))
+ scale_colour_brewer(type="qual", palette=2)
)
```

We can clearly identify to outliers in the PCA plot, one in each experimental groups. These two outliers basically provide the separation according to the first principal component. Thus, they outliers will most likely increase the overall variability and thus diminish statistical power later when testing for differential expression. Therefore we remove them. We can get them by selecting only those samples that have a value greater than zero for the first principal component score. This can conveniently be done using the function `subset`

and specifying the logical vector `dataGG$PC1 > 0`

. `subset`

will then return the sample names a positions where `dataGG$PC1 > 0`

evaluates to `TRUE`

.

```
outliers <- as.character(subset(colnames(DESeq2Table), dataGG$PC1 > 0))
outliers
```

`## [1] "WT-SRR1042888" "Del_8_17_homo-SRR1042890"`

`DESeq2Table <- DESeq2Table[, !(colnames(DESeq2Table) %in% outliers)] `

We now produce the cluster heatmap and the PCA plot without the outliers.

```
rld <- rlogTransformation(DESeq2Table, blind=TRUE)
distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
rownames(mat) <- colData(rld)$condition
colnames(mat) <- colData(rld)$sampleNO
hmcol <- colorRampPalette(brewer.pal(9, "OrRd"))(255)
heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13))
```

```
Pvars <- rowVars(assay(rld))
select <- order(Pvars, decreasing = TRUE)[seq_len(min(ntop,
length(Pvars)))]
PCA <- prcomp(t(assay(rld)[select, ]), scale = F)
percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1)
dataGG = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2],
PC3 = PCA$x[,3], PC4 = PCA$x[,4],
sampleNO = colData(rld)$sampleNO,
condition = colData(rld)$condition)
(qplot(PC1, PC2, data = dataGG, color = condition,
main = "PC1 vs PC2, top variable genes, no outliers", size = I(6))
+ labs(x = paste0("PC1, VarExp:", round(percentVar[1],4)),
y = paste0("PC2, VarExp:", round(percentVar[2],4)))
+ scale_colour_brewer(type="qual", palette=4)
)
```

The separation between the two groups is not extremely clear, they do not cluster together in the heatmap and they also are only separated by the second principal component.

This will most likely result in only a modest number of differentially expressed genes and is in line with the small number of differentially expressed genes (~ 100) the authors report in the original publication.

It is known that a standard Poisson model can only account for the technical noise in RNA–Seq data. In the Poisson model the variance is equal to the mean, while in real RNA–Seq data the variance is greater than the mean, a phenomenon often encountered in count data in general and referred to as “overdispersion”.

A popular way to model this is to use a Negative–Binomial-distribution (NB), which includes an additional parameter dispersion parameter \(\alpha\) such that \(E(NB) = \mu\) and

\[ \text{Var[NB($\mu$, $\alpha$)]} = \mu + \alpha \cdot \mu^2 \]

Hence, the variance is greater than the mean. *DESeq2* also uses the NB model. Hence, the first step in the analysis of differential expression, is to obtain an estimate of the dispersion parameter for each gene. The typical shape of the dispersion fit is an exponentially decaying curve. The asymptotic dispersion for highly expressed genes can be seen as a measurement of biological variability in the sense of a squared coefficient of variation: a dispersion value of 0.01 means that the gene’s expression tends to differ by typically \(\sqrt{0.01}\) = 10% between samples of the same treatment group.

For weak genes, the Poisson noise is an additional source of noise, which is added to the dispersion. The function visualizes *DESeq2*’s dispersion estimates:

```
DESeq2Table <- estimateDispersions(DESeq2Table)
plotDispEsts(DESeq2Table)
```

The black points are the dispersion estimates for each gene as obtained by considering the information from each gene separately. Unless one has many samples, these values fluctuate strongly around their true values.

Therefore, the red trend line is fitted, which shows the dispersions’ dependence on the mean, and then shrink each gene’s estimate towards the red line to obtain the final estimates (blue points) that are then used in the hypothesis test.

The blue circles above the main “cloud” of points are genes which have high gene–wise dispersion estimates which are labelled as dispersion outliers. These estimates are therefore not shrunk toward the fitted trend line.

The warnings just indicate that the dispersion estimation failed for some genes.

We can perform the statistical testing for differential expression and extract its results. Calling performs the test for differential expression, while the call to the function extracts the results of the test and returns adjusted \(p\)–values according to the Benjamini–Hochberg rule to control the FDR. The test–performed is Wald test, which is a test for coefficients in a regression model. It is based on a z–score, i.e. a \(N(0,1)\) distributed (under \(H_0\)) test statistic.

```
DESeq2Table <- nbinomWaldTest(DESeq2Table)
DESeq2Res <- results(DESeq2Table, pAdjustMethod = "BH")
table(DESeq2Res$padj < 0.1)
```

```
##
## FALSE TRUE
## 17525 147
```

We identify 147 differentially expressed genes.

In addition to this, *DESeq2* performs automated independent filtering: For weakly expressed genes, we have no chance of seeing differential expression, because the low read counts suffer from so high Poisson noise that any biological effect is drowned in the uncertainties from the read counting.

At first sight, there may seem to be little benefit in filtering out these genes. After all, the test found them to be non–significant anyway. However, these genes have an influence on the multiple testing procedure, whose performance commonly improves if such genes are removed. By removing the weakly–expressed genes from the input to the BH–FDR procedure, we can find more genes to be significant among those which we keep, and so improve the power of our test. This approach is known as independent filtering.

The *DESeq2* software automatically performs independent filtering which maximizes the number of genes which will have a BH–adjusted \(p\)–value less than a critical value (by default, alpha is set to 0.1). This automatic independent filtering is performed by, and can be controlled by, the results function. We can observe how the number of rejections changes for various cutoffs based on mean normalized count. The following optimal threshold and table of possible values is stored as an attribute of the results object.

```
plot(metadata(DESeq2Res)$filterNumRej, type="b", xlab="quantiles of 'baseMean'",
ylab="number of rejections")
```

The term **independent** highlights an important caveat. Such filtering is permissible only if the filter criterion is independent of the actual test statistic under the null–hypothesis.

Otherwise, the filtering would invalidate the test and consequently the assumptions of the BH procedure. This is why we filtered on the average over all samples: this filter is blind to the assignment of samples to the deletion and control group and hence independent under the null hypothesis of equal expression.

The null–\(p\)–values follow a uniform distribution on the unit interval [0,1] if they are computed using a continuous null distribution. Significant \(p\)–values thus become visible as an enrichment of \(p\)–values near zero in the histogram.

Thus, \(p\)–value histogram of “correctly” computed \(p\)–values will have a rectangular shape with a peak at 0.

A histogram of \(p\)–values should always be plotted in order to check whether they have been computed correctly. We also do this here:

`hist(DESeq2Res$pvalue, col = "lavender", main = "WT vs Deletion", xlab = "p-values")`

We can see that this is clearly not the case for the \(p\)–values returned by *DESeq2* in this case.

Very often, if the assumed variance of the null distribution is too high, we see hill–shaped \(p\)–value histogram. If the variance is too low, we get a U–shaped histogram, with peaks at both ends.

Here we have a hill–shape, indicating an overestimation of the variance in the null distribution. Thus, the \(N(0,1)\) null distribution of the Wald test is not appropriate here.

The dispersion estimation is not condition specific and estimates only a single dispersion estimate per gene. This is sensible, since the number of replicates is usually low.

However, if we have e.g. batches or “outlying” samples that are consistently a bit different from others within a group, the dispersion within the experimental group can be different and a single dispersion parameter not be appropriate.

For an example of the estimation of multiple dispersions, see the analysis performed in: Reyes et. al. - Drift and conservation of differential exon usage across tissues in primate species, 2013

Fortunately, there is software available to estimate the variance of the null–model from the test statistics. This is commonly referred to as “empirical null modelling”.

Here we use the *fdrtool* for this using the Wald statistic as input. This packages returns the estimated null variance, as well as estimates of various other FDR–related quantities and the \(p\)–values computed using the estimated null model parameters.

We first remove genes filtered out by independent filtering and the dispersion outliers, they have NA adj. pvals and NA \(p\)–values respectively.

```
DESeq2Res <- DESeq2Res[ !is.na(DESeq2Res$padj), ]
DESeq2Res <- DESeq2Res[ !is.na(DESeq2Res$pvalue), ]
```

We now remove the original adjusted \(p\)–values, since we will add the corrected ones later on.

`DESeq2Res <- DESeq2Res[, -which(names(DESeq2Res) == "padj")]`

We can now use z–scores returned by *DESeq2*as input to *fdrtool* to re–estimate the \(p\)–values

`FDR.DESeq2Res <- fdrtool(DESeq2Res$stat, statistic= "normal", plot = T)`

```
## Step 1... determine cutoff point
## Step 2... estimate parameters of null distribution and eta0
## Step 3... compute p-values and estimate empirical PDF/CDF
## Step 4... compute q-values and local fdr
## Step 5... prepare for plotting
```

`FDR.DESeq2Res$param[1, "sd"]`

```
## sd
## 0.7888
```

Interestingly the estimated null model variance is 0.7888, which is lower than \(1\), the theoretical one.

We now add the new BH–adjusted \(p\)–values add values to the results data frame.

`DESeq2Res[,"padj"] <- p.adjust(FDR.DESeq2Res$pval, method = "BH")`

We can now plot the histogram of the “correct” \(p\)–values.

```
hist(FDR.DESeq2Res$pval, col = "royalblue4",
main = "WT vs Deletion, correct null model", xlab = "CORRECTED p-values")
```

As we can see, the null model is now correct and the histogram has the expected shape.

We can now extract the number of differentially expressed genes and produce an MA plot of the two–group comparison as shown in figure 4a of the original paper. (“M” for minus, because a log ratio is equal to log minus log, and “A” for average).

Similar to its usage in the initial quality control, a MA plot provides a useful overview for an experiment with a two-group comparison:

The plot represents each gene with a dot. The \(x\)–axis is the average expression over all samples, the y axis the log2 fold change between treatment and control. Genes with an FDR value below a threshold (here 0.1) are shown in red. This plot demonstrates that only genes with a large average normalized count contain more information to yield a significant call.

Also note *DESeq2*’s shrinkage estimation of log fold changes (LFCs): When count values are too low to allow an accurate estimate of the LFC, the value is “shrunken” towards zero to avoid that these values, which otherwise would frequently be unrealistically large, dominate the top–ranked log fold changes. Thus, in contrast to the “trumpet shaped” of the MA plot in the original paper (they used the orginal *DESeq*, we obtain a “cigar-shaped” MA plot.

`table(DESeq2Res[,"padj"] < 0.1)`

```
##
## FALSE TRUE
## 17369 303
```

`plotMA(DESeq2Res)`

We now identify 303 differentially expressed genes at an FDR of \(0.1\).

We can now check the overlap with the original results of the paper. In the original publication 100 genes with an FDR value of less than \(0.05\) were identified. The excel table “ng.2971–S3.xls” containing their FC and \(p\)–values is available as supplementary table 4 of the original publication. We convert the table into a `.csv`

file and import it.

```
paperRes <- read.csv("ng.2971-S3.csv", skip = 1, header =T)
names(paperRes)
```

```
## [1] "Gene_ID" "baseMean" "baseMean.WT" "baseMean.DEL" "fold.change"
## [6] "log2FoldChange" "pval" "padj" "GROUP" "Gene.name"
## [11] "chr" "start.mm10." "end.mm10." "strand" "Gene.name.1"
## [16] "KEGG" "GO"
```

We now extract our significant genes and their annotation

```
sigGenes <- rownames(subset(DESeq2Res, padj < 0.1))
anno <- AnnotationDbi::select(org.Mm.eg.db,
keys=rownames(DESeq2Res),
columns=c("SYMBOL","SYMBOL", "GENENAME"),
keytype="ENSEMBL")
anSig <- as.data.frame(subset(anno, ENSEMBL %in% sigGenes))
sample_n(anSig, 5)
```

```
## ENSEMBL SYMBOL GENENAME
## 11862 ENSMUSG00000047935 Gm5607 predicted gene 5607
## 7750 ENSMUSG00000032399 Rpl4 ribosomal protein L4
## 13496 ENSMUSG00000058799 Nap1l1 nucleosome assembly protein 1-like 1
## 14631 ENSMUSG00000069516 Lyz2 lysozyme 2
## 11735 ENSMUSG00000047215 Rpl9 ribosomal protein L9
```

We now extract the genes identified in the original paper that we call as differentially expressed.

```
overlap <- subset(paperRes, Gene.name %in% anSig$SYMBOL )
dim(overlap)
```

`## [1] 231 17`

Now it can be asked how many genes that we identify with FDR < 0.1 also had FDR < 0.1 in the original paper? and get the total number of genes in the original paper with FDR < 0.1 (there are 116)

`dim(subset(overlap, padj < 0.1))[1]`

`## [1] 105`

`dim(subset(paperRes, padj < 0.1))[1]`

`## [1] 116`

Out of the 303 differentially expressed genes that we identify at an FDR of \(0.1\), 105 have also been identified in the original paper (out of 116 in total in the original paper with an FDR of \(0.1\)).

The MA–plot also looks very similar to the paper plot. Thus, we can reproduce the paper results with a slightly different software setting and can gain power by correcting the null model. The “blood” and “ribosomal proteins” parts of the MA–plot marked in the paper figure are also clearly visible.

A quick way to visualize the counts for a particular gene is to use the `plotCounts`

function, which takes as arguments the `DESeqDataSet`

, a gene name, and the group over which to plot the counts. Some of the differentially expressed genes identified via RNA–Seq have been validated by qPCR (see Figure 4b of the paper). Here, we check the expression of Nr2f1.

```
Nr2f1ENSEMBL <- subset(anSig, SYMBOL == "Nr2f1")$ENSEMBL
plotCounts(DESeq2Table, gene=Nr2f1ENSEMBL, intgroup=c("condition"))
```

We can now try characterize the identified differentially expressed genes a bit better by performing an GO enrichment analysis. Essentially the gene ontology () is a hierarchically organized collection of functional gene sets. For a nice introduction to the GO see

The function `genefinder`

from the `r Biocpkg("genefilter")`

package will be used to find background genes that are similar in expression to the differentially expressed genes. The function tries to identify 10 genes for each DE–gene that match its expression strength.

We then check whether the background has roughly the same distribution of average expression strength as the foreground by plotting the densities.

We do this in order not to select a biased background since the gene set testing is performed by a simple Fisher test on a 2x2 table, which uses only the status of a gene, i.e. whether it is differentially expressed or not and not its fold–change or absolute expression.

Note that the chance of a gene being identified as DE will most probably depend its gene expression for RNA–Seq data (potentially also its gene length etc.). Thus it is important to find a matching background. Our the testing approach here is very similar to web tools like DAVID, however, we explicitly model the background here.

We first get average gene expressions for each of the genes and then find non–DE genes that show a similar expression as the DE–genes. These genes are then our background.

```
overallBaseMean <- as.matrix(DESeq2Res[, "baseMean", drop = F])
sig_idx <- match(anSig$ENSEMBL, rownames(overallBaseMean))
backG <- c()
for(i in sig_idx){
ind <- genefinder(overallBaseMean, i, 10, method = "manhattan")[[1]]$indices
backG <- c(backG, ind)
}
backG <- unique(backG)
backG <- rownames(overallBaseMean)[backG]
```

We now remove DE genes from background and the get the total number of genes in the background.

```
backG <- setdiff(backG, anSig$ENSEMBL)
length(backG)
```

`## [1] 2437`

Plotting the density of the average expressions, shows that the background matching has worked reasonably well.

```
multidensity( list(
all= log2(DESeq2Res[,"baseMean"]) ,
foreground =log2(DESeq2Res[anSig$ENSEMBL, "baseMean"]),
background =log2(DESeq2Res[backG, "baseMean"])),
xlab="log2 mean normalized counts", main = "Matching for enrichment analysis")
```