```{r show_Data, dependson="load_counts"}
DESeq2Table
head(assay(DESeq2Table))
colSums(assay(DESeq2Table))
colData(DESeq2Table)
rowData(DESeq2Table)
```

Note that the `rowData` slot is a *GRangesList*, which
contains all the information about the exons for each gene, i.e., for each row
of the count matrix. It also contains additional annotation for each gene, i.e.
a gene description and a gene symbol.
```{r, dependson="load_counts"}
mcols(rowData(DESeq2Table))
```

The `colData` slot contains all the sample metadata.
```{r, dependson="load_counts"}
colData(DESeq2Table)
```

At this point, we have counted the reads which overlap the genes in
the gene model we specified. Note that the class `DataFrame` is a
[Bioconductor](http://bioconductor.org/) variant of the standard `data.frame` class in `R`.
which has among other things a more sensible print funtion, only showing
the head and the tail of the table.
**Before we proceed, now correct an annotation error: the samples
SRR1042891 and SRR1042886 have the opposite genotype**
```{r correct_anno_error, dependson="load_counts", eval=TRUE}
con <- as.character(colData(DESeq2Table)$condition)
con[2] <- "Del_8_17_homo"
con[7] <- "WT"
colData(DESeq2Table)$condition <- factor(con)
```
We can now move on to the actual differential
expression analysis steps after specifying an appropriate *design*.
# Experimental design and the design formula
The *DESeqDataSet* has an associated
*design formula*. The design is specified at the beginning of the
analysis, as it will inform many of the *DESeq2* functions how to
treat the samples in the analysis (one exception is the size factor
estimation, i.e., the adjustment for differing library sizes, which
does not depend on the design formula). The design formula tells
which variables in the column metadata table (`colData`) specify the
experimental design and how these factors should be used in the
analysis.
The simplest design formula for differential expression would be
`~ condition`, where `condition` is a column in `colData(DESeq2Table)` which
specifies which of two (or more groups) the samples belong to. This
is also what we will be using for the data at hand.
You can use `R`'s formula notation to express any experimental design
that can be described within an ANOVA--like framework. Note that
*DESeq2* uses the same formula notation as, for instance, the *lm*
function of base `R`. If the question of interest is whether a fold
change due to treatment is different across groups, interaction terms
can be included using models such as
`~ group + treatment + group:treatment`.
More complex designs such as these are covered in
the `r Biocpkg("DESeq2")` vignette.
# Quality control and Normalization of the count data
Having created a count table and fed it into a DESeq2 object, the
next step in the workflow is the quality control of the data. Let's check
how many genes we capture by counting the number of genes that have
non--zero counts in all samples.
```{r, non_zero_counts, dependson="load_counts"}
GeneCounts <- counts(DESeq2Table)
idx.nz <- apply(GeneCounts, 1, function(x) { all(x > 0)})
sum(idx.nz)
### random sample from the count matrix
nz.counts <- subset(GeneCounts, idx.nz)
sam <- sample(dim(nz.counts)[1], 5)
nz.counts[sam, ]
```
In a typical RNA-Seq experiment, there will be at least several thousand genes
that are expressed in all samples. If the number of non--zero genes
is very low, there is usually something wrong with at least some of the
samples (e.g. the sample preparation was not done properly, or there was some
problem with the sequencing).
# Normalization
As different libraries will be sequenced to different depths, offsets are built
in the statistical model of `r Biocpkg("DESeq2")` to ensure that parameters are comparable.
The term normalization is often used for that, but it should be noted that the
raw read counts are not actually altered. `r Biocpkg("DESeq2")` defines a virtual
reference sample by taking the median of each gene's values across samples and then computes
size factors as the median of ratios of each sample to the reference sample.
Generally, the ratios of the size factors should roughly match the ratios of the
library sizes. Dividing each column of the count table by the corresponding
size factor yields normalized count values, which can be scaled to give
a counts per million interpretation.
Thus, if all size factors are roughly equal to one, the libraries have
been sequenced equally deeply.
Count data sets typically show a (left--facing) trombone shape in average
vs mean--difference plots (MA--plots), reflecting the higher
variability of log ratios at lower counts.
In addition, points will typically be
centered around a log ratio of 0 if the normalization factors are calculated
appropriately, although this is just a general guide.
In the code below we produce density estimates of the sample counts and
pairwise mean--difference plots after the computation of the size factors.
The MA--plots are saved to a file in order to avoid cluttering the
document.
We first relevel the condition factor to make sure that we have
the wild type samples as a base level. This will allow us to obtain
the fold changes always as WT--deletion. We also rename the deletion
level to "Del".
```{r relevel condition, dependson="load_counts"}
colData(DESeq2Table)$condition <- factor(colData(DESeq2Table)$condition, levels = c("WT", "Del"))
colData(DESeq2Table)$condition <- factor(ifelse(is.na(colData(DESeq2Table)$condition), "Del", "WT"), levels = c("WT", "Del"))
```
We now estimate estimate the size factors an print them
```{r getSizeFactors, dependson="load_counts"}
DESeq2Table <- estimateSizeFactors(DESeq2Table)
sizeFactors(DESeq2Table)
```
To assess whether the normalization has worked, we plot the densities of
counts for the different samples. Since most of the genes are (heavily) affected
by the experimental conditions, a succesful normalization will lead to overlapping
densities.
```{r plotDensities, dependson="getSizeFactors"}
multidensity( counts(DESeq2Table, normalized = T)[idx.nz ,],
xlab="mean counts", xlim=c(0, 1000))
```
This is indeed the case here. The same holds true for the empircal cummulative distribution functions (ECDFs)
which can be thought of as integrals of the densities and give the probability of observing
a certain number of counts equal to $x$ or less given the data.
In an ECDF plot, the estimated probility is plotted on the y--axis and the count
values on the x--axis. I.e. you can read of the median and other
quantiles from this plot.
As already mentioned, if the normalization has worked, the ECDFs of the different samples
should be overlapping.
```{r plotECDFs, dependson= c("non_zero_counts", "load_counts", "getSizeFactors")}
multiecdf( counts(DESeq2Table, normalized = T)[idx.nz ,],
xlab="mean counts", xlim=c(0, 1000))
```
To further assess systematic differences between
the samples, we can also plot pairwise mean--average plots: We plot the average of
the log--transformed counts vs the fold change per gene for each of the sample pairs.
The function `combn` helps us to automatically generate all sample pairs and the
function `MDPlot` from the `r Biocpkg("EDASeq")` package then generates the
the pairwise MA plots.
```{r pairwiseMAs, dependson= c("non_zero_counts", "load_counts", "getSizeFactors")}
pdf("pairwiseMAs.pdf")
MA.idx = t(combn(1:8, 2))
for( i in seq_along( MA.idx[,1])){
MDPlot(counts(DESeq2Table, normalized = T)[idx.nz ,],
c(MA.idx[i,1],MA.idx[i,2]),
main = paste( colnames(DESeq2Table)[MA.idx[i,1]], " vs ",
colnames(DESeq2Table)[MA.idx[i,2]] ), ylim = c(-3,3))
}
dev.off()
```
The plots look good, if there is no systematic shift between the samples, the
log--fold--change should scatter around zero, which is the case here.
# PCA and sample heatmaps
A useful first step in an RNA--Seq analysis is often to assess overall similarity
between samples: Which samples are similar to each other, which are different?
Does this fit to the expectation from the experiment's design?
We use the \R{} function \Rfunction{dist} to calculate the Euclidean distance between samples.
To avoid that the distance measure is dominated by a few highly variable genes,
and have a roughly equal contribution from all genes, we use it on the regularized log--transformed data.
The aim of the regularized log--transform is to stabilize the variance
of the data and to make its distribution roughly symmetric since
many common statistical methods for exploratory analysis of multidimensional data,
especially methods for clustering and ordination (e.g., principal--component analysis and the like),
work best for (at least approximately) homoskedastic data; this means that the
variance of an observable quantity (i.e., here, the expression strength of a gene)
does not depend on the mean.
In RNA--Seq data, however, variance grows with the mean. For example,
if one performs PCA directly on a matrix of normalized read counts, the result
typically depends only on the few most strongly expressed genes because they
show the largest absolute differences between samples.
A simple and often used strategy to avoid this is to take the logarithm of
the normalized count values plus a small pseudocount; however, now the genes with
low counts tend to dominate the results because, due to the strong Poisson noise
inherent to small count values, they show the strongest relative differences
between samples. Note that this effect can be diminished by adding a relatively
high number of pseudocounts, e.g. 32, since this will also substantially
reduce the variance of the fold changes.
As a solution, `r Biocpkg("DESeq2") ` offers the regularized--logarithm transformation,
or rlog for short. For genes with high counts, the rlog transformation
differs not much from an ordinary log2 transformation.
For genes with lower counts, however, the values are shrunken towards the genes'
averages across all samples. Using an empirical Bayesian prior in the form of
a ridge penalty, this is done such that the rlog--transformed data are approximately
homoskedastic. Note that the rlog transformation is provided for applications
other than differential testing. For differential testing it is always recommended to apply
the DESeq function to raw counts.
Note the use of the function \Rfunction{t} to transpose the data matrix.
We need this because dist calculates distances between data rows and our samples
constitute the columns. We visualize the distances in a heatmap, using the function
` heatmap.2 ` from the ` r CRANpkg("gplots")` package.
Note that we have changed the row names of the distance matrix to
contain the genotype and the column to contain the sample ID, so
that we have all this information in view when looking at the heatmap.
``` {r sample_heatmaps, fig.cap= "sample heatmaps", fig.height = 20, fig.width = 20, dependson= "load_counts"}
rld <- rlogTransformation(DESeq2Table, blind=TRUE)
distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
rownames(mat) <- colData(rld)$condition
colnames(mat) <- colData(rld)$sampleNO
hmcol <- colorRampPalette(brewer.pal(9, "Blues"))(255)
heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13))
```
Another way to visualize sample--to--sample distances is a principal--components analysis (PCA).
In this ordination method, the data points (i.e., here, the samples) are projected
onto the 2D plane such that they spread out optimally.
```{r PCA_computations, dependson="sample_heatmaps" }
ntop = 500
Pvars <- rowVars(assay(rld))
select <- order(Pvars, decreasing = TRUE)[seq_len(min(ntop,
length(Pvars)))]
PCA <- prcomp(t(assay(rld)[select, ]), scale = F)
percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1)
```
Since PCA can be slightly problematic with high dimensional data,
we first use select only the `r ntop` genes showing the highest variance.
We then use the function ` prcomp ` to compute the principal components. The
resulting principal component scores, i.e. the coordinates of the samples in
the space of the principal components can then be plotted.
For this we use the function `qplot` from `r CRANpkg("ggplot2") `. `r CRANpkg("ggplot2") `
is a comprehensive plotting package for R, a very good tutorial is [here](http://jofrhwld.github.io/avml2012/).
The function `qplot` is supposed to mimic the standard plotting function `plot`
as closely as possible, but additional changes can be made, e.g. lattice--type
graphics (splitting the plot by a factor of interest) can easily be generated.
Addition elements are changed by using `+` and the corresponding commands.
In our PCA plot, we change the plotting colors used and the axis labels.
```{r PCA_plots, fig.cap= "PCA plot", fig.width=7.5, fig.height=6, dependson="PCA_computations" }
dataGG = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2],
PC3 = PCA$x[,3], PC4 = PCA$x[,4],
sampleNO = colData(rld)$sampleNO,
condition = colData(rld)$condition)
(qplot(PC1, PC2, data = dataGG, color = condition,
main = "PC1 vs PC2, top variable genes", size = I(6))
+ labs(x = paste0("PC1, VarExp:", round(percentVar[1],4)),
y = paste0("PC2, VarExp:", round(percentVar[2],4)))
+ scale_colour_brewer(type="qual", palette=2)
)
```
We can clearly identify to outliers in the PCA plot, one in each experimental
groups. These two outliers basically provide the separation according to the first
principal component. Thus, they outliers will most likely increase the
overall variability and thus diminish statistical power later when testing
for differential expression. Therefore we remove them. We can get them
by selecting only those samples that have a value greater than zero for the
first principal component score. This can conveniently be done using the function
` subset ` and specifying the logical vector ` dataGG$PC1 > 0 ` . ` subset ` will
then return the sample names a positions where ` dataGG$PC1 > 0 ` evaluates to `TRUE`.
```{r identify_and_remove_outliers, dependson="PCA_plots"}
outliers <- as.character(subset(colnames(DESeq2Table), dataGG$PC1 > 0))
outliers
DESeq2Table <- DESeq2Table[, !(colnames(DESeq2Table) %in% outliers)]
```
We now produce the cluster heatmap and the PCA plot without the outliers.
```{r sample_heatmaps_no_outlier, fig.cap= "Quality control plots without outliers", fig.width=20, fig.height=20, dependson="identify_and_remove_outliers"}
rld <- rlogTransformation(DESeq2Table, blind=TRUE)
distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
rownames(mat) <- colData(rld)$condition
colnames(mat) <- colData(rld)$sampleNO
hmcol <- colorRampPalette(brewer.pal(9, "OrRd"))(255)
heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13))
```
```{r PCA_no_outlier, fig.cap= "PCA plot without outliers", fig.width=7.5, fig.height=6}
Pvars <- rowVars(assay(rld))
select <- order(Pvars, decreasing = TRUE)[seq_len(min(ntop,
length(Pvars)))]
PCA <- prcomp(t(assay(rld)[select, ]), scale = F)
percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1)
dataGG = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2],
PC3 = PCA$x[,3], PC4 = PCA$x[,4],
sampleNO = colData(rld)$sampleNO,
condition = colData(rld)$condition)
(qplot(PC1, PC2, data = dataGG, color = condition,
main = "PC1 vs PC2, top variable genes, no outliers", size = I(6))
+ labs(x = paste0("PC1, VarExp:", round(percentVar[1],4)),
y = paste0("PC2, VarExp:", round(percentVar[2],4)))
+ scale_colour_brewer(type="qual", palette=4)
)
```
The separation between the two groups is not extremely clear, they do not
cluster together in the heatmap and they also are only separated by
the second principal component.
This will most likely result in only a modest number of differentially expressed
genes and is in line with the small number of differentially expressed genes
(~ 100) the authors report in the original publication.
# Differential expression analysis
\subsection{Dispersion estimation}
It is known that a standard Poisson model can only account for the technical
noise in RNA--Seq data. In the Poisson model the variance is equal to the
mean, while in real RNA--Seq data the variance is greater than the mean, a phenomenon
often encountered in count data in general and referred to as "overdispersion".
A popular way to model this is to use a Negative--Binomial-distribution (NB), which includes an
additional parameter dispersion parameter $\alpha$ such that $E(NB) = \mu$ and
\[
\text{Var[NB($\mu$, $\alpha$)]} = \mu + \alpha \cdot \mu^2
\]
Hence, the variance is greater than the mean. `r Biocpkg("DESeq2") ` also uses the NB
model. Hence, the first step in the analysis of differential expression, is
to obtain an estimate of the dispersion parameter for each gene. The typical
shape of the dispersion fit is an exponentially decaying curve. The asymptotic
dispersion for highly expressed genes can be seen as a measurement of biological
variability in the sense of a squared coefficient of variation: a dispersion
value of 0.01 means that the gene's expression tends to differ by typically
$\sqrt{0.01}$ = 10\% between samples of the same treatment group.
For weak genes, the Poisson noise is an additional source of noise, which is
added to the dispersion. The function \Rfunction{plotDispEsts}
visualizes `r Biocpkg("DESeq2") `'s dispersion estimates:
```{r dispersion_estimation, dependson=c("load_counts", "getSizeFactors") }
DESeq2Table <- estimateDispersions(DESeq2Table)
plotDispEsts(DESeq2Table)
```
The black points are the dispersion estimates for each gene as obtained by considering
the information from each gene separately. Unless one has many samples,
these values fluctuate strongly around their true values.
Therefore, the red trend line is fitted, which shows the dispersions' dependence
on the mean, and then shrink each gene's estimate towards the red line
to obtain the final estimates (blue points) that are then used in the hypothesis test.
The blue circles above the main "cloud" of points are genes which have high gene--wise
dispersion estimates which are labelled as dispersion outliers.
These estimates are therefore not shrunk toward the fitted trend line.
The warnings just indicate that the dispersion estimation failed for some genes.
# Statistical testing of Differential expression
We can perform the statistical testing for differential expression
and extract its results. Calling \Rfunction{nbinomWaldTest} performs
the test for differential expression, while the call to the
\Rfunction{results} function extracts the results of the test and returns
adjusted $p$--values according to the Benjamini--Hochberg rule to control
the FDR. The test--performed is Wald test, which is a test for coefficients
in a regression model. It is based on a z--score, i.e. a $N(0,1)$ distributed (under $H_0$)
test statistic.
```{r DESEq2_Results, dependson=c("load_counts", "getSizeFactors", "dispersion_estimation")}
DESeq2Table <- nbinomWaldTest(DESeq2Table)
DESeq2Res <- results(DESeq2Table, pAdjustMethod = "BH")
table(DESeq2Res$padj < 0.1)
```
We identify `r sum(na.exclude(DESeq2Res$padj < 0.1)) ` differentially expressed genes.
## Independent filtering
In addition to this, `r Biocpkg("DESeq2") ` performs automated independent filtering:
For weakly expressed genes, we have no chance of seeing differential expression,
because the low read counts suffer from so high Poisson noise that any biological
effect is drowned in the uncertainties from the read counting.
At first sight, there may seem to be little benefit in filtering out these genes.
After all, the test found them to be non--significant anyway. However, these
genes have an influence on the multiple testing procedure, whose performance commonly
improves if such genes are removed. By removing the weakly--expressed genes from
the input to the BH--FDR procedure, we can find more genes to be significant among
those which we keep, and so improve the power of our test.
This approach is known as independent filtering.
The `r Biocpkg("DESeq2") ` software automatically performs independent filtering which maximizes
the number of genes which will have a BH--adjusted $p$--value less than a critical
value (by default, alpha is set to 0.1).
This automatic independent filtering is performed by, and can be controlled by,
the results function. We can observe how the number of rejections changes for
various cutoffs based on mean normalized count. The following optimal threshold
and table of possible values is stored as an attribute of the results object.
```{r independent_filtering_res, dependson="DESeq2_Results"}
plot(metadata(DESeq2Res)$filterNumRej, type="b", xlab="quantiles of 'baseMean'",
ylab="number of rejections")
```
The term **independent** highlights an important caveat. Such filtering is permissible
only if the filter criterion is independent of the actual test statistic under the
null--hypothesis.
Otherwise, the filtering would invalidate the test and consequently the
assumptions of the BH procedure. This is why we filtered on the average over all
samples: this filter is blind to the assignment of samples to the deletion
and control group and hence independent under the null hypothesis of equal
expression.
## Inspection and correction of $p$--values
The null--$p$--values follow a uniform distribution on the unit interval
[0,1] if they are computed using a continuous null distribution. Significant $p$--values
thus become visible as an enrichment of $p$--values near zero in the histogram.
Thus, $p$--value histogram of "correctly" computed $p$--values
will have a rectangular shape with a peak at 0.
A histogram of $p$--values should always be plotted in order to check whether
they have been computed correctly. We also do this here:
```{r pvalues_from_DESEq, fig.cap="p-values, wrong null distribution", dependson="DESeq2_Results"}
hist(DESeq2Res$pvalue, col = "lavender", main = "WT vs Deletion", xlab = "p-values")
```
We can see that this is clearly not the case for the $p$--values returned by
`r Biocpkg("DESeq2") ` in this case.
Very often, if the assumed variance of the null distribution is too high, we see
hill--shaped $p$--value histogram. If the variance is too low, we get a U--shaped
histogram, with peaks at both ends.
Here we have a hill--shape, indicating an overestimation of the variance in the
null distribution. Thus, the $N(0,1)$ null distribution of the Wald test is
not appropriate here.
The dispersion estimation is not condition specific and estimates only a
single dispersion estimate per gene. This is sensible, since the number of
replicates is usually low.
However, if we have e.g. batches or "outlying" samples that are consistently
a bit different from others within a group, the dispersion within the experimental
group can be different and a single dispersion parameter not be appropriate.
For an example of the estimation of multiple dispersions, see the analysis performed
in:
[Reyes et. al. - Drift and conservation of differential exon usage across tissues in primate species, 2013](\href{http://dx.doi.org/10.1073/pnas.1307202110)
Fortunately, there is software available to estimate the variance of the
null--model from the test statistics. This is commonly referred to as
"empirical null modelling".
Here we use the `r CRANpkg("fdrtool")` for this using the Wald statistic as input.
This packages returns the estimated
null variance, as well as estimates of various other FDR--related quantities
and the $p$--values computed using the estimated null model parameters.
We first remove genes filtered out by independent filtering and the
dispersion outliers, they have NA adj. pvals and NA $p$--values respectively.
```{r remove_outliers, dependson="DESeq2_Results"}
DESeq2Res <- DESeq2Res[ !is.na(DESeq2Res$padj), ]
DESeq2Res <- DESeq2Res[ !is.na(DESeq2Res$pvalue), ]
```
We now remove the original adjusted $p$--values, since we will add the corrected
ones later on.
```{r remove_original_pvals, dependson=c("DESeq2_Results", "remove_outliers")}
DESeq2Res <- DESeq2Res[, -which(names(DESeq2Res) == "padj")]
```
We can now use z--scores returned by `r Biocpkg("DESeq2") `as input to `r CRANpkg("fdrtool")`
to re--estimate the $p$--values
```{r reEstimatePvals, dependson="remove_original_pvals"}
FDR.DESeq2Res <- fdrtool(DESeq2Res$stat, statistic= "normal", plot = T)
FDR.DESeq2Res$param[1, "sd"]
```
Interestingly the estimated null model variance is `r FDR.DESeq2Res$param[1, "sd"] `,
which is lower than $1$, the theoretical one.
We now add the new BH--adjusted $p$--values add values to the results data frame.
```{r addNewPvals, dependson="reEstimatePvals"}
DESeq2Res[,"padj"] <- p.adjust(FDR.DESeq2Res$pval, method = "BH")
```
We can now plot the histogram of the "correct" $p$--values.
```{r corrected_pavlues, fig.cap="p-values, correct null distribution", dependson="addNewPvals"}
hist(FDR.DESeq2Res$pval, col = "royalblue4",
main = "WT vs Deletion, correct null model", xlab = "CORRECTED p-values")
```
As we can see, the null model is now correct and the histogram has the
expected shape.
## Extracting differentially expressed genes
We can now extract the number of differentially expressed
genes and produce an MA plot of the two--group comparison as
shown in figure 4a of the original paper. ("M" for minus,
because a log ratio is equal to log minus log, and "A" for average).
Similar to its usage in the initial quality control, a
MA plot provides a useful overview for an experiment with a
two-group comparison:
The plot represents each gene with a dot. The $x$--axis is the average expression
over all samples, the y axis the log2 fold change between treatment and control.
Genes with an FDR value below a threshold (here 0.1) are shown in red.
This plot demonstrates that only genes with a large average normalized count contain
more information to yield a significant call.
Also note `r Biocpkg("DESeq2")`'s shrinkage estimation of log fold changes (LFCs):
When count values are too low to allow an accurate estimate of the LFC, the
value is "shrunken" towards zero to avoid that these values, which otherwise
would frequently be unrealistically large, dominate the top--ranked log fold changes.
Thus, in contrast to the "trumpet shaped"
of the MA plot in the original paper (they used the orginal `r Biocpkg("DESeq")`,
we obtain a "cigar-shaped" MA plot.
```{r DE_Genes, fig.cap= "MA plot for DESeq2 analysis", dependson="addNewPvals"}
table(DESeq2Res[,"padj"] < 0.1)
plotMA(DESeq2Res)
```
We now identify `r sum(DESeq2Res[,"padj"] < 0.1)` differentially expressed
genes at an FDR of $0.1$.
## Check overlap with the paper results
We can now check the overlap with the original results of the paper.
In the original publication 100 genes with an FDR value of less than
$0.05$ were identified. The excel table "ng.2971--S3.xls" containing
their FC and $p$--values is available
as supplementary table 4 of the original publication. We convert the table
into a `.csv ` file and import it.
```{r check_overlap_import_table }
paperRes <- read.csv("ng.2971-S3.csv", skip = 1, header =T)
names(paperRes)
```
We now extract our significant genes and their annotation
```{r check_overlap_extract_sig_genes, dependson="addNewPvals"}
sigGenes <- rownames(subset(DESeq2Res, padj < 0.1))
anno <- AnnotationDbi::select(org.Mm.eg.db,
keys=rownames(DESeq2Res),
columns=c("SYMBOL","SYMBOL", "GENENAME"),
keytype="ENSEMBL")
anSig <- as.data.frame(subset(anno, ENSEMBL %in% sigGenes))
sample_n(anSig, 5)
```
We now extract the genes identified in the original paper that we call as
differentially expressed.
```{r }
overlap <- subset(paperRes, Gene.name %in% anSig$SYMBOL )
dim(overlap)
```
Now it can be asked how many genes that we identify with FDR < 0.1 also had
FDR < 0.1 in the original paper? and get the total number of genes in the original
paper with FDR < 0.1 (there are `r dim(subset(paperRes, padj < 0.1))[1] `)
```{r }
dim(subset(overlap, padj < 0.1))[1]
dim(subset(paperRes, padj < 0.1))[1]
```
Out of the `r sum(DESeq2Res[,"padj"] < 0.1) ` differentially expressed
genes that we identify at an FDR of $0.1$, `r dim(subset(overlap, padj < 0.1))[1]`
have also been identified in the original paper (out of `r dim(subset(paperRes, padj < 0.1))[1]`
in total in the original paper with an FDR of $0.1$).
The MA--plot also looks very similar to the paper plot. Thus, we can reproduce the
paper results with a slightly different software setting and can gain power
by correcting the null model. The "blood" and "ribosomal proteins" parts of
the MA--plot marked in the paper figure are also clearly visible.
## Check a validated gene
A quick way to visualize the counts for a particular gene is to use
the `plotCounts` function, which takes as arguments the
`DESeqDataSet`, a gene name, and the group over which to plot the
counts. Some of the differentially expressed genes identified via RNA--Seq
have been validated by qPCR (see Figure 4b of the paper). Here, we check the
expression of [Nr2f1](http://www.ensembl.org/Mus_musculus/Gene/Summary?db=core;g=ENSMUSG00000069171;r=13:78188973-78199757).
```{r plotcounts}
Nr2f1ENSEMBL <- subset(anSig, SYMBOL == "Nr2f1")$ENSEMBL
plotCounts(DESeq2Table, gene=Nr2f1ENSEMBL, intgroup=c("condition"))
```
# Gene ontology enrichment analysis
We can now try characterize the identified differentially expressed genes
a bit better by performing an GO enrichment analysis. Essentially the
gene ontology (\url{http://www.geneontology.org/}) is a hierarchically organized
collection of functional gene sets. For a nice introduction to the GO see
* [du Plessis et. al. -- The what, where, how and why of gene ontology primer for bioinformaticians- Bioinformatics, 2011](http://dx.doi.org/10.1093/bib/bbr002)
\subsection{Matching the background set}
The function `genefinder` from the ` r Biocpkg("genefilter")` package will be used
to find background genes that are similar in expression to the differentially expressed
genes. The function tries to identify 10 genes for each DE--gene that match its expression strength.
We then check whether the background has roughly the same distribution
of average expression strength as the foreground by plotting the densities.
We do this in order not to select a biased background since the gene set testing
is performed by a simple Fisher test on a 2x2 table, which uses only the status
of a gene, i.e. whether it is differentially expressed or not and not its fold--change
or absolute expression.
Note that the chance of a gene being identified as DE will
most probably depend its gene expression for RNA--Seq data (potentially
also its gene length etc.). Thus it is important to find a matching background.
Our the testing approach here is very similar to web tools like
[DAVID](http://david.abcc.ncifcrf.gov/),
however, we explicitly model the background here.
We first get average gene expressions for each of the genes and then find
non--DE genes that show a similar expression as the DE--genes. These genes
are then our background.
```{r GO_Analysis create backgrounds}
overallBaseMean <- as.matrix(DESeq2Res[, "baseMean", drop = F])
sig_idx <- match(anSig$ENSEMBL, rownames(overallBaseMean))
backG <- c()
for(i in sig_idx){
ind <- genefinder(overallBaseMean, i, 10, method = "manhattan")[[1]]$indices
backG <- c(backG, ind)
}
backG <- unique(backG)
backG <- rownames(overallBaseMean)[backG]
```
We now remove DE genes from background and the get the total number of genes in
the background.
```{r, remove_DE_from_backG}
backG <- setdiff(backG, anSig$ENSEMBL)
length(backG)
```
Plotting the density of the average expressions, shows that the background matching
has worked reasonably well.
```{r GO, fig.cap= "matching foreground and background", dependson="removemove_DE_from_backG"}
multidensity( list(
all= log2(DESeq2Res[,"baseMean"]) ,
foreground =log2(DESeq2Res[anSig$ENSEMBL, "baseMean"]),
background =log2(DESeq2Res[backG, "baseMean"])),
xlab="log2 mean normalized counts", main = "Matching for enrichment analysis")
```
We can now
perform the actual testing. For this purpose we use the \Biocpkg{topGO} package
which implements a nice interface to the Fisher--test based enrichment
testing and also has additional algorithms taking the GO structure into account,
by e.g. only reporting the most specific
gene set in the hierarchy. For more details, see the paper:
* [Alexa et. al. -- Improved scoring of functional groups from gene expression data by decorrelating GO graph structure - Bioinformatics, 2006](http://dx.doi.org/10.1093/bib/bbr002)
The GO has three top ontologies, cellular component (CC), biological processes
(BP), and molecular function (MF). Here we test all off them subsequently.
# Running topGO
We first create a factor `alg` which indicates for every gene in
our universe (union of background and DE-genes), whether it is
differentially expressed or not. It has the ENSEMBL ID's of the genes in
our universe as names and contains 1 if the gene is DE and 0 otherwise.
```{r create factor of interesting genes, dependson="remove_DE_from_backG" }
onts = c( "MF", "BP", "CC" )
geneIDs = rownames(overallBaseMean)
inUniverse = geneIDs %in% c(anSig$ENSEMBL, backG)
inSelection = geneIDs %in% anSig$ENSEMBL
alg <- factor( as.integer( inSelection[inUniverse] ) )
names(alg) <- geneIDs[inUniverse]
```
We first initialize the `r Biocpkg("topGO")` data set for each top ontology,
using the GO annotations contained in the annotation data base for mouse \Biocannopkg{org.Mm.eg.db}.
The \Robject{nodeSize} parameter specifies a minimum size of a GO category we want to use: i.e. here
categories with less than 5 genes are not included in the testing. Since
we have ENSEMBL IDs as our key, we have to specify it since \Biocpkg{topGO}
uses ENTREZ identifiers by default.
Now the tests can be run. `r Biocpkg("topGO")` offers a wide range of options,
for details see the paper or the package vignette.
Here we run two common tests: an ordinary Fisher test for every GO category, and the
"elim" algorithm, which tries to incorporate the hierarchical structure of the
GO and to "decorrelate" it.
The "elim" algorithm starts processing the nodes/GO category
from the highest (bottommost) level and then iteratively
moves to nodes from a lower level. If a node is scored as significant,
all of its genes are marked as removed in all ancestor nodes.
This way, the ``elim" algorithm aims at finding the most specific node
for every gene.
The tests use a 0.01 $p$--value cutoff by default. We order by the classic
algorithm to make the analysis comparable to the one performed in the original
paper. The enrichment results of the original publication
are reported in supplementary table 6 and 7.
```{r run tests, results='hide', dependson="create factor of interesting genes" }
tab = as.list(onts)
names(tab) = onts
for(i in 1:3){
## prepare data
tgd <- new( "topGOdata", ontology=onts[i], allGenes = alg, nodeSize=5,
annot=annFUN.org, mapping="org.Mm.eg.db", ID = "ensembl" )
## run tests
resultTopGO.elim <- runTest(tgd, algorithm = "elim", statistic = "Fisher" )
resultTopGO.classic <- runTest(tgd, algorithm = "classic", statistic = "Fisher" )
## look at results
tab[[i]] <- GenTable( tgd, Fisher.elim = resultTopGO.elim,
Fisher.classic = resultTopGO.classic,
orderBy = "Fisher.classic" , topNodes = 200)
}
```
We can now look at the results, we look at the top 200 GO categories according
to the "Fisher classic" algorithm. The function "GenTable" produces
a table of significant GO categories. Finally, we bind all resulting tables
together and write them to an `csv`--file that can be view with
spreadsheet programs.
```{r process results, dependson="run tests" }
topGOResults <- rbind.fill(tab)
write.csv(topGOResults, file = "topGOResults.csv")
```
Gene set enrichment analysis has been and is a field of very extensive
research in bioinformatics. For additional approaches see the \Biocpkg{topGO}
vignette and the references therein and also GeneSetEnrichment workflow
in Bioconductor:
* [http://bioconductor.org/packages/release/BiocViews.html#___GeneSetEnrichment]
# Session information
As last part of this document, we call the function *sessionInfo*,
which reports the version numbers of R and all the packages used in
this session. It is good practice to always keep such a record as it
will help to trace down what has happened in case that an R script
ceases to work because the functions have been changed in a newer
version of a package. The session information should also **always**
be included in any emails to the
[Bioconductor support site](https://support.bioconductor.org) along
with all code used in the analysis.
```{r}
sessionInfo()
```