--- title: "Analysis of RNA-Seq data: gene-level exploratory analysis and differential expression" output: BiocStyle::html_document: toc: true highlight: tango css: Andrzej.css --- Bernd Klaus[2] and Wolfgang Huber [2], based on material by Michael Love [1] and Simon Anders [2], [1] Department of Biostatistics, Dana-Farber Cancer Institute and Harvard School of Public Health, Boston, US; [2] European Molecular Biology Laboratory (EMBL), Heidelberg, Germany. {r style, echo=FALSE, results="asis", cache=FALSE} BiocStyle::markdown() opts_chunkset(fig.width=14, fig.height=10, cache=T, error=FALSE, message = FALSE) options(digits = 4, scipen = 10, stringsAsFactors = F, width=100)  **LAST UPDATE AT** r date() # Preparations We first set global chunk options and load the neccessary packages. {r setup, cache = FALSE} library(BiocStyle) library(rmarkdown) library(geneplotter) library(ggplot2) library(plyr) library(LSD) library(DESeq2) library(gplots) library(RColorBrewer) library(stringr) library(topGO) library(genefilter) library(biomaRt) library(dplyr) library(EDASeq) library(fdrtool) library(org.Mm.eg.db) data.dir <- file.path("/g/huber/users/klaus/Data/NCBI")  In this document we introduce a workflow for a typical RNA--Seq data analysis. We start from a "count table", which summarizes the number of sequence reads mapping to each of the genes and discuss quality control, differential expression and enrichment analysis of the data. Important aspects of the R language and [Bioconductor](http://bioconductor.org/) data structures for high--dimensional genomic data are discussed along the way. We re---analyze RNA--Seq data obtained by comparing the expression profiles of WT mice to mice harboring a deletion that is associated with a MYC enhancer which is required for proper lip/palate formation. The data was published along with the following paper: * [Uslu et. al. -- Long--range enhancers regulating Myc expression are required for normal facial morphogenesis, 2014](http://dx.doi.org/10.1038/ng.2971) Note that in the workflows below, we do not aim at giving a comprehensive treatment of of possible options, but rather try to introduce sensible default approaches for the necessary steps. It is based on a protocol published in Nature Methods, which contains more detailed information on various steps. Additionally, the "RNA--Seq workflow" is well worth reading and contains a lot of additional background information. * [Anders et. al. -- Count--based differential expression analysis of RNA sequencing data using R and Bioconductor, 2013](http://dx.doi.org/10.1038/nprot.2013.099) * [Love et. al. -- RNA--Seq workflow: gene--level exploratory analysis and differential expression](http://master.bioconductor.org/help/workflows/rnaseqGene/) This lab will walk you through an end--to-end RNA--Seq differential expression workflow, using r Biocpkg("DESeq2") along with other Bioconductor packages. # Preparing count matrices As input, the *DESeq2* package expects count data as obtained, e.g., from RNA--Seq or another high--throughput sequencing experiment, in the form of a matrix of integer values. The value in the *i*--th row and the *j*--th column of the matrix tells how many reads have been mapped to gene *i* in sample *j*. Analogously, for other types of assays, the rows of the matrix might correspond e.g., to binding regions (with ChIP--Seq) or peptide sequences (with quantitative mass spectrometry). The count values must be raw counts of sequencing reads. This is important for *DESeq2*'s statistical model to hold, as only the actual counts allow assessing the measurement precision correctly. Hence, please do not supply other quantities, such as (rounded) normalized counts, or counts of covered base pairs --- this will only lead to nonsensical results. # Aligning reads to a reference and feature counting The computational analysis of an RNA--Seq experiment begins earlier: what we get from the sequencing machine is a set of FASTQ files that contain the nucleotide sequence of each read and a quality score at each position. These reads must first be aligned to a reference genome or transcriptome. It is important to know if the sequencing experiment was single--end or paired--end, as the alignment software will require the user to specify both FASTQ files for a paired--end experiment. The output of this alignment step is commonly stored in a file format called [SAM/BAM](http://samtools.github.io/hts-specs). To download the published data associated with the paper from the gene expression omnibus repository, we need the [SRA toolkit](\href{http://www.ncbi.nlm.nih.gov/Traces/sra/), which allows to download the FASTQ files from short read archive. Binaries are available for ubuntu and centOS. The accession numbers (and links to the landing pages) are given in the article, they are \href{http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1279692}{GSM1279692} -- \href{http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1279701}{GSM1279701} A number of software programs exist to align reads to the reference genome, and the development is too rapid for this document to provide an up--to--date list. We recommend consulting benchmarking papers that discuss the advantages and disadvantages of each software, which include accuracy, ability to align reads over splice junctions, speed, memory footprint, and many other features. The reads for this experiment were aligned to the Ensembl release 66 mouse reference genome using [TopHat](http://ccb.jhu.edu/software/tophat/index.shtml) obtained packaged with a GTF (gene annotation) file from the \href{http://tophat.cbcb.umd.edu/igenomes.shtml}{iGenomes} project. The Python library [HTSeq](http://www-huber.embl.de/users/anders/HTSeq) was then used to count the aligned reads to features. All of these steps are treated in detail in the references given above. For a full example of using the *HTSeq* Python package for read counting, please see the r Biocexptpkg("pasilla") vignette. For an example of generating the *DESeqDataSet* from files produced by *htseq--count*, please see the r Biocpkg("DESeq2") vignette. We now obtain the count table of the experiment directly from a pre--saved file. {r load_counts, eval=T } load(url("http://www-huber.embl.de/users/klaus/geneCounts.RData")) DESeq2Table  # The *DESeqDataSet* class, column and row metadata Bioconductor software packages often define and use a custom class for their data object, which makes sure that all the needed data slots are consistently provided and fulfill the requirements. In addition, Bioconductor has general data classes (such as the *SummarizedExperiment*) that can be used to move data between packages. In r Biocpkg("DESeq2"), the custom class is called *DESeqDataSet*. It is built on top of the *SummarizedExperiment* class (in technical term, it is a subclass), and it is easy to convert *SummarizedExperiment* instances into *DESeqDataSet* and vice versa. One of the main differences is that the assay slot is instead accessed using the *count* accessor, and the class enforces that the values in this matrix are non--negative integers. {r sumexp, echo=FALSE} par(mar=c(0,0,0,0)) plot(1,1,xlim=c(0,100),ylim=c(0,100),bty="n", type="n",xlab="",ylab="",xaxt="n",yaxt="n") polygon(c(45,80,80,45),c(10,10,70,70),col=rgb(1,0,0,.5),border=NA) polygon(c(45,80,80,45),c(68,68,70,70),col=rgb(1,0,0,.5),border=NA) text(62.5,40,"assay(s)") text(62.5,30,"e.g. 'counts'") polygon(c(20,40,40,20),c(10,10,70,70),col=rgb(0,0,1,.5),border=NA) polygon(c(20,40,40,20),c(68,68,70,70),col=rgb(0,0,1,.5),border=NA) text(30,40,"rowData") polygon(c(45,80,80,45),c(75,75,90,90),col=rgb(.5,0,.5,.5),border=NA) polygon(c(45,47,47,45),c(75,75,90,90),col=rgb(.5,0,.5,.5),border=NA) text(62.5,82.5,"colData")  Here we show the component parts of a *SummarizedExperiment* object, and also its subclasses, such as the *DESeqDataSet* which we downloaded. The assay(s) (red block) contains the matrix (or matrices) of summarized values, the rowData (blue block) contains information about the genomic ranges, and the colData (purple block) contains information about the samples or experiments. The highlighted line in each block represents the first row (note that the first row of colData lines up with the first column of the assay. We can investigate the content of the downloaded *DESeqDataSet* by looking at the counts in the assay slot, the phenotypic data about the samples in colData (sample information), and the data about the genes in the rowData (information on genomic features) slot. {r show_Data, dependson="load_counts"} DESeq2Table head(assay(DESeq2Table)) colSums(assay(DESeq2Table)) colData(DESeq2Table) rowData(DESeq2Table)  Note that the rowData slot is a *GRangesList*, which contains all the information about the exons for each gene, i.e., for each row of the count matrix. It also contains additional annotation for each gene, i.e. a gene description and a gene symbol. {r, dependson="load_counts"} mcols(rowData(DESeq2Table))  The colData slot contains all the sample metadata. {r, dependson="load_counts"} colData(DESeq2Table)  At this point, we have counted the reads which overlap the genes in the gene model we specified. Note that the class DataFrame is a [Bioconductor](http://bioconductor.org/) variant of the standard data.frame class in R. which has among other things a more sensible print funtion, only showing the head and the tail of the table. **Before we proceed, now correct an annotation error: the samples SRR1042891 and SRR1042886 have the opposite genotype** {r correct_anno_error, dependson="load_counts", eval=TRUE} con <- as.character(colData(DESeq2Table)condition) con[2] <- "Del_8_17_homo" con[7] <- "WT" colData(DESeq2Table)$condition <- factor(con)  We can now move on to the actual differential expression analysis steps after specifying an appropriate *design*. # Experimental design and the design formula The *DESeqDataSet* has an associated *design formula*. The design is specified at the beginning of the analysis, as it will inform many of the *DESeq2* functions how to treat the samples in the analysis (one exception is the size factor estimation, i.e., the adjustment for differing library sizes, which does not depend on the design formula). The design formula tells which variables in the column metadata table (colData) specify the experimental design and how these factors should be used in the analysis. The simplest design formula for differential expression would be ~ condition, where condition is a column in colData(DESeq2Table) which specifies which of two (or more groups) the samples belong to. This is also what we will be using for the data at hand. You can use R's formula notation to express any experimental design that can be described within an ANOVA--like framework. Note that *DESeq2* uses the same formula notation as, for instance, the *lm* function of base R. If the question of interest is whether a fold change due to treatment is different across groups, interaction terms can be included using models such as ~ group + treatment + group:treatment. More complex designs such as these are covered in the r Biocpkg("DESeq2") vignette. # Quality control and Normalization of the count data Having created a count table and fed it into a DESeq2 object, the next step in the workflow is the quality control of the data. Let's check how many genes we capture by counting the number of genes that have non--zero counts in all samples. {r, non_zero_counts, dependson="load_counts"} GeneCounts <- counts(DESeq2Table) idx.nz <- apply(GeneCounts, 1, function(x) { all(x > 0)}) sum(idx.nz) ### random sample from the count matrix nz.counts <- subset(GeneCounts, idx.nz) sam <- sample(dim(nz.counts)[1], 5) nz.counts[sam, ]  In a typical RNA-Seq experiment, there will be at least several thousand genes that are expressed in all samples. If the number of non--zero genes is very low, there is usually something wrong with at least some of the samples (e.g. the sample preparation was not done properly, or there was some problem with the sequencing). # Normalization As different libraries will be sequenced to different depths, offsets are built in the statistical model of r Biocpkg("DESeq2") to ensure that parameters are comparable. The term normalization is often used for that, but it should be noted that the raw read counts are not actually altered. r Biocpkg("DESeq2") defines a virtual reference sample by taking the median of each gene's values across samples and then computes size factors as the median of ratios of each sample to the reference sample. Generally, the ratios of the size factors should roughly match the ratios of the library sizes. Dividing each column of the count table by the corresponding size factor yields normalized count values, which can be scaled to give a counts per million interpretation. Thus, if all size factors are roughly equal to one, the libraries have been sequenced equally deeply. Count data sets typically show a (left--facing) trombone shape in average vs mean--difference plots (MA--plots), reflecting the higher variability of log ratios at lower counts. In addition, points will typically be centered around a log ratio of 0 if the normalization factors are calculated appropriately, although this is just a general guide. In the code below we produce density estimates of the sample counts and pairwise mean--difference plots after the computation of the size factors. The MA--plots are saved to a file in order to avoid cluttering the document. We first relevel the condition factor to make sure that we have the wild type samples as a base level. This will allow us to obtain the fold changes always as WT--deletion. We also rename the deletion level to "Del". {r relevel condition, dependson="load_counts"} colData(DESeq2Table)$condition <- factor(colData(DESeq2Table)$condition, levels = c("WT", "Del")) colData(DESeq2Table)$condition <- factor(ifelse(is.na(colData(DESeq2Table)$condition), "Del", "WT"), levels = c("WT", "Del"))  We now estimate estimate the size factors an print them {r getSizeFactors, dependson="load_counts"} DESeq2Table <- estimateSizeFactors(DESeq2Table) sizeFactors(DESeq2Table)  To assess whether the normalization has worked, we plot the densities of counts for the different samples. Since most of the genes are (heavily) affected by the experimental conditions, a succesful normalization will lead to overlapping densities. {r plotDensities, dependson="getSizeFactors"} multidensity( counts(DESeq2Table, normalized = T)[idx.nz ,], xlab="mean counts", xlim=c(0, 1000))  This is indeed the case here. The same holds true for the empircal cummulative distribution functions (ECDFs) which can be thought of as integrals of the densities and give the probability of observing a certain number of counts equal to$x$or less given the data. In an ECDF plot, the estimated probility is plotted on the y--axis and the count values on the x--axis. I.e. you can read of the median and other quantiles from this plot. As already mentioned, if the normalization has worked, the ECDFs of the different samples should be overlapping. {r plotECDFs, dependson= c("non_zero_counts", "load_counts", "getSizeFactors")} multiecdf( counts(DESeq2Table, normalized = T)[idx.nz ,], xlab="mean counts", xlim=c(0, 1000))  To further assess systematic differences between the samples, we can also plot pairwise mean--average plots: We plot the average of the log--transformed counts vs the fold change per gene for each of the sample pairs. The function combn helps us to automatically generate all sample pairs and the function MDPlot from the r Biocpkg("EDASeq") package then generates the the pairwise MA plots. {r pairwiseMAs, dependson= c("non_zero_counts", "load_counts", "getSizeFactors")} pdf("pairwiseMAs.pdf") MA.idx = t(combn(1:8, 2)) for( i in seq_along( MA.idx[,1])){ MDPlot(counts(DESeq2Table, normalized = T)[idx.nz ,], c(MA.idx[i,1],MA.idx[i,2]), main = paste( colnames(DESeq2Table)[MA.idx[i,1]], " vs ", colnames(DESeq2Table)[MA.idx[i,2]] ), ylim = c(-3,3)) } dev.off()  The plots look good, if there is no systematic shift between the samples, the log--fold--change should scatter around zero, which is the case here. # PCA and sample heatmaps A useful first step in an RNA--Seq analysis is often to assess overall similarity between samples: Which samples are similar to each other, which are different? Does this fit to the expectation from the experiment's design? We use the \R{} function \Rfunction{dist} to calculate the Euclidean distance between samples. To avoid that the distance measure is dominated by a few highly variable genes, and have a roughly equal contribution from all genes, we use it on the regularized log--transformed data. The aim of the regularized log--transform is to stabilize the variance of the data and to make its distribution roughly symmetric since many common statistical methods for exploratory analysis of multidimensional data, especially methods for clustering and ordination (e.g., principal--component analysis and the like), work best for (at least approximately) homoskedastic data; this means that the variance of an observable quantity (i.e., here, the expression strength of a gene) does not depend on the mean. In RNA--Seq data, however, variance grows with the mean. For example, if one performs PCA directly on a matrix of normalized read counts, the result typically depends only on the few most strongly expressed genes because they show the largest absolute differences between samples. A simple and often used strategy to avoid this is to take the logarithm of the normalized count values plus a small pseudocount; however, now the genes with low counts tend to dominate the results because, due to the strong Poisson noise inherent to small count values, they show the strongest relative differences between samples. Note that this effect can be diminished by adding a relatively high number of pseudocounts, e.g. 32, since this will also substantially reduce the variance of the fold changes. As a solution, r Biocpkg("DESeq2")  offers the regularized--logarithm transformation, or rlog for short. For genes with high counts, the rlog transformation differs not much from an ordinary log2 transformation. For genes with lower counts, however, the values are shrunken towards the genes' averages across all samples. Using an empirical Bayesian prior in the form of a ridge penalty, this is done such that the rlog--transformed data are approximately homoskedastic. Note that the rlog transformation is provided for applications other than differential testing. For differential testing it is always recommended to apply the DESeq function to raw counts. Note the use of the function \Rfunction{t} to transpose the data matrix. We need this because dist calculates distances between data rows and our samples constitute the columns. We visualize the distances in a heatmap, using the function  heatmap.2  from the  r CRANpkg("gplots") package. Note that we have changed the row names of the distance matrix to contain the genotype and the column to contain the sample ID, so that we have all this information in view when looking at the heatmap.  {r sample_heatmaps, fig.cap= "sample heatmaps", fig.height = 20, fig.width = 20, dependson= "load_counts"} rld <- rlogTransformation(DESeq2Table, blind=TRUE) distsRL <- dist(t(assay(rld))) mat <- as.matrix(distsRL) rownames(mat) <- colData(rld)$condition colnames(mat) <- colData(rld)$sampleNO hmcol <- colorRampPalette(brewer.pal(9, "Blues"))(255) heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13))  Another way to visualize sample--to--sample distances is a principal--components analysis (PCA). In this ordination method, the data points (i.e., here, the samples) are projected onto the 2D plane such that they spread out optimally. {r PCA_computations, dependson="sample_heatmaps" } ntop = 500 Pvars <- rowVars(assay(rld)) select <- order(Pvars, decreasing = TRUE)[seq_len(min(ntop, length(Pvars)))] PCA <- prcomp(t(assay(rld)[select, ]), scale = F) percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1)  Since PCA can be slightly problematic with high dimensional data, we first use select only the r ntop genes showing the highest variance. We then use the function  prcomp  to compute the principal components. The resulting principal component scores, i.e. the coordinates of the samples in the space of the principal components can then be plotted. For this we use the function qplot from r CRANpkg("ggplot2") . r CRANpkg("ggplot2")  is a comprehensive plotting package for R, a very good tutorial is [here](http://jofrhwld.github.io/avml2012/). The function qplot is supposed to mimic the standard plotting function plot as closely as possible, but additional changes can be made, e.g. lattice--type graphics (splitting the plot by a factor of interest) can easily be generated. Addition elements are changed by using + and the corresponding commands. In our PCA plot, we change the plotting colors used and the axis labels. {r PCA_plots, fig.cap= "PCA plot", fig.width=7.5, fig.height=6, dependson="PCA_computations" } dataGG = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2], PC3 = PCA$x[,3], PC4 = PCA$x[,4], sampleNO = colData(rld)$sampleNO, condition = colData(rld)$condition) (qplot(PC1, PC2, data = dataGG, color = condition, main = "PC1 vs PC2, top variable genes", size = I(6)) + labs(x = paste0("PC1, VarExp:", round(percentVar[1],4)), y = paste0("PC2, VarExp:", round(percentVar[2],4))) + scale_colour_brewer(type="qual", palette=2) )  We can clearly identify to outliers in the PCA plot, one in each experimental groups. These two outliers basically provide the separation according to the first principal component. Thus, they outliers will most likely increase the overall variability and thus diminish statistical power later when testing for differential expression. Therefore we remove them. We can get them by selecting only those samples that have a value greater than zero for the first principal component score. This can conveniently be done using the function  subset  and specifying the logical vector  dataGG$PC1 > 0  .  subset  will then return the sample names a positions where  dataGG$PC1 > 0  evaluates to TRUE. {r identify_and_remove_outliers, dependson="PCA_plots"} outliers <- as.character(subset(colnames(DESeq2Table), dataGG$PC1 > 0)) outliers DESeq2Table <- DESeq2Table[, !(colnames(DESeq2Table) %in% outliers)]  We now produce the cluster heatmap and the PCA plot without the outliers. {r sample_heatmaps_no_outlier, fig.cap= "Quality control plots without outliers", fig.width=20, fig.height=20, dependson="identify_and_remove_outliers"} rld <- rlogTransformation(DESeq2Table, blind=TRUE) distsRL <- dist(t(assay(rld))) mat <- as.matrix(distsRL) rownames(mat) <- colData(rld)$condition colnames(mat) <- colData(rld)$sampleNO hmcol <- colorRampPalette(brewer.pal(9, "OrRd"))(255) heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13))  {r PCA_no_outlier, fig.cap= "PCA plot without outliers", fig.width=7.5, fig.height=6} Pvars <- rowVars(assay(rld)) select <- order(Pvars, decreasing = TRUE)[seq_len(min(ntop, length(Pvars)))] PCA <- prcomp(t(assay(rld)[select, ]), scale = F) percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1) dataGG = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2], PC3 = PCA$x[,3], PC4 = PCA$x[,4], sampleNO = colData(rld)$sampleNO, condition = colData(rld)$condition) (qplot(PC1, PC2, data = dataGG, color = condition, main = "PC1 vs PC2, top variable genes, no outliers", size = I(6)) + labs(x = paste0("PC1, VarExp:", round(percentVar[1],4)), y = paste0("PC2, VarExp:", round(percentVar[2],4))) + scale_colour_brewer(type="qual", palette=4) )  The separation between the two groups is not extremely clear, they do not cluster together in the heatmap and they also are only separated by the second principal component. This will most likely result in only a modest number of differentially expressed genes and is in line with the small number of differentially expressed genes (~ 100) the authors report in the original publication. # Differential expression analysis \subsection{Dispersion estimation} It is known that a standard Poisson model can only account for the technical noise in RNA--Seq data. In the Poisson model the variance is equal to the mean, while in real RNA--Seq data the variance is greater than the mean, a phenomenon often encountered in count data in general and referred to as "overdispersion". A popular way to model this is to use a Negative--Binomial-distribution (NB), which includes an additional parameter dispersion parameter $\alpha$ such that $E(NB) = \mu$ and $\text{Var[NB(\mu, \alpha)]} = \mu + \alpha \cdot \mu^2$ Hence, the variance is greater than the mean. r Biocpkg("DESeq2")  also uses the NB model. Hence, the first step in the analysis of differential expression, is to obtain an estimate of the dispersion parameter for each gene. The typical shape of the dispersion fit is an exponentially decaying curve. The asymptotic dispersion for highly expressed genes can be seen as a measurement of biological variability in the sense of a squared coefficient of variation: a dispersion value of 0.01 means that the gene's expression tends to differ by typically $\sqrt{0.01}$ = 10\% between samples of the same treatment group. For weak genes, the Poisson noise is an additional source of noise, which is added to the dispersion. The function \Rfunction{plotDispEsts} visualizes r Biocpkg("DESeq2") 's dispersion estimates: {r dispersion_estimation, dependson=c("load_counts", "getSizeFactors") } DESeq2Table <- estimateDispersions(DESeq2Table) plotDispEsts(DESeq2Table)  The black points are the dispersion estimates for each gene as obtained by considering the information from each gene separately. Unless one has many samples, these values fluctuate strongly around their true values. Therefore, the red trend line is fitted, which shows the dispersions' dependence on the mean, and then shrink each gene's estimate towards the red line to obtain the final estimates (blue points) that are then used in the hypothesis test. The blue circles above the main "cloud" of points are genes which have high gene--wise dispersion estimates which are labelled as dispersion outliers. These estimates are therefore not shrunk toward the fitted trend line. The warnings just indicate that the dispersion estimation failed for some genes. # Statistical testing of Differential expression We can perform the statistical testing for differential expression and extract its results. Calling \Rfunction{nbinomWaldTest} performs the test for differential expression, while the call to the \Rfunction{results} function extracts the results of the test and returns adjusted $p$--values according to the Benjamini--Hochberg rule to control the FDR. The test--performed is Wald test, which is a test for coefficients in a regression model. It is based on a z--score, i.e. a $N(0,1)$ distributed (under $H_0$) test statistic. {r DESEq2_Results, dependson=c("load_counts", "getSizeFactors", "dispersion_estimation")} DESeq2Table <- nbinomWaldTest(DESeq2Table) DESeq2Res <- results(DESeq2Table, pAdjustMethod = "BH") table(DESeq2Res$padj < 0.1)  We identify r sum(na.exclude(DESeq2Res$padj < 0.1))  differentially expressed genes. ## Independent filtering In addition to this, r Biocpkg("DESeq2")  performs automated independent filtering: For weakly expressed genes, we have no chance of seeing differential expression, because the low read counts suffer from so high Poisson noise that any biological effect is drowned in the uncertainties from the read counting. At first sight, there may seem to be little benefit in filtering out these genes. After all, the test found them to be non--significant anyway. However, these genes have an influence on the multiple testing procedure, whose performance commonly improves if such genes are removed. By removing the weakly--expressed genes from the input to the BH--FDR procedure, we can find more genes to be significant among those which we keep, and so improve the power of our test. This approach is known as independent filtering. The r Biocpkg("DESeq2")  software automatically performs independent filtering which maximizes the number of genes which will have a BH--adjusted $p$--value less than a critical value (by default, alpha is set to 0.1). This automatic independent filtering is performed by, and can be controlled by, the results function. We can observe how the number of rejections changes for various cutoffs based on mean normalized count. The following optimal threshold and table of possible values is stored as an attribute of the results object. {r independent_filtering_res, dependson="DESeq2_Results"} plot(metadata(DESeq2Res)$filterNumRej, type="b", xlab="quantiles of 'baseMean'", ylab="number of rejections")  The term **independent** highlights an important caveat. Such filtering is permissible only if the filter criterion is independent of the actual test statistic under the null--hypothesis. Otherwise, the filtering would invalidate the test and consequently the assumptions of the BH procedure. This is why we filtered on the average over all samples: this filter is blind to the assignment of samples to the deletion and control group and hence independent under the null hypothesis of equal expression. ## Inspection and correction of$p$--values The null--$p$--values follow a uniform distribution on the unit interval [0,1] if they are computed using a continuous null distribution. Significant$p$--values thus become visible as an enrichment of$p$--values near zero in the histogram. Thus,$p$--value histogram of "correctly" computed$p$--values will have a rectangular shape with a peak at 0. A histogram of$p$--values should always be plotted in order to check whether they have been computed correctly. We also do this here: {r pvalues_from_DESEq, fig.cap="p-values, wrong null distribution", dependson="DESeq2_Results"} hist(DESeq2Res$pvalue, col = "lavender", main = "WT vs Deletion", xlab = "p-values")  We can see that this is clearly not the case for the $p$--values returned by r Biocpkg("DESeq2")  in this case. Very often, if the assumed variance of the null distribution is too high, we see hill--shaped $p$--value histogram. If the variance is too low, we get a U--shaped histogram, with peaks at both ends. Here we have a hill--shape, indicating an overestimation of the variance in the null distribution. Thus, the $N(0,1)$ null distribution of the Wald test is not appropriate here. The dispersion estimation is not condition specific and estimates only a single dispersion estimate per gene. This is sensible, since the number of replicates is usually low. However, if we have e.g. batches or "outlying" samples that are consistently a bit different from others within a group, the dispersion within the experimental group can be different and a single dispersion parameter not be appropriate. For an example of the estimation of multiple dispersions, see the analysis performed in: [Reyes et. al. - Drift and conservation of differential exon usage across tissues in primate species, 2013](\href{http://dx.doi.org/10.1073/pnas.1307202110) Fortunately, there is software available to estimate the variance of the null--model from the test statistics. This is commonly referred to as "empirical null modelling". Here we use the r CRANpkg("fdrtool") for this using the Wald statistic as input. This packages returns the estimated null variance, as well as estimates of various other FDR--related quantities and the $p$--values computed using the estimated null model parameters. We first remove genes filtered out by independent filtering and the dispersion outliers, they have NA adj. pvals and NA $p$--values respectively. {r remove_outliers, dependson="DESeq2_Results"} DESeq2Res <- DESeq2Res[ !is.na(DESeq2Res$padj), ] DESeq2Res <- DESeq2Res[ !is.na(DESeq2Res$pvalue), ]  We now remove the original adjusted $p$--values, since we will add the corrected ones later on. {r remove_original_pvals, dependson=c("DESeq2_Results", "remove_outliers")} DESeq2Res <- DESeq2Res[, -which(names(DESeq2Res) == "padj")]  We can now use z--scores returned by r Biocpkg("DESeq2") as input to r CRANpkg("fdrtool") to re--estimate the $p$--values {r reEstimatePvals, dependson="remove_original_pvals"} FDR.DESeq2Res <- fdrtool(DESeq2Res$stat, statistic= "normal", plot = T) FDR.DESeq2Res$param[1, "sd"]  Interestingly the estimated null model variance is r FDR.DESeq2Res$param[1, "sd"] , which is lower than$1$, the theoretical one. We now add the new BH--adjusted$p$--values add values to the results data frame. {r addNewPvals, dependson="reEstimatePvals"} DESeq2Res[,"padj"] <- p.adjust(FDR.DESeq2Res$pval, method = "BH")  We can now plot the histogram of the "correct" $p$--values. {r corrected_pavlues, fig.cap="p-values, correct null distribution", dependson="addNewPvals"} hist(FDR.DESeq2Res$pval, col = "royalblue4", main = "WT vs Deletion, correct null model", xlab = "CORRECTED p-values")  As we can see, the null model is now correct and the histogram has the expected shape. ## Extracting differentially expressed genes We can now extract the number of differentially expressed genes and produce an MA plot of the two--group comparison as shown in figure 4a of the original paper. ("M" for minus, because a log ratio is equal to log minus log, and "A" for average). Similar to its usage in the initial quality control, a MA plot provides a useful overview for an experiment with a two-group comparison: The plot represents each gene with a dot. The$x$--axis is the average expression over all samples, the y axis the log2 fold change between treatment and control. Genes with an FDR value below a threshold (here 0.1) are shown in red. This plot demonstrates that only genes with a large average normalized count contain more information to yield a significant call. Also note r Biocpkg("DESeq2")'s shrinkage estimation of log fold changes (LFCs): When count values are too low to allow an accurate estimate of the LFC, the value is "shrunken" towards zero to avoid that these values, which otherwise would frequently be unrealistically large, dominate the top--ranked log fold changes. Thus, in contrast to the "trumpet shaped" of the MA plot in the original paper (they used the orginal r Biocpkg("DESeq"), we obtain a "cigar-shaped" MA plot. {r DE_Genes, fig.cap= "MA plot for DESeq2 analysis", dependson="addNewPvals"} table(DESeq2Res[,"padj"] < 0.1) plotMA(DESeq2Res)  We now identify r sum(DESeq2Res[,"padj"] < 0.1) differentially expressed genes at an FDR of$0.1$. ## Check overlap with the paper results We can now check the overlap with the original results of the paper. In the original publication 100 genes with an FDR value of less than$0.05$were identified. The excel table "ng.2971--S3.xls" containing their FC and$p$--values is available as supplementary table 4 of the original publication. We convert the table into a .csv  file and import it. {r check_overlap_import_table } paperRes <- read.csv("ng.2971-S3.csv", skip = 1, header =T) names(paperRes)  We now extract our significant genes and their annotation {r check_overlap_extract_sig_genes, dependson="addNewPvals"} sigGenes <- rownames(subset(DESeq2Res, padj < 0.1)) anno <- AnnotationDbi::select(org.Mm.eg.db, keys=rownames(DESeq2Res), columns=c("SYMBOL","SYMBOL", "GENENAME"), keytype="ENSEMBL") anSig <- as.data.frame(subset(anno, ENSEMBL %in% sigGenes)) sample_n(anSig, 5)  We now extract the genes identified in the original paper that we call as differentially expressed. {r } overlap <- subset(paperRes, Gene.name %in% anSig$SYMBOL ) dim(overlap)  Now it can be asked how many genes that we identify with FDR < 0.1 also had FDR < 0.1 in the original paper? and get the total number of genes in the original paper with FDR < 0.1 (there are r dim(subset(paperRes, padj < 0.1))[1] ) {r } dim(subset(overlap, padj < 0.1))[1] dim(subset(paperRes, padj < 0.1))[1]  Out of the r sum(DESeq2Res[,"padj"] < 0.1)  differentially expressed genes that we identify at an FDR of $0.1$, r dim(subset(overlap, padj < 0.1))[1] have also been identified in the original paper (out of r dim(subset(paperRes, padj < 0.1))[1] in total in the original paper with an FDR of $0.1$). The MA--plot also looks very similar to the paper plot. Thus, we can reproduce the paper results with a slightly different software setting and can gain power by correcting the null model. The "blood" and "ribosomal proteins" parts of the MA--plot marked in the paper figure are also clearly visible. ## Check a validated gene A quick way to visualize the counts for a particular gene is to use the plotCounts function, which takes as arguments the DESeqDataSet, a gene name, and the group over which to plot the counts. Some of the differentially expressed genes identified via RNA--Seq have been validated by qPCR (see Figure 4b of the paper). Here, we check the expression of [Nr2f1](http://www.ensembl.org/Mus_musculus/Gene/Summary?db=core;g=ENSMUSG00000069171;r=13:78188973-78199757). {r plotcounts} Nr2f1ENSEMBL <- subset(anSig, SYMBOL == "Nr2f1")$ENSEMBL plotCounts(DESeq2Table, gene=Nr2f1ENSEMBL, intgroup=c("condition"))  # Gene ontology enrichment analysis We can now try characterize the identified differentially expressed genes a bit better by performing an GO enrichment analysis. Essentially the gene ontology (\url{http://www.geneontology.org/}) is a hierarchically organized collection of functional gene sets. For a nice introduction to the GO see * [du Plessis et. al. -- The what, where, how and why of gene ontology primer for bioinformaticians- Bioinformatics, 2011](http://dx.doi.org/10.1093/bib/bbr002) \subsection{Matching the background set} The function genefinder from the  r Biocpkg("genefilter") package will be used to find background genes that are similar in expression to the differentially expressed genes. The function tries to identify 10 genes for each DE--gene that match its expression strength. We then check whether the background has roughly the same distribution of average expression strength as the foreground by plotting the densities. We do this in order not to select a biased background since the gene set testing is performed by a simple Fisher test on a 2x2 table, which uses only the status of a gene, i.e. whether it is differentially expressed or not and not its fold--change or absolute expression. Note that the chance of a gene being identified as DE will most probably depend its gene expression for RNA--Seq data (potentially also its gene length etc.). Thus it is important to find a matching background. Our the testing approach here is very similar to web tools like [DAVID](http://david.abcc.ncifcrf.gov/), however, we explicitly model the background here. We first get average gene expressions for each of the genes and then find non--DE genes that show a similar expression as the DE--genes. These genes are then our background. {r GO_Analysis create backgrounds} overallBaseMean <- as.matrix(DESeq2Res[, "baseMean", drop = F]) sig_idx <- match(anSig$ENSEMBL, rownames(overallBaseMean)) backG <- c() for(i in sig_idx){ ind <- genefinder(overallBaseMean, i, 10, method = "manhattan")[[1]]$indices backG <- c(backG, ind) } backG <- unique(backG) backG <- rownames(overallBaseMean)[backG]  We now remove DE genes from background and the get the total number of genes in the background. {r, remove_DE_from_backG} backG <- setdiff(backG, anSig$ENSEMBL) length(backG)  Plotting the density of the average expressions, shows that the background matching has worked reasonably well. {r GO, fig.cap= "matching foreground and background", dependson="removemove_DE_from_backG"} multidensity( list( all= log2(DESeq2Res[,"baseMean"]) , foreground =log2(DESeq2Res[anSig$ENSEMBL, "baseMean"]), background =log2(DESeq2Res[backG, "baseMean"])), xlab="log2 mean normalized counts", main = "Matching for enrichment analysis")  We can now perform the actual testing. For this purpose we use the \Biocpkg{topGO} package which implements a nice interface to the Fisher--test based enrichment testing and also has additional algorithms taking the GO structure into account, by e.g. only reporting the most specific gene set in the hierarchy. For more details, see the paper: * [Alexa et. al. -- Improved scoring of functional groups from gene expression data by decorrelating GO graph structure - Bioinformatics, 2006](http://dx.doi.org/10.1093/bib/bbr002) The GO has three top ontologies, cellular component (CC), biological processes (BP), and molecular function (MF). Here we test all off them subsequently. # Running topGO We first create a factor alg which indicates for every gene in our universe (union of background and DE-genes), whether it is differentially expressed or not. It has the ENSEMBL ID's of the genes in our universe as names and contains 1 if the gene is DE and 0 otherwise. {r create factor of interesting genes, dependson="remove_DE_from_backG" } onts = c( "MF", "BP", "CC" ) geneIDs = rownames(overallBaseMean) inUniverse = geneIDs %in% c(anSig$ENSEMBL, backG) inSelection = geneIDs %in% anSig$ENSEMBL alg <- factor( as.integer( inSelection[inUniverse] ) ) names(alg) <- geneIDs[inUniverse]  We first initialize the r Biocpkg("topGO") data set for each top ontology, using the GO annotations contained in the annotation data base for mouse \Biocannopkg{org.Mm.eg.db}. The \Robject{nodeSize} parameter specifies a minimum size of a GO category we want to use: i.e. here categories with less than 5 genes are not included in the testing. Since we have ENSEMBL IDs as our key, we have to specify it since \Biocpkg{topGO} uses ENTREZ identifiers by default. Now the tests can be run. r Biocpkg("topGO") offers a wide range of options, for details see the paper or the package vignette. Here we run two common tests: an ordinary Fisher test for every GO category, and the "elim" algorithm, which tries to incorporate the hierarchical structure of the GO and to "decorrelate" it. The "elim" algorithm starts processing the nodes/GO category from the highest (bottommost) level and then iteratively moves to nodes from a lower level. If a node is scored as significant, all of its genes are marked as removed in all ancestor nodes. This way, the elim" algorithm aims at finding the most specific node for every gene. The tests use a 0.01$p\$--value cutoff by default. We order by the classic algorithm to make the analysis comparable to the one performed in the original paper. The enrichment results of the original publication are reported in supplementary table 6 and 7. {r run tests, results='hide', dependson="create factor of interesting genes" } tab = as.list(onts) names(tab) = onts for(i in 1:3){ ## prepare data tgd <- new( "topGOdata", ontology=onts[i], allGenes = alg, nodeSize=5, annot=annFUN.org, mapping="org.Mm.eg.db", ID = "ensembl" ) ## run tests resultTopGO.elim <- runTest(tgd, algorithm = "elim", statistic = "Fisher" ) resultTopGO.classic <- runTest(tgd, algorithm = "classic", statistic = "Fisher" ) ## look at results tab[[i]] <- GenTable( tgd, Fisher.elim = resultTopGO.elim, Fisher.classic = resultTopGO.classic, orderBy = "Fisher.classic" , topNodes = 200) }  We can now look at the results, we look at the top 200 GO categories according to the "Fisher classic" algorithm. The function "GenTable" produces a table of significant GO categories. Finally, we bind all resulting tables together and write them to an csv--file that can be view with spreadsheet programs. {r process results, dependson="run tests" } topGOResults <- rbind.fill(tab) write.csv(topGOResults, file = "topGOResults.csv")  Gene set enrichment analysis has been and is a field of very extensive research in bioinformatics. For additional approaches see the \Biocpkg{topGO} vignette and the references therein and also GeneSetEnrichment workflow in Bioconductor: * [http://bioconductor.org/packages/release/BiocViews.html#___GeneSetEnrichment] # Session information As last part of this document, we call the function *sessionInfo*, which reports the version numbers of R and all the packages used in this session. It is good practice to always keep such a record as it will help to trace down what has happened in case that an R script ceases to work because the functions have been changed in a newer version of a package. The session information should also **always** be included in any emails to the [Bioconductor support site](https://support.bioconductor.org) along with all code used in the analysis.
{r} sessionInfo()