Required packages:
library(tidyverse)
library(DESeq2)
library(airway)
Table taken from [1].
We can transform these study reports into a 2x2 matrix:
therapy <- matrix(c(21,15,2,3), nrow=2)
rownames(therapy) = c("surgery","radio")
colnames(therapy) = c("controlled", "not controlled")
therapy
## controlled not controlled
## surgery 21 2
## radio 15 3
Then perform a chi-square test:
chisq.test(therapy)
## Warning in chisq.test(therapy): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: therapy
## X-squared = 0.085967, df = 1, p-value = 0.7694
R gives a warning, because in tables with low counts (<5) it is usually advised to use the Fisher-test instead of the chi-square test. The null and alternative hypotheses are the same in both cases, but the technical details differ.
fisher.test(therapy)
##
## Fisher's Exact Test for Count Data
##
## data: therapy
## p-value = 0.6384
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.2089115 27.5538747
## sample estimates:
## odds ratio
## 2.061731
For visualization, we can use the mosaic plot:
mosaicplot(therapy, shade=TRUE)
In the PlantGrowth
data, two treatments and a control condition were tested with respect to the yield, i.e. the resulting weights of the plants.
As always, we first look at the data:
ggplot(PlantGrowth, aes(x=group, y=weight))+
geom_boxplot(outlier.alpha=0)+
geom_jitter(width=0.25)+
scale_y_continuous(limits=c(0,7))
To test whether any of the groups differ from each other, we can use the ANOVA F-test. It tests against the null hypothesis that all three groups are equal.
my_aov <- aov(lm(weight ~ group, data=PlantGrowth))
my_aov
## Call:
## aov(formula = lm(weight ~ group, data = PlantGrowth))
##
## Terms:
## group Residuals
## Sum of Squares 3.76634 10.49209
## Deg. of Freedom 2 27
##
## Residual standard error: 0.6233746
## Estimated effects may be unbalanced
summary(my_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## group 2 3.766 1.8832 4.846 0.0159 *
## Residuals 27 10.492 0.3886
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To explain the code: The ANOVA is performed on a linear model object, which we define with the lm
function. Here, we set up a linear model that models the plant weight as a function of the treatment group.
The ANOVA is significant, which means not all groups are equal. We can then perform a follow-up Tukey test, to obtain adjusted p-values for the individual comparisons:
TukeyHSD(my_aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = lm(weight ~ group, data = PlantGrowth))
##
## $group
## diff lwr upr p adj
## trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
## trt2-ctrl 0.494 -0.1972161 1.1852161 0.1979960
## trt2-trt1 0.865 0.1737839 1.5562161 0.0120064
Three t-tests with Bonferroni correction:
control <- filter(PlantGrowth, group=="ctrl") %>% select(weight)
trt1 <- filter(PlantGrowth, group=="trt1") %>% select(weight)
trt2 <- filter(PlantGrowth, group=="trt2") %>% select(weight)
p_vals <- c(t.test(control,trt1)$p.value,
t.test(control,trt2)$p.value,
t.test(trt1,trt2)$p.value)
p_vals
## [1] 0.250382509 0.047899256 0.009298405
p.adjust(p_vals,method="bonferroni")
## [1] 0.75114753 0.14369777 0.02789521
These two procedures (ANOVA followed by TukeyHSD and Bonferroni-corrected t-tests) test against the same null hypothesis: “All groups are equal.” But they differ technically, resulting in slightly different adjusted p-values.
This in an example from Chapter 6 in MSMB.
data("airway")
aw = DESeqDataSet(se = airway, design = ~ cell + dex)
aw = DESeq(aw)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
awde = as.data.frame(results(aw)) %>% dplyr::filter(!is.na(pvalue))
The above commands load the airway
data and run a differential expression analysis on them. This data set contains read counts per gene for airway smooth muscle cell lines in an RNA-Seq experiment. The aim of this experiment was to compare the treatment with dex
(dexamethasone, a drug which is used to treat asthma) to a control (no treatment). Another variable that differs between the samples is the cell line, encoded by cell
. DESeq2
runs a test for every gene to check whether the dex
treatment has a significant effect on its expression. The design ~cell + dex
specifies a model where the read count is modeled as the sum of the cell line and a treatment effect. The residuals are modeled as gamma-poisson distributed.
We can make a histogram of the p-values:
awde %>%
ggplot(aes(pvalue))+
geom_histogram(binwidth=0.025)
Usually, we control for the false discovery rate in high-throughput experiments. This is done with the Benjamini-Hochberg algorithm:
adjusted_pvalues <- p.adjust(awde$pvalue, method="BH")
Let’s compare the number of DE genes found with un-adjusted and adjusted p-values:
# unadjusted with alpha = 0.5
sum(awde$pvalue < 0.05)
## [1] 5677
# adkusted with FDR = 0.1
sum(adjusted_pvalues < 0.05)
## [1] 3467
[1] Oehlert, G. W. (2000). A first course in design and analysis of experiments.