This demonstration roughly follows an example shown here: https://rafalab.github.io/pages/harvardx.html.
We will draw a sample of mice fed with high fat diet and decide whether their average weight is different from the average weight of the mice in the control group.
I load the data we are going to look at.
mice_pheno <- read.csv2(file= url("https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/mice_pheno.csv"), sep=",")
mice_pheno$Bodyweight <- as.numeric(mice_pheno$Bodyweight)
str(mice_pheno)
## 'data.frame': 846 obs. of 3 variables:
## $ Sex : chr "F" "F" "F" "F" ...
## $ Diet : chr "hf" "hf" "hf" "hf" ...
## $ Bodyweight: num 31.9 32.5 22.8 19.9 32.2 ...
The loaded object is a data.frame
with three columns, giving the sex, diet and weight of the mice.
Filter out NAs:
mice_pheno <- filter(mice_pheno, !is.na(Bodyweight))
For demonstration purposes, we write one weights vector for each group:
control <- filter(mice_pheno, Sex=="M", Diet=="chow")$Bodyweight
hf_Diet <- filter(mice_pheno, Sex=="M", Diet == "hf")$Bodyweight
population <- c(control,hf_Diet)
length(control)
## [1] 223
length(hf_Diet)
## [1] 193
We want to compare the two diets. As the two groups of mice are both large, we are going to assume that they are our “populations of interest”.
Suppose the true average weight of the control population, \(\mu_0\) is known.
mu0 <- mean(control)
We have a sample of \(n\) mice with high-fat diet and measure their weights:
set.seed(51)
n <- 15
my_sample <- sample(hf_Diet,n)
The observed weight difference between our sample and the control:
observed_diff <- mean(my_sample) - mu0
observed_diff
## [1] 3.338188
Null hypothesis: The true weight difference between the control group (mu0
) and the high-fat diet mice is zero.
This part is adapted from: http://genomicsclass.github.io/book/pages/clt_in_practice.html
The t-statistic compares the difference between observed mean and \(\mu_0\), and standardizes it by the standard error:
\(t = \frac{\overline{x} - \mu_0}{SE}\)
sample_se <- sqrt(var(my_sample)/n)
tstat <- observed_diff/sample_se
Because of the CLT, we expect that tstat
follows a normal distribution with mean = 0
and sd=1
.
Visually compare tstat
to expectation:
t_vals <- seq(-5,5, by=0.01)
density <- dnorm(t_vals, mean=0, sd=1)
ggplot(data.frame(t_vals,density), aes(t_vals,density))+
geom_line()+
geom_vline(xintercept=tstat)
We can calculate the p-value as its probability under the normal distribution:
(pnorm(tstat, mean=0, sd=1,lower.tail=FALSE))*2
## [1] 0.0430436
We can approximate the true null distribution of the t statistic by repeatedly sampling from the whole population:
# choose N
N <- 7
mu0 <- mean(population)
null_t <- rep(NA, 10000)
for(i in seq_along(null_t)){
null_sample <- sample(population, N)
se <- sqrt(var(null_sample)/N)
null_t[i] <- (mean(null_sample)-mu0) / se
}
Histogram of null distribution:
data.frame(null_t) %>%
ggplot(aes(x=null_t)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Make a qqnorm plot of this distribution to check if it’s normal:
data.frame(null_t) %>%
ggplot(aes(sample=null_t))+
stat_qq()+ stat_qq_line()
The t-statistic is not normally distributed if \(n\) is too small:
- If \(n\) is too small, we cannot approximate the standard error very well (because we calculate it from a small sample).
- Therefore the CLT is not applicable.
- Rare events are more likely than for a normal distribution.
- Still using the normal distribution leads to underestimation of the p value (i.e. more false positives).
The t-distribution has been found to be a suitable distribution for the t-statistic. It has heavier tails than the normal distribution, that means extreme values of t are more likely than under the normal distribution. We can visually compare our data to the t-distribution:
ggplot(data.frame(null_t), aes(sample=null_t))+
stat_qq(distribution=stats::qt, dparams=list(df=N-1))+
stat_qq_line(distribution=stats::qt, dparams=list(df=N-1))
We can now compare the p-values found when calculating the probability of \(t\) under the normal, or under the t-distribution.
For the normal distribution:
(1-pnorm(tstat, mean=0, sd=1))*2
## [1] 0.0430436
For the t-distribution:
(1-pt(tstat,df=n-1))*2
## [1] 0.06256966
The t-test uses the t-distribution:
t.test(my_sample,mu = mean(control))
##
## One Sample t-test
##
## data: my_sample
## t = 2.0233, df = 14, p-value = 0.06257
## alternative hypothesis: true mean is not equal to 30.96381
## 95 percent confidence interval:
## 30.76335 37.84065
## sample estimates:
## mean of x
## 34.302