This demonstration roughly follows an example shown here: https://rafalab.github.io/pages/harvardx.html.

We will draw a sample of mice fed with high fat diet and decide whether their average weight is different from the average weight of the mice in the control group.

1 Data

I load the data we are going to look at.

mice_pheno <- read.csv2(file= url("https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/mice_pheno.csv"), sep=",")
mice_pheno$Bodyweight <- as.numeric(mice_pheno$Bodyweight)
str(mice_pheno)
## 'data.frame':    846 obs. of  3 variables:
##  $ Sex       : chr  "F" "F" "F" "F" ...
##  $ Diet      : chr  "hf" "hf" "hf" "hf" ...
##  $ Bodyweight: num  31.9 32.5 22.8 19.9 32.2 ...

The loaded object is a data.frame with three columns, giving the sex, diet and weight of the mice.

Filter out NAs:

mice_pheno <- filter(mice_pheno, !is.na(Bodyweight))

For demonstration purposes, we write one weights vector for each group:

control <- filter(mice_pheno, Sex=="M", Diet=="chow")$Bodyweight
hf_Diet <- filter(mice_pheno, Sex=="M", Diet == "hf")$Bodyweight
population <- c(control,hf_Diet)

length(control)
## [1] 223
length(hf_Diet)
## [1] 193

We want to compare the two diets. As the two groups of mice are both large, we are going to assume that they are our “populations of interest”.

1.1 Null hypothesis and observation

Suppose the true average weight of the control population, \(\mu_0\) is known.

mu0 <- mean(control)

We have a sample of \(n\) mice with high-fat diet and measure their weights:

set.seed(51)
n <- 15
my_sample <- sample(hf_Diet,n)

The observed weight difference between our sample and the control:

observed_diff <- mean(my_sample) - mu0
observed_diff
## [1] 3.338188

Null hypothesis: The true weight difference between the control group (mu0) and the high-fat diet mice is zero.

1.2 t statistic

This part is adapted from: http://genomicsclass.github.io/book/pages/clt_in_practice.html

The t-statistic compares the difference between observed mean and \(\mu_0\), and standardizes it by the standard error:

\(t = \frac{\overline{x} - \mu_0}{SE}\)

sample_se <- sqrt(var(my_sample)/n)
tstat <- observed_diff/sample_se

1.3 Null distribution and p-value according to CLT

Because of the CLT, we expect that tstat follows a normal distribution with mean = 0 and sd=1.

Visually compare tstat to expectation:

t_vals <- seq(-5,5, by=0.01)
density <- dnorm(t_vals, mean=0, sd=1)

ggplot(data.frame(t_vals,density), aes(t_vals,density))+
  geom_line()+
  geom_vline(xintercept=tstat)

We can calculate the p-value as its probability under the normal distribution:

(pnorm(tstat, mean=0, sd=1,lower.tail=FALSE))*2
## [1] 0.0430436

1.4 Is the normal distribution a good null distribution?

We can approximate the true null distribution of the t statistic by repeatedly sampling from the whole population:

# choose N
N <- 7

mu0 <- mean(population)
null_t <- rep(NA, 10000)

for(i in seq_along(null_t)){
  null_sample <- sample(population, N)
  se <- sqrt(var(null_sample)/N)
  null_t[i] <- (mean(null_sample)-mu0) / se
}

Histogram of null distribution:

data.frame(null_t) %>% 
  ggplot(aes(x=null_t)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Make a qqnorm plot of this distribution to check if it’s normal:

data.frame(null_t) %>% 
  ggplot(aes(sample=null_t))+
  stat_qq()+ stat_qq_line()

The t-statistic is not normally distributed if \(n\) is too small:
- If \(n\) is too small, we cannot approximate the standard error very well (because we calculate it from a small sample).
- Therefore the CLT is not applicable.
- Rare events are more likely than for a normal distribution.
- Still using the normal distribution leads to underestimation of the p value (i.e. more false positives).

1.5 The t-distribution

The t-distribution has been found to be a suitable distribution for the t-statistic. It has heavier tails than the normal distribution, that means extreme values of t are more likely than under the normal distribution. We can visually compare our data to the t-distribution:

ggplot(data.frame(null_t), aes(sample=null_t))+
  stat_qq(distribution=stats::qt, dparams=list(df=N-1))+ 
  stat_qq_line(distribution=stats::qt, dparams=list(df=N-1))

We can now compare the p-values found when calculating the probability of \(t\) under the normal, or under the t-distribution.

For the normal distribution:

(1-pnorm(tstat, mean=0, sd=1))*2
## [1] 0.0430436

For the t-distribution:

(1-pt(tstat,df=n-1))*2
## [1] 0.06256966

The t-test uses the t-distribution:

t.test(my_sample,mu = mean(control))
## 
##  One Sample t-test
## 
## data:  my_sample
## t = 2.0233, df = 14, p-value = 0.06257
## alternative hypothesis: true mean is not equal to 30.96381
## 95 percent confidence interval:
##  30.76335 37.84065
## sample estimates:
## mean of x 
##    34.302