This exercise is from [1].
The above table comes from one of the first studies of the link between lung cancer and smoking, by Richard Doll and A. Bradford Hill. In 20 hospitals in London, UK, patients admitted with lung cancer in the previous year were queried about their smoking behavior. For each patient admitted, researchers studied the smoking behavior of a non-cancer control patient at the same hospital of the same sex and within the same 5-year grouping on age. A smoker was defined as a person who had smoked at least one cigarette a day for at least a year.
Here’s the data again:
epitope_data <- data.frame(position=1:100,
count=c(2, 0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 2, 2, 7, 1, 0, 2, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 2, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0))
As on day 3, we assume a known false positive rate of 1% and this time, we perform a binomial test on all positions:
count <- epitope_data$count
p_values <- sapply(count, function(count){
binom.test(count,50,p=0.01, alternative="greater")$p.value
})
Look up the p.adjust
function. Can you use it to adjust the p-values and control the family-wise error rate? Is the peak at position 42 still significant after the correction?
[1] Agresti, A. (2006). An Introduction to Categorical Data Analysis: Second Edition. In An Introduction to Categorical Data Analysis: Second Edition. https://doi.org/10.1002/0470114754