1 Disease prevalence

There is a disease with a known prevalence of 4%. You have a group of 100 randomly selected persons.

  1. How many of them would you expect to have the disease?
  2. What is the probability of observing exactly 7 persons with the disease?
  3. What is the probability of observing 7 or more persons with the disease?
  4. Assume the 100 persons are a group of randomly selected persons with a precondition. Would the fact that 7 of them have the disease convince you that the prevalence is higher for people with the precondition?

2 The discoveries data

Consider the discoveries data. This data set is contained in base R and has the number of “great inventions” for a number of years. These are clearly count data. Let’s transform them into a vector:

discov <-  discoveries[1:100]
  1. Look up the example data using the R help function.
  2. Compare the fit to a Poisson with a fit to a negative binomial (=gamma poisson) distribution.
  3. Which of the two distribution do you think describe the data better? And what could be a reason for that (remember what the data describe)?
  4. What could be problematic about fitting a Poisson or negative binomial distribution to these data? Which assumption could be violated?

3 ELISA example

This example is modified from chapter 1 in MSMB by Susan Holmes and Wolfgang Huber.

When testing certain pharmaceutical compounds, it is important to detect proteins that provoke an allergic reaction. The molecular sites that are responsible for such reactions are called epitopes.

ELISA assays are used to detect specific epitopes at different positions along a protein. The protein is tested at 100 different positions, supposed to be independent. For each patient, this position can either be a hit, or not. We’re going to study the data for 50 patients tallied at each of the 100 positions.

Run the following lines:

epitope_data <- data.frame(position=1:100,
                           count=c(2, 0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 2, 2, 7, 1, 0, 2, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 2, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0))

In this data frame, the number of hits among the 50 patients is counted at each position.

  1. Plot the data in a meaningful way.
  2. Fit the counts to a Poisson distribution. What is the fitted rate parameter?
  3. Does this look like a good fit?
  4. ELISAs can give false positives at a certain rate. False positive means declaring a hit – we think we have an epitope – when there is none. Assume that most of the positions actually don’t contain an epitope. In this case, you can consider the fitted Poisson model as the “background noise” model, with lambda giving the expected number of false positives. Given this model, what are the chances of seeing a value as large as 7, if no epitope is present?

4 Mice data

Load the mice data:

mice_pheno <- read.csv2(file= url("https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/mice_pheno.csv"), sep=",")
mice_pheno$Bodyweight <- as.numeric(mice_pheno$Bodyweight)
  1. Plot a histogram, and a normal-QQ plot for female control mice. Would you say the weights follow a normal distribution?
  2. For comparing the control and high-fat group (for female mice), show the ECDF of both in the same plot.