One of the main challenges of biological data analysis is dealing with heterogeneity. The quantities we are interested in often do not show a simple, unimodal “textbook distribution”. For example, in the last part of Chapter 2) we saw how the histogram of sequence scores in Figure 2.25 had two separate modes, one for CpG-islands and one for non-islands. We can see the data as a simple mixture of a few (in this case: two) components. We call these *finite mixtures*. Other mixtures can involve almost as many components as we have observations. These we call *infinite mixtures*58 We will see that—as for so many modeling choices–the right complexity of the mixture is in the eye of the beholder and often depends on the amount of data and the resolution and smoothness we want to attain..

In Chapter 1 we saw how a simple generative model with a Poisson distribution led us to make useful inference in the detection of an epitope. Unfortunately, a satisfactory fit to real data with such a simple model is often out of reach. However, simple models such as the normal or Poisson distributions can serve as building blocks for more realistic models using the mixing framework that we cover in this chapter. Mixtures occur naturally for flow cytometry data, biometric measurements, RNA-Seq, ChIP-Seq, microbiome and many other types of data collected using modern biotechnologies. In this chapter we will learn from simple examples how to build more realistic models of distributions using mixtures.

In this chapter we will:

Generate our own mixture model data from distributions composed of two normal populations.

See how the Expectation-Maximization (EM) algorithm and allows us to “reverse engineer” the underlying mixtures in dataset.

Use a special type of mixture called zero-inflation for data such as ChIP-Seq data that have many extra zeros.

Discover the empirical cumulative distribution: a special mixture that we can build from observed data. This will enable us to see how we can simulate the variability of our estimates using the bootstrap.

Build the Laplace distribution as an instance of an infinite mixture model- with many components. We will use it to model promoter lengths and microarray intensities.

Have our first encounter with the gamma-Poisson distribution, a hierarchical model useful for RNA-Seq data. We will see it comes natually from mixing different Poisson distributed sources.

See how mixture models enable us to choose data transformations.

Here is a first example of a mixture model with two equal-sized components. We decompose the generating process into steps:

**Flip a fair coin.**

Generate a random number from a normal distribution with mean 1 and variance 0.25.

Generate a random number from a normal distribution with mean 3 and variance 0.25. The histogram shown in Figure 4.1 was produced by repeating these two steps 10,000 times using the following code.

```
coinflips = (runif(10000) > 0.5)
table(coinflips)
```

```
## coinflips
## FALSE TRUE
## 5005 4995
```

Figure 4.1: Histogram of 10,000 random draws from a fair mixture of two normals. The left hand part of the histogram is dominated by numbers generated from (A), on the right from (B).

```
oneFlip = function(fl, mean1 = 1, mean2 = 3, sd1 = 0.5, sd2 = 0.5) {
if (fl) {
rnorm(1, mean1, sd1)
} else {
rnorm(1, mean2, sd2)
}
}
fairmix = vapply(coinflips, oneFlip, numeric(1))
library("ggplot2")
library("dplyr")
ggplot(tibble(value = fairmix), aes(x = value)) +
geom_histogram(fill = "purple", binwidth = 0.1)
```

► Question 4.1

How can you use R’s vectorized syntax to remove the `vapply`

-loop and generate the `fairmix`

vector more efficiently?

► Solution

► Question 4.2

Using your improved code, use one million coin flips and make a histogram with 500 bins. What do you notice?

► Solution

The density function for a normal \(N(\mu,\sigma)\) random variable can be written explicitly; we usually call it \(\phi(x)=\frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}\)

► Question 4.3

- Plot a histogram of those values of
`fair$values`

for which`coinflips`

is`TRUE`

. Hint: use`y = ..density..`

in the call to`aes`

(which means that the vertical axis is to show the proportion of counts) and set the binwidth to 0.01.

- Overlay the line corresponding to \(\phi(z)\).

► Solution

In fact we can write the mathematical formula for the density of all of `fair$values`

(the limiting curve that the histograms tend to look like) as a sum of the two densities.

\[\begin{equation} f(x)=\frac{1}{2}\phi_1(x)+\frac{1}{2}\phi_2(x), \tag{4.1} \end{equation}\]

where \(\phi_1\) is the density of the normal \(N(\mu_1=1,\sigma^2=0.25)\) and \(\phi_2\) is the density of the normal \(N(\mu_2=3,\sigma^2=0.25)\). Figure 4.3 was generated by the following code.

Figure 4.3: The theoretical density of the mixture.

```
fairtheory = tibble(
x = seq(-1, 5, length.out = 1000),
f = 0.5 * dnorm(x, mean = means[1], sd = sds[1]) +
0.5 * dnorm(x, mean = means[2], sd = sds[2]))
ggplot(fairtheory, aes(x = x, y = f)) +
geom_line(color = "red", size = 1.5) + ylab("mixture density")
```

In this case the mixture model is extremely visible as the two component distributions have little overlap. Figure 4.3 shows two distinct peaks: we call this a **bimodal** distribution. However in many cases the separation is not so clear.

Figure 4.4: A mixture of two normals that is harder to recognize.

► Question 4.4

Figure 4.4 is a histogram of a fair mixture of two normals with the same variances. Can you guess the two *mean* parameters of the component distributions? Hint: you can use trial and error, and simulate various mixtures to see if you can make a matching histogram. Looking at the R code for the chapter will show you exactly how the data were generated.

► Solution

In Figure 4.5, the bars from the two component distributions are plotted on top of each other. A different way of showing the components is Figure 4.6, produced by the code below.

Figure 4.6: As Figure 4.5, with stacked bars for the two mixture components.

```
ggplot(mystery, aes(x = values, fill = coinflips)) +
geom_histogram(data = dplyr::filter(mystery, coinflips),
fill = "red", alpha = 0.2, breaks = br) +
geom_histogram(data = dplyr::filter(mystery, !coinflips),
fill = "darkblue", alpha = 0.2, breaks = br) +
geom_histogram(fill = "purple", breaks = br, alpha = 0.2)
```

► Question 4.5

In Figures 4.5 and 4.6, we were able to use the `coinflips`

column in the data to disentangle the components. In real data, this information is missing.

A book-long treatment on the subject of finite mixtures (McLachlan and Peel 2004).

Ecological and molecular data often come in the form of counts. For instance, this may be the number of individuals from each of several species at each of several locations. Such data can often be seen as a mixture of two scenarios: if the species is not present, the count is necessarily zero, but *if* the species is present, the number of individuals we observe varies, with a random sampling distribution; this distribution may also include zeros. We model this as a mixture:

\[\begin{equation*} f_{\text{zi}}(y) = \lambda \, \delta(y) + (1-\lambda) \, f_{\text{count}}(y), \end{equation*}\]

where \(\delta\) is Dirac’s delta function, which represents a probability distribution that has all its mass at 0. The zeros from the first mixture component are called “structural”: in our example, they occur because certain species do not live in certain habitats. The second component, \(f_{\text{count}}\) may also include zeros and other small numbers, simply due to random sampling. The R packages **pscl** (Zeileis, Kleiber, and Jackman 2008) and **zicounts** provide many examples and functions for working with **zero inflated** counts.

Let’s consider the example of ChIP-Seq data. These data are sequences of pieces of DNA that are obtained from chromatin immunoprecipitation (ChIP). This technology enables the mapping of the locations along genomic DNA of transcription factors, nucleosomes, histone modifications, chromatin remodeling enzymes, chaperones, polymerases and other protein. It was the main technology used by the Encyclopedia of DNA Elements (ENCODE) Project. Here we use an example (Kuan et al. 2011) from the **mosaicsExample** package, which shows data measured on chromosome 22 from ChIP-Seq of antibodies for the STAT1 protein and the H3K4me3 histone modification applied to the GM12878 cell line. Here we do not show the code to construct the `binTFBS`

object; it is shown in the source code file for this chapter and follows the vignette of the **mosaics** package.

`binTFBS`

```
## Summary: bin-level data (class: BinData)
## ----------------------------------------
## - # of chromosomes in the data: 1
## - total effective tag counts: 462479
## (sum of ChIP tag counts of all bins)
## - control sample is incorporated
## - mappability score is NOT incorporated
## - GC content score is NOT incorporated
## - uni-reads are assumed
## ----------------------------------------
```

From this object, we can create the histogram of per-bin counts.

Figure 4.7: The number of binding sites found in 200nt windows along chromosome 22 in a ChIP-Seq dataset.

```
bincts = print(binTFBS)
ggplot(bincts, aes(x = tagCount)) +
geom_histogram(binwidth = 1, fill = "forestgreen")
```

We see in Figure 4.7 that there are many zeros, although from this plot it is not immediately obvious whether the number of zeros is really extraordinary, given the frequencies of the other small numbers (\(1,2,...\)).

► Question 4.9

- Redo the histogram of the counts using a logarithm base 10 scale on the \(y\)-axis.

- Estimate \(\pi_0\), the proportion of bins with zero counts.

► Solution

So far we have looked at mixtures of two components. We can extend our description to cases where there may be more. For instance, when weighing N=7,000 nucleotides obtained from mixtures of deoxyribonucleotide monophosphates (each type has a different weight, measured with the same standard deviation sd=3), we might observe the histogram (shown in Figure 4.9) generated by the following code.

Figure 4.9: Simulation of 7,000 nucleotide mass measurements.

```
masses = c(A = 331, C = 307, G = 347, T = 322)
probs = c(A = 0.12, C = 0.38, G = 0.36, T = 0.14)
N = 7000
sd = 3
nuclt = sample(length(probs), N, replace = TRUE, prob = probs)
quadwts = rnorm(length(nuclt),
mean = masses[nuclt],
sd = sd)
ggplot(tibble(quadwts = quadwts), aes(x = quadwts)) +
geom_histogram(bins = 100, fill = "purple")
```

► Question 4.10

Repeat this simulation experiment with \(N=1000\) nucleotide measurements. What do you notice in the histogram?

► Question 4.11

What happens when \(N=7000\) but the standard deviation is 10?

► Question 4.12

Plot the theoretical density curve for the distribution simulated in Figure 4.9.

In this case, as we have enough measurements with good enough precision, we are able to distinguish the four nucleotides and decompose the distribution shown in Figure 4.9. With fewer data and/or more noisy measurements, the four modes and the distribution component might be less clear.

In this section, we will consider an extreme case of mitxure models, where we model our sample of \(n\) data points as a mixture of \(n\) point masses. We could use almost any set of data here; to be concrete, we use Darwin’s *Zea Mays* data63 They were collected by Darwin who asked his cousin, Francis Galton to analyse them. R.A. Fisher re-analysed the same data using a paired t-test (Bulmer 2003). We will get back to this example in Chapter 13. in which he compared the heights of 15 pairs of *Zea Mays* plants (15 self-hybridized versus 15 crossed). The data are available in the **HistData** package, and we plot the distribution of the 15 differences in height:

```
library("HistData")
ZeaMays$diff
```

```
## [1] 6.125 -8.375 1.000 2.000 0.750 2.875 3.500 5.125
## [9] 1.750 3.625 7.000 3.000 9.375 7.500 -6.000
```

Figure 4.10: The observed sample can be seen as a mixture of point masses at each of the values (real point masses would be bars without any width whatsoever).

```
ggplot(ZeaMays, aes(x = diff, ymax = 1/15, ymin = 0)) +
geom_linerange(size = 1, col = "forestgreen") + ylim(0, 0.1)
```

In Section 3.6.6 we saw that the empirical cumulative distribution function (ECDF) for a sample of size \(n\) is

\[\begin{equation} \hat{F}_n(x)= \frac{1}{n}\sum_{i=1}^n {\mathbb 1}_{x \leq x_i}, \tag{4.4} \end{equation}\]

and we saw ECDF plots in Figure 3.22. We can also write the *density* of our sample as

\[\begin{equation} \hat{f}_n(x) =\frac{1}{n}\sum_{i=1}^n \delta_{x_i}(x) \tag{4.5} \end{equation}\]

In general, the density of a probability distribution is the derivative (if it exists) of the distribution function. We have applied this principle here: the density of the distribution defined by Equation (4.4) is Equation (4.5). We could do this since one can consider the function \(\delta_a\) the “derivative” of the step function \({\mathbb 1}_{x \leq a}\): it is completely flat almost everywhere, except at the one point \(a\) where there is the step, and where its value is “infinite” There is a bit of advanced mathematics (beyond standard calculus) required for this to make sense, which is however beyond the scope of our treatment here.. Equation (4.5) highlights that our data sample can be considered a mixture of \(n\) **point masses** at the observed values \(x_1,x_2,..., x_{n}\), such as in Figure 4.10.

Figure 4.11: The value of a statistic \(\tau\) is estimated from data (the grey matrices) generated from an underlying distribution \(F\). Different samples from \(F\) lead to different data, and so to different values of the estimate \(\hat{\tau}\): this is called **sampling variability**. The distribution of all the \(\hat{\tau}\)’s is the **sampling distribution**.

Statistics of our sample, such as the mean, minimum or median, can now be written as a function of the ECDF, for instance, \(\bar{x} = \int \delta_{x_i}(x)\,\text{d}x\). As another instance, if \(n\) is an odd number, the median is \(x_{(\frac{n+1}{2})}\), the value right in the middle of the ordered list.

The true **sampling distribution** of a statistic \(\hat{\tau}\) is often hard to know as it requires many different data samples from which to compute the statistic; this is shown in Figure 4.11.

The **bootstrap** principle approximates the true sampling distribution of \(\hat{\tau}\) by creating new samples drawn from the empirical distribution built from the original sample (Figure 4.12). We *reuse* the data (by considering it a mixture distribution of \(\delta\)s) to create new “datasets” by taking samples and looking at the sampling distribution of the statistics \(\hat{\tau}^*\) computed on them. This is called the **nonparametric bootstrap** resampling approach, see Efron and Tibshirani (1994) for a complete reference. It is very versatile and powerful method that can be applied to basically any statistic, no matter how complicated it is. We will see example applications of this method, in particular to clustering, in Chapter 5).

Figure 4.12: The bootstrap simulates the sampling variability by drawing samples not from the underlying true distribution \(F\) (as in Figure 4.11), but from the empirical distribution function \(\hat{F}_n\).

Using these ideas, let’s try to estimate the sampling distribution of the median of the Zea Mays differences we saw in Figure 4.10. We use similar simulations to those in the previous sections: Draw \(B=1000\) samples of size 15 from the 15 values (each being a component in the 15-component mixture). Then compute the 1000 medians of each of these samples of 15 values and look at their distribution: this is the bootstrap sampling distribution of the median.

Figure 4.13: The bootstrap sampling distribution of the median of the Zea Mays differences.

```
B = 1000
meds = replicate(B, {
i = sample(15, 15, replace = TRUE)
median(ZeaMays$diff[i])
})
ggplot(tibble(medians = meds), aes(x = medians)) +
geom_histogram(bins = 30, fill = "purple")
```

► Question 4.13

Estimate a 99% confidence interval for the median based on these simulations. What can you conclude from looking at the overlap between this interval and 0?

► Question 4.14

Use the **bootstrap** package to redo the same analysis using the function `bootstrap`

for both `median`

and `mean`

. What differences do you notice between the sampling distribution of the mean and the median?

► Solution

In theoretical statistics, nonparametric methods are those that have infinitely many degrees of freedom or numbers of unknown parameters.

Despite their name, nonparametric methods are not methods that do not use parameters: all statistical methods estimate unknown quantities. In practice, we do not wait for infinity; when the number of parameters becomes as large or larger than the amount of data available, we say the method is nonparametric. The bootstrap uses a mixture with \(n\) components, so with a sample of size \(n\), it qualifies as a nonparametric method.

► Question 4.15

If the sample is composed of \(n=3\) different values, how many different bootstrap resamples are possible? Answer the same question with \(n=15\).

► Solution

► Question 4.16

What are the two types of error that can occur when using the bootstrap as it is implemented in the **bootstrap** package? Which parameter can you modify to improve one of them?

► Solution

Sometimes mixtures can be useful even if we don’t aim to assign a label to each observation or, to put it differently, if we allow as many `labels’ as there are observations. If the number of mixture components is as big as (or bigger than) the number of observations, we say we have an **infinite mixture**. Let’s look at some examples.

Figure 4.14: Laplace knew already that the probability density \[f_Y(y)=\frac{1}{2\phi}\exp\left(-\frac{|y-\theta|}{\phi}\right),\qquad\phi>0\] has the median as its location parameter \(\theta\) and the median absolute deviation (MAD) as its scale parameter \(\phi\).

Consider the following two-level data generating scheme: **Level 1** Create a sample of `W`

s from an exponential distribution.

`w = rexp(10000, rate = 1)`

**Level 2** The \(w\)s serve as the variances of normal variables with mean \(\mu\) generated using `rnorm`

.

Figure 4.15: Data sampled from a Laplace distribution.

```
mu = 0.3
lps = rnorm(length(w), mean = mu, sd = sqrt(w))
ggplot(data.frame(lps), aes(x = lps)) +
geom_histogram(fill = "purple", binwidth = 0.1)
```

This turns out to be a rather useful distribution. It has well-understood properties and is named after Laplace, who proved that the median is a good estimator of its location parameter \(\theta\) and that the median absolute deviation can be used to estimate its scale parameter \(\phi\). From the formula in the caption of Figure 4.14 we see that the \(L_1\) distance (absolute value of the difference) holds a similar position in the Laplace density as the \(L_2\) (square of the difference) does for the normal density.

Conversely, in Bayesian regression64 Don’t worry if you are not familiar with this, in that case just skip this sentence., having a Laplace distribution as a prior on the coefficients amounts to an \(L_1\) penalty, called the *lasso* (Tibshirani 1996), while a normal distribution as a prior leads to an \(L_2\) penalty, called ridge regression.

► Question 4.17

Write a random variable whose distribution is the symmetric Laplace as a function of normal and exponential random variables.

► Solution

In the Laplace distribution, the variances of the normal components depend on \(W\), while the means are unaffected. A useful extension adds another parameter \(\theta\) that controls the locations or centers of the components. We generate data `alps`

from a hierarchical model with \(W\) an exponential variable; the output shown in Figure 4.16 is a histogram of normal \(N(\theta+w\mu,\sigma w)\) random numbers, where the \(w\)’s themselves were randomly generated from an exponential distribution with mean \(1\) as shown in the code:

Figure 4.16: Histogram of data generated from an asymmetric Laplace distribution – a scale mixture of many normals whose means and variances are dependent. We write \(X \sim AL(\theta, \mu, \sigma)\).

```
mu = 0.3; sigma = 0.4; theta = -1
w = rexp(10000, 1)
alps = rnorm(length(w), theta + mu * w, sigma * sqrt(w))
ggplot(tibble(alps), aes(x = alps)) +
geom_histogram(fill = "purple", binwidth = 0.1)
```

Such hierarchical mixture distributions, where every instance of the data has its own mean and variance, are useful models in many biological settings. Examples are shown in Figure 4.17.

► Question 4.18

Looking at the log-ratio of gene expression values from a microarray, one gets a distribution as shown on the right of Figure 4.17. How would one explain that the data have a histogram of this form?

The Laplace distribution is an example of where the consideration of the generative process indicates how the variance and mean are linked. The expectation value and variance of an asymmetric Laplace distribution \(AL(\theta, \mu, \sigma)\) are

\[\begin{equation} E(Y ) = \theta + \mu \quad\quad\text{and}\quad\quad\text{var}(Y)=\sigma^2+\mu^2. \tag{4.7} \end{equation}\]

Note the variance is dependent on the mean, unless \(\mu = 0\) (the case of the symmetric Laplace Distribution). This is the feature of the distribution that makes it useful. Having a mean-variance dependence is very common for physical measurements, be they microarray fluorescence intensities, peak heights from a mass spectrometer, or reads counts from high-throughput sequencing, as we’ll see in the next section.

Figure 4.18: How to count the fish in a lake? MC Escher.

A similar two-level hierarchical model is often also needed to model real-world count data. At the lower level, simple Poisson and binomial distributions serve as the building blocks, but their parameters may depend on some underlying (latent) process. In ecology, for instance, we might be interested in variations of fish species in all the lakes in a region. We sample the fish species in each lake to estimate their true abundances, and that could be modeled by a Poisson. But the true abundances will vary from lake to lake. And if we want to see whether, for instance, changes in climate or altitude play a role, we need to disentangle such systematic effects from random lake-to-lake variation. The different Poisson rate parameters \(\lambda\) can be modeled as coming from a distribution of rates. Such a hierarchical model also enables us to add supplementary steps in the hierarchy, for instance we could be interested in many different types of fish, model altitude and other environmental factors separately, etc.

Further examples of sampling schemes that are well modeled by mixtures of Poisson variables include applications of high-throughput sequencing, such as RNA-Seq, which we will cover in detail in Chapter 8, or 16S rRNA-Seq data used in microbial ecology.

Now we are getting to know a new distribution that we haven’t seen before. The gamma distribution is an extension of the (one-parameter) exponential distribution, but it has two parameters, which makes it more flexible. It is often useful as a building block for the upper level of a hierarchical model. The gamma distribution is positive-valued and continuous. While the density of the exponential has its maximum at zero and then simply decreases towards 0 as the value goes to infinity, the density of the gamma distribution has its maximum at some finite value. Let’s explore it by simulation examples. The histograms in Figure 4.19 were generated by the following lines of code:

Figure 4.19: Histograms of random samples of gamma distributions. The upper histogram shows a gamma\((2,\frac{1}{3})\), the lower a gamma\((10,\frac{3}{2})\) distribution. The gamma is a flexible two parameter distribution: .

```
ggplot(tibble(x = rgamma(10000, shape = 2, rate = 1/3)),
aes(x = x)) + geom_histogram(bins = 100, fill= "purple")
ggplot(tibble(x = rgamma(10000, shape = 10, rate = 3/2)),
aes(x = x)) + geom_histogram(bins = 100, fill= "purple")
```

Generate a set of parameters: \(\lambda_1,\lambda_2,...\) from a gamma distribution.

Use these to generate a set of Poisson(\(\lambda_i\)) random variables, one for each \(\lambda_1\).

Figure 4.20: Histogram of `gp`

, generated via a gamma-Poisson hierachical model.

```
lambda = rgamma(10000, shape = 10, rate = 3/2)
gp = rpois(length(lambda), lambda = lambda)
ggplot(tibble(x = gp), aes(x = x)) +
geom_histogram(bins = 100, fill= "purple")
```

The resulting values are said to come from a gamma–Poisson mixture. Figure 4.20 shows the histogram of `gp`

.

► Question 4.19

- Are the values generated from such a gamma–Poisson mixture continuous or discrete ?

- What is another name for this distribution? Hint: Try the different distributions provided by the
`goodfit`

function from the**vcd**package.

► Solution

In R, and in some other places, the gamma-Poisson distribution travels under the alias name of **negative binomial distribution**. The two names are synonyms; the second one alludes to the fact that Equation (4.8) bears some formal similarities to the probabilities of a binomial distribution. The first name, gamma–Poisson distribution, is more indicative of its generating mechanism, and that’s what we will use in the rest of the book. It is a discrete distribution, that means that it takes values only on the natural numbers (in contrast to the gamma distribution, which covers the whole positive real axis). Its probability distribution is

\[\begin{equation} \text{P}(K=k)=\left(\begin{array}{c}k+a-1\\k\end{array}\right) \, p^a \, (1-p)^k, \tag{4.8} \end{equation}\]

which depends on the two parameters \(a\in\mathbb{R}^+\) and \(p\in[0,1]\). Equivalently, the two parameters can be expressed by the mean \(\mu=pa/(1-p)\) and a parameter called the **dispersion** \(\alpha=1/a\). The variance of the distribution depends on these parameters, and is \(\mu+\alpha\mu^2\).

► Question 4.20

If you are more interested in analytical derivations than illustrative simulations, try writing out the mathematical derivation of the gamma-Poisson probability distribution.

► Solution

A key issue we need to control when we analyse experimental data is how much variability there is between repeated measurements of the same underlying true value, i.e. between replicates. This will determine whether and how well we can see any true differences, i.e. between different conditions. Data that arise through the type of hierarchical models we have studied in this chapter often turn out to have very heterogenous variances, and this can be a challenge. We will see how in such cases **variance-stabilizing transformations** (Anscombe 1948) can help. Note how we construct the dataframe (or, more precisely, the *tibble*) `simdat`

: the output of the `lapply`

loop is a list of *tibble*s, one for each value of `lambdas`

. With the pipe operator `%>%`

we send it to the function `bind_rows`

(from the **dplyr** package). The result is a dataframe of all the list elements neatly stacked on top of each other. Let’s start with a series of Poisson variables with rates `lambdas`

:

```
lambdas = seq(100, 900, by = 100)
simdat = lapply(lambdas, function(l)
tibble(y = rpois(n = 40, lambda=l), lambda = l)
) %>% bind_rows
library("ggbeeswarm")
ggplot(simdat, aes(x = lambda, y = y)) +
geom_beeswarm(alpha = 0.6, color = "purple")
ggplot(simdat, aes(x = lambda, y = sqrt(y))) +
geom_beeswarm(alpha = 0.6, color = "purple")
```

The data that we see in the left panel of Figure 4.23 are an example of what is called **heteroscedasticity**: the standard deviations (or, equivalently, the variance) of our data is different in different regions of our data space. In particular, it increases along the \(x\)-axis, with the mean. For the Poisson distribution, we indeed know that the standard deviation is the square root of the mean; for other types of data, there may be other dependencies. This can be a problem if we want to apply subsequent analysis techniques (for instance, regression, or a statistical test) that are based on assuming that the variances are the same. In Figure 4.23, the number of replicates going into each beeswarm are quite large (40). In practice, this is not always the case, so the heteroskedasticity may be harder to see, even though it is there. However, as we see in the right panel of Figure 4.23, if we apply the square root transformation to the \(y\)-variables, then the transformed variables will have approximately the same variance. In the following code, we use the transformation \(y \mapsto 2\sqrt{y}\), which yields variance approximately equal to 1.

`summarise(group_by(simdat, lambda), sd(y), sd(2*sqrt(y)))`

```
## # A tibble: 9 x 3
## lambda `sd(y)` `sd(2 * sqrt(y))`
## <dbl> <dbl> <dbl>
## 1 100 11.3 1.13
## 2 200 14.4 1.02
## 3 300 20.5 1.17
## 4 400 22.2 1.10
## 5 500 20.6 0.917
## 6 600 23.3 0.945
## 7 700 16.6 0.630
## 8 800 28.5 1.00
## 9 900 28.0 0.937
```

► Question 4.21

Repeat the computation in the above code chunk for a version of `simdat`

with a larger number of replicates than 40.

Another example, now using the gamma-Poisson distribution, is shown in Figure 4.24. We generate gamma-Poisson variables `u`

66 To catch a greater range of values for the mean value `mu`

, without creating too dense a sequence, we use a geometric series: \(\mu_{i+1} = 2\mu_i\). and plot the 95% confidence intervals around the mean.

```
muvalues = 2^seq(0, 10, by = 1)
simgp = lapply(muvalues, function(mu) {
u = rnbinom(n = 1e4, mu = mu, size = 4)
tibble(mean = mean(u), sd = sd(u),
lower = quantile(u, 0.025),
upper = quantile(u, 0.975),
mu = mu)
} ) %>% bind_rows
head(as.data.frame(simgp), 2)
```

```
## mean sd lower upper mu
## 1 0.9993 1.124556 0 4 1
## 2 2.0013 1.738563 0 6 2
```

Figure 4.24: gamma-Poisson distributed measurement data, for a range of \(\mu\) from 1 to 1024.

```
ggplot(simgp, aes(x = mu, y = mean, ymin = lower, ymax = upper)) +
geom_point() + geom_errorbar()
```

► Question 4.22

How can we find a transformation for these data that stabilizes the variance, similar to the square root function for the Poisson distributed data?

► Solution

Figure 4.25: Piecewise linear function that stabilizes the variance of the data in Figure 4.24.

```
simgp = mutate(simgp,
slopes = 1 / sd,
trsf = cumsum(slopes * mean))
ggplot(simgp, aes(x = mean, y = trsf)) +
geom_point() + geom_line() + xlab("")
```

We see in Figure 4.25 that this function has some resemblance to a square root function in particular at its lower end. At the upper end, it seems to look more like a logarithm. The more mathematically inclined will see that an elegant extension of these numerical calcuations can be done through a little calculus known as the **delta method**, as follows.

Call our transformation function \(g\), and assume it’s differentiable67 That’s not a very strong assumption. Pretty much any function that we might consider reasonable here is differentiable.. Also call our random variables \(X_i\), with means \(\mu_i\) and variances \(v_i\), and we assume that \(v_i\) and \(\mu_i\) are related by a functional relationship \(v_i = v(\mu_i)\). Then, for values of \(X_i\) in the neighborhood of its mean \(\mu_i\),

\[\begin{equation} g(X_i) = g(\mu_i) + g'(\mu_i) (X_i-\mu_i) + ... \tag{4.10} \end{equation}\]

where the dots stand for higher order terms that we can neglect. The variances of the transformed values are then

\[\begin{equation} \text{Var}(g(X_i)) \simeq g'(\mu_i)^2 \; \text{Var}(X_i) = g'(\mu_i)^2 \, v(\mu_i), \tag{4.11} \end{equation}\]

where we have used the rules \(\text{Var}(X-c)=\text{Var}(X)\) and \(\text{Var}(cX)=c^2\,\text{Var}(X)\) that hold whenever \(c\) is a constant number. Requiring that this be constant leads to the differential equation

\[\begin{equation} g'(x) = \frac{1}{\sqrt{v(x)}}. \tag{4.12} \end{equation}\]

For any given mean-variance relationship \(v(\mu)\), we can solve this for the function \(g\). Let’s check this for some simple cases:

if \(v(\mu)=\mu\) (Poisson), we recover \(g(x)=\sqrt{x}\), the square root transformation.

If \(v(\mu)=c\,\mu^2\), solving the differential equation (4.12) gives \(g(x)=\log(x)\). This explains why the logarithm transformation is so popular in many data analysis applications: it acts as a variance stabilizing transformation whenever the data have a constant coefficient of variation, that is, when the standard deviation is proportional to the mean.

► Question 4.23

What is the variance-stabilizing transformation associated with \(v(\mu) = \mu + c\,\mu^2\)?

We have given motivating examples and ways of using mixtures to model biological data. We saw how the EM algorithm is an interesting example of solving a difficult optimization problem by iteratively pretending we know one part of the solution to compute the other part, then alternating to pretend the other component is known and computing the first part, and so on, until convergence.

We have seen how to model mixtures of two or more normal distributions with different means and variances. We have seen how to decompose a given sample of data from such a mixture, even without knowing the latent variable, using the EM algorithm. The EM approach requires that we know the parametric form of the distributions and the number of components. In Chapter 5, we will see how we can find groupings in data even without relying on such information – this is then called clustering. We can keep in mind that there is a strong conceptual relationship between clustering and mixture modeling.

Infinite mixture models are good for constructing new distributions (such as the gamma-Poisson or the Laplace) out of more basic ones (such as binomial, normal, Poisson). Common examples are

mixtures of normals (often with a hierarchical model on the means and the variances);

beta-binomial mixtures – where the probability \(p\) in the binomial is generated according to a \(\text{beta}(a, b)\) distribution;

gamma-Poisson for read counts (see Chapter 8);

gamma-exponential for PCR.

Mixture models are useful whenever there are several layers of experimental variability. For instance, at the lowest layer, our measurement precision may be limited by basic physical detection limits, and these may be modeled by a Poisson distribution in the case of a counting-based assay, or a normal distribution in the case of the continuous measurement. On top of there may be one (or more) layers of instrument-to-instrument variation, variation in the reagents, operator variaton etc.

Mixture models reflect that there is often heterogeneous amounts of variability (variances) in the data. In such cases, suitable data transformations, i.e., variance stabilizing transformations, are necessary before subsequent visualization or analysis. We’ll study in depth an example for RNA-Seq in Chapter 8, and this also proves useful in the normalization of next generation reads in microbial ecology (McMurdie and Holmes 2014).

Another important application of mixture modeling is the two-component model in multiple testing – we will come back to this in Chapter 6.

We saw that by using the observed sample as a mixture we could generate many simulated samples that inform us about the sampling distribution of an estimate. This method is called the bootstrap and we will return to it several times as it provides a way of evaluating estimates even when an analytic formulae are not available (we say it is non-parametric).

A useful book-long treatment of finite mixture models is by McLachlan and Peel (2004); for the EM algorithm, see also the book by McLachlan and Krishnan (2007). A recent book that presents all EM type algorithms within the Majorize-Minimization (MM) framework is by Lange (2016).

There are in fact mathematical reasons why many natural phenomena can be seen as mixtures: this occurs when the observed events are exchangeable (the order in which they occur doesn’t matter). The theory underlying this is quite mathematical, a good way to start is to look at the Wikipedia entry and the paper by Diaconis and Freedman (1980).

In particular, we use mixtures for high-throughput data. You will see examples in Chapters 8 and 11.

The bootstrap can be used in many situations and is a very useful tool to know about, a friendly treatment is given in (Efron and Tibshirani 1993).

A historically interesting paper is the original article on variance stabilization by Anscombe (1948), who proposed ways of making variance stabilizing transformations for Poisson and gamma-Poisson random variables. Variance stabilization is explained using the delta method in many standard texts in theoretical statistics, e.g., those by Rice (2006 Chapter 6) and Kéry and Royle (2015 Page 35).

Kéry and Royle (2015) provide a nice exploration of using R to build hierarchical models for abundance estimation in niche and spatial ecology.

**The EM algorithm step by step.** As an example dataset, we use the values in the file `Myst.rds`

. As always, it is a good idea to first visualize the data. The histogram is shown in Figure 4.26. We are going to model these data as a mixture of two normal distributions with unknown means and standard deviations. We’ll call the two components A and B.

Figure 4.26: Histogram of our example dataset for the EM algorithm.

```
yvar = readRDS("../data/Myst.rds")$yvar
ggplot(tibble(yvar), aes(x = yvar)) + geom_histogram(binwidth=0.025)
str(yvar)
```

`## num [1:1800] 0.3038 0.0596 -0.0204 0.1849 0.2842 ...`

We start by randomly assigning a “probability of membership” to each of the values in `yvar`

for each of the groups, A and B. These are numbers between 0 and 1, `pA`

represents the probability of coming from mixture component A. The complementary probability is `pB`

.

```
pA = runif(length(yvar))
pB = 1 - pA
```

We also need to set up some housekeeping variables: `iter`

counts over the iterations of the EM algorithm; `loglik`

stores the current log-likelihood; `delta`

stores the change in the log-likelihood from the previous iteration to the current one. We also define the parameters `tolerance`

, `miniter`

and `maxiter`

of the algorithm.

```
iter = 0
loglik = -Inf
delta = +Inf
tolerance = 1e-3
miniter = 50; maxiter = 1000
```

Study the code below and answer the following questions:

Which lines correspond to the E-step, which to the M-step?

What does the M-step do, what does the E-step do?

What is the role of the algorithm arguments `tolerance`

, `miniter`

and `maxiter`

?

Why do we need to compute `loglik`

?

Compare the result of what we are doing here to the output of the `normalmixEM`

function from the **mixtools** package.

```
while((delta > tolerance) && (iter <= maxiter) || (iter < miniter)) {
lambda = mean(pA)
muA = weighted.mean(yvar, pA)
muB = weighted.mean(yvar, pB)
sdA = sqrt(weighted.mean((yvar - muA)^2, pA))
sdB = sqrt(weighted.mean((yvar - muB)^2, pB))
phiA = dnorm(yvar, mean = muA, sd = sdA)
phiB = dnorm(yvar, mean = muB, sd = sdB)
pA = lambda * phiA
pB = (1 - lambda) * phiB
ptot = pA + pB
pA = pA / ptot
pB = pB / ptot
loglikOld = loglik
loglik = sum(log(pA))
delta = abs(loglikOld - loglik)
iter = iter + 1
}
param = tibble(group = c("A","B"), mean = c(muA,muB), sd = c(sdA,sdB))
param
```

```
## # A tibble: 2 x 3
## group mean sd
## <chr> <dbl> <dbl>
## 1 A -0.169 0.0983
## 2 B 0.147 0.150
```

`iter`

`## [1] 370`

▢

Compare the theoretical values of the gamma-Poisson distribution with parameters given by the estimates in `ofit$par`

in Section 4.4.3 to the data used for the estimation using a QQ-plot.

▢

**Mixture modeling examples for regression**. The **flexmix** package (Grün, Scharl, and Leisch 2012) enables us to cluster and fit regressions to the data at the same time. The standard M-step `FLXMRglm`

of **flexmix** is an interface to R’s generalized linear modeling facilities (the `glm`

function). Load the package and an example dataset.

```
library("flexmix")
data("NPreg")
```

a) First, plot the data and try to guess how the points were generated.

b) Fit a two component mixture model using the commands

`m1 = flexmix(yn ~ x + I(x^2), data = NPreg, k = 2)`

c) Look at the estimated parameters of the mixture components and make a truth table that cross-classifies true classes versus cluster memberships. What does the summary of the object `m1`

show us?

d) Plot the data again, this time coloring each point according to its estimated class.

▢

**Other hierarchical noise models:**

Find two papers that explore the use of other infite mixtures for modeling molecular biology technological variation.

▢

Anscombe, Francis J. 1948. “The Transformation of Poisson, Binomial and Negative-Binomial Data.” *Biometrika*. JSTOR, 246–54.

Bulmer, Michael George. 2003. *Francis Galton: Pioneer of Heredity and Biometry*. JHU Press.

Chen, Min, Yang Xie, and Michael Story. 2011. “An Exponential-Gamma Convolution Model for Background Correction of Illumina BeadArray Data.” *Communications in Statistics-Theory and Methods* 40 (17). Taylor & Francis: 3055–69.

Diaconis, Persi, and David Freedman. 1980. “Finite Exchangeable Sequences.” *The Annals of Probability*. JSTOR, 745–64.

Diaconis, Persi, and Susan Holmes. 1994. “Gray Codes for Randomization Procedures.” *Statistics and Computing* 4 (4). Springer: 287–302.

Efron, Bradley, and Robert J Tibshirani. 1994. *An Introduction to the Bootstrap*. CRC press.

Efron, B., and R. Tibshirani. 1993. *An Introduction to the Bootstrap*. Chapman & Hall/CRC.

Grün, Bettina, Theresa Scharl, and Friedrich Leisch. 2012. “Modelling Time Course Gene Expression Data with Finite Mixtures of Linear Additive Models.” *Bioinformatics* 28 (2): 222–28. https://doi.org/10.1093/bioinformatics/btr653.

Hoeting, Jennifer A, David Madigan, Adrian E Raftery, and Chris T Volinsky. 1999. “Bayesian Model Averaging: A Tutorial.” *Statistical Science*. JSTOR, 382–401.

Kéry, Marc, and J Andrew Royle. 2015. *Applied Hierarchical Modeling in Ecology: Analysis of Distribution, Abundance and Species Richness in R and Bugs: Volume 1: Prelude and Static Models*. Academic Press.

Kristiansson, Erik, Michael Thorsen, Markus J Tamás, and Olle Nerman. 2009. “Evolutionary Forces Act on Promoter Length: Identification of Enriched Cis-Regulatory Elements.” *Molecular Biology and Evolution* 26 (6). SMBE: 1299–1307.

Kuan, Pei Fen, Dongjun Chung, Guangjin Pan, James A Thomson, Ron Stewart, and Sündüz Keleş. 2011. “A Statistical Framework for the Analysis of ChIP-Seq Data.” *Journal of the American Statistical Association* 106 (495). Taylor & Francis: 891–903.

Lange, Kenneth. 2016. *MM Optimization Algorithms*. SIAM.

McLachlan, Geoffrey, and Thriyambakam Krishnan. 2007. *The EM Algorithm and Extensions*. Vol. 382. John Wiley & Sons.

McLachlan, Geoffrey, and David Peel. 2004. *Finite Mixture Models*. John Wiley & Sons.

McMurdie, Paul J, and Susan Holmes. 2014. “Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible.” *PLoS Computational Biology* 10 (4). Public Library of Science: e1003531.

Purdom, Elizabeth, and Susan P Holmes. 2005. “Error Distribution for Gene Expression Data.” *Statistical Applications in Genetics and Molecular Biology* 4 (1).

Rice, John. 2006. *Mathematical Statistics and Data Analysis*. Cengage Learning.

Shalizi, Cosma. 2017. *Advanced Data Analysis from an Elementary Point of View*. Cambridge University Press. https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf.

Slonim, Noam, Gurinder Singh Atwal, Gašper Tkačik, and William Bialek. 2005. “Information-Based Clustering.” *PNAS* 102 (51). National Acad Sciences: 18297–18302.

Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” *Journal of the Royal Statistical Society. Series B (Methodological)*. JSTOR, 267–88.

Wills, Quin F, Kenneth J Livak, Alex J Tipping, Tariq Enver, Andrew J Goldson, Darren W Sexton, and Chris Holmes. 2013. “Single-Cell Gene Expression Analysis Reveals Genetic Associations Masked in Whole-Tissue Experiments.” *Nature Biotechnology* 31 (8). Nature Publishing Group: 748–52.

Zeileis, Achim, Christian Kleiber, and Simon Jackman. 2008. “Regression Models for Count Data in R.” *Journal of Statistical Software* 27 (8). http://www.jstatsoft.org/v27/i08/.

Page built: 2019-05-15

Support for maintaining the online version of this book is provided by de.NBI

For website support please contact msmith@embl.de