# 4 Mixture Models

One of the main challenges of biological data analysis is dealing with heterogeneity. The quantities we are interested in often do not show a simple, unimodal “textbook distribution”. For example, in the last part of Chapter 2) we saw how the histogram of sequence scores in Figure 2.25 had two separate modes, one for CpG-islands and one for non-islands. We can see the data as a simple mixture of a few (in this case: two) components. We call these finite mixtures. Other mixtures can involve almost as many components as we have observations. These we call infinite mixtures58 We will see that—as for so many modeling choices–the right complexity of the mixture is in the eye of the beholder and often depends on the amount of data and the resolution and smoothness we want to attain..

In Chapter 1 we saw how a simple generative model with a Poisson distribution led us to make useful inference in the detection of an epitope. Unfortunately, a satisfactory fit to real data with such a simple model is often out of reach. However, simple models such as the normal or Poisson distributions can serve as building blocks for more realistic models using the mixing framework that we cover in this chapter. Mixtures occur naturally for flow cytometry data, biometric measurements, RNA-Seq, ChIP-Seq, microbiome and many other types of data collected using modern biotechnologies. In this chapter we will learn from simple examples how to build more realistic models of distributions using mixtures.

## 4.1 Goals for this chapter

In this chapter we will:

• Generate our own mixture model data from distributions composed of two normal populations.

• See how the Expectation-Maximization (EM) algorithm and allows us to “reverse engineer” the underlying mixtures in dataset.

• Use a special type of mixture called zero-inflation for data such as ChIP-Seq data that have many extra zeros.

• Discover the empirical cumulative distribution: a special mixture that we can build from observed data. This will enable us to see how we can simulate the variability of our estimates using the bootstrap.

• Build the Laplace distribution as an instance of an infinite mixture model- with many components. We will use it to model promoter lengths and microarray intensities.

• Have our first encounter with the gamma-Poisson distribution, a hierarchical model useful for RNA-Seq data. We will see it comes natually from mixing different Poisson distributed sources.

• See how mixture models enable us to choose data transformations.

## 4.2 Finite mixtures

### 4.2.1 Simple examples and computer experiments

Here is a first example of a mixture model with two equal-sized components. We decompose the generating process into steps:

Flip a fair coin.

Generate a random number from a normal distribution with mean 1 and variance 0.25.

Generate a random number from a normal distribution with mean 3 and variance 0.25. The histogram shown in Figure 4.1 was produced by repeating these two steps 10,000 times using the following code.

coinflips = (runif(10000) > 0.5)
table(coinflips)
## coinflips
## FALSE  TRUE
##  5005  4995

Figure 4.1: Histogram of 10,000 random draws from a fair mixture of two normals. The left hand part of the histogram is dominated by numbers generated from (A), on the right from (B).

oneFlip = function(fl, mean1 = 1, mean2 = 3, sd1 = 0.5, sd2 = 0.5) {
if (fl) {
rnorm(1, mean1, sd1)
} else {
rnorm(1, mean2, sd2)
}
}
fairmix = vapply(coinflips, oneFlip, numeric(1))
library("ggplot2")
library("dplyr")
ggplot(tibble(value = fairmix), aes(x = value)) +
geom_histogram(fill = "purple", binwidth = 0.1)

► Question 4.1

How can you use R’s vectorized syntax to remove the vapply-loop and generate the fairmix vector more efficiently?

► Solution

► Question 4.2

Using your improved code, use one million coin flips and make a histogram with 500 bins. What do you notice?

► Solution

The density function for a normal $$N(\mu,\sigma)$$ random variable can be written explicitly; we usually call it $$\phi(x)=\frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}$$

► Question 4.3

1. Plot a histogram of those values of fair$values for which coinflips is TRUE. Hint: use y = ..density.. in the call to aes (which means that the vertical axis is to show the proportion of counts) and set the binwidth to 0.01. 2. Overlay the line corresponding to $$\phi(z)$$. ► Solution In fact we can write the mathematical formula for the density of all of fair$values (the limiting curve that the histograms tend to look like) as a sum of the two densities.

$$$f(x)=\frac{1}{2}\phi_1(x)+\frac{1}{2}\phi_2(x), \tag{4.1}$$$

where $$\phi_1$$ is the density of the normal $$N(\mu_1=1,\sigma^2=0.25)$$ and $$\phi_2$$ is the density of the normal $$N(\mu_2=3,\sigma^2=0.25)$$. Figure 4.3 was generated by the following code.

Figure 4.3: The theoretical density of the mixture.

fairtheory = tibble(
x = seq(-1, 5, length.out = 1000),
f = 0.5 * dnorm(x, mean = means[1], sd = sds[1]) +
0.5 * dnorm(x, mean = means[2], sd = sds[2]))
ggplot(fairtheory, aes(x = x, y = f)) +
geom_line(color = "red", size = 1.5) + ylab("mixture density")

In this case the mixture model is extremely visible as the two component distributions have little overlap. Figure 4.3 shows two distinct peaks: we call this a bimodal distribution. However in many cases the separation is not so clear.

Figure 4.4: A mixture of two normals that is harder to recognize.

► Question 4.4

Figure 4.4 is a histogram of a fair mixture of two normals with the same variances. Can you guess the two mean parameters of the component distributions? Hint: you can use trial and error, and simulate various mixtures to see if you can make a matching histogram. Looking at the R code for the chapter will show you exactly how the data were generated.

► Solution

In Figure 4.5, the bars from the two component distributions are plotted on top of each other. A different way of showing the components is Figure 4.6, produced by the code below.

Figure 4.6: As Figure 4.5, with stacked bars for the two mixture components.

ggplot(mystery, aes(x = values, fill = coinflips)) +
geom_histogram(data = dplyr::filter(mystery, coinflips),
fill = "red", alpha = 0.2, breaks = br) +
geom_histogram(data = dplyr::filter(mystery, !coinflips),
fill = "darkblue", alpha = 0.2, breaks = br) +
geom_histogram(fill = "purple", breaks = br, alpha = 0.2)

► Question 4.5

Why do the bars in Figure 4.6, but not those in Figure 4.5 have the same heights as in Figure 4.4?

In Figures 4.5 and 4.6, we were able to use the coinflips column in the data to disentangle the components. In real data, this information is missing.

A book-long treatment on the subject of finite mixtures (McLachlan and Peel 2004).

### 4.2.2 Discovering the hidden class labels

We use a method called the expectation-maximization (EM) algorithm to infer the value of the hidden groupings. The expectation-maximization algorithm is a popular iterative procedure that alternates between

• pretending we know the probability with which each observation belongs to a component and estimating the distribution parameters of the components, and

• pretending we know the parameters of the component distributions and estimating the probability with which each observation belongs to the components.

Let’s take a simple example. We measure a variable $$Y$$ on a series of objects that we think come from two groups, although we do not know the group labels. We start by augmenting59 Adding another variable which was not measured, called a hidden or latent variable. the data with the unobserved (latent) group label, which we’ll call $$U$$.

We are interested in finding the values of $$U$$ and the unknown parameters of the underlying densities. We will use the maximum likelihood approach introduced in Chapter 2 to estimate the parameters that make the data $$Y$$ the most likely. We will use the notion of bivariate distribution here because we are going to look at the distribution of “couples” $$(Y, U)$$. We can write the probability densities under the parametric model:

$$$f_\theta(y,u) = f_\theta(y\,|\,u) f_\theta(u) \tag{4.2}$$$

where $$\theta$$ stands for the tuple of parameters of the underlying density. In our previous example, $$\theta$$ would be the two means, the two standard deviations, and the mixture fraction $$\lambda=0.5$$; thus $$\theta=(\mu_1,\mu_2,\sigma_1,\sigma_2,\lambda)$$.

► Question 4.6

Suppose we have two unfair coins, whose probabilities of heads are $$p_1 = 0.125$$ and $$p_2=0.25$$. With probability $$\pi$$ we pick coin 1, with probability $$1-\pi$$, coin 2. We then toss that coin twice and record the number of heads $$K$$.
a) Simulate 100 instances of this procedure, with $$\pi=\frac{1}{8}$$, and compute the contingency table of $$K$$.
b) Do the same with $$\pi=\frac{1}{4}$$.
c) If you were not told $$p_1$$, $$p_2$$ and $$\pi$$, could you infer them from the contingency table?

► Solution

#### Mixture of normals

Suppose we have a mixture of two normals with mean parameters unknown and standard deviations $$1$$; thus, $$(\mu_1=?,\,\mu_2=?,\,\sigma_1=\sigma_2=1)$$. Here is an example of data generated according to such a model. The labels are $$u$$.

mus = c(-0.5, 1.5)
u = sample(2, 100, replace = TRUE)
y = rnorm(length(u), mean = mus[u])
duy = tibble(u, y)
head(duy)
## # A tibble: 6 x 2
##       u       y
##   <int>   <dbl>
## 1     2  1.74
## 2     1 -0.259
## 3     1 -2.83
## 4     1 -1.47
## 5     1  0.0517
## 6     1 -2.37

If we know the labels $$u$$, we can estimate the means using separate MLEs for each group. The overall MLE is obtained by maximizing

$$$f(y, u \,|\, \theta) = \prod_{\{i:\,u_i=1\}} \phi_1(y_i) \prod_{\{i:\,u_i=2\}} \phi_2(y_i), \tag{4.3}$$$

or its logarithm. The maximization can be split into two independent pieces and solved as if we had two different MLEs to find:

group_by(duy, u) %>% summarize(mean(y))
## # A tibble: 2 x 2
##       u mean(y)
##   <int>     <dbl>
## 1     1    -0.355
## 2     2     1.37

► Question 4.7

Suppose we knew the mixing proportion was $$\lambda=\frac{1}{2}$$, so that the density is as above in Equation (4.1), namely $$\frac{1}{2}\phi_1+\frac{1}{2}\phi_2$$. Try writing out the (log) likelihood. What prevents us from solving for the MLE explicitly here?

► Solution

In reality we do not know the $$u$$ labels, nor do we know the mixture proportions (i.e., $$\lambda$$). We have to start with an initial guess for the labels, estimate the parameters and go through several iterations of the algorithm, updating at each step our current best guess of the group labels and the parameters until we see no substantial improvement in our optimizations.

In fact we can do something more elaborate and replace the “hard” labels $$u$$ for each observation (it is either in group 1 or 2) by membership probabilities that sum up to 1. The probability $$p(u, x\,|\,\theta)$$ serves as a weight or “participation” of observation $$x$$ in the likelihood function60 This is sometimes called “soft” averaging, (Slonim et al. 2005). This highlights the fact that we create weighted averages where each point’s probability serves as the weight, so we don’t have to make a hard decision that a point belongs to this or that group.. The marginal likelihood for the observed $$y$$ is the sum over all possible values of $$u$$ of the densities at $$(y,u)$$:

$\begin{equation*} \text{marglike}(\theta;y)=f(y\,|\,\theta)=\sum_u f(y,u\,|\,\theta) \,\text{d}u. \end{equation*}$

At each iteration (we mark the present values with an asterisk, $$*$$), the current best guesses for the unknown parameters (for instance $$\mu_1^*$$ and $$\mu_2^*$$) are combined into $$\theta^*=(\mu_1^*,\mu_2^*,\lambda^*)$$. We use these to compute the so-called Expectation61 The term expectation here means that we average, or integrate, over all possible values of $$u$$. function $$E^*(\theta)$$

$\begin{equation*} E^*(\theta)=E_{\theta^*,Y}[\log p(u,y\,|\,\theta^*)]=\sum_u p(u\,|\,y ,\theta^*) \log p(u,y\,|\,\theta^*), \end{equation*}$

The value of $$\theta$$ that maximizes $$E^*$$ is found in what is known as the Maximization step. These two iterations (E and M) are repeated until the improvements are small; this is a numerical indication that we are close to a flattening of the likelihood and so we have reached a local maximum62 The iteration trajectory will depend on the starting point, but, as in an uphill hike, while the intermediary points may be different, we always end up on top of the hill - as long as there is only one hilltop and not several.. It’s good practice to repeat such a procedure several times from different starting points and check that we always get the same answer.

► Question 4.8

Several R packages provide EM implementations, including mclust, EMcluster and EMMIXskew. Choose one and run the EM function several times with different starting values. Then use the function normalmixEM from the mixtools package to compare the outputs.

► Solution

The EM algorithm is very instructive:

1. It shows us how we can tackle a difficult problem with too many unknowns by alternating between solving simpler problems. In this way, we eventually find estimates of hidden variables.

2. It provides a first example of soft averaging i.e., where we don’t decide whether an observation belongs to one group or another, but allow it to participate in several groups by using probabilities of membership as weights, and thus obtain more nuanced estimates.

3. The method employed here can be extended to the more general case of model-averaging (Hoeting et al. 1999). It can be sometimes beneficial to consider several models simultaneously if we are unsure which one is relevant for our data. We can combine them together into a weighted model. The weights are provided by the likelihoods of the models.

### 4.2.3 Models for zero inflated data

Ecological and molecular data often come in the form of counts. For instance, this may be the number of individuals from each of several species at each of several locations. Such data can often be seen as a mixture of two scenarios: if the species is not present, the count is necessarily zero, but if the species is present, the number of individuals we observe varies, with a random sampling distribution; this distribution may also include zeros. We model this as a mixture:

$\begin{equation*} f_{\text{zi}}(y) = \lambda \, \delta(y) + (1-\lambda) \, f_{\text{count}}(y), \end{equation*}$

where $$\delta$$ is Dirac’s delta function, which represents a probability distribution that has all its mass at 0. The zeros from the first mixture component are called “structural”: in our example, they occur because certain species do not live in certain habitats. The second component, $$f_{\text{count}}$$ may also include zeros and other small numbers, simply due to random sampling. The R packages pscl (Zeileis, Kleiber, and Jackman 2008) and zicounts provide many examples and functions for working with zero inflated counts.

#### Example: ChIP-Seq data

Let’s consider the example of ChIP-Seq data. These data are sequences of pieces of DNA that are obtained from chromatin immunoprecipitation (ChIP). This technology enables the mapping of the locations along genomic DNA of transcription factors, nucleosomes, histone modifications, chromatin remodeling enzymes, chaperones, polymerases and other protein. It was the main technology used by the Encyclopedia of DNA Elements (ENCODE) Project. Here we use an example (Kuan et al. 2011) from the mosaicsExample package, which shows data measured on chromosome 22 from ChIP-Seq of antibodies for the STAT1 protein and the H3K4me3 histone modification applied to the GM12878 cell line. Here we do not show the code to construct the binTFBS object; it is shown in the source code file for this chapter and follows the vignette of the mosaics package.

binTFBS
## Summary: bin-level data (class: BinData)
## ----------------------------------------
## - # of chromosomes in the data: 1
## - total effective tag counts: 462479
##   (sum of ChIP tag counts of all bins)
## - control sample is incorporated
## - mappability score is NOT incorporated
## - GC content score is NOT incorporated
## ----------------------------------------

From this object, we can create the histogram of per-bin counts.

Figure 4.7: The number of binding sites found in 200nt windows along chromosome 22 in a ChIP-Seq dataset.

bincts = print(binTFBS)
ggplot(bincts, aes(x = tagCount)) +
geom_histogram(binwidth = 1, fill = "forestgreen")

We see in Figure 4.7 that there are many zeros, although from this plot it is not immediately obvious whether the number of zeros is really extraordinary, given the frequencies of the other small numbers ($$1,2,...$$).

► Question 4.9

1. Redo the histogram of the counts using a logarithm base 10 scale on the $$y$$-axis.
2. Estimate $$\pi_0$$, the proportion of bins with zero counts.

► Solution

### 4.2.4 More than two components

So far we have looked at mixtures of two components. We can extend our description to cases where there may be more. For instance, when weighing N=7,000 nucleotides obtained from mixtures of deoxyribonucleotide monophosphates (each type has a different weight, measured with the same standard deviation sd=3), we might observe the histogram (shown in Figure 4.9) generated by the following code.

Figure 4.9: Simulation of 7,000 nucleotide mass measurements.

masses = c(A =  331, C =  307, G =  347, T =  322)
probs  = c(A = 0.12, C = 0.38, G = 0.36, T = 0.14)
N  = 7000
sd = 3
nuclt   = sample(length(probs), N, replace = TRUE, prob = probs)
mean = masses[nuclt],
sd   = sd)
geom_histogram(bins = 100, fill = "purple")

► Question 4.10

Repeat this simulation experiment with $$N=1000$$ nucleotide measurements. What do you notice in the histogram?

► Question 4.11

What happens when $$N=7000$$ but the standard deviation is 10?

► Question 4.12

Plot the theoretical density curve for the distribution simulated in Figure 4.9.

In this case, as we have enough measurements with good enough precision, we are able to distinguish the four nucleotides and decompose the distribution shown in Figure 4.9. With fewer data and/or more noisy measurements, the four modes and the distribution component might be less clear.

## 4.3 Empirical distributions and the nonparametric bootstrap

In this section, we will consider an extreme case of mitxure models, where we model our sample of $$n$$ data points as a mixture of $$n$$ point masses. We could use almost any set of data here; to be concrete, we use Darwin’s Zea Mays data63 They were collected by Darwin who asked his cousin, Francis Galton to analyse them. R.A. Fisher re-analysed the same data using a paired t-test (Bulmer 2003). We will get back to this example in Chapter 13. in which he compared the heights of 15 pairs of Zea Mays plants (15 self-hybridized versus 15 crossed). The data are available in the HistData package, and we plot the distribution of the 15 differences in height:

library("HistData")
ZeaMays$diff ## [1] 6.125 -8.375 1.000 2.000 0.750 2.875 3.500 5.125 ## [9] 1.750 3.625 7.000 3.000 9.375 7.500 -6.000 Figure 4.10: The observed sample can be seen as a mixture of point masses at each of the values (real point masses would be bars without any width whatsoever). ggplot(ZeaMays, aes(x = diff, ymax = 1/15, ymin = 0)) + geom_linerange(size = 1, col = "forestgreen") + ylim(0, 0.1) In Section 3.6.6 we saw that the empirical cumulative distribution function (ECDF) for a sample of size $$n$$ is $$$\hat{F}_n(x)= \frac{1}{n}\sum_{i=1}^n {\mathbb 1}_{x \leq x_i}, \tag{4.4}$$$ and we saw ECDF plots in Figure 3.22. We can also write the density of our sample as $$$\hat{f}_n(x) =\frac{1}{n}\sum_{i=1}^n \delta_{x_i}(x) \tag{4.5}$$$ In general, the density of a probability distribution is the derivative (if it exists) of the distribution function. We have applied this principle here: the density of the distribution defined by Equation (4.4) is Equation (4.5). We could do this since one can consider the function $$\delta_a$$ the “derivative” of the step function $${\mathbb 1}_{x \leq a}$$: it is completely flat almost everywhere, except at the one point $$a$$ where there is the step, and where its value is “infinite” There is a bit of advanced mathematics (beyond standard calculus) required for this to make sense, which is however beyond the scope of our treatment here.. Equation (4.5) highlights that our data sample can be considered a mixture of $$n$$ point masses at the observed values $$x_1,x_2,..., x_{n}$$, such as in Figure 4.10. Figure 4.11: The value of a statistic $$\tau$$ is estimated from data (the grey matrices) generated from an underlying distribution $$F$$. Different samples from $$F$$ lead to different data, and so to different values of the estimate $$\hat{\tau}$$: this is called sampling variability. The distribution of all the $$\hat{\tau}$$’s is the sampling distribution. Statistics of our sample, such as the mean, minimum or median, can now be written as a function of the ECDF, for instance, $$\bar{x} = \int \delta_{x_i}(x)\,\text{d}x$$. As another instance, if $$n$$ is an odd number, the median is $$x_{(\frac{n+1}{2})}$$, the value right in the middle of the ordered list. The true sampling distribution of a statistic $$\hat{\tau}$$ is often hard to know as it requires many different data samples from which to compute the statistic; this is shown in Figure 4.11. The bootstrap principle approximates the true sampling distribution of $$\hat{\tau}$$ by creating new samples drawn from the empirical distribution built from the original sample (Figure 4.12). We reuse the data (by considering it a mixture distribution of $$\delta$$s) to create new “datasets” by taking samples and looking at the sampling distribution of the statistics $$\hat{\tau}^*$$ computed on them. This is called the nonparametric bootstrap resampling approach, see Efron and Tibshirani (1994) for a complete reference. It is very versatile and powerful method that can be applied to basically any statistic, no matter how complicated it is. We will see example applications of this method, in particular to clustering, in Chapter 5). Figure 4.12: The bootstrap simulates the sampling variability by drawing samples not from the underlying true distribution $$F$$ (as in Figure 4.11), but from the empirical distribution function $$\hat{F}_n$$. Using these ideas, let’s try to estimate the sampling distribution of the median of the Zea Mays differences we saw in Figure 4.10. We use similar simulations to those in the previous sections: Draw $$B=1000$$ samples of size 15 from the 15 values (each being a component in the 15-component mixture). Then compute the 1000 medians of each of these samples of 15 values and look at their distribution: this is the bootstrap sampling distribution of the median. Figure 4.13: The bootstrap sampling distribution of the median of the Zea Mays differences. B = 1000 meds = replicate(B, { i = sample(15, 15, replace = TRUE) median(ZeaMays$diff[i])
})
ggplot(tibble(medians = meds), aes(x = medians)) +
geom_histogram(bins = 30, fill = "purple")

► Question 4.13

Estimate a 99% confidence interval for the median based on these simulations. What can you conclude from looking at the overlap between this interval and 0?

► Question 4.14

Use the bootstrap package to redo the same analysis using the function bootstrap for both median and mean. What differences do you notice between the sampling distribution of the mean and the median?

► Solution

#### Why nonparametric?

In theoretical statistics, nonparametric methods are those that have infinitely many degrees of freedom or numbers of unknown parameters.

Despite their name, nonparametric methods are not methods that do not use parameters: all statistical methods estimate unknown quantities. In practice, we do not wait for infinity; when the number of parameters becomes as large or larger than the amount of data available, we say the method is nonparametric. The bootstrap uses a mixture with $$n$$ components, so with a sample of size $$n$$, it qualifies as a nonparametric method.

► Question 4.15

If the sample is composed of $$n=3$$ different values, how many different bootstrap resamples are possible? Answer the same question with $$n=15$$.

► Solution

► Question 4.16

What are the two types of error that can occur when using the bootstrap as it is implemented in the bootstrap package? Which parameter can you modify to improve one of them?

► Solution

## 4.4 Infinite mixtures

Sometimes mixtures can be useful even if we don’t aim to assign a label to each observation or, to put it differently, if we allow as many labels’ as there are observations. If the number of mixture components is as big as (or bigger than) the number of observations, we say we have an infinite mixture. Let’s look at some examples.

### 4.4.1 Infinite mixture of normals

Figure 4.14: Laplace knew already that the probability density $f_Y(y)=\frac{1}{2\phi}\exp\left(-\frac{|y-\theta|}{\phi}\right),\qquad\phi>0$ has the median as its location parameter $$\theta$$ and the median absolute deviation (MAD) as its scale parameter $$\phi$$.

Consider the following two-level data generating scheme: Level 1 Create a sample of Ws from an exponential distribution.

w = rexp(10000, rate = 1)

Level 2 The $$w$$s serve as the variances of normal variables with mean $$\mu$$ generated using rnorm.

Figure 4.15: Data sampled from a Laplace distribution.

mu  = 0.3
lps = rnorm(length(w), mean = mu, sd = sqrt(w))
ggplot(data.frame(lps), aes(x = lps)) +
geom_histogram(fill = "purple", binwidth = 0.1)

This turns out to be a rather useful distribution. It has well-understood properties and is named after Laplace, who proved that the median is a good estimator of its location parameter $$\theta$$ and that the median absolute deviation can be used to estimate its scale parameter $$\phi$$. From the formula in the caption of Figure 4.14 we see that the $$L_1$$ distance (absolute value of the difference) holds a similar position in the Laplace density as the $$L_2$$ (square of the difference) does for the normal density.

Conversely, in Bayesian regression64 Don’t worry if you are not familiar with this, in that case just skip this sentence., having a Laplace distribution as a prior on the coefficients amounts to an $$L_1$$ penalty, called the lasso (Tibshirani 1996), while a normal distribution as a prior leads to an $$L_2$$ penalty, called ridge regression.

► Question 4.17

Write a random variable whose distribution is the symmetric Laplace as a function of normal and exponential random variables.

► Solution

#### Asymmetric Laplace

In the Laplace distribution, the variances of the normal components depend on $$W$$, while the means are unaffected. A useful extension adds another parameter $$\theta$$ that controls the locations or centers of the components. We generate data alps from a hierarchical model with $$W$$ an exponential variable; the output shown in Figure 4.16 is a histogram of normal $$N(\theta+w\mu,\sigma w)$$ random numbers, where the $$w$$’s themselves were randomly generated from an exponential distribution with mean $$1$$ as shown in the code:

Figure 4.16: Histogram of data generated from an asymmetric Laplace distribution – a scale mixture of many normals whose means and variances are dependent. We write $$X \sim AL(\theta, \mu, \sigma)$$.

mu = 0.3; sigma = 0.4; theta = -1
w  = rexp(10000, 1)
alps = rnorm(length(w), theta + mu * w, sigma * sqrt(w))
ggplot(tibble(alps), aes(x = alps)) +
geom_histogram(fill = "purple", binwidth = 0.1)

Such hierarchical mixture distributions, where every instance of the data has its own mean and variance, are useful models in many biological settings. Examples are shown in Figure 4.17.

► Question 4.18

Looking at the log-ratio of gene expression values from a microarray, one gets a distribution as shown on the right of Figure 4.17. How would one explain that the data have a histogram of this form?

The Laplace distribution is an example of where the consideration of the generative process indicates how the variance and mean are linked. The expectation value and variance of an asymmetric Laplace distribution $$AL(\theta, \mu, \sigma)$$ are

$$$E(Y ) = \theta + \mu \quad\quad\text{and}\quad\quad\text{var}(Y)=\sigma^2+\mu^2. \tag{4.7}$$$

Note the variance is dependent on the mean, unless $$\mu = 0$$ (the case of the symmetric Laplace Distribution). This is the feature of the distribution that makes it useful. Having a mean-variance dependence is very common for physical measurements, be they microarray fluorescence intensities, peak heights from a mass spectrometer, or reads counts from high-throughput sequencing, as we’ll see in the next section.

### 4.4.2 Infinite mixtures of Poisson variables.

Figure 4.18: How to count the fish in a lake? MC Escher.

A similar two-level hierarchical model is often also needed to model real-world count data. At the lower level, simple Poisson and binomial distributions serve as the building blocks, but their parameters may depend on some underlying (latent) process. In ecology, for instance, we might be interested in variations of fish species in all the lakes in a region. We sample the fish species in each lake to estimate their true abundances, and that could be modeled by a Poisson. But the true abundances will vary from lake to lake. And if we want to see whether, for instance, changes in climate or altitude play a role, we need to disentangle such systematic effects from random lake-to-lake variation. The different Poisson rate parameters $$\lambda$$ can be modeled as coming from a distribution of rates. Such a hierarchical model also enables us to add supplementary steps in the hierarchy, for instance we could be interested in many different types of fish, model altitude and other environmental factors separately, etc.

Further examples of sampling schemes that are well modeled by mixtures of Poisson variables include applications of high-throughput sequencing, such as RNA-Seq, which we will cover in detail in Chapter 8, or 16S rRNA-Seq data used in microbial ecology.

### 4.4.3 Gamma distribution: two parameters (shape and scale)

Now we are getting to know a new distribution that we haven’t seen before. The gamma distribution is an extension of the (one-parameter) exponential distribution, but it has two parameters, which makes it more flexible. It is often useful as a building block for the upper level of a hierarchical model. The gamma distribution is positive-valued and continuous. While the density of the exponential has its maximum at zero and then simply decreases towards 0 as the value goes to infinity, the density of the gamma distribution has its maximum at some finite value. Let’s explore it by simulation examples. The histograms in Figure 4.19 were generated by the following lines of code:

Figure 4.19: Histograms of random samples of gamma distributions. The upper histogram shows a gamma$$(2,\frac{1}{3})$$, the lower a gamma$$(10,\frac{3}{2})$$ distribution. The gamma is a flexible two parameter distribution: .

ggplot(tibble(x = rgamma(10000, shape = 2, rate = 1/3)),
aes(x = x)) + geom_histogram(bins = 100, fill= "purple")
ggplot(tibble(x = rgamma(10000, shape = 10, rate = 3/2)),
aes(x = x)) + geom_histogram(bins = 100, fill= "purple")

#### Gamma–Poisson mixture: a hierarchical model

1. Generate a set of parameters: $$\lambda_1,\lambda_2,...$$ from a gamma distribution.

2. Use these to generate a set of Poisson($$\lambda_i$$) random variables, one for each $$\lambda_1$$.

Figure 4.20: Histogram of gp, generated via a gamma-Poisson hierachical model.

lambda = rgamma(10000, shape = 10, rate = 3/2)
gp = rpois(length(lambda), lambda = lambda)
ggplot(tibble(x = gp), aes(x = x)) +
geom_histogram(bins = 100, fill= "purple")

The resulting values are said to come from a gamma–Poisson mixture. Figure 4.20 shows the histogram of gp.

► Question 4.19

1. Are the values generated from such a gamma–Poisson mixture continuous or discrete ?
2. What is another name for this distribution? Hint: Try the different distributions provided by the goodfit function from the vcd package.

► Solution

In R, and in some other places, the gamma-Poisson distribution travels under the alias name of negative binomial distribution. The two names are synonyms; the second one alludes to the fact that Equation (4.8) bears some formal similarities to the probabilities of a binomial distribution. The first name, gamma–Poisson distribution, is more indicative of its generating mechanism, and that’s what we will use in the rest of the book. It is a discrete distribution, that means that it takes values only on the natural numbers (in contrast to the gamma distribution, which covers the whole positive real axis). Its probability distribution is

$$$\text{P}(K=k)=\left(\begin{array}{c}k+a-1\\k\end{array}\right) \, p^a \, (1-p)^k, \tag{4.8}$$$

which depends on the two parameters $$a\in\mathbb{R}^+$$ and $$p\in[0,1]$$. Equivalently, the two parameters can be expressed by the mean $$\mu=pa/(1-p)$$ and a parameter called the dispersion $$\alpha=1/a$$. The variance of the distribution depends on these parameters, and is $$\mu+\alpha\mu^2$$.

► Question 4.20

If you are more interested in analytical derivations than illustrative simulations, try writing out the mathematical derivation of the gamma-Poisson probability distribution.

► Solution

### 4.4.4 Variance stabilizing transformations

A key issue we need to control when we analyse experimental data is how much variability there is between repeated measurements of the same underlying true value, i.e. between replicates. This will determine whether and how well we can see any true differences, i.e. between different conditions. Data that arise through the type of hierarchical models we have studied in this chapter often turn out to have very heterogenous variances, and this can be a challenge. We will see how in such cases variance-stabilizing transformations (Anscombe 1948) can help. Note how we construct the dataframe (or, more precisely, the tibble) simdat: the output of the lapply loop is a list of tibbles, one for each value of lambdas. With the pipe operator %>% we send it to the function bind_rows (from the dplyr package). The result is a dataframe of all the list elements neatly stacked on top of each other. Let’s start with a series of Poisson variables with rates lambdas:

lambdas = seq(100, 900, by = 100)
simdat = lapply(lambdas, function(l)
tibble(y = rpois(n = 40, lambda=l), lambda = l)
) %>% bind_rows
library("ggbeeswarm")
ggplot(simdat, aes(x = lambda, y = y)) +
geom_beeswarm(alpha = 0.6, color = "purple")
ggplot(simdat, aes(x = lambda, y = sqrt(y))) +
geom_beeswarm(alpha = 0.6, color = "purple")

The data that we see in the left panel of Figure 4.23 are an example of what is called heteroscedasticity: the standard deviations (or, equivalently, the variance) of our data is different in different regions of our data space. In particular, it increases along the $$x$$-axis, with the mean. For the Poisson distribution, we indeed know that the standard deviation is the square root of the mean; for other types of data, there may be other dependencies. This can be a problem if we want to apply subsequent analysis techniques (for instance, regression, or a statistical test) that are based on assuming that the variances are the same. In Figure 4.23, the number of replicates going into each beeswarm are quite large (40). In practice, this is not always the case, so the heteroskedasticity may be harder to see, even though it is there. However, as we see in the right panel of Figure 4.23, if we apply the square root transformation to the $$y$$-variables, then the transformed variables will have approximately the same variance. In the following code, we use the transformation $$y \mapsto 2\sqrt{y}$$, which yields variance approximately equal to 1.

summarise(group_by(simdat, lambda), sd(y), sd(2*sqrt(y)))
## # A tibble: 9 x 3
##   lambda sd(y) sd(2 * sqrt(y))
##    <dbl>   <dbl>             <dbl>
## 1    100    11.3             1.13
## 2    200    14.4             1.02
## 3    300    20.5             1.17
## 4    400    22.2             1.10
## 5    500    20.6             0.917
## 6    600    23.3             0.945
## 7    700    16.6             0.630
## 8    800    28.5             1.00
## 9    900    28.0             0.937

► Question 4.21

Repeat the computation in the above code chunk for a version of simdat with a larger number of replicates than 40.

Another example, now using the gamma-Poisson distribution, is shown in Figure 4.24. We generate gamma-Poisson variables u66 To catch a greater range of values for the mean value mu, without creating too dense a sequence, we use a geometric series: $$\mu_{i+1} = 2\mu_i$$. and plot the 95% confidence intervals around the mean.

muvalues = 2^seq(0, 10, by = 1)
simgp = lapply(muvalues, function(mu) {
u = rnbinom(n = 1e4, mu = mu, size = 4)
tibble(mean = mean(u), sd = sd(u),
lower = quantile(u, 0.025),
upper = quantile(u, 0.975),
mu = mu)
} ) %>% bind_rows
head(as.data.frame(simgp), 2)
##     mean       sd lower upper mu
## 1 0.9993 1.124556     0     4  1
## 2 2.0013 1.738563     0     6  2

Figure 4.24: gamma-Poisson distributed measurement data, for a range of $$\mu$$ from 1 to 1024.

ggplot(simgp, aes(x = mu, y = mean, ymin = lower, ymax = upper)) +
geom_point() + geom_errorbar()

► Question 4.22

How can we find a transformation for these data that stabilizes the variance, similar to the square root function for the Poisson distributed data?

► Solution

Figure 4.25: Piecewise linear function that stabilizes the variance of the data in Figure 4.24.

simgp = mutate(simgp,
slopes = 1 / sd,
trsf   = cumsum(slopes * mean))
ggplot(simgp, aes(x = mean, y = trsf)) +
geom_point() + geom_line() + xlab("")

We see in Figure 4.25 that this function has some resemblance to a square root function in particular at its lower end. At the upper end, it seems to look more like a logarithm. The more mathematically inclined will see that an elegant extension of these numerical calcuations can be done through a little calculus known as the delta method, as follows.

Call our transformation function $$g$$, and assume it’s differentiable67 That’s not a very strong assumption. Pretty much any function that we might consider reasonable here is differentiable.. Also call our random variables $$X_i$$, with means $$\mu_i$$ and variances $$v_i$$, and we assume that $$v_i$$ and $$\mu_i$$ are related by a functional relationship $$v_i = v(\mu_i)$$. Then, for values of $$X_i$$ in the neighborhood of its mean $$\mu_i$$,

$$$g(X_i) = g(\mu_i) + g'(\mu_i) (X_i-\mu_i) + ... \tag{4.10}$$$

where the dots stand for higher order terms that we can neglect. The variances of the transformed values are then

$$$\text{Var}(g(X_i)) \simeq g'(\mu_i)^2 \; \text{Var}(X_i) = g'(\mu_i)^2 \, v(\mu_i), \tag{4.11}$$$

where we have used the rules $$\text{Var}(X-c)=\text{Var}(X)$$ and $$\text{Var}(cX)=c^2\,\text{Var}(X)$$ that hold whenever $$c$$ is a constant number. Requiring that this be constant leads to the differential equation

$$$g'(x) = \frac{1}{\sqrt{v(x)}}. \tag{4.12}$$$

For any given mean-variance relationship $$v(\mu)$$, we can solve this for the function $$g$$. Let’s check this for some simple cases:

• if $$v(\mu)=\mu$$ (Poisson), we recover $$g(x)=\sqrt{x}$$, the square root transformation.

• If $$v(\mu)=c\,\mu^2$$, solving the differential equation (4.12) gives $$g(x)=\log(x)$$. This explains why the logarithm transformation is so popular in many data analysis applications: it acts as a variance stabilizing transformation whenever the data have a constant coefficient of variation, that is, when the standard deviation is proportional to the mean.

► Question 4.23

What is the variance-stabilizing transformation associated with $$v(\mu) = \mu + c\,\mu^2$$?

## 4.5 Summary of this chapter

We have given motivating examples and ways of using mixtures to model biological data. We saw how the EM algorithm is an interesting example of solving a difficult optimization problem by iteratively pretending we know one part of the solution to compute the other part, then alternating to pretend the other component is known and computing the first part, and so on, until convergence.

#### Finite mixture models.

We have seen how to model mixtures of two or more normal distributions with different means and variances. We have seen how to decompose a given sample of data from such a mixture, even without knowing the latent variable, using the EM algorithm. The EM approach requires that we know the parametric form of the distributions and the number of components. In Chapter 5, we will see how we can find groupings in data even without relying on such information – this is then called clustering. We can keep in mind that there is a strong conceptual relationship between clustering and mixture modeling.

#### Common infinite mixture models

Infinite mixture models are good for constructing new distributions (such as the gamma-Poisson or the Laplace) out of more basic ones (such as binomial, normal, Poisson). Common examples are

• mixtures of normals (often with a hierarchical model on the means and the variances);

• beta-binomial mixtures – where the probability $$p$$ in the binomial is generated according to a $$\text{beta}(a, b)$$ distribution;

• gamma-Poisson for read counts (see Chapter 8);

• gamma-exponential for PCR.

#### Applications

Mixture models are useful whenever there are several layers of experimental variability. For instance, at the lowest layer, our measurement precision may be limited by basic physical detection limits, and these may be modeled by a Poisson distribution in the case of a counting-based assay, or a normal distribution in the case of the continuous measurement. On top of there may be one (or more) layers of instrument-to-instrument variation, variation in the reagents, operator variaton etc.

Mixture models reflect that there is often heterogeneous amounts of variability (variances) in the data. In such cases, suitable data transformations, i.e., variance stabilizing transformations, are necessary before subsequent visualization or analysis. We’ll study in depth an example for RNA-Seq in Chapter 8, and this also proves useful in the normalization of next generation reads in microbial ecology (McMurdie and Holmes 2014).

Another important application of mixture modeling is the two-component model in multiple testing – we will come back to this in Chapter 6.

#### The ECDF and bootstrapping

We saw that by using the observed sample as a mixture we could generate many simulated samples that inform us about the sampling distribution of an estimate. This method is called the bootstrap and we will return to it several times as it provides a way of evaluating estimates even when an analytic formulae are not available (we say it is non-parametric).

A useful book-long treatment of finite mixture models is by McLachlan and Peel (2004); for the EM algorithm, see also the book by McLachlan and Krishnan (2007). A recent book that presents all EM type algorithms within the Majorize-Minimization (MM) framework is by Lange (2016).

There are in fact mathematical reasons why many natural phenomena can be seen as mixtures: this occurs when the observed events are exchangeable (the order in which they occur doesn’t matter). The theory underlying this is quite mathematical, a good way to start is to look at the Wikipedia entry and the paper by Diaconis and Freedman (1980).

In particular, we use mixtures for high-throughput data. You will see examples in Chapters 8 and 11.

The bootstrap can be used in many situations and is a very useful tool to know about, a friendly treatment is given in (Efron and Tibshirani 1993).

A historically interesting paper is the original article on variance stabilization by Anscombe (1948), who proposed ways of making variance stabilizing transformations for Poisson and gamma-Poisson random variables. Variance stabilization is explained using the delta method in many standard texts in theoretical statistics, e.g., those by Rice (2006 Chapter 6) and Kéry and Royle (2015 Page 35).

Kéry and Royle (2015) provide a nice exploration of using R to build hierarchical models for abundance estimation in niche and spatial ecology.

## 4.7 Exercises

► Exercise 4.1

The EM algorithm step by step. As an example dataset, we use the values in the file Myst.rds. As always, it is a good idea to first visualize the data. The histogram is shown in Figure 4.26. We are going to model these data as a mixture of two normal distributions with unknown means and standard deviations. We’ll call the two components A and B.

Figure 4.26: Histogram of our example dataset for the EM algorithm.

yvar = readRDS("../data/Myst.rds")$yvar ggplot(tibble(yvar), aes(x = yvar)) + geom_histogram(binwidth=0.025) str(yvar) ## num [1:1800] 0.3038 0.0596 -0.0204 0.1849 0.2842 ... We start by randomly assigning a “probability of membership” to each of the values in yvar for each of the groups, A and B. These are numbers between 0 and 1, pA represents the probability of coming from mixture component A. The complementary probability is pB. pA = runif(length(yvar)) pB = 1 - pA We also need to set up some housekeeping variables: iter counts over the iterations of the EM algorithm; loglik stores the current log-likelihood; delta stores the change in the log-likelihood from the previous iteration to the current one. We also define the parameters tolerance, miniter and maxiter of the algorithm. iter = 0 loglik = -Inf delta = +Inf tolerance = 1e-3 miniter = 50; maxiter = 1000 Study the code below and answer the following questions: Which lines correspond to the E-step, which to the M-step? What does the M-step do, what does the E-step do? What is the role of the algorithm arguments tolerance, miniter and maxiter? Why do we need to compute loglik? Compare the result of what we are doing here to the output of the normalmixEM function from the mixtools package. while((delta > tolerance) && (iter <= maxiter) || (iter < miniter)) { lambda = mean(pA) muA = weighted.mean(yvar, pA) muB = weighted.mean(yvar, pB) sdA = sqrt(weighted.mean((yvar - muA)^2, pA)) sdB = sqrt(weighted.mean((yvar - muB)^2, pB)) phiA = dnorm(yvar, mean = muA, sd = sdA) phiB = dnorm(yvar, mean = muB, sd = sdB) pA = lambda * phiA pB = (1 - lambda) * phiB ptot = pA + pB pA = pA / ptot pB = pB / ptot loglikOld = loglik loglik = sum(log(pA)) delta = abs(loglikOld - loglik) iter = iter + 1 } param = tibble(group = c("A","B"), mean = c(muA,muB), sd = c(sdA,sdB)) param ## # A tibble: 2 x 3 ## group mean sd ## <chr> <dbl> <dbl> ## 1 A -0.169 0.0983 ## 2 B 0.147 0.150 iter ## [1] 370 ► Exercise 4.2 Compare the theoretical values of the gamma-Poisson distribution with parameters given by the estimates in ofit$par in Section 4.4.3 to the data used for the estimation using a QQ-plot.

► Exercise 4.3

Mixture modeling examples for regression. The flexmix package (Grün, Scharl, and Leisch 2012) enables us to cluster and fit regressions to the data at the same time. The standard M-step FLXMRglm of flexmix is an interface to R’s generalized linear modeling facilities (the glm function). Load the package and an example dataset.

library("flexmix")
data("NPreg")

a) First, plot the data and try to guess how the points were generated.
b) Fit a two component mixture model using the commands

m1 = flexmix(yn ~ x + I(x^2), data = NPreg, k = 2)

c) Look at the estimated parameters of the mixture components and make a truth table that cross-classifies true classes versus cluster memberships. What does the summary of the object m1` show us?
d) Plot the data again, this time coloring each point according to its estimated class.

► Exercise 4.4

Other hierarchical noise models:
Find two papers that explore the use of other infite mixtures for modeling molecular biology technological variation.

### References

Anscombe, Francis J. 1948. “The Transformation of Poisson, Binomial and Negative-Binomial Data.” Biometrika. JSTOR, 246–54.

Bulmer, Michael George. 2003. Francis Galton: Pioneer of Heredity and Biometry. JHU Press.

Chen, Min, Yang Xie, and Michael Story. 2011. “An Exponential-Gamma Convolution Model for Background Correction of Illumina BeadArray Data.” Communications in Statistics-Theory and Methods 40 (17). Taylor & Francis: 3055–69.

Diaconis, Persi, and David Freedman. 1980. “Finite Exchangeable Sequences.” The Annals of Probability. JSTOR, 745–64.

Diaconis, Persi, and Susan Holmes. 1994. “Gray Codes for Randomization Procedures.” Statistics and Computing 4 (4). Springer: 287–302.

Efron, Bradley, and Robert J Tibshirani. 1994. An Introduction to the Bootstrap. CRC press.

Efron, B., and R. Tibshirani. 1993. An Introduction to the Bootstrap. Chapman & Hall/CRC.

Grün, Bettina, Theresa Scharl, and Friedrich Leisch. 2012. “Modelling Time Course Gene Expression Data with Finite Mixtures of Linear Additive Models.” Bioinformatics 28 (2): 222–28. https://doi.org/10.1093/bioinformatics/btr653.

Hoeting, Jennifer A, David Madigan, Adrian E Raftery, and Chris T Volinsky. 1999. “Bayesian Model Averaging: A Tutorial.” Statistical Science. JSTOR, 382–401.

Kéry, Marc, and J Andrew Royle. 2015. Applied Hierarchical Modeling in Ecology: Analysis of Distribution, Abundance and Species Richness in R and Bugs: Volume 1: Prelude and Static Models. Academic Press.

Kristiansson, Erik, Michael Thorsen, Markus J Tamás, and Olle Nerman. 2009. “Evolutionary Forces Act on Promoter Length: Identification of Enriched Cis-Regulatory Elements.” Molecular Biology and Evolution 26 (6). SMBE: 1299–1307.

Kuan, Pei Fen, Dongjun Chung, Guangjin Pan, James A Thomson, Ron Stewart, and Sündüz Keleş. 2011. “A Statistical Framework for the Analysis of ChIP-Seq Data.” Journal of the American Statistical Association 106 (495). Taylor & Francis: 891–903.

Lange, Kenneth. 2016. MM Optimization Algorithms. SIAM.

McLachlan, Geoffrey, and Thriyambakam Krishnan. 2007. The EM Algorithm and Extensions. Vol. 382. John Wiley & Sons.

McLachlan, Geoffrey, and David Peel. 2004. Finite Mixture Models. John Wiley & Sons.

McMurdie, Paul J, and Susan Holmes. 2014. “Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible.” PLoS Computational Biology 10 (4). Public Library of Science: e1003531.

Purdom, Elizabeth, and Susan P Holmes. 2005. “Error Distribution for Gene Expression Data.” Statistical Applications in Genetics and Molecular Biology 4 (1).

Rice, John. 2006. Mathematical Statistics and Data Analysis. Cengage Learning.