Finding categories of cells, illnesses, organisms and then naming them is a core activity in the natural sciences. In Chapter 4 we’ve seen that some data can be modeled as mixtures from different groups or populations with a clear parametric generative model. We saw how in those examples we could use the EM algorithm to disentangle the components. We are going to extend the idea of unraveling of different groups to cases where the **clusters** do not necessarily have nice elliptic69 Mixture modeling with multivariate normal distributions implies elliptic cluster boundaries. shapes.

Clustering takes data (continuous or quasi-continuous) and adds to them a new categorical *group* variable that can often simplify decision making; even if this sometimes comes at a cost of ignoring *intermediate* states. For instance, medical decisions are simplified by replacing possibly complex, high-dimensional diagnostic measurements by simple groupings: a full report of numbers associated with fasting glucose, glycated hemoglobin and plasma glucose two hours after intake is replaced by assigning the patient to a diabetes mellitus “group”.

In this chapter, we will study how to find meaningful clusters or groups in both low-dimensional and high-dimensional **nonparametric** settings. However, there is a caveat: clustering algorithms are designed to find clusters, so they will find clusters, even where there are none70 This is reminescent of humans: we like to see patterns – even in randomness.. So, cluster *validation* is an essential component of our process, especially if there is no prior domain knowledge that supports the existence of clusters.

In this chapter we will

Study the different types of data that can be beneficially clustered.

See measures of (dis)similarity and

**distances**that help us define clusters.Uncover hidden or latent clustering by partitioning the data into

*tighter*sets.Use clustering when given biomarkers on each of hundreds of thousands cells. We’ll see that for instance immune cells can be naturally grouped into tight subpopulations.

Run nonparametric algorithms such as

**\(k\)-means**,**\(k\)-medoids**on real single cell data.Experiment with recursive approaches to clustering that combine observations and groups into a hierarchy of sets; these methods are known as

**hierarchical clustering**.Study how to validate clusters through resampling-based bootstrap approaches, which we will demonstrate on a single-cell dataset.

John Snow made a map of cholera cases and identified *clusters* of cases. He then collected additional information about the situation of the pumps. The proximity of dense clusters of cases to the Broadstreet pump pointed to the water as a possible culprit. He collected separate sources of information that enabled him to infer the source of the cholera outbreak.

Figure 5.1: John Snow’s map of cholera cases: small barcharts at each house indicate a clustering of diagnosed cases.

David Freedman has a wonderful detailed account of all the steps that led to this discovery (Freedman 1991).

Now, let’s look at another map of London, shown in Figure 5.2. The red dots designate locations that were bombed during World War II. Many theories were put forward during the war by the analytical teams. They attempted to find a rational explanation for the bombing patterns (proximity to utility plants, arsenals, \(...\)). In fact, after the war it was revealed that the bombings were randomly distributed without any attempt at hitting particular targets.

Clustering is a useful technique for understanding complex multivariate data; it is an **unsupervised**71 Thus named because all variables have the same status, we are not trying to predict or learn the value of one variable (the supervisory response) based on the information from explanatory variables.. Exploratory techniques show groupings that can be important in interpreting the data.

For instance, clustering has enabled researchers to enhance their understanding of cancer biology. Tumors that appeared to be the same based on their anatomical location and histopathology fell into multiple clusters based on their molecular signatures, such as gene expression data (Hallett et al. 2012). Eventually, such clusterings might lead to the definition of new, more relevant disease types. Relevance is evidenced, e.g., by the fact that they are associated with different patient outcomes. What we aim to do in this chapter is understand how pictures like Figure 5.3 are constructed and how to interpret them.

In Chapter 4, we have already studied one technique, the EM algorithm, for uncovering groups. The techniques we explore in this chapter are more general and can be applied to more complex data. Many of them are based on distances between pairs of observations (this can be all versus all, or sometimes only all versus some), and they make no explicit assumptions about the generative mechanism of the data involving particular families of distributions, such as normal, gamma-Poisson, etc. There is a proliferation of clustering algorithms in the literature and in the scientific software landscape; this can be intimidating. In fact it is linked to the diversity of the types of data and the objectives pursued in different domains.

► Task

Look up the BiocViews Clustering or the Cluster view on CRAN and count the number of packages providing clustering tools.

**Of a feather**: how the distances are measured and similarities between observations defined has a strong impact on the clustering result. Our first step is to decide what we mean by *similar*. There are multiple ways of comparing birds: for instance, a distance using size and weight will give a different clustering than one using diet or habitat. Once we have chosen the relevant features, we have to decide how we combine differences between the multiple features into a single number. Here is a selection of choices, some of them are illustrated in Figure 5.5.

Figure 5.5: Equal-distance contour plots according to four different distances: points on any one curve are all the same distance from the center point.

**Euclidean** The Euclidean distance between two points \(A=(a_1,...,a_p)\) and \(B= (b_1,...,b_p)\) in a \(p\)-dimensional space (for the \(p\) features) is the square root of the sum of squares of the differences in all \(p\) coordinate directions:

\[\begin{equation*} d(A,B)=\sqrt{(a_1-b_1)^2+(a_2-b_2)^2+... +(a_p-b_p)^2}. \end{equation*}\]

**Manhattan** The Manhattan, City Block, Taxicab or \(L_1\) distance takes the sum of the absolute differences in all coordinates.

\[\begin{equation*} d(A,B)=|a_1-b_1|+|a_2-b_2|+... +|a_p-b_p|. \end{equation*}\]

**Maximum** The maximum of the absolute differences between coordinates is also called the \(L_\infty\) distance:

\[\begin{equation*} d_\infty(A,B)= \max_{i}|a_i-b_i|. \end{equation*}\]

**Weighted Euclidean distance** is a generalization of the ordinary Euclidean distance, by giving different directions in feature space different weights. We have already encountered one example of a weighted Euclidean distance in Chapter 2, the \(\chi^2\) distance. It is used to compare rows in contingency tables, and the weight of each feature is the inverse of the expected value. The *Mahalanobis* distance is another weighted Euclidean distance that takes into account the fact that different features may have a different dynamic range, and that some features may be positively or negatively correlated with each other. The weights in this case are derived from the covariance matrix of the features. See also Question 5.1.

**Minkowski** Allowing the exponent to be \(m\) instead of \(2\), as in the Euclidean distance, gives the Minkowski distance

\[\begin{equation} d(A,B) = \left( (a_1-b_1)^m+(a_2-b_2)^m+... +(a_p-b_p)^m \right)^\frac{1}{m}. \tag{5.1} \end{equation}\]

**Edit, Hamming** This distance is the simplest way to compare character sequences. It simply counts the number of differences between two character strings. This could be applied to nucleotide or amino acid sequences – although in that case, the different character substitutions are usually associated with different contributions to the distance (to account for physical or evolutionary similarity), and deletions and insertions may also be allowed.

**Binary** When the two vectors have binary bits as coordinates, we can think of the non-zero elements as ‘on’ and the zero elements as ‘off’. The binary distance is the proportion of features having only one bit on amongst those features that have at least one bit on.

**Jaccard Distance** Occurrence of traits or features in ecological or mutation data can be translated into presence and absence and encoded as 1’s and 0’s. In such situations, co-occurence is often more informative than co-absence. For instance, when comparing mutation patterns in HIV, the co-existence in two different strains of a mutation tends to be a more important observation than its co-absence. For this reason, biologists use the **Jaccard index**. Let’s call our two observation vectors \(S\) and \(T\), \(f_{11}\) the number of times a feature co-occurs in \(S\) and \(T\), \(f_{10}\) (and \(f_{01}\)) the number of times a feature occurs in \(S\) but not in \(T\) (and vice versa), and \(f_{00}\) the number of times a feature is co-absent. The Jaccard index is

\[\begin{equation} J(S,T) = \frac{f_{11}}{f_{01}+f_{10}+f_{11}}, \tag{5.2} \end{equation}\]

(i.e., it ignores \(f_{00}\)), and the **Jaccard dissimilarity** is

\[\begin{equation} d_J(S,T) = 1-J(S,T) = \frac{f_{01}+f_{10}}{f_{01}+f_{10}+f_{11}}. \tag{5.3} \end{equation}\]

**Correlation based distance**

\[\begin{equation*} d(A,B)=\sqrt{2(1-\text{cor}(A,B))}. \end{equation*}\]

Figure 5.6: An example for the use of Mahalanobis distances to measure the distance of a new data point (red) from two cluster centers.

► Question 5.1

Which of the two cluster centers in Figure 5.6 is the red point closest to?

► Solution

Figure 5.7: The lower triangle of distances can be computed by any of a hundred different functions in various **R** packages (`vegdist`

in **vegan**, `daisy`

in **cluster**, `genetic_distance`

in **gstudio**, `dist.dna`

in **ape**, `Dist`

in **amap**, `distance`

in **ecodist**, `dist.multiPhylo`

in **distory**, `shortestPath`

in **gdistance**, % `dudi.dist`

and `dist.genet`

in **ade4**).

The centers of the groups are sometimes called medoids, thus the name PAM (partitioning around medoids). Partitioning or iterative relocation methods work well in high-dimensional settings, where we cannot72 This is due to the so-called curse of dimensionality. We will discuss this in more detail in Chapter 12. easily use probability densities, the EM algorithm and parametric mixture modeling in the way we did in Chapter 4. Besides the distance measure, the main choice to be made is the number of clusters \(k\). The PAM (partitioning around medoids, Kaufman and Rousseeuw (2009)) method is as follows:

Starts from a matrix of \(p\) features measured on a set of \(n\) observations.

Randomly pick \(k\) distinct

*cluster centers*out of the \(n\) observations (“seeds”).Assign each of the remaining observation to the group to whose center it is the closest.

For each group, choose a new center from the observations in the group, such that the sum of the distances of group members to the center is minimal; this is called the

*medoid*.Repeat Steps 3 and 4 until the groups stabilize.

Each time the algorithm is run, different initial seeds will be picked in Step 2, and in general, this can lead to different final results. A popular implementation is the `pam`

function in the package **cluster**.

A slight variation of the method replaces the medoids by the arithmetic means (centers of gravity) of the clusters and is called \(k\)-means. While in PAM, the centers are observations, this is not, in general, the case with \(k\)-means. The function `kmeans`

comes with every installation of R in the **stats** package; an example run is shown in Figure 5.9.

These so-called \(k\)-methods are the most common off-the-shelf methods for clustering; they work particularly well when the clusters are of comparable size and convex (blob-shaped). On the other hand, if the true clusters are very different in size, the larger ones will tend to be broken up; the same is true for groups that have pronounced non-spherical or non-elliptic shapes.

► Question 5.3

The \(k\)-means algorithm alternates between computing the average point and assigning the points to clusters. How does this alternating, iterative method differ from an EM-algorithm?

► Solution

There are clever schemes that repeat the process many times using different initial centers or resampled datasets. Repeating a clustering procedure multiple times on the same data, but with different starting points creates *strong forms* according to Diday and Brito (1989). Repeated subsampling of the dataset and applying a clustering method will result in groups of observations that are “almost always” grouped together; these are called *tight clusters* (Tseng and Wong 2005). The study of strong forms or tight clusters facilitates the choice of the number of clusters. A recent package developed to combine and compare the output from many different clusterings is **clusterExperiment**. Here we give an example from its vignette. Single-cell RNA-Seq experiments provide counts of reads, representing gene transcripts, from individual cells. The single cell resolution enables scientists, among other things, to follow cell lineage dynamics. Clustering has proved very useful for analysing such data.

► Question 5.4

Follow the vignette of the package **clusterExperiment**. Call the ensemble clustering function `clusterMany`

, using `pam`

for the individual clustering efforts. Set the choice of genes to include at either the 60, 100 or 150 most variable genes. Plot the clustering results for \(k\) varying between 4 and 9. What do you notice?

► Solution

You can find reviews of bioinformatics methods for flow cytometry in (O’Neill et al. 2013) and a well-kept wikipedia article.

Studying measurements on single cells improves both the focus and resolution with which we can analyze cell types and dynamics. Flow cytometry enables the simultaneous measurement of about 10 different cell markers. Mass cytometry expands the collection of measurements to as many as 80 proteins per cell. A particularly promising application of this technology is the study of immune cell dynamics.

At different stages of their development, immune cells express unique combinations of proteins on their surfaces. These protein-markers are called **CD**s (**clusters of differentiation**) and are collected by flow cytometry (using fluorescence, see Hulett et al. (1969)) or mass cytometry (using single-cell atomic mass spectrometry of heavy element reporters, see Bendall et al. (2012)). An example of a commonly used CD is CD4, this protein is expressed by helper T cells that are referred to as being “CD4+”. Note however that some cells express CD4 (thus are CD4+), but are not actually helper T cells. We start by loading some useful Bioconductor packages for cytometry data, **flowCore** and **flowViz**, and read in an examplary data object `fcsB`

as follows:

```
library("flowCore")
library("flowViz")
fcsB = read.FCS("../data/Bendall_2011.fcs")
slotNames(fcsB)
```

`## [1] "exprs" "parameters" "description"`

Figure 5.11 shows a scatterplot of two of the variables available in the `fcsB`

data. (We will see how to make such plots below.) We can see clear bimodality and clustering in these two dimensions.

► Question 5.5

- Look at the structure of the
`fcsB`

object (hint: the`colnames`

function). How many variables were measured?

- Subset the data to look at the first few rows (hint: use
`Biobase::exprs(fcsB)`

). How many cells were measured?

First we load the table data that reports the mapping between isotopes and markers (antibodies), and then we replace the isotope names in the column names of `fcsB`

with the marker names. This makes the subsequent analysis and plotting code more readable:

```
markersB = readr::read_csv("../data/Bendall_2011_markers.csv")
mt = match(markersB$isotope, colnames(fcsB))
stopifnot(!any(is.na(mt)))
colnames(fcsB)[mt] = markersB$marker
```

Now we are ready to generate Figure 5.11

Figure 5.11: Cell measurements that show clear clustering in two dimensions.

`flowPlot(fcsB, plotParameters = colnames(fcsB)[2:3], logy = TRUE)`

Plotting the data in two dimensions as in Figure 5.11 already shows that the cells can be grouped into subpopulations. Sometimes just one of the markers can be used to define populations on their own; in that case simple **rectangular gating** is used to separate the populations; for instance, CD4+ cells can be gated by taking the subpopulation with high values for the CD4 marker. Cell clustering can be improved by carefully choosing transformations of the data. The left part of Figure 5.12 shows a simple one dimensional histogram before transformation; on the right of Figure 5.12 we see the distribution after transformation. It reveals a bimodality and the existence of two cell populations.

**Data Transformation: hyperbolic arcsin (asinh)**. It is standard to transform both flow and mass cytometry data using one of several special functions. We take the example of the inverse hyperbolic sine (asinh):

\[\begin{equation*} \operatorname{asinh}(x) = \log{(x + \sqrt{x^2 + 1})}. \end{equation*}\]

From this we can see that for large values of \(x\), \(\operatorname{asinh}(x)\) behaves like the \(\log\) and is practically equal to \(\log(x)+\log(2)\); for small \(x\) the function is close to linear in \(x\).

► Task

Try running the following code to see the two main regimes of the transformation: small values and large values.

```
v1 = seq(0, 1, length.out = 100)
plot(log(v1), asinh(v1), type = 'l')
plot(v1, asinh(v1), type = 'l')
v3 = seq(30, 3000, length = 100)
plot(log(v3), asinh(v3), type= 'l')
```

This is another example of a variance stabilizing transformation, also mentioned in Chapters 4 and 8. Figure 5.12 is produced by the following code, that uses the **flowCore** package.

```
asinhtrsf = arcsinhTransform(a = 0.1, b = 1)
fcsBT = transform(fcsB,
transformList(colnames(fcsB)[-c(1, 2, 41)], asinhtrsf))
densityplot( ~`CD3all`, fcsB)
densityplot( ~`CD3all`, fcsBT)
```

► Question 5.6

How many dimensions does the following code use to split the data into 2 groups using \(k\)-means ?

```
kf = kmeansFilter("CD3all" = c("Pop1","Pop2"), filterId="myKmFilter")
fres = flowCore::filter(fcsBT, kf)
summary(fres)
```

```
## Pop1: 33429 of 91392 events (36.58%)
## Pop2: 57963 of 91392 events (63.42%)
```

```
fcsBT1 = flowCore::split(fcsBT, fres, population = "Pop1")
fcsBT2 = flowCore::split(fcsBT, fres, population = "Pop2")
```

Figure 5.13, generated by the following code, shows a naïve projection of the data into the two dimensions spanned by the CD3 and CD56 markers:

Figure 5.13: After transformation these cells were clustered using `kmeans`

.

```
library("flowPeaks")
fp = flowPeaks(Biobase::exprs(fcsBT)[, c("CD3all", "CD56")])
plot(fp)
```

When plotting points that densely populate an area we should try to avoid overplotting. We saw some of the preferred techniques in Chapter 3; here we use contours and shading. This is done as follows:

Figure 5.14: Like Figure 5.13, using contours.

```
flowPlot(fcsBT, plotParameters = c("CD3all", "CD56"), logy = FALSE)
contour(fcsBT[, c(40, 19)], add = TRUE)
```

This produces Figure 5.14; a more informative version of Figure 5.13.

► Task

A more recent Bioconductor package, **ggcyto**, has been designed to enable the plotting of each patient in a diffent facet using `ggplot`

.

Try comparing the output using this approach to what we did above using the following:

```
library("ggcyto")
library("labeling")
ggcd4cd8=ggcyto(fcsB,aes(x=CD4,y=CD8))
ggcd4=ggcyto(fcsB,aes(x=CD4))
ggcd8=ggcyto(fcsB,aes(x=CD8))
p1=ggcd4+geom_histogram(bins=60)
p1b=ggcd8+geom_histogram(bins=60)
asinhT = arcsinhTransform(a=0,b=1)
transl = transformList(colnames(fcsB)[-c(1,2,41)], asinhT)
fcsBT = transform(fcsB, transl)
p1t=ggcyto(fcsBT,aes(x=CD4))+geom_histogram(bins=90)
p2t=ggcyto(fcsBT,aes(x=CD4,y=CD8))+geom_density2d(colour="black")
p3t=ggcyto(fcsBT,aes(x=CD45RA,y=CD20))+geom_density2d(colour="black")
```

Data sets such as flow cytometry, that contain only a few markers and a large number of cells, are amenable to density-based clustering. This method looks for regions of high density separated by sparser regions. It has the advantage of being able to cope with clusters that are not necessarily convex. One implementation of such a method is called dbscan. Let’s look at an example by running the following code.

```
library("dbscan")
mc5 = Biobase::exprs(fcsBT)[, c(15,16,19,40,33)]
res5 = dbscan::dbscan(mc5, eps = 0.65, minPts = 30)
mc5df = data.frame(mc5, cluster = as.factor(res5$cluster))
table(mc5df$cluster)
```

```
##
## 0 1 2 3 4 5 6 7 8
## 75954 4031 5450 5310 259 257 63 25 43
```

```
ggplot(mc5df, aes(x=CD4, y=CD8, col=cluster))+geom_density2d()
ggplot(mc5df, aes(x=CD3all, y=CD20, col=cluster))+geom_density2d()
```

The output is shown in Figure 5.15. The overlaps of the clusters in the 2D projections enable us to appreciate the multidimensional nature of the clustering.

► Question 5.7

Try increasing the dimension to 6 by adding one CD marker-variables from the input data.

Then vary `eps`

, and try to find four clusters such that at least two of them have more than 100 points.

Repeat this will 7 CD marker-variables, what do you notice?

► Solution

The dbscan method clusters points in dense regions according to the *density-connectedness* criterion. It looks at small neighborhood spheres of radius \(\epsilon\) to see if points are connected.

The building block of dbscan is the concept of density-reachability: a point \(q\) is directly **density-reachable** from a point \(p\) if it is not further away than a given threshold \(\epsilon\), and if \(p\) is surrounded by sufficiently many points such that one may consider \(p\) (and \(q\)) be part of a dense region. We say that \(q\) is *density-reachable* from \(p\) if there is a sequence of points \(p_1,...,p_n\) with \(p_1 = p\) and \(p_n = q\), so that each \(p_{i + 1}\) is directly density-reachable from \(p_i\).

A *cluster* is then a subset of points that satisfy the following properties:

All points within the cluster are mutually density-connected.

If a point is density-connected to any point of the cluster, it is part of the cluster as well.

Groups of points must have at least

`MinPts`

points to count as a cluster.

It is important that the method looks for high density of points in a neighborhood. Other methods exist that try to define clusters by a void, or “missing points” between clusters. But these are vulnerable to the curse of dimensionality; these can create spurious “voids”.

Figure 5.16: A snippet of Linn{}us’ taxonomy that clusters organisms according to feature similarities.

Hierarchical clustering is a bottom-up approach, where similar observations and subclasses are assembled iteratively. Figure 5.16 shows how Linnæus made nested clusters of organisms according to specific characteristics. Such hierarchical organization has been useful in many fields and goes back to Aristotle who postulated a *ladder of nature*.

**Dendrogram ordering**. As you can see in the example of Figure 5.17, the order of the labels does not matter within sibling pairs. Horizontal distances are usually meaningless, while the vertical distances do encode some information. These properties are important to remember when making interpretations about neighbors that are not monophyletic (i.e., not in the same subtree or clade), but appear as neighbors in the plot (for instance B and D in the right hand tree are non-monophyletic neighbors).

**Top-down hierarchies**. An alternative, top-down, approach takes all the objects and splits them sequentially according to a chosen criterion. Such so-called **recursive partitioning** methods are often used to make decision trees. They can be useful for prediction (say, survival time, given a medical diagnosis): we are hoping in those instances to split heterogeneous populations into more homogeneous subgroups by partitioning. In this chapter, we concentrate on the bottom-up approaches. We will return to partitioning when we talk about supervised learning and classifcation in Chapter 12.

Figure 5.18: In the single linkage method, the distance between groups \(C_1\) and \(C_2\) is defined as the distance between the closest two points from the groups.

Figure 5.19: In the complete linkage method, the distance between groups \(C_1\) and \(C_2\) is defined as the maximum distance between pairs of points from the two groups.

When creating a hierarchical clustering by aggregation, we will need more than just the distances between all pairs of individual objects: we also need a way to calculate distances between the aggregates. There are different choices of how to define them, based on the object-object distances, and each choice results in a different type of hierarchical clustering.

A hierarchical clustering algorithms is easy enough to get started, by grouping the most similar observations together. But once an aggregation has occurred, one is required to say what the distance between a newly formed cluster and all other points is computed, or between two clusters.

- The
**minimal jump**method, also called**single linkage**or nearest neighbor method computes the distance between clusters as the smallest distance between any two points in the two clusters (as shown in Figure 5.18):

\[\begin{equation*} d_{12} = \min_{i \in C_1, i \in C_2 } d_{ij}. \end{equation*}\]

This method tends to create clusters that look like contiguous strings of points. The cluster tree often looks like a comb.

- The
**maximum jump**(or**complete linkage**) method defines the distance between clusters as the largest distance between any two objects in the two clusters, as represented in Figure 5.19:

\[\begin{equation*} d_{12} = \max_{i \in C_1, i \in C_2 } d_{ij}. \end{equation*}\]

Figure 5.20: The Ward method maximizes the between group sum of squares (red edges), while minimizing the sums of squares within groups (black edges).

- The
*average linkage*method is half way between the two above:

\[\begin{equation*} d_{12} = \frac{1}{|C_1| |C_2|}\sum_{i \in C_1, i \in C_2 } d_{ij} \end{equation*}\]

**Ward’s method**takes an analysis of variance approach, where the goal is to minimize the variance within clusters. This method is very efficient, however, it tends to create break the clusters up into ones of smaller sizes.

Advantages and disadvantages of the various distances between aggregates(Chakerian and Holmes 2012).

Method | Pros | Cons |
---|---|---|

Single linkage | number of clusters | comblike trees. |

Complete linkage | compact classes | one obs. can alter groups |

Average linkage | similar size and variance | not robust |

Centroid | robust to outliers | smaller number of clusters |

Ward | minimising an inertia | classes small if high variability |

Figure 5.21: Hierarchical clustering output has similar properties to a mobile: the branches can rotate freely from their suspension points.

These are the choices we have to make building hierarchical clustering trees. An advantage of hierarchical clustering compared to the partitioning methods is that it offers a graphical diagnostic of the strength of groupings: the length of the inner edges in the tree.

When we have prior knowledge that the clusters are about the same size, using average linkage or Ward’s method of minimizing the within class variance is the best tactic.

► Question 5.8

**Hierarchical clustering for cell populations** The `Morder`

data are gene expression measurements for 156 genes on T cells of 3 types (naïve, effector, memory) from 10 patients (Holmes et al. 2005). Using the **pheatmap** package, make two simple heatmaps, without dendogram or reordering, for Euclidean and Manhattan distances of these data.

► Question 5.9

Now, look at the differences in orderings in the hierarchical clustering trees with these two distances. What differences are noticeable?

Figure 5.23: This tree can be drawn in many different ways. The ordering of the leaves as it is appears here is \((8,11,9,10,7,5,6,1,4,2,3)\).

► Question 5.10

A hierarchical clustering tree is like the Calder mobile in Figure 5.21 that can swing around many internal pivot points, giving many orderings of the tips consistent with a given tree. Look at the tree in Figure 5.23. How many ways are there of ordering the tip labels and still maintain consistence with that tree?

It is common to see heatmaps whose rows and/or columns are ordered based on a hierachical clustering tree. Sometimes this makes some clusters look very strong – stronger than what the tree really implies. There are alternative ways of ordering the rows and columns in heatmaps, for instance, in the package **NeatMap**, that uses ordination methods73 These will be explained in Chapter 9. to find orderings.

The clustering methods we have described are tailored to deliver good groupings of the data under various constraints. However, keep in mind that clustering methods will always deliver groups, even if there are none. If, in fact, there are no real clusters in the data, a hierarchical clustering tree may show relatively short inner branches; but it is difficult to quantify this. In general it is important to validate your choice of clusters with more objective criteria.

One criterion to assess the quality of a clustering result is to ask to what extent it maximizes the between group differences while keeping the within-group distances small (maximizing the lengths of red lines and minimizing those of the black lines in Figure 5.20). We formalize this with the within-groups sum of squared distances (WSS):

\[\begin{equation} \text{WSS}_k=\sum_{\ell=1}^k \sum_{x_i \in C_\ell} d^2(x_i, \bar{x}_{\ell}) \tag{5.4} \end{equation}\]

Here, \(k\) is the number of clusters, \(C_\ell\) is the set of objects in the \(\ell\)-th cluster, and \(\bar{x}_\ell\) is the center of mass (the average point) in the \(\ell\)-th cluster. We state the dependence on \(k\) of the WSS in Equation (5.4) as we are interested in comparing this quantity across different values of \(k\), for the same cluster algorithm. Stated as it is, the WSS is however not a sufficient criterion: the smallest value of WSS would simply be obtained by making each point its own cluster. The WSS is a useful building block, but we need more sophisticated ideas than just looking at this number alone.

One idea is to look at \(\text{WSS}_k\) as a function of \(k\). This will always be a decreasing function, but if there is a pronounced region where it decreases sharply and then flattens out, we call this an *elbow* and might take this as a potential sweet spot for the number of clusters.

► Question 5.11

**An alternative expression for \(\text{WSS}_k\)**. Use R to compute the sum of distances between all pairs of points in a cluster and compare it to \(\text{WSS}_k\). Can you see how \(\text{WSS}_k\) can also be written:

\[\begin{equation} \text{WSS}_k=\sum_{\ell=1}^k \frac{1}{2 n_\ell} \sum_{x_i \in C_\ell} \sum_{x_j \in C_\ell} d^2(x_i,x_j), \tag{5.5} \end{equation}\]

where \(n_\ell\) is the size of the \(\ell\)-th cluster.

Question 5.11 show us that the within-cluster sums of squares \(\text{WSS}_k\) measures both the distances of all points in a cluster to its center, and the average distance between all pairs of points in the cluster.

When looking at the behavior of various indices and statistics that help us decide how many clusters are appropriate for the data, it can be useful to look at cases where we actually know the right answer.

To start, we simulate data coming from four groups. We use the pipe (`%>%`

) operator and the `bind_rows`

function from **dplyr** to concatenate the four *tibble*s corresponding to each cluster into one big *tibble*74 The pipe operator passes the value to its left into the function to its right. This can make the flow of data easier to follow in code: `f(x) %>% g(y)`

is equivalent to `g(f(x), y)`

.

```
library("dplyr")
simdat = lapply(c(0, 8), function(mx) {
lapply(c(0,8), function(my) {
tibble(x = rnorm(100, mean = mx, sd = 2),
y = rnorm(100, mean = my, sd = 2),
class = paste(mx, my, sep = ":"))
}) %>% bind_rows
}) %>% bind_rows
simdat
```

```
## # A tibble: 400 x 3
## x y class
## <dbl> <dbl> <chr>
## 1 -2.42 -4.59 0:0
## 2 1.89 -1.56 0:0
## 3 0.558 2.17 0:0
## 4 2.51 -0.873 0:0
## 5 -2.52 -0.766 0:0
## 6 3.62 0.953 0:0
## 7 0.774 2.43 0:0
## 8 -1.71 -2.63 0:0
## 9 2.01 1.28 0:0
## 10 2.03 -1.25 0:0
## # … with 390 more rows
```

`simdatxy = simdat[, c("x", "y")] # without class label`

Figure 5.24: The `simdat`

data colored by the class labels. Here, we know the labels since we generated the data – usually we do not know them.

```
ggplot(simdat, aes(x = x, y = y, col = class)) + geom_point() +
coord_fixed()
```

We compute the within-groups sum of squares for the clusters obtained from the \(k\)-means method:

Figure 5.25: The barchart of the WSS statistic as a function of \(k\) shows that the last substantial jump is just before \(k=4\). This indicates that the best choice for these data is \(k=4\).

```
wss = tibble(k = 1:8, value = NA_real_)
wss$value[1] = sum(scale(simdatxy, scale = FALSE)^2)
for (i in 2:nrow(wss)) {
km = kmeans(simdatxy, centers = wss$k[i])
wss$value[i] = sum(km$withinss)
}
ggplot(wss, aes(x = k, y = value)) + geom_col()
```

► Question 5.12

- Run the code above several times and compare the
`wss`

values for different runs. Why are they different?

- Create a set of data with uniform instead of normal distributions with the same range and dimensions as
`simdat`

. Compute the WSS values for for thess data. What do you conclude?

► Question 5.13

The so-called **Calinski-Harabasz** index uses the WSS and BSS (between group sums of squares). It is inspired by the \(F\) statistic75 The \(F\) statistic is the ratio of the mean sum of squares explained by a factor to the mean residual sum of squares. used in analysis of variance:

\[\begin{equation*} \text{CH}(k)=\frac{\text{BSS}_k}{\text{WSS}_k}\times\frac{N-k}{N-1} \qquad \text{where} \quad \text{BSS}_k = \sum_{\ell=1}^k n_\ell(\bar{x}_{\ell}-\bar{x})^2, \end{equation*}\]

where \(\bar{x}\) is the overall center of mass (average point). Plot the Calinski-Harabasz index for the `simdat`

data.

► Solution

Taking the logarithm of the within-sum-of-squares (\(\log(\text{WSS}_k)\)) and comparing it to averages from simulated data with less structure can be a good way of choosing \(k\). This is the basic idea of the **gap statistic** introduced by Tibshirani, Walther, and Hastie (2001). We compute \(\log(\text{WSS}_k)\) for a range of values of \(k\), the number of clusters, and compare it to that obtained on reference data of similar dimensions with various possible ‘non-clustered’ distributions. We can use uniformly distributed data as we did above or data simulated with the same covariance structure as our original data.

This algorithm is a Monte Carlo method that compares the gap statistic \(\log(\text{WSS}_k)\) for the observed data to an average over simulations of data with similar structure.

**Algorithm for computing the gap statistic (Tibshirani, Walther, and Hastie 2001):**

Cluster the data with \(k\) clusters and compute \(\text{WSS}_k\) for the various choices of \(k\).

Generate \(B\) plausible reference data sets, using Monte Carlo sampling from a homogeneous distribution and redo Step 1 above for these new simulated data. This results in \(B\) new within-sum-of-squares for simulated data \(W_{kb}^*\), for \(b=1,...,B\).

Compute the \(\text{gap}(k)\)-statistic:

\[\begin{equation*} \text{gap}(k) = \overline{l}_k - \log \text{WSS}_k \quad\text{with}\quad \overline{l}_k =\frac{1}{B}\sum_{b=1}^B \log W^*_{kb} \end{equation*}\]

Note that the first term is expected to be bigger than the second one if the clustering is good (i.e., the WSS is smaller); thus the gap statistic will be mostly positive and we are looking for its highest value.

- We can use the standard deviation

\[\begin{equation*} \text{sd}_k^2 = \frac{1}{B-1}\sum_{b=1}^B\left(\log(W^*_{kb})-\overline{l}_k\right)^2 \end{equation*}\]

to help choose the best \(k\). Several choices are available, for instance, to choose the smallest \(k\) such that

\[\begin{equation*} \text{gap}(k) \geq \text{gap}(k+1) - s'_{k+1}\qquad \text{where } s'_{k+1}=\text{sd}_{k+1}\sqrt{1+1/B}. \end{equation*}\]

The packages **cluster** and **clusterCrit** provide implementations.

► Question 5.14

Make a function that plots the gap statistic as in Figure 5.27. Show the output for the `simdat`

example dataset clustered with the `pam`

function.

► Solution

Let’s now use the method on a real example. We load the **Hiiragi** data that we already explored in Chapter 3 (Ohnishi et al. 2014) and will see how the cells cluster.

```
library("Hiiragi2013")
data("x")
```

We start by choosing the 50 most variable genes (features)76 The intention behind this step is to reduce the influence of technical (or batch) effects. Although individually small, when accumulated over all the 45101 features in `x`

, many of which match genes that are weakly or not expressed, without this feature selection step, such effects are prone to suppress the biological signal..

```
selFeats = order(rowVars(Biobase::exprs(x)), decreasing = TRUE)[1:50]
embmat = t(Biobase::exprs(x)[selFeats, ])
embgap = clusGap(embmat, FUN = pamfun, K.max = 24, verbose = FALSE)
k1 = maxSE(embgap$Tab[, "gap"], embgap$Tab[, "SE.sim"])
k2 = maxSE(embgap$Tab[, "gap"], embgap$Tab[, "SE.sim"],
method = "Tibs2001SEmax")
c(k1, k2)
```

`## [1] 9 7`

The default choice for the number of clusters, `k1`

, is the first value of \(k\) for which the gap is not larger than the first local maximum minus a standard error \(s\) (see the manual page of the `clusGap`

function). This gives a number of clusters \(k = 9\), whereas the choice recommended by Tibshirani, Walther, and Hastie (2001) is the smallest \(k\) such that \(\text{gap}(k) \geq \text{gap}(k+1) - s_{k+1}'\), this gives \(k = 7\). Let’s plot the gap statistic (Figure 5.28).

Figure 5.28: The gap statistic for the **Hiiragi2013** data.

```
plot(embgap, main = "")
cl = pamfun(embmat, k = k1)$cluster
table(pData(x)[names(cl), "sampleGroup"], cl)
```

```
## cl
## 1 2 3 4 5 6 7 8 9
## E3.25 23 11 1 1 0 0 0 0 0
## E3.25 (FGF4-KO) 0 0 1 16 0 0 0 0 0
## E3.5 (EPI) 2 1 0 0 0 8 0 0 0
## E3.5 (FGF4-KO) 0 0 8 0 0 0 0 0 0
## E3.5 (PE) 0 0 0 0 9 2 0 0 0
## E4.5 (EPI) 0 0 0 0 0 0 0 4 0
## E4.5 (FGF4-KO) 0 0 0 0 0 0 0 0 10
## E4.5 (PE) 0 0 0 0 0 0 4 0 0
```

Above we see the comparison between the clustering that we got from `pamfun`

with the sample labels in the annotation of the data.

► Question 5.15

How do the results change if you use all the features in `x`

, rather than subsetting the top 50 most variable genes?

We saw the bootstrap principle in Chapter 4: ideally, we would like to use many new samples (sets of data) from the underlying data generating process, for each of them apply our clustering method, and then see how stable the clusterings are, or how much they change, using an index such as those we used above to compare clusterings. Of course, we don’t have these additional samples. So we are, in fact, going to create new datasets simply by taking different random subsamples of the data, look at the different clusterings we get each time, and compare them. Tibshirani, Walther, and Hastie (2001) recommend using bootstrap resampling to infer the number of clusters using the gap statistic.

We will continue using the **Hiiragi2013** data. Here we follow the investigation of the hypothesis that the inner cell mass (ICM) of the mouse blastocyst in embyronic day 3.5 (E3.5) falls “naturally” into two clusters corresponding to pluripotent epiblast (EPI) versus primitive endoderm (PE), while the data for embryonic day 3.25 (E3.25) do not yet show this symmetry breaking.

We will not use the true group labels in our clustering and only use them in the final interpretation of the results. We will apply the bootstrap to the two different data sets (E3.5) and (E3.25) separately. Each step of the bootstrap will generate a clustering of a random subset of the data and we will need to compare these through a consensus of an ensemble of clusters. There is a useful framework for this in the **clue** package (Hornik 2005). The function `clusterResampling`

, taken from the supplement of Ohnishi et al. (2014), implements this approach:

```
clusterResampling = function(x, ngenes = 50, k = 2, B = 250,
prob = 0.67) {
mat = Biobase::exprs(x)
ce = cl_ensemble(list = lapply(seq_len(B), function(b) {
selSamps = sample(ncol(mat), size = round(prob * ncol(mat)),
replace = FALSE)
submat = mat[, selSamps, drop = FALSE]
sel = order(rowVars(submat), decreasing = TRUE)[seq_len(ngenes)]
submat = submat[sel,, drop = FALSE]
pamres = pam(t(submat), k = k)
pred = cl_predict(pamres, t(mat[sel, ]), "memberships")
as.cl_partition(pred)
}))
cons = cl_consensus(ce)
ag = sapply(ce, cl_agreement, y = cons)
list(agreements = ag, consensus = cons)
}
```

The function `clusterResampling`

performs the following steps:

[clue1] Draw a random subset of the data (the data are either all E3.25 or all E3.5 samples) by selecting 67% of the samples without replacement.

Select the top

`ngenes`

features by overall variance (in the subset) .[clue2] Apply \(k\)-means clustering and predict the cluster memberships of the samples that were not in the subset with the

`cl_predict`

method from the**clue**package, through their proximity to the cluster centres.Repeat steps [clue1]-[clue2]

`B`

times.Apply consensus clustering (

`cl_consensus`

).[clueagree] For each of the

`B`

clusterings, measure the agreement with the consensus through the function(`cl_agreement`

). Here a good agreement is indicated by a value of 1, and less agreement by smaller values. If the agreement is generally high, then the clustering into \(k\) classes can be considered stable and reproducible; inversely, if it is low, then no stable partition of the samples into \(k\) clusters is evident.

As a measure of between-cluster distance for the consensus clustering, the *Euclidean* dissimilarity of the memberships is used, i.e., the square root of the minimal sum of the squared differences of \(\mathbf{u}\) and all column permutations of \(\mathbf{v}\), where \(\mathbf{u}\) and \(\mathbf{v}\) are the cluster membership matrices. As agreement measure for Step [clueagree], the quantity \(1 - d/m\) is used, where \(d\) is the Euclidean dissimilarity, and \(m\) is an upper bound for the maximal Euclidean dissimilarity.

```
iswt = (x$genotype == "WT")
cr1 = clusterResampling(x[, x$Embryonic.day == "E3.25" & iswt])
cr2 = clusterResampling(x[, x$Embryonic.day == "E3.5" & iswt])
```

The results are shown in Figure 5.30. They confirm the hypothesis that the E.35 data fall into two clusters.

```
ag1 = tibble(agreements = cr1$agreements, day = "E3.25")
ag2 = tibble(agreements = cr2$agreements, day = "E3.5")
ggplot(bind_rows(ag1, ag2), aes(x = day, y = agreements)) +
geom_boxplot() +
ggbeeswarm::geom_beeswarm(cex = 1.5, col = "#0000ff40")
mem1 = tibble(y = sort(cl_membership(cr1$consensus)[, 1]),
x = seq(along = y), day = "E3.25")
mem2 = tibble(y = sort(cl_membership(cr2$consensus)[, 1]),
x = seq(along = y), day = "E3.5")
ggplot(bind_rows(mem1, mem2), aes(x = x, y = y, col = day)) +
geom_point() + facet_grid(~ day, scales = "free_x")
```

**Computational complexity**. An algorithm is said to be \(O(n^k)\), if, as \(n\) gets larger, the resource consumption (CPU time or memory) grows proportionally to \(n^k\). There may be other (sometimes considerable) baseline costs, or costs that grow proportionally to lower powers of \(n\), but these always become negligible compared to the leading term as \(n\to\infty\).

It is important to remember that the computation of all versus all distances of \(n\) objects is an \(O(n^2)\) operation (in time and memory). Classic hierarchical clustering approaches (such as `hclust`

in the **stats** package) are even \(O(n^3)\) in time. For large \(n\), this may become impractical77 E.g., the distance matrix for one million objects, stored as 8-byte floating point numbers, would take up about 4 Terabytes, and an `hclust`

-like algorithm would run 30 years even under the optimistic assumption that each of the iterations only takes a nanosecond.. We can avoid the complete computation of the all-vs-all distance matrix. For instance, \(k\)-means has the advantage of only requiring \(O(n)\) computations, since it only keeps track of the distances between each object and the cluster centers, whose number remains the same even if \(n\) increases.

Fast implementations such as **fastclust** (Müllner 2013) and **dbscan** have been carefully optimized to deal with a large number of observations.

Consider a set of measurements that reflect some underlying true values (say, species represented by DNA sequences from their genomes), but have been degraded by technical noise. Clustering can be used to remove such noise.

Suppose that we have a bivariate distribution of observations made with the same error variances. However, the sampling is from two groups that have very different baseline frequencies. Suppose, further, that the errors are continuous independent bivariate normally distributed. We have \(10^{3}\) of `seq1`

and \(10^{5}\) of `seq2`

, as generated for instance by the code:

Figure 5.31: Although both groups have noise distributions with the same variances, the apparent radii of the groups are very different. The \(10^{5}\) instances in `seq2`

have many more opportunities for errors than what we see in `seq1`

, of which there are only \(10^{3}\). Thus we see that frequencies are important in clustering the data.

```
library("mixtools")
library("ggplot2")
seq1 = rmvnorm(n = 1e3, mu = -c(1, 1), sigma = 0.5 * diag(c(1, 1)))
seq2 = rmvnorm(n = 1e5, mu = c(1, 1), sigma = 0.5 * diag(c(1, 1)))
twogr = data.frame(
rbind(seq1, seq2),
seq = factor(c(rep(1, nrow(seq1)),
rep(2, nrow(seq2))))
)
colnames(twogr)[1:2] = c("x", "y")
ggplot(twogr, aes(x = x, y = y, colour = seq,fill = seq)) +
geom_hex(alpha = 0.5, bins = 50) + coord_fixed()
```

The observed values would look as in Figure 5.31.

► Question 5.16

Take the data `seq1`

and `seq2`

and cluster them into two groups according to distance from group center. Do you think the results should depend on the frequencies of each of the two sequence types?

► Solution

See Kahneman (2011) for a book-length treatment of our natural heuristics and the ways in which they can mislead us when we make probability calculations (we recommend especially Chapters 14 and 15).

► Task

Simulate `n=2000`

binary variables of length `len=200`

that indicate the quality of `n`

sequencing reads of length `len`

. For simplicity, let us assume that sequencing errors occur independently and uniformly with probability `perr=0.001`

. That is, we only care whether a base was called correctly (`TRUE`

) or not (`FALSE`

).

```
n = 2000
len = 200
perr = 0.001
seqs = matrix(runif(n * len) >= perr, nrow = n, ncol = len)
```

Now, compute all pairwise distances between reads.

`dists = as.matrix(dist(seqs, method = "manhattan"))`

For various values of number of reads `k`

(from 2 to `n`

), the maximum distance within this set of reads is computed by the code below and shown in Figure 5.32.

Figure 5.32: The diameter of a set of sequences as a function of the number of sequences.

```
library("tibble")
dfseqs = tibble(
k = 10 ^ seq(log10(2), log10(n), length.out = 20),
diameter = vapply(k, function(i) {
s = sample(n, i)
max(dists[s, s])
}, numeric(1)))
ggplot(dfseqs, aes(x = k, y = diameter)) + geom_point()+geom_smooth()
```

We will now improve the 16SrRNA-read clustering using a denoising mechanism that incorporates error probabilities.

**What are the data?** In the bacterial 16SrRNA gene there are so-called **variable regions** that are taxa-specific. These provide fingerprints that enables *taxon*78 Calling different groups of bacteria *taxa* rather than *species* highlights the approximate nature of the concept, as the notion of species is more fluid in bacteria than, say, in animals. identification. The raw data are FASTQ-files with quality scored sequences of PCR-amplified DNA regions79 The FASTQ format is described here.. We use an iterative alternating approach80 Similar to the EM algorithm we saw in Chapter 4. to build a probabilistic noise model from the data. We call this a *de novo* method, because we use clustering, and we use the cluster centers as our denoised sequence variants (a.k.a. Amplicon Sequence Variants, ASVs, see (Callahan, McMurdie, and Holmes 2017)). After finding all the denoised variants, we create contingency tables of their counts across the different samples. We will show in Chapter 10 how these tables can be used to infer properties of the underlying bacterial communities using networks and graphs.

In order to improve data quality, one often has to start with the raw data and model all the sources of variation carefully. One can think of this as an example of *cooking from scratch* (see the gruesome details in Ben J Callahan et al. (2016) and Exercise 5.5).

► Question 5.17

Suppose that we have two sequences of length 200 (seq1 and seq2) present in our sample at very different abundances. We are told that the technological sequencing errors occur as independent Bernoulli(0.0005) random events for each nucleotide.

What is the distribution of the number of errors per sequence?

► Solution

Figure 5.33 shows us how close the distribution is to being Poisson distributed.

The DADA method (Divisive Amplicon Denoising Algorithm, Rosen et al. (2012)) uses a parameterized model of substitution errors that distinguishes sequencing errors from real biological variation. The model computes the probabilities of base substitutions, such as seeing an \({\tt A}\) instead of a \({\tt C}\). It assumes that these probabilities are independent of the position along the sequence. Because error rates vary substantially between sequencing runs and PCR protocols, the model parameters are estimated from the data themselves using an EM-type approach. A read is classified as noisy or exact given the current parameters, and the noise model parameters are updated accordingly81 In the case of a large data set, the noise model estimation step does not have to be done on the complete set. See https://benjjneb.github.io/dada2/bigdata.html for tricks and tools when dealing with large data sets..

The dereplicated sequences82 F stands for forward strand and R for reverse strand. are read in and then divisive denoising and estimation is run with the `dada`

function as in the following code:

```
derepFs = readRDS(file="../data/derepFs.rds")
derepRs = readRDS(file="../data/derepRs.rds")
library("dada2")
ddF = dada(derepFs, err = NULL, selfConsist = TRUE)
ddR = dada(derepRs, err = NULL, selfConsist = TRUE)
```

In order to verify that the error transition rates have been reasonably well estimated, we inspect the fit between the observed error rates (black points) and the fitted error rates (black lines) (Figure 5.34).

`plotErrors(ddF)`

Once the errors have been estimated, the algorithm is rerun on the data to find the sequence variants:

```
dadaFs = dada(derepFs, err=ddF[[1]]$err_out, pool = TRUE)
dadaRs = dada(derepRs, err=ddR[[1]]$err_out, pool = TRUE)
```

**Note:** The sequence inference function can run in two different modes: Independent inference by sample (`pool = FALSE`

), and pooled inference from the sequencing reads combined from all samples. Independent inference has two advantages: as a functions of the number of samples, computation time is linear and memory requirements are constant. Pooled inference is more computationally taxing, however it can improve the detection of rare variants that occur just once or twice in an individual sample but more often across all samples. As this dataset is not particularly large, we performed pooled inference.

Sequence inference removes nearly all substitution and **indel**83 The term *indel* stands for insertion-deletion; when comparing two sequences that differ by a small stretch of characters, it is a matter of viewpoint whether this is an insertion or a deletion, thus the name. errors from the data. We merge the inferred forward and reverse sequences, while removing paired sequences that do not perfectly overlap as a final control against residual errors.

`mergers = mergePairs(dadaFs, derepFs, dadaRs, derepRs)`

We produce a contingency table of counts of ASVs. This is a higher-resolution analogue of the “OTU84 operational taxonomic units table”, i.e., a samples by features table whose cells contain the number of times each sequence variant was observed in each sample.

`seqtab.all = makeSequenceTable(mergers[!grepl("Mock",names(mergers))])`

► Question 5.18

Explore the components of the objects `dadaRs`

and `mergers`

.

► Solution

Chimera are sequences that are artificially created during PCR amplification by the melding of two (in rare cases, more) of the original sequences. To complete our denoising workflow, we remove them with a call to the function `removeBimeraDenovo`

, leaving us with a clean contingency table we will use later on.

`seqtab = removeBimeraDenovo(seqtab.all)`

► Question 5.19

Why do you think the chimera are quite easy to recognize?

What proportion of the reads were chimeric in the `seqtab.all`

data?

What proportion of unique sequence variants are chimeric?

► Solution

**Of a feather: how to compare observations** We saw at the start of the chapter how finding the **right distance** is an essential first step in a clustering analysis; this is a case where the *garbage in, garbage out* motto is in full force. Always choose a distance that is scientifically meaningful and compare output from as many distances as possible; sometimes the same data require different distances when different scientific objectives are pursued.

**Two ways of clustering** We saw there are two approaches to clustering:

iterative partitioning approaches such as \(k\)-means and \(k\)-medoids (PAM) that alternated between estimating the cluster centers and assigning points to them;

hierarchical clustering approaches that first agglomerate points, and subsequently the growing clusters, into nested sequences of sets that can be represented by hierarchical clustering

*trees*.

**Biological examples** Clustering is important tool for finding latent classes in single cell measurements, especially in immunology and single cell data analyses. We saw how density-based clustering is useful for lower dimensional data where sparsity is not an issue.

**Validation the clusters** Clustering algorithms *always* deliver clusters, so we need to assess their quality and the number of clusters to choose carefully. These validation steps are perfomed using visualization tools and repeating the clustering on many resamples of the data. We saw how statistics such as WSS/BSS or \(\log(\text{WSS})\) can be calibrated using simulations on data where we understand the group structure and can provide useful benchmarks for choosing the number of clusters on new data. Of course, the use of biologically relevant information to inform and confirm the meaning of clusters is always the best validation approach.

**Distances and probabilities** Finally: distances are not everything. We showed how important it was to take into account baseline frequencies and local densities when clustering. This is essential in a cases such as clustering to denoise 16S rRNA sequence reads where the true class or taxa group occur at very different frequencies.

For a complete book on *Finding groups in data*, see Kaufman and Rousseeuw (2009). The vignette of the **clusterExperiment** package contains a complete workflow for generating clusters using many different techniques, including preliminary dimension reduction (PCA) that we will cover in Chapter 7. There is no consensus on methods for deciding how many clusters are needed to describe data in the absence of contiguous biological information. However, making hierarchical clusters of the *strong forms* is a method that has the advantage of allowing the user to decide how far down to cut the hierarchical tree and be careful not to cut in places where these inner branches are short. See the vignette of **clusterExperiment** for an application to single cell RNA experimental data.

In analyzing the Hiiragi data, we used cluster probabilities, a concept already mentioned in Chapter 4, where the EM algorithm used them as weights to compute expected value statistics. The notion of probabilistic clustering is well-developed in the Bayesian nonparametric mixture framework, which enriches the mixture models we covered in Chapter 4 to more general settings. See Dundar et al. (2014) for a real example using this framework for flow cytometry. In the denoising and assignment of high-throughput sequencing reads to specific strains of bacteria or viruses, clustering is essential. In the presence of noise, clustering into groups of *true* strains of very unequal sizes can be challenging. Using the data to create a noise model enables both denoising and cluster assignment concurrently. Denoising algorithms such as those by Rosen et al. (2012) or Benjamin J Callahan et al. (2016) use an iterative workflow inspired by the EM method (McLachlan and Krishnan 2007).

We can define the average dissimilarity of a point \(x_i\) to a cluster \(C_k\) as the average of the distances from \(x_i\) to all points in \(C_k\). Let \(A(i)\) be the average dissimilarity of all points in the cluster that \(x_i\) belongs to. Let \(B(i)\) be the lowest average dissimilarity of \(x_i\) to any other cluster of which \(x_i\) is not a member. The cluster with this lowest average dissimilarity is said to be the **neighboring cluster** of \(x_i\), because it is the next best fit cluster for point \(x_i\). The **silhouette index** is

\[\begin{equation*} S(i)=\frac{B(i)-A(i)}{\max_i(A(i),B(i))}. \end{equation*}\]

**5.1.a** Compute the silhouette index for the `simdat`

data we simulated in Section 5.7.

```
library("cluster")
pam4 = pam(simdatxy, 4)
sil = silhouette(pam4, 4)
plot(sil, col=c("red","green","blue","purple"), main="Silhouette")
```

**5.1.b** Change the number of clusters \(k\) and assess which \(k\) gives the best silhouette index.

**5.1.c** Now, repeat this for groups that have uniform (unclustered) data distributions over a whole range of values.

▢

**5.2.a** Make a “character” representation of the distance between the 20 locations in the `dune`

data from the **vegan** package using the function `symnum`

.

**5.2.b** Make a heatmap plot of these distances.

▢

**5.3.a** Load the `spirals`

data from the **kernlab** package. Plot the results of using \(k\)-means on the data. This should give you something similar to Figure 5.35.

**5.3.b** You’ll notice that the clustering in Figure 5.35 seems unsatisfactory. Show how a different method, such as `specc`

or `dbscan`

, could cluster `spirals`

data in a more useful manner.

**5.3.c** Repeat the `dbscan`

clustering with different parameters. How robust is the number of groups?

▢

Looking at graphical representations in simple two-dimensional maps can often reveal important clumping patterns. We saw an example for this with the map that enabled Snow to discover the source of the London cholera outbreak. Such clusterings can often indicate important information about hidden variables acting on the observations. Look at a map for breast cancer incidence in the US at:

http://www.huffingtonpost.com/bill-davenhall/post_1663_b_817254.html (Mandal et al. 2009); the areas of high incidence seem spatially clustered. Can you guess the reason(s) for this clustering and high incidence rates on the West and East coasts and around Chicago?

▢

**Amplicon bioinformatics: from raw reads to dereplicated sequences**. As a supplementary exercise, we provide the intermediate steps necessary to a full data preprocessing workflow for denoising 16S rRNA sequences. We start by setting the directories and loading the downloaded data:

```
base_dir = "../data"
miseq_path = file.path(base_dir, "MiSeq_SOP")
filt_path = file.path(miseq_path, "filtered")
fnFs = sort(list.files(miseq_path, pattern="_R1_001.fastq"))
fnRs = sort(list.files(miseq_path, pattern="_R2_001.fastq"))
sampleNames = sapply(strsplit(fnFs, "_"), `[`, 1)
if (!file_test("-d", filt_path)) dir.create(filt_path)
filtFs = file.path(filt_path, paste0(sampleNames, "_F_filt.fastq.gz"))
filtRs = file.path(filt_path, paste0(sampleNames, "_R_filt.fastq.gz"))
fnFs = file.path(miseq_path, fnFs)
fnRs = file.path(miseq_path, fnRs)
print(length(fnFs))
```

`## [1] 20`

The data are highly-overlapping Illumina Miseq \(2\times 250\) amplicon sequences from the V4 region of the 16S rRNA gene (Kozich et al. 2013). There were originally 360 fecal samples collected longitudinally from 12 mice over the first year of life. These were collected by Schloss et al. (2012) to investigate the development and stabilization of the murine microbiome. We have selected 20 samples to illustrate how to preprocess the data.

We will need to filter out low-quality reads and trim them to a consistent length. While generally recommended filtering and trimming parameters serve as a starting point, no two datasets are identical and therefore it is always worth inspecting the quality of the data before proceeding. We show the sequence quality plots for the two first samples in Figure 5.36. They are generated by:

```
plotQualityProfile(fnFs[1:2]) + ggtitle("Forward")
plotQualityProfile(fnRs[1:2]) + ggtitle("Reverse")
```

Note that we also see the background distribution of quality scores at each position in Figure 5.36 as a grey-scale heat map. The dark colors correspond to higher frequency.

▢

Generate similar plots for four randomly selected sets of forward and reverse reads. Compare forward and reverse read qualities; what do you notice?

▢

Here, the forward reads maintain high quality throughout, while the quality of the reverse reads drops significantly at about position 160. Therefore, we truncate the forward reads at position 240, and trimm the first 10 nucleotides as these positions are of lower quality. The reverse reads are trimmed at position 160. Combine these trimming parameters with standard filtering parameters remember to enforce a maximum of 2 expected errors per-read. (Hint: Trim and filter on paired reads jointly, i.e., both reads must pass the filter for the pair to pass. The input arguments should be chosen following the **dada2** vignette carefully. We recommend filtering out all reads with any ambiguous nucleotides.)

▢

Use R to create a map like the one shown in Figure 5.2. Hint: go to the website of the British National Archives and download street addresses of hits, use an address resolution service to convert these into geographic coordinates, and display these as points on a map of London.

▢

Aure, Miriam Ragle, Valeria Vitelli, Sandra Jernström, Surendra Kumar, Marit Krohn, Eldri U Due, Tonje Husby Haukaas, et al. 2017. “Integrative Clustering Reveals a Novel Split in the Luminal A Subtype of Breast Cancer with Impact on Outcome.” *Breast Cancer Research* 19 (1). BioMed Central: 44.

Bendall, Sean C, Garry P Nolan, Mario Roederer, and Pratip K Chattopadhyay. 2012. “A Deep Profiler’s Guide to Cytometry.” *Trends in Immunology* 33 (7). Elsevier: 323–32.

Callahan, Benjamin J, Paul J McMurdie, and Susan P Holmes. 2017. “Exact Sequence Variants Should Replace Operational Taxonomic Units in Marker Gene Data Analysis.” *ISME Journal*. Nature publishing, 1–5.

Callahan, Benjamin J, Paul J McMurdie, Michael J Rosen, Andrew W Han, Amy J Johnson, and Susan P Holmes. 2016. “DADA2: High Resolution Sample Inference from Amplicon Data.” *Nature Methods*. Nature Publishing, 1–4.

Callahan, Ben J, Kris Sankaran, Julia A Fukuyama, Paul J McMurdie, and Susan P Holmes. 2016. “Bioconductor Workflow for Microbiome Data Analysis: From Raw Reads to Community Analyses.” *F1000Research* 5. Faculty of 1000 Ltd.

Caporaso, J.G., J. Kuczynski, J. Stombaugh, K. Bittinger, F.D. Bushman, E.K. Costello, N. Fierer, et al. 2010. “QIIME Allows Analysis of High-Throughput Community Sequencing Data.” *Nature Methods* 7 (5). Nature Publishing Group: 335–36.

Chakerian, John, and Susan Holmes. 2012. “Computational Tools for Evaluating Phylogenetic and Hierarchical Clustering Trees.” *Journal of Computational and Graphical Statistics* 21 (3). Taylor & Francis Group: 581–99.

Diday, Edwin, and M Paula Brito. 1989. “Symbolic Cluster Analysis.” In *Conceptual and Numerical Analysis of Data*, 45–84. Springer.

Dundar, Murat, Ferit Akova, Halid Z. Yerebakan, and Bartek Rajwa. 2014. “A Non-Parametric Bayesian Model for Joint Cell Clustering and Cluster Matching: Identification of Anomalous Sample Phenotypes with Random Effects.” *BMC Bioinformatics* 15 (1): 1–15. https://doi.org/10.1186/1471-2105-15-314.

Freedman, David A. 1991. “Statistical Models and Shoe Leather.” *Sociological Methodology* 21 (2): 291–313.

Hallett, Robin M, Anna Dvorkin-Gheva, Anita Bane, and John A Hassell. 2012. “A Gene Signature for Predicting Outcome in Patients with Basal-Like Breast Cancer.” *Scientific Reports* 2. Nature Publishing Group.

Holmes, Susan, Michael He, Tong Xu, and Peter P Lee. 2005. “Memory T Cells Have Gene Expression Patterns Intermediate Between Naive and Effector.” *PNAS* 102 (15). National Acad Sciences: 5519–23.

Hornik, Kurt. 2005. “A CLUE for CLUster Ensembles.” *Journal of Statistical Software* 14 (12). University of California, Los Angeles.

Hulett, Henry R, William A Bonner, Janet Barrett, and Leonard A Herzenberg. 1969. “Cell Sorting: Automated Separation of Mammalian Cells as a Function of Intracellular Fluorescence.” *Science* 166 (3906). American Association for the Advancement of Science: 747–49.

Kahneman, Daniel. 2011. *Thinking, Fast and Slow*. Macmillan.

Kaufman, Leonard, and Peter J Rousseeuw. 2009. *Finding Groups in Data: An Introduction to Cluster Analysis*. Vol. 344. John Wiley & Sons.

Kozich, James J, Sarah L Westcott, Nielson T Baxter, Sarah K Highlander, and Patrick D Schloss. 2013. “Development of a Dual-Index Sequencing Strategy and Curation Pipeline for Analyzing Amplicon Sequence Data on the Miseq Illumina Sequencing Platform.” *Applied and Environmental Microbiology* 79 (17). Am Soc Microbiol: 5112–20.

Mandal, Rakesh, Sophie St-Hilaire, John G Kie, and DeWayne Derryberry. 2009. “Spatial Trends of Breast and Prostate Cancers in the United States Between 2000 and 2005.” *International Journal of Health Geographics* 8 (1). BioMed Central: 53.

McLachlan, Geoffrey, and Thriyambakam Krishnan. 2007. *The EM Algorithm and Extensions*. Vol. 382. John Wiley & Sons.

Müllner, Daniel. 2013. “Fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python.” *Journal of Statistical Software* 53 (9): 1–18.

Ohnishi, Y., W. Huber, A. Tsumura, M. Kang, P. Xenopoulos, K. Kurimoto, A. K. Oles, et al. 2014. “Cell-to-Cell Expression Variability Followed by Signal Reinforcement Progressively Segregates Early Mouse Lineages.” *Nature Cell Biology* 16 (1): 27–37.

O’Neill, Kieran, Nima Aghaeepour, Josef Špidlen, and Ryan Brinkman. 2013. “Flow Cytometry Bioinformatics.” *PLoS Computational Biology* 9 (12). Public Library of Science: e1003365.

Rosen, Michael J, Benjamin J Callahan, Daniel S Fisher, and Susan P Holmes. 2012. “Denoising PCR-Amplified Metagenome Data.” *BMC Bioinformatics* 13 (1). BioMed Central Ltd: 283.

Schloss, P.D., A.M. Schuber, J.P. Zackular, K.D. Iverson, Young V.B., and Petrosino J.F. 2012. “Stabilization of the Murine Gut Microbiome Following Weaning.” *Gut Microbes* 3 (4): 383–93.

Schloss, P D, S L Westcott, T Ryabin, J R Hall, M Hartmann, E B Hollister, R A Lesniewski, et al. 2009. “Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities.” *Applied and Environmental Microbiology* 75 (23): 7537–41.

Tibshirani, Robert, Guenther Walther, and Trevor Hastie. 2001. “Estimating the Number of Clusters in a Data Set via the Gap Statistic.” *JRSSB* 63 (2). Wiley Online Library: 411–23.

Tseng, George C, and Wing H Wong. 2005. “Tight Clustering: A Resampling-Based Approach for Identifying Stable and Tight Patterns in Data.” *Biometrics* 61 (1). Wiley Online Library: 10–16.

Tversky, Amos, and Daniel Kahneman. 1974. “Heuristics and Biases: Judgement Under Uncertainty.” *Science* 185: 1124–30.

Tversky, Amos, and Daniel Kahneman. 1975. “Judgment Under Uncertainty: Heuristics and Biases.” In *Utility, Probability, and Human Decision Making*, 141–62. Springer.

Page built: 2019-05-15

Support for maintaining the online version of this book is provided by de.NBI

For website support please contact msmith@embl.de