# 12 Supervised Learning

In a supervised learning setting, we have a yardstick or plumbline to judge how well we are doing: the response itself.

A frequent question in biological and biomedical applications is whether a property of interest (say, disease type, cell type, the prognosis of a patient) can be “predicted”, given one or more other properties, called the predictors. Often we are motivated by a situation in which the property to be predicted is unknown (it lies in the future, or is hard to measure), while the predictors are known. The crucial point is that we learn the prediction rule from a set of training data in which the property of interest is also known. Once we have the rule, we can either apply it to new data, and make actual predictions of unknown outcomes; or we can dissect the rule with the aim of better understanding the underlying biology.

Compared to unsupervised learning and what we have seen in Chapters 5, 7 and 9, where we do not know what we are looking for or how to decide whether our result is “right”, we are on much more solid ground with supervised learning: the objective is clearly stated, and there are straightforward criteria to measure how well we are doing.

The central issues in supervised learning154 Sometimes the term statistical learning is used, more or less exchangeably. are overfitting and generalizability: did we just learn the training data “by heart” by constructing a rule that has 100% accuracy on the training data, but would perform poorly on any new data? Or did our rule indeed pick up some of the pertinent patterns in the system being studied, which will also apply to yet unseen new data? (Figure 12.1)

Figure 12.1: An example for overfitting: two regression lines are fit to data in the $$(x, y)$$-plane (black points). We can think of such a line as a rule that predicts the $$y$$-value, given an $$x$$-value. Both lines are smooth, but the fits differ in what is called their bandwidth, which intuitively can be interpreted their stiffness. The blue line seems overly keen to follow minor wiggles in the data, while the orange line captures the general trend but is less detailed. The effective number of parameters needed to describe the blue line is much higher than for the orange line. Also, if we were to obtain additional data, it is likely that the blue line would do a worse job than the orange line in modeling the new data. We’ll formalize these concepts –training error and test set error– later in this chapter. Although exemplified here with line fitting, the concept applies more generally to prediction models.

## 12.1 Goals for this chapter

In this chapter we will:

• See exemplary applications that motivate the use of supervised learning methods.

• Learn what discriminant analysis does.

• Define measures of performance.

• Encounter the curse of dimensionality and see what overfitting is.

• Find out about regularization – in particular, penalization – and understand the concepts of generizability and model complexity.

• See how to use cross-validation to tune parameters of algorithms.

• Discuss method hacking.

## 12.2 What are the data?

The basic data structure for both supervised and unsupervised learning is (at least conceptually) a dataframe, where each row corresponds to an object and the columns are different features (usually numerical values) of the objects155 This is a simplified description. Machine learning is a huge field, and lots of generalizations of this simple conceptual picture have been made. Already the construction of relevant features is an art by itself — we have seen examples with images of cells in Chapter 11, and more generally there are lots of possibilities to extract features from images, sounds, movies, free text, $$...$$ Moreover, there is a variant of machine learning methods called kernel methods that do not need features at all; instead, kernel methods use distances or measures of similarity between objects. It may be easier, for instance, to define a measure of similarity between two natural language text objects than to find relevant numerical features to represent them. Kernel methods are beyond the scope of this book.. While in unsupervised learning we aim to find (dis)similarity relationships between the objects based on their feature values (e.g., by clustering or ordination), in supervised learning we aim to find a mathematical function (or a computational algorithm) that predicts the value of one of the features from the other features. Many implementations require that there are no missing values, whereas other methods can be made to work with some amount of missing data.

The feature that we select over all the others with the aim of predicting is called the objective or the response. Sometimes the choice is natural, but sometimes it is also instructive to reverse the roles, especially if we are interested in dissecting the prediction function for the purpose of biological understanding, or in disentangling correlations from causation.

The framework for supervised learning covers both continuous and categorical response variables. In the continuous case we also call it regression, in the categorical case, classification. It turns out that this distinction is not a detail, as it has quite far-reaching consequences for the choice of loss function (Section 12.5) and thus the choice of algorithm .

The first question to consider in any supervised learning task is how the number of objects compares to the number of predictors. The more objects, the better, and much of the hard work in supervised learning has to do with overcoming the limitations of having a finite (and typically, too small) training set.

Figure 12.2: In supervised learning, we assign two different roles to our variables. We have labeled the explanatory variables $$X$$ and the response variable(s) $$Y$$. There are also two different sets of observations: the training set $$X_\ell$$ and $$Y_\ell$$ and the test set $$X_v$$ and $$Y_v$$. (The subscripts refer to alternative names for the two sets: “learning” and “validation”.)

Give examples where we have encountered instances of supervised learning with a categorical response in this book.

### 12.2.1 Motivating examples

#### Predicting diabetes type

The diabetes dataset  presents three different groups of diabetes patients and five clinical variables measured on them.

data("diabetes", package = "rrcov")
head(diabetes)
##     rw fpg glucose insulin sspg  group
## 1 0.81  80     356     124   55 normal
## 2 0.95  97     289     117   76 normal
## 3 0.94 105     319     143  105 normal
## 4 1.04  90     356     199  108 normal
## 5 1.00  90     323     240  143 normal
## 6 0.76  86     381     157  165 normal

The univariate distributions (more precisely, some density estimates of them) are shown in Figure 12.3.

Figure 12.3: We see already from the one-dimensional distributions that some of the individual variables could potentially predict which group a patient is more likely to belong to. Our goal is to combine variables to improve over such one-dimensional prediction models.

library("ggplot2")
library("reshape2")
ggplot(melt(diabetes, id.vars = "group"), aes(x = value, col = group)) +
geom_density() + facet_wrap( ~variable, ncol = 1, scales = "free") +
theme(legend.position = "bottom")

The variables are explained in the manual page of the dataset, and in the paper :

• rw: relative weight

• fpg: fasting plasma glucose

• glucose: area under the plasma glucose curve for the three hour oral glucose tolerance test (OGTT)

• insulin: area under the plasma insulin curve for the OGTT

• sspg: steady state plasma glucose response

• group: normal, chemical diabetes and overt diabetes

#### Predicting cellular phenotypes

Neumann et al. (2010) observed human cancer cells using live-cell imaging. The cells were genetically engineered so that their histones were tagged with a green fluorescent protein (GFP). A genome-wide RNAi library was applied to the cells, and for each siRNA perturbation, movies of a few hundred cells were recorded for about two days, to see what effect the depletion of each gene had on cell cycle, nuclear morphology and cell proliferation. Their paper reports the use of an automated image classification algorithm that quantified the visual appearance of each cell’s nucleus and enabled the prediction of normal mitosis states or aberrant nuclei. The algorithm was trained on the data from around 3000 cells that were annotated by a human expert. It was then applied to almost 2 billions images of nuclei (Figure 12.4). Using automated image classification provided scalablity (annotating 2 billion images manually would take a long time) and objectivity.

#### Predicting embryonic cell states

We will revisit the mouse embryo data , which we have already seen in Chapters 3, 5 and 7. We’ll try to predict cell state and genotype from the gene expression measurements in Sections 12.3.2 and 12.6.3.

## 12.3 Linear discrimination

We start with one of the simplest possible discrimination problems156 Arguably the simplest possible problem is a single continuous feature, two classes, and the task of finding a single threshold to discriminate between the two groups – as in Figure 6.2.: we have objects described by two continuous features (so the objects can be thought of as points in the 2D plane) and falling into three groups. Our aim is to define class boundaries, which are lines in the 2D space.

### 12.3.1 Diabetes data

Let’s see whether we can predict the group from the sspg and glucose variables in the diabetes data. It’s always a good idea to first visualise the data (Figure 12.5).

Figure 12.5: Scatterplot of two of the variables in the diabetes data. Each point is a sample, and the color indicates the diabetes type as encoded in the group variable.

ggdb = ggplot(mapping = aes(x = sspg, y = glucose)) +
geom_point(aes(colour = group), data = diabetes)
ggdb

We’ll start with a method called linear discriminant analysis (LDA). This method is a foundation stone of classification, many of the more complicated (and sometimes more powerful) algorithms are really just generalizations of LDA.

library("MASS")
diabetes_lda = lda(group ~ sspg + glucose, data = diabetes)
diabetes_lda
## Call:
## lda(group ~ sspg + glucose, data = diabetes)
##
## Prior probabilities of groups:
##    normal  chemical     overt
## 0.5241379 0.2482759 0.2275862
##
## Group means:
##              sspg   glucose
## normal   114.0000  349.9737
## chemical 208.9722  493.9444
## overt    318.8788 1043.7576
##
## Coefficients of linear discriminants:
##                 LD1         LD2
## sspg    0.005036943 -0.01539281
## glucose 0.005461400  0.00449050
##
## Proportion of trace:
##    LD1    LD2
## 0.9683 0.0317
ghat = predict(diabetes_lda)$class table(ghat, diabetes$group)
##
## ghat       normal chemical overt
##   normal       69       12     1
##   chemical      7       24     6
##   overt         0        0    26
mean(ghat != diabetes$group) ## [1] 0.1793103 ► Question 205 What do the different parts of the above output mean? Now, let’s visualise the LDA result. We are going to plot the prediction regions for each of the three groups. We do this by creating a grid of points and using our prediction rule on each of them. We’ll then also dig a bit deeper into the mechanics of LDA and plot the class centers (diabetes_lda$means) and ellipses that correspond to the fitted covariance matrix (diabetes_lda$scaling). Assembling this visualization requires us to write a bit of code. make1Dgrid = function(x) { rg = grDevices::extendrange(x) seq(from = rg[1], to = rg[2], length.out = 100) } Set up the points for prediction, a $$100 \times 100$$ grid that covers the data range. diabetes_grid = with(diabetes, expand.grid(sspg = make1Dgrid(sspg), glucose = make1Dgrid(glucose))) Do the predictions. diabetes_grid$ghat =
predict(diabetes_lda, newdata = diabetes_grid)$class The group centers. centers = diabetes_lda$means

Compute the ellipse. We start from a unit circle (approximated by a polygon with 360 sides) and apply the corresponding affine transformation from the LDA output.

unitcircle = exp(1i * seq(0, 2*pi, length.out = 360)) |>
($$z) cbind(Re(z), Im(z)))() ellipse = unitcircle %*% solve(diabetes_ldascaling) |> as_tibble() All three ellipses, one for each group center. ellipses = lapply(rownames(centers), function(gr) { mutate(ellipse, sspg = sspg + centers[gr, "sspg"], glucose = glucose + centers[gr, "glucose"], group = gr) }) |> bind_rows() Now we are ready to plot (Figure 12.6). Figure 12.6: As Figure 12.5, with the classification regions from the LDA model shown. The three ellipses represent the class centers and the covariance matrix of the LDA model; note that there is only one covariance matrix, which is the same for all three classes. Therefore also the sizes and orientations of the ellipses are the same for the three classes, only their centers differ. They represent contours of equal class membership probability. ggdb + geom_raster(aes(fill = ghat), data = diabetes_grid, alpha = 0.25, interpolate = TRUE) + geom_point(data = as_tibble(centers), pch = "+", size = 8) + geom_path(aes(colour = group), data = ellipses) + scale_x_continuous(expand = c(0, 0)) + scale_y_continuous(expand = c(0, 0)) ► Question 206 Why is the boundary between the prediction regions for chemical and overt not perpendicular to the line between the group centers? ► Solution ► Question 207 How confident would you be about the predictions in those areas of the 2D plane that are far from all of the cluster centers? ► Solution ► Question 208 Why is the boundary between the prediction regions for normal and chemical not half-way between the centers, but shifted in favor of normal? Hint: have a look at the prior argument of lda. Try again with uniform prior. ► Solution ► Question 209 Figures 12.6 and 12.7 show both the fitted LDA model, through the ellipses, and the prediction regions, through the area coloring. What part of this visualization is generic for all sorts of classification methods, what part is method-specific? ► Solution ► Question 210 What is the difference in the prediction accuracy if we use all 5 variables instead of just glucose and sspg? ► Solution ► Question 211 Instead of approximating the prediction regions by classification from a grid of points, compute the separating lines explicitly from the linear determinant coefficients. ► Solution ### 12.3.2 Predicting embryonic cell state from gene expression Assume that we already know that the four genes FN1, TIMD2, GATA4 and SOX7 are relevant to the classification task157 Later in this chapter we will see methods that can drop this assumption and screen all available features.. We want to build a classifier that predicts the developmental time (embryonic days: E3.25, E3.5, E4.5). We load the data and select four corresponding probes. library("Hiiragi2013") data("x") probes = c("1426642_at", "1418765_at", "1418864_at", "1416564_at") embryoCells = t(Biobase::exprs(x)[probes, ]) |> as_tibble() |> mutate(Embryonic.day = xEmbryonic.day) |> dplyr::filter(xgenotype == "WT") We can use the Bioconductor annotation package associated with the microarray to verify that the probes correspond to the intended genes. annotation(x) ## [1] "mouse4302" library("mouse4302.db") anno = AnnotationDbi::select(mouse4302.db, keys = probes, columns = c("SYMBOL", "GENENAME")) anno ## PROBEID SYMBOL ## 1 1426642_at Fn1 ## 2 1418765_at Timd2 ## 3 1418864_at Gata4 ## 4 1416564_at Sox7 ## GENENAME ## 1 fibronectin 1 ## 2 T cell immunoglobulin and mucin domain containing 2 ## 3 GATA binding protein 4 ## 4 SRY (sex determining region Y)-box 7 mt = match(annoPROBEID, colnames(embryoCells)) colnames(embryoCells)[mt] = annoSYMBOL Now we are ready to visualize the data in a pairs plot (Figure 12.8). library("GGally") ggpairs(embryoCells, mapping = aes(col = Embryonic.day), columns = annoSYMBOL, upper = list(continuous = "points")) We can now call lda on these data. The linear combinations LD1 and LD2 that serve as discriminating variables are given in the slot ed_ldascaling of the output from lda. ec_lda = lda(Embryonic.day ~ Fn1 + Timd2 + Gata4 + Sox7, data = embryoCells) round(ec_ldascaling, 1) ## LD1 LD2 ## Fn1 -0.2 -0.4 ## Timd2 0.5 0.0 ## Gata4 -0.1 -0.6 ## Sox7 -0.7 0.5 For the visualization of the learned model in Figure 12.9, we need to build the prediction regions and their boundaries by expanding the grid in the space of the two new coordinates LD1 and LD2. Figure 12.9: LDA classification regions for Embryonic.day. ec_rot = predict(ec_lda)x |> as_tibble() |> mutate(ed = embryoCellsEmbryonic.day) ec_lda2 = lda(ec_rot[, 1:2], predict(ec_lda)class) ec_grid = with(ec_rot, expand.grid( LD1 = make1Dgrid(LD1), LD2 = make1Dgrid(LD2))) ec_gridedhat = predict(ec_lda2, newdata = ec_grid)class ggplot() + geom_point(aes(x = LD1, y = LD2, colour = ed), data = ec_rot) + geom_raster(aes(x = LD1, y = LD2, fill = edhat), data = ec_grid, alpha = 0.4, interpolate = TRUE) + scale_x_continuous(expand = c(0, 0)) + scale_y_continuous(expand = c(0, 0)) + coord_fixed() ► Question 212 Repeat these analyses using quadratic discriminant analysis (qda). What difference do you see in the shape of the boundaries? ► Solution ► Question 213 What happens if you call lda or qda with a lot more genes, say the first 1000, in the Hiiragi dataset? ► Solution ## 12.4 Machine learning vs rote learning Computers are really good at memorizing facts. In the worst case, a machine learning algorithm is a roundabout way of doing this158 The not-so roundabout way is database technologies.. The central goal in statistical learning, however, is generalizability. We want an algorithm that is able to generalize, i.e., interpolate and extrapolate from given data to make good predictions about future data. Let’s look at the following example. We generate random data (rnorm) for n objects, with different numbers of features (given by p). We train a LDA on these data and compute the misclassification rate, i.e., the fraction of times the prediction is wrong (pred != resp). library("dplyr") p = 2:21 n = 20 mcl = lapply(p, function(pp) { replicate(100, { xmat = matrix(rnorm(n * pp), nrow = n) resp = sample(c("apple", "orange"), n, replace = TRUE) fit = lda(xmat, resp) pred = predict(fit)class mean(pred != resp) }) |> mean() |> (\(x) tibble(mcl = x, p = pp))() }) |> bind_rows() Figure 12.11: Misclassification rate of LDA applied to random data. While the number of observations n is held constant (at 20), we are increasing the number of features p starting from 2 up to 21. The misclassification rate becomes almost zero as p approaches 20. The LDA model becomes so elaborate and over-parameterized that it manages to learn the random labels “by heart”. (As p becomes even larger, the “performance” degrades again somewhat, apparently due to numerical properties of the lda implementation used here.) ggplot(mcl, aes(x = p, y = mcl)) + geom_line() + geom_point() + ylab("Misclassification rate") ► Question 214 What is the purpose of the replicate loop in the above code? What happens if you omit it (or replace the 100 by 1)? ► Solution Figure 12.11 seems to imply that we can perfectly predict random labels from random data, if we only fit a complex enough model, i.e., one with many parameters. How can we overcome such an absurd conclusion? The problem with the above code is that the model performance is evaluated on the same data on which it was trained. This generally leads to positive bias, as you see in this crass example. How can we overcome this problem? The key idea is to assess model performance on different data than those on which the model was trained. ### 12.4.1 Cross-validation A naive approach might be to split the data in two halves, and use the first half for learning (“training”) and the second half for assessment (“testing”). It turns out that this is needlessly variable and needlessly inefficient. It is needlessly variable, since by splitting the data only once, our results can be quite affected by how the split happens to fall. It seems better to do the splitting many times, and average. This will give us more stable results. It is needlessly inefficient, since the performance of machine learning algorithms depends on the number of observations, and the performance measured on half the data is likely159 Unless we have such an excess of data that it doesn’t matter. to be worse than what it is with all the data. For this reason, it is better to use unequal sizes of training and test data. In the extreme case, we’ll use as much as \(n-1$$ observations for training, and the remaining one for testing. After we’ve done this likewise for all observations, we can average our performance metric. This is called leave-one-out cross-validation.

See Chapter Model Assessment and Selection in the book by Hastie, Tibshirani, and Friedman (2008) for further discussion on these trade-offs.

An alternative is $$k$$-fold cross-validation, where the observations are repeatedly split into a training set of size of around $$n(k-1)/k$$ and a test set of size of around $$n/k$$. Both alternatives have pros and contras, and there is not a universally best choice. An advantage of leave-one-out is that the amount of data used for training is close to the maximally available data; this is especially important if the sample size is limiting and “every little matters” for the algorithm. A drawback of leave-one-out is that the training sets are all very similar, so they may not model sufficiently well the kind of sampling changes to be expected if a new dataset came along. For large $$n$$, leave-one-out cross-validation can be needlessly time-consuming.

estimate_mcl_loocv = function(x, resp) {
vapply(seq_len(nrow(x)), function(i) {
fit  = lda(x[-i, ], resp[-i])
ptrn = predict(fit, newdata = x[-i,, drop = FALSE])$class ptst = predict(fit, newdata = x[ i,, drop = FALSE])$class
c(train = mean(ptrn != resp[-i]), test = (ptst != resp[i]))
}, FUN.VALUE = numeric(2)) |> rowMeans() |> t() |> as_tibble()
}

xmat = matrix(rnorm(n * last(p)), nrow = n)
resp = sample(c("apple", "orange"), n, replace = TRUE)

mcl = lapply(p, function(k) {
estimate_mcl_loocv(xmat[, 1:k], resp)
}) |> bind_rows() |> data.frame(p) |> melt(id.var = "p")

Figure 12.12: Cross-validation: the misclassification rate of LDA applied to random data, when evaluated on test data that were not used for learning, hovers around 0.5 independent of p. The misclassification rate on the training data is also shown. It behaves similar to what we already saw in Figure 12.11.

ggplot(mcl, aes(x = p, y = value, col = variable)) + geom_line() +
geom_point() + ylab("Misclassification rate")

The result is show in Figure 12.12.

► Question 215

Why are the curves in Figure 12.12 more variable (“wiggly”) than in Figure 12.11? How can you overcome this?

► Solution

### 12.4.2 The curse of dimensionality

In Section 12.4.1 we have seen overfitting and cross-validation on random data, but how does it look if there is in fact a relevant class separation?

p   = 2:20
mcl = replicate(100, {
xmat = matrix(rnorm(n * last(p)), nrow = n)
resp = sample(c("apple", "orange"), n, replace = TRUE)
xmat[, 1:6] = xmat[, 1:6] + as.integer(factor(resp))

lapply(p, function(k) {
estimate_mcl_loocv(xmat[, 1:k], resp)
}) |> bind_rows() |> cbind(p = p) |> melt(id.var = "p")
}, simplify = FALSE) |> bind_rows()

Figure 12.13: As we increase the number of features included in the model, the misclassification rate initially improves; as we start including more and more irrelevant features, it increases again, as we are fitting noise.

mcl = group_by(mcl, p, variable) |> summarise(value = mean(value))

ggplot(mcl, aes(x = p, y = value, col = variable)) + geom_line() +
geom_point() + ylab("Misclassification rate")

Figure 12.14: Idealized version of Figure 12.13, from Hastie, Tibshirani, and Friedman (2008). A recurrent goal in machine learning is finding the sweet spot in the variance–bias trade-off.

The result is shown in Figure 12.13. The group centers are the vectors (in $$\mathbb{R}^{20}$$) given by the coordinates $$(1, 1, 1, 1, 1, 1, 0, 0, 0, ...)$$ (apples) and $$(2, 2, 2, 2, 2, 2, 0, 0, 0, ...)$$ (oranges), and the optimal decision boundary is the hyperplane orthogonal to the line between them. For $$p$$ smaller than $$6$$, the decision rule cannot reach this hyperplane – it is biased. As a result, the misclassification rate is suboptimal, and it decreases with $$p$$. But what happens for $$p$$ larger than $$6$$? The algorithm is, in principle, able to model the optimal hyperplane, and it should not be distracted by the additional features. The problem is that it is. The more additional features enter the dataset, the higher the probability that one or more of them happen to fall in a way that they look like good, discriminating features in the training data – only to mislead the classifier and degrade its performance in the test data. Shortly we’ll see how to use penalization to (try to) control this problem.

The term curse of dimensionality was coined by Bellman (1961). It refers to the fact that high-dimensional spaces are very hard, if not impossible, to sample thoroughly: for instance, to cover a 2-dimensional square of side length 1 with grid points that are 0.1 apart, we need $$10^2=100$$ points. In 100 dimensions, we need $$10^{100}$$ – which is already more than the number of protons in the universe. In genomics, we often aim to fit models to data with thousands of features. Also our intuitions about distances between points or about the relationship between a volume and its surface break down in a high-dimensional settings. We’ll explore some of the weirdnesses of high-dimensional spaces in the next few questions.

► Question 216

Assume you have a dataset with 1 000 000 data points in $$p$$ dimensions. The data are uniformly distributed in the unit hybercube (i.e., all features lie in the interval $$[0,1]$$). What’s the side length of a hybercube that can be expected to contain just 10 of the points, as a function of $$p$$?

► Solution

Next, let’s look at the relation between inner regions of the feature space versus its boundary regions. Generally speaking, prediction at the boundaries of feature space is more difficult than in its interior, as it tends to involve extrapolation, rather than interpolation. In the next question you’ll see how this difficulty explodes with feature space dimension.

► Question 217

What fraction of a unit cube’s total volume is closer than 0.01 to any of its surfaces, as a function of the dimension?

► Solution

► Question 218

What is the coefficient of variation (ratio of standard deviation over average) of the distance between two randomly picked points in the unit hypercube, as a function of the dimension?

► Solution

## 12.5 Objective functions

We’ve already seen the misclassification rate (MCR) used to assess our classification performance in Figures 12.1112.13. Its population version is defined as

$$$\text{MCR} = \text{E}\left[ 𝟙_{\hat{y} \neq y} \right], \tag{12.1}$$$

and for a finite sample

$$$\widehat{\text{MCR}} = \frac{1}{n}\sum_{i=1}^n 𝟙_{\hat{y_i} \neq y_i}. \tag{12.2}$$$

This is not the only choice we could make. Perhaps we care more about the misclassification of apples as oranges than vice versa, and we can reflect this by introducing weights that depend on the type of error made into the sum of Equation (12.2) (or the integral of Equation (12.1)). This can get even more elaborate if we have more than two classes. Often we want to see the whole confusion table, which we can get via

table(truth, response)

An important special case is binary classification with asymmetric costs – think about, say, a medical test. Here, the sensitivity (a.k.a. true positive rate or recall) is related to the misclassification of healthy as ill, and the specificity (or true negative rate) depends on the probability of misclassification of ill as healthy. Often, there is a single parameter (e.g., a threshold) that can be moved up and down, allowing a trade-off between sensitivity and specificity (and thus, equivalently, between the two types of misclassification). In those cases, we usually are not content to know the classifier performance at one single choice of threshold, but at many (or all) of them. This leads to receiver operating characteristic (ROC) or precision-recall curves.

► Question 219

What are the exact relationships between the per-class misclassification rates and sensitivity and specificity?

► Solution

Another cost function can be computed from the Jaccard index, which we already saw in Chapter 5.

$$$J(A,B) = \frac{|\,A\cap B\,|}{|\,A\cup B\,|}, \tag{12.3}$$$

where $$A$$ is the set of observations for which the true class is 1 ($$A=\{i\,|\,y_i=1\}$$) and $$B$$ is the set of observations for which the predicted class is 1. The number $$J$$ is between 0 and 1, and when $$J$$ is large, it indicates high overlap of the two sets. Note that $$J$$ does not depend on the number of observations for which both true and predicted class is 0 – so it is particularly suitable for measuring the performance of methods that try to find rare events.

We can also consider probabilistic class predictions, which come in the form $$\hat{P}(Y\,|\,X)$$. In this case, a possible risk function would be obtained by looking at distances between the true probability distribution and the estimated probability distributions. For two classes, the finite sample version of the $$\log \text{loss}$$ is

$$$\log \text{loss} = -\frac{1}{n}\sum_{i=1}^n y_i\log(\hat{p}_i) + (1 - y_i)\log(1 - \hat{p}_i), \tag{12.4}$$$

where $$\hat{p}_i \in [0,1]$$ is the prediction, and $$y_i\in\{0,1\}$$ is the truth.

Note that the $$\log \text{loss}$$ will be infinite if a prediction is totally confident ($$\hat{p}_i$$ is exactly $$0$$ or $$1$$) but wrong.

For continuous continuous response variables (regression), a natural choice is the mean squared error (MSE). It is the average squared error,

$$$\widehat{\text{MSE}} = \frac{1}{n}\sum_{i=1}^n ( \hat{Y}_i - Y_i )^2. \tag{12.5}$$$

The population version is defined analogously, by turning the summation into an integral as in Equations (12.1) and (12.2).

Statisticians call functions like Equations ((12.1)(12.5)) variously (and depending on context and predisposition) risk function, cost function, objective function160 There is even an R package dedicated to evaluation of statistical learners called metrics..

Figure 12.18: In the upper bull’s eye, the estimates are systematically off target, but in a quite reproducible manner. The green segment represents the bias. In the lower bull’s eye, the estimates are not biased, as they are centered in the right place, however they have high variance. We can distinguish the two scenarios since we see the result from many shots. If we only had one shot and missed the bull’s eye, we could not easily know whether that’s because of bias or variance.

An important fact that helps us understand the tradeoffs when picking a statistical learning model is that the MSE is the sum of two terms, and often the choices we can make are such that one of those terms goes down while the other one goes up. The bias measures how different the average of all the different estimates is from the truth, and variance, how much an individual one might scatter from the average value (Figure 12.18). In applications, we often only get one shot, therefore being reliably almost on target can beat being right on the long term average but really off today. The decomposition

$$$\text{MSE} = \underbrace{\text{Var}(\hat{Y})}_{\text{variance}} + \underbrace{\mathbb{E}[\hat{Y}-Y]^2}_{\text{bias}} \tag{12.6}$$$

follows by straightforward algebra.

When trying to minimize the MSE, it is important to realize that sometimes we can pay the price of a small bias to greatly reduce variance, and thus overall improve MSE. We already encountered shrinkage estimation in Chapter 8. In classification (i.e., when we have categorical response variables), different objective functions than the MSE are used, and there is usually no such straightforward decomposition as in Equation (12.6). The good news is that we can usually go even much further than in the case of continuous responses with our trading biases for variance. This is because the discreteness of the response absorbs certain biases , so that the cost of higher bias is almost zero, while we still get the benefit of better (smaller) variance.

### 12.6.1 Penalization

In high-dimensional statistics, we are constantly plagued by variance: there is just not enough data to fit all the possible parameters. One of the most fruitful ideas in high-dimensional statistics is penalization: a tool to actively control and exploit the variance-bias tradeoff. Penalization is part of a larger class of regularization methods that are used to ensure stable estimates.

Although generalization of LDA to high-dimensional settings is possible , it turns out that logistic regression is a more general approach161 It fits into the framework of generalized linear models, which we encounted in Chapter 8., and therefore we’ll now switch to that, using the glmnet package.

For multinomial—or, for the special case of two classes, binomial—logistic regression models, the posterior log-odds between $$k$$ classes and can be written in the form (see the section on Logistic Regression in the book by Hastie, Tibshirani, and Friedman (2008) for a more complete presentation):

$$$\log \frac{P(Y=i\,|\,X=x)}{P(Y=k\,|\,X=x)} = \beta^0_i + \beta_i x, \tag{12.7}$$$

where $$i=1,...,k-1$$ enumerates the different classes and the $$k$$-th class is chosen as a reference. The data matrix $$x$$ has dimensions $$n\times p$$, where $$n$$ is number of observations and $$p$$ the number of features. The $$p$$-dimensional vector $$\beta_i$$ determines how the classification odds for class $$i$$ versus class $$k$$ depend on $$x$$. The numbers $$\beta^0_i$$ are intercepts and depend, among other things, on the classes’ prior probabilities. Instead of the log odds (12.7) (i.e., ratios of class probabilities), we can also write down an equivalent model for the class probabilities themselves, and the fact that we here used the $$k$$-th class as a reference is an arbitrary choice, as the model estimates are equivariant under this choice . The model is fit by maximising the log-likelihood $$\mathcal{l}(\beta, \beta^0; x)$$, where $$\beta=(\beta_1,...,\beta_{k-1})$$ and analogously for $$\beta^0$$.

So far, so good. But as $$p$$ gets larger, there is an increasing chance that some of the estimates go wildly off the mark, due to random sampling happenstances in the data (remember Figure 12.1). This is true even if for each individual coordinate of the vector $$\beta_i$$, the error distribution is bounded: the probabilty of there being one coordinate that is in the far tails increases the more coordiates there are, i.e., the larger $$p$$ is.

A related problem can also occur, not in (12.7), but in other, non-linear models, as the model dimension $$p$$ increases while the sample size $$n$$ remains the same: the likelihood landscape around its maximum becomes increasingly flat, and the maximum-likelihood estimate of the model parameters becomes more and more variable. Eventually, the maximum is no longer a point, but a submanifold, and the maximum likelihood estimate is unidentifiable. Both of these limitations can be overcome with a modification of the objective: instead of maximising the bare log-likelihood, we maximise a penalized version of it,

$$$\hat{\beta}= \arg\max_\beta \mathcal{l}(\beta, \beta^0; x) + \lambda \operatorname{pen}(\beta), \tag{12.8}$$$

where $$\lambda\ge0$$ is a real number, and $$\operatorname{pen}$$ is a convex function, called the penalty function. Popular choices are $$\operatorname{pen}(\beta)=|\beta|^2$$ (ridge regression) and $$\operatorname{pen}(\beta)=|\beta|^1$$ (lasso).

Here, $$|\beta|^\nu=\sum_i \beta_i^\nu$$ is the $$L_\nu$$-norm of the vector $$\beta$$. Variations are possible, for instead we could include in this summation only some but not all of the elements of $$\beta$$; or we could scale different elements differently, for instance based on some prior belief of their scale and importance. In the elastic net, ridge and lasso are hybridized by using the penalty function $$\operatorname{pen}(\beta)=(1-\alpha)|\beta|^1+\alpha|\beta|^2$$ with some further parameter $$\alpha\in[0,1]$$. The crux is, of course, how to choose the right $$\lambda$$, and we will discuss that in the following.

### 12.6.2 Example: predicting colon cancer from stool microbiome composition

Zeller et al. (2014) studied metagenome sequencing data from fecal samples of 156 humans that included colorectal cancer patients and tumor-free controls. Their aim was to see whether they could identify biomarkers (presence or abundance of certain taxa) that could help with early tumor detection. The data are available from Bioconductor through its ExperimentHub service under the identifier EH361.

library("ExperimentHub")
eh = ExperimentHub()
zeller = eh[["EH361"]]
table(zeller$disease) ## ## cancer large_adenoma n small_adenoma ## 53 15 61 27 ► Question 220 Explore the eh object to see what other datasets there are. ► Solution For the following, let’s focus on the normal and cancer samples and set the adenomas aside. zellerNC = zeller[, zeller$disease %in% c("n", "cancer")]

Before jumping into model fitting, as always it’s a good idea to do some exploration of the data. First, let’s look at the sample annotations. The following code prints the data from three randomly picked samples. (Only looking at the first ones, say with the R function head, is also an option, but may not be representative of the whole dataset).

pData(zellerNC)[ sample(ncol(zellerNC), 3), ]
##                    subjectID age gender bmi country disease
## CCIS07277498ST-4-0    FR-276  63   male  NA  france       n
## CCIS90164298ST-4-0    FR-294  84   male  23  france       n
## CCIS98832363ST-4-0    FR-552  55 female  25  france  cancer
##                    tnm_stage ajcc_stage localization     fobt
## CCIS07277498ST-4-0      <NA>       <NA>         <NA>     <NA>
## CCIS90164298ST-4-0      <NA>       <NA>         <NA> negative
## CCIS98832363ST-4-0    t3n1m0        iii           lc positive
##                    wif-1_gene_methylation_test   group bodysite
## CCIS07277498ST-4-0                    negative control    stool
## CCIS90164298ST-4-0                    negative control    stool
## CCIS98832363ST-4-0                    negative     crc    stool
## CCIS07277498ST-4-0     white     66936604
## CCIS90164298ST-4-0     white     64903644
## CCIS98832363ST-4-0     white     69793122

Next, let’s explore the feature names:

We define the helper function formatfn to line wrap these long character strings for the available space here.

formatfn = function(x)
gsub("|", "| ", x, fixed = TRUE) |> lapply(strwrap)

rownames(zellerNC)[1:4]
## [1] "k__Bacteria"
## [2] "k__Viruses"
## [3] "k__Bacteria|p__Firmicutes"
## [4] "k__Bacteria|p__Bacteroidetes"
rownames(zellerNC)[nrow(zellerNC) + (-2:0)] |> formatfn()
## [[1]]
## [1] "k__Bacteria| p__Proteobacteria| c__Deltaproteobacteria|"
## [2] "o__Desulfovibrionales| f__Desulfovibrionaceae|"
## [3] "g__Desulfovibrio| s__Desulfovibrio_termitidis"
##
## [[2]]
## [1] "k__Viruses| p__Viruses_noname| c__Viruses_noname|"
## [2] "o__Viruses_noname| f__Baculoviridae|"
## [3] "g__Alphabaculovirus|"
## [4] "s__Bombyx_mori_nucleopolyhedrovirus|"
## [5] "t__Bombyx_mori_nucleopolyhedrovirus_unclassified"
##
## [[3]]
## [1] "k__Bacteria| p__Proteobacteria| c__Deltaproteobacteria|"
## [2] "o__Desulfovibrionales| f__Desulfovibrionaceae|"
## [3] "g__Desulfovibrio| s__Desulfovibrio_termitidis|"
## [4] "t__GCF_000504305"

As you can see, the features are a mixture of abundance quantifications at different taxonomic levels, from kingdom over phylum to species. We could select only some of these, but here we continue with all of them. Next, let’s look at the distribution of some of the features. Here, we show an arbitrary choice of two, number 510 and 527; in practice, it is helpful to scroll through many such plots quickly to get an impression (Figure 12.19).

Figure 12.19: Histograms of the distributions for two randomly selected features. The distributions are highly skewed, with many zero values and a thin, long tail of non-zero values.

ggplot(melt(Biobase::exprs(zellerNC)[c(510, 527), ]), aes(x = value)) +
geom_histogram(bins = 25) +
facet_wrap( ~ Var1, ncol = 1, scales = "free")

In the simplest case, we fit model (12.7) as follows.

library("glmnet")
glmfit = glmnet(x = t(Biobase::exprs(zellerNC)),
y = factor(zellerNC$disease), family = "binomial") A remarkable feature of the glmnet function is that it fits (12.7) not only for one choice of $$\lambda$$, but for all possible $$\lambda$$s at once. For now, let’s look at the prediction performance for, say, $$\lambda=0.04$$. The name of the function parameter is s: predTrsf = predict(glmfit, newx = t(Biobase::exprs(zellerNC)), type = "class", s = 0.04) table(predTrsf, zellerNC$disease)
##
## predTrsf cancer  n
##   cancer     51  0
##   n           2 61

Not bad – but remember that this is on the training data, without cross-validation. Let’s have a closer look at glmfit. The glmnet package offers a a diagnostic plot that is worth looking at (Figure 12.20).

Figure 12.20: Regularization paths for glmfit.

plot(glmfit, col = brewer.pal(8, "Dark2"), lwd = sqrt(3), ylab = "")

► Question 221

What are the $$x$$- and $$y$$-axes in Figure 12.20? What are the different lines?

► Solution

Let’s get back to the question of how to choose the parameter $$\lambda$$. We could try many different choices –and indeed, all possible choices– of $$\lambda$$, assess classification performance in each case using cross-validation, and then choose the best $$\lambda$$.

You'll already realize from the description of this strategy that if we optimize $$\lambda$$ in this way, the resulting apparent classification performance will likely be exaggerated. We need a truly independent dataset, or at least another, outer cross-validation loop to get a more realistic impression of the generalizability. We will get back to this question at the end of the chapter. We could do so by writing a loop as we did in the estimate_mcl_loocv function in Section 12.4.1. It turns out that the glmnet package already has built-in functionality for that, with the function cv.glmnet, which we can use instead.

Figure 12.21: Diagnostic plot for cv.glmnet: shown is a measure of cross-validated prediction performance, the deviance, as a function of $$\lambda$$. The dashed vertical lines show lambda.min and lambda.1se.

cvglmfit = cv.glmnet(x = t(Biobase::exprs(zellerNC)),
y = factor(zellerNC$disease), family = "binomial") plot(cvglmfit) The diagnostic plot is shown in Figure 12.21. We can access the optimal value with cvglmfit$lambda.min
## [1] 0.06376534

As this value results from finding a minimum in an estimated curve, it turns out that it is often too small, i.e., that the implied penalization is too weak. A heuristic recommended by the authors of the glmnet package is to use a somewhat larger value instead, namely the largest value of $$\lambda$$ such that the performance measure is within 1 standard error of the minimum.

embryoCellsClassifier = cv.glmnet(t(Biobase::exprs(sx)), sx$genotype, family = "binomial", type.measure = "class") plot(embryoCellsClassifier) In Figure 12.23 we see that the misclassification error is (essentially) monotonously increasing with $$\lambda$$, and is smallest for $$\lambda\to 0$$, i.e., if we apply no penalization at all. ► Question 226 What is going on with these data? ► Solution ## 12.7 A large choice of methods We have now seen three classification methods: linear discriminant analysis (lda), quadratic discriminant analysis (qda) and logistic regression using elastic net penalization (glmnet). In fact, there are hundreds of different learning algorithms162 For an introduction to the subject that uses R and provides many examples and exercises, we recommend . available in R and its add-on packages. You can get an overview in the CRAN task view Machine Learning & Statistical Learning. Some examples are: • Support vector machines: the function svm in the package e1071; ksvm in kernlab • Tree based methods in the packages rpart, tree, randomForest • Boosting methods: the functions glmboost and gamboost in package mboost • PenalizedLDA in the package PenalizedLDA, dudi.discr and dist.pcaiv in ade4). The complexity and heterogeneity of choices of learning strategies, tuning parameters and evaluation criteria in each of these packages can be confusing. You will already have noted differences in the interfaces of the lda, qda and glmnet functions, i.e., in how they expect their input data to presented and what they return. There is even greater diversity across all the other packages and functions. At the same time, there are common tasks such as cross-validation, parameter tuning and performance assessment that are more or less the same no matter what specific method is used. As you have seen, e.g., in our estimate_mcl_loocv function, the looping and data shuffling involved led to rather verbose code. So what to do if you want to try out and explore different learning algorithms? Fortunately, there are several projects that provide unified interfaces to the large number of different machine learning interfaces in R, and also try to provide “best practice” implementations of the common tasks such as parameter tuning and performance assessment. The two most well-known ones are the packages caret and mlr. Here were have a look at caret. You can get a list of supported methods through its getModelInfo function. There are quite a few, here we just show the first 8. library("caret") caretMethods = names(getModelInfo()) head(caretMethods, 8) ## [1] "ada" "AdaBag" "AdaBoost.M1" "adaboost" ## [5] "amdai" "ANFIS" "avNNet" "awnb" length(caretMethods) ## [1] 239 We will check out a neural network method, the nnet function from the eponymous package. The parameter slot informs us on the the available tuning parameters163 They are described in the manual of the nnet function.. getModelInfo("nnet", regex = FALSE)[[1]]$parameter
##   parameter   class         label
## 1      size numeric #Hidden Units
## 2     decay numeric  Weight Decay

Let’s try it out.

trnCtrl = trainControl(
method = "repeatedcv",
repeats = 3,
classProbs = TRUE)
tuneGrid = expand.grid(
size = c(2, 4, 8),
decay = c(0, 1e-2, 1e-1))
nnfit = train(
Embryonic.day ~ Fn1 + Timd2 + Gata4 + Sox7,
data = embryoCells,
method = "nnet",
tuneGrid  = tuneGrid,
trControl = trnCtrl,
metric = "Accuracy")

That’s quite a mouthful, but the nice thing is that this syntax is standardized and applies across many different methods. All you need to do specify the name of the method and the grid of tuning parameters that should be explored via the tuneGrid argument.

Now we can have a look at the output (Figure 12.25).

nnfit
## Neural Network
##
## 66 samples
##  4 predictor
##  3 classes: 'E3.25', 'E3.5', 'E4.5'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 60, 59, 60, 59, 59, 61, ...
## Resampling results across tuning parameters:
##
##   size  decay  Accuracy   Kappa
##   2     0.00   0.7177778  0.4252365
##   2     0.01   0.7850397  0.6022012
##   2     0.10   0.7705556  0.5658417
##   4     0.00   0.7484524  0.5291407
##   4     0.01   0.7517063  0.5297076
##   4     0.10   0.7792857  0.5811742
##   8     0.00   0.7649603  0.5968491
##   8     0.01   0.7846825  0.6006144
##   8     0.10   0.7608333  0.5475901
##
## Accuracy was used to select the optimal model using the
##  largest value.
## The final values used for the model were size = 2 and decay
##  = 0.01.

Figure 12.25: Parameter tuning of the neural net by cross-validation.

plot(nnfit)
predict(nnfit) |> head(10)
##  [1] E3.25 E3.25 E3.25 E3.25 E3.25 E3.25 E3.25 E3.25 E3.25 E3.25
## Levels: E3.25 E3.5 E4.5

► Question 227

Will the accuracy that we obtained above for the optimal tuning parameters generalize to a new dataset? What could you do to address that?

► Solution

### 12.7.1 Method hacking

In Chapter 6 we encountered p-value hacking. A similar phenomenon exists in statistical learning: given a dataset, we explore various different methods of preprocessing (such as normalization, outlier detection, transformation, feature selection), try out different machine learning algorithms and tune their parameters until we are content with the result. The measured accuracy is likely to be too optimistic, i.e., will not generalize to a new dataset. Embedding as many of our methodical choices into a computational formalism and having an outer cross-validation loop (not to be confused with the inner loop that does the parameter tuning) will ameliorate the problem. But is unlikely to address it completely, since not all our choices can be formalized.

The gold standard remains validation on truly unseen data. In addition, it is never a bad thing if the classifier is not a black box but can be interpreted in terms of domain knowledge. Finally, report not just summary statistics, such as misclassification rates, but lay open the complete computational workflow, so that anyone (including your future self) can convince themselves of the robustness of the result or of the influence of the preprocessing, model selection and tuning choices .

## 12.8 Summary of this chapter

We have seen examples of machine learning applications; we have focused on predicting categorical variables (like diabetes type or cell class). Predicting continuous outcomes is also part of machine learning, although we have not considered it here. There are many parallels and overlaps between machine learning and statistical regression (which we studied in Chapter 8). One can consider them two different names for pretty much the same activity, although each has its own flavors: in machine learning, the emphasis is on the prediction of the outcome variables, whereas in regression we often care at least as much about the role of the covariates – which of them have an effect on the outcome, and what is the nature of these effects? In other words, we do not only want predictions, we also want to understand them.

We saw linear and quadratic discriminant analysis, two intuitive methods for partitioning a two-dimensional data plane (or a $$p$$-dimensional space) into regions using either linear or quadratic separation lines (or hypersurfaces). We also saw logistic regression, which takes a slightly different approach but is more amenable to operating in higher dimensions and to regularization.

We encountered the main challenge of machine learning: how to avoid overfitting? We explored why overfitting happens in the context of the so-called curse of dimensionality, and we learned how it may be overcome using regularization.

In other words, machine learning would be easy if we had infinite amounts of data representatively covering the whole space of possible inputs and outputs164 It would “just” be a formidable database / data management problem.. The challenge is to make the best out of a finite amount of training data, and to generalize these to new, unseen inputs. There is a vigorous trade-off between the amount, resolution and coverage of training data and the complexity of the model. Many models have continuous parameters that enable us to “tune” their complexity or the strength of their regularization. Cross-validation can help us with such tuning, although it is not a panacea, and caveats apply, as we saw in Section 12.6.3.

## 12.10 Exercises

► Exercise 12.1 Apply a kernel support vector machine, available in the kernlab package, to the zeller microbiome data. What kernel function works well?

► Exercise 12.2 Use glmnet for a prediction of a continous variable, i.e., for regression. Use the prostate cancer data from Chapter 3 of . The data are available in the CRAN package ElemStatLearn. Explore the effects of using ridge versus lasso penalty.

► Exercise 12.3 Consider smoothing as a regression and model selection problem (remember Figure 12.1). What is the equivalent quantity to the penalization parameter $$\lambda$$ in Equation (12.8)? How do you choose it?

► Solution

We refer to Chapter 5 of

► Exercise 12.4 Scale invariance. Consider a rescaling of one of the features in the (generalized) linear model (12.7). For instance, denote the $$\nu$$-th column of $$x$$ by $$x_{\cdot\nu}$$, and suppose that $$p\ge2$$ and that we rescale $$x_{\cdot\nu} \mapsto s\, x_{\cdot\nu}$$ with some number $$s\neq0$$. What will happen to the estimate $$\hat{\beta}$$ from Equation (12.8) in (a) the unpenalized case ($$\lambda=0$$) and (b) the penalized case ($$\lambda>0$$)?

► Solution

In the unpenalized case, the estimates will be scaled by $$1/s$$, so that the resulting model is, in effect, the same. In the penalized case, the penalty from the $$\nu$$-th component of $$\beta$$ will be different. If $$|s|>1$$, the amplitude of the feature is increased, smaller $$\beta$$-components are required for it to have the same effect in the prediction, and therefore the feature is more likely to receive a non-zero and/or larger estimate, possibly on the cost of the other features; conversely for $$|s|<1$$. Regular linear regression is scale-invariant, whereas penalized regression is scale-dependent. It’s important to remember this when interpreting penalized model fits.

► Exercise 12.5 It has been quipped that all classification methods are just refinements of two archetypal ideas: discriminant analysis and $$k$$ nearest neighbors. In what sense might that be a useful classification?

► Solution

In linear discriminant analysis, we consider our objects as elements of $$\mathbb{R}^p$$, and the learning task is to define regions in this space, or boundary hyperplanes between them, which we use to predict the class membership of new objects. This is archetypal for classification by partition. Generalizations of linear discriminant analysis permit more general spaces and more general boundary shapes.

In $$k$$ nearest neighbors, no embedding into a coordinate space is needed, but instead we require a distance (or dissimilarity) measure that can be computed between each pair of objects, and the classification decision for a new object depends on its distances to the training objects and their classes. This is archetypal for kernel-based methods.

Page built: 2022-08-09 using R version 4.2.0 (2022-04-22)

Support for maintaining the online version of this book is provided by de.NBI