In a supervised learning setting, we have a yardstick or plumbline to judge how well we are doing: the response itself.

A frequent question in biological and biomedical applications is whether a property of interest (say, disease type, cell type, the prognosis of a patient) can be “predicted”, given one or more other properties, called the **predictors**. Often we are motivated by a situation in which the property to be predicted is unknown (it lies in the future, or is hard to measure), while the predictors are known. The crucial point is that we *learn* the prediction rule from a set of *training data* in which the property of interest is also known. Once we have the rule, we can either apply it to new data, and make actual predictions of unknown outcomes; or we can dissect the rule with the aim of better understanding the underlying biology.

Compared to unsupervised learning and what we have seen in Chapters 5, 7 and 9, where we do not know what we are looking for or how to decide whether our result is “right”, we are on much more solid ground with supervised learning: the objective is clearly stated, and there are straightforward criteria to measure how well we are doing.

The central issues in **supervised learning**154 Sometimes the term **statistical learning** is used, more or less exchangeably. are **overfitting** and **generalizability**: did we just learn the training data “by heart” by constructing a rule that has 100% accuracy on the training data, but would perform poorly on any new data? Or did our rule indeed pick up some of the pertinent patterns in the system being studied, which will also apply to yet unseen new data? (Figure 12.1)

Figure 12.1: An example for **overfitting**: two regression lines are fit to data in the \((x, y)\)-plane (black points). We can think of such a line as a rule that predicts the \(y\)-value, given an \(x\)-value. Both lines are smooth, but the fits differ in what is called their **bandwidth**, which intuitively can be interpreted their stiffness. The blue line seems overly keen to follow minor wiggles in the data, while the orange line captures the general trend but is less detailed. The effective number of parameters needed to describe the blue line is much higher than for the orange line. Also, if we were to obtain additional data, it is likely that the blue line would do a **worse** job than the orange line in modeling the new data. We’ll formalize these concepts –training error and test set error– later in this chapter. Although exemplified here with line fitting, the concept applies more generally to prediction models.

In this chapter we will:

See exemplary applications that motivate the use of supervised learning methods.

Learn what discriminant analysis does.

Define measures of performance.

Encounter the curse of dimensionality and see what overfitting is.

Find out about regularization – in particular, penalization – and understand the concepts of generizability and model complexity.

See how to use cross-validation to tune parameters of algorithms.

Discuss method hacking.

The basic data structure for both supervised and unsupervised learning is (at least conceptually) a dataframe, where each row corresponds to an object and the columns are different features (usually numerical values) of the objects155 This is a simplified description. Machine learning is a huge field, and lots of generalizations of this simple conceptual picture have been made. Already the construction of relevant features is an art by itself — we have seen examples with images of cells in Chapter 11, and more generally there are lots of possibilities to extract features from images, sounds, movies, free text, \(...\) Moreover, there is a variant of machine learning methods called **kernel methods** that do not need features at all; instead, kernel methods use distances or measures of similarity between objects. It may be easier, for instance, to define a measure of similarity between two natural language text objects than to find relevant numerical features to represent them. Kernel methods are beyond the scope of this book.. While in unsupervised learning we aim to find (dis)similarity relationships between the objects based on their feature values (e.g., by clustering or ordination), in supervised learning we aim to find a mathematical function (or a computational algorithm) that predicts the value of one of the features from the other features. Many implementations require that there are no missing values, whereas other methods can be made to work with some amount of missing data.

The feature that we select over all the others with the aim of predicting is called the **objective** or the **response**. Sometimes the choice is natural, but sometimes it is also instructive to reverse the roles, especially if we are interested in dissecting the prediction function for the purpose of biological understanding, or in disentangling correlations from causation.

The framework for supervised learning covers both continuous and categorical response variables. In the continuous case we also call it **regression**, in the categorical case, **classification**. It turns out that this distinction is not a detail, as it has quite far-reaching consequences for the choice of loss function (Section 12.5) and thus the choice of algorithm (Friedman 1997).

The first question to consider in any supervised learning task is how the number of objects compares to the number of predictors. The more objects, the better, and much of the hard work in supervised learning has to do with overcoming the limitations of having a finite (and typically, too small) training set.

Figure 12.2: In supervised learning, we assign two different roles to our variables. We have labeled the explanatory variables \(X\) and the response variable(s) \(Y\). There are also two different sets of observations: the training set \(X_\ell\) and \(Y_\ell\) and the test set \(X_v\) and \(Y_v\). (The subscripts refer to alternative names for the two sets: “learning” and “validation”.)

► Task

Give examples where we have encountered instances of supervised learning with a categorical response in this book.

The `diabetes`

dataset (Reaven and Miller 1979) presents three different groups of diabetes patients and five clinical variables measured on them.

```
data("diabetes", package = "rrcov")
head(diabetes)
```

```
## rw fpg glucose insulin sspg group
## 1 0.81 80 356 124 55 normal
## 2 0.95 97 289 117 76 normal
## 3 0.94 105 319 143 105 normal
## 4 1.04 90 356 199 108 normal
## 5 1.00 90 323 240 143 normal
## 6 0.76 86 381 157 165 normal
```

The univariate distributions (more precisely, some density estimates of them) are shown in Figure 12.3.

Figure 12.3: We see already from the one-dimensional distributions that some of the individual variables could potentially predict which group a patient is more likely to belong to. Our goal is to combine variables to improve over such one-dimensional prediction models.

```
library("ggplot2")
library("reshape2")
ggplot(melt(diabetes, id.vars = "group"), aes(x = value, col = group)) +
geom_density() + facet_wrap( ~variable, ncol = 1, scales = "free") +
theme(legend.position = "bottom")
```

The variables are explained in the manual page of the dataset, and in the paper (Reaven and Miller 1979):

rw: relative weight

fpg: fasting plasma glucose

glucose: area under the plasma glucose curve for the three hour oral glucose tolerance test (OGTT)

insulin: area under the plasma insulin curve for the OGTT

sspg: steady state plasma glucose response

group: normal, chemical diabetes and overt diabetes

Neumann et al. (2010) observed human cancer cells using live-cell imaging. The cells were genetically engineered so that their histones were tagged with a green fluorescent protein (GFP). A genome-wide RNAi library was applied to the cells, and for each siRNA perturbation, movies of a few hundred cells were recorded for about two days, to see what effect the depletion of each gene had on cell cycle, nuclear morphology and cell proliferation. Their paper reports the use of an automated image classification algorithm that quantified the visual appearance of each cell’s nucleus and enabled the prediction of normal mitosis states or aberrant nuclei. The algorithm was trained on the data from around 3000 cells that were annotated by a human expert. It was then applied to almost 2 billions images of nuclei (Figure 12.4). Using automated image classification provided scalablity (annotating 2 billion images manually would take a long time) and objectivity.

We start with one of the simplest possible discrimination problems156 Arguably the simplest possible problem is a single continuous feature, two classes, and the task of finding a single threshold to discriminate between the two groups – as in Figure 6.2.: we have objects described by two continuous features (so the objects can be thought of as points in the 2D plane) and falling into three groups. Our aim is to define class boundaries, which are lines in the 2D space.

Let’s see whether we can predict the `group`

from the `sspg`

and `glucose`

variables in the `diabetes`

data. It’s always a good idea to first visualise the data (Figure 12.5).

Figure 12.5: Scatterplot of two of the variables in the `diabetes`

data. Each point is a sample, and the color indicates the diabetes type as encoded in the `group`

variable.

```
= ggplot(mapping = aes(x = sspg, y = glucose)) +
ggdb geom_point(aes(colour = group), data = diabetes)
ggdb
```

We’ll start with a method called **linear discriminant analysis** (**LDA**). This method is a foundation stone of classification, many of the more complicated (and sometimes more powerful) algorithms are really just generalizations of LDA.

```
library("MASS")
= lda(group ~ sspg + glucose, data = diabetes)
diabetes_lda diabetes_lda
```

```
## Call:
## lda(group ~ sspg + glucose, data = diabetes)
##
## Prior probabilities of groups:
## normal chemical overt
## 0.5241379 0.2482759 0.2275862
##
## Group means:
## sspg glucose
## normal 114.0000 349.9737
## chemical 208.9722 493.9444
## overt 318.8788 1043.7576
##
## Coefficients of linear discriminants:
## LD1 LD2
## sspg 0.005036943 -0.01539281
## glucose 0.005461400 0.00449050
##
## Proportion of trace:
## LD1 LD2
## 0.9683 0.0317
```

```
= predict(diabetes_lda)$class
ghat table(ghat, diabetes$group)
```

```
##
## ghat normal chemical overt
## normal 69 12 1
## chemical 7 24 6
## overt 0 0 26
```

`mean(ghat != diabetes$group)`

`## [1] 0.1793103`

► Question 205

What do the different parts of the above output mean?

Now, let’s visualise the LDA result. We are going to plot the prediction regions for each of the three groups. We do this by creating a grid of points and using our prediction rule on each of them. We’ll then also dig a bit deeper into the mechanics of LDA and plot the class centers (`diabetes_lda$means`

) and ellipses that correspond to the fitted covariance matrix (`diabetes_lda$scaling`

). Assembling this visualization requires us to write a bit of code.

```
= function(x) {
make1Dgrid = grDevices::extendrange(x)
rg seq(from = rg[1], to = rg[2], length.out = 100)
}
```

Set up the points for prediction, a \(100 \times 100\) grid that covers the data range.

```
= with(diabetes,
diabetes_grid expand.grid(sspg = make1Dgrid(sspg),
glucose = make1Dgrid(glucose)))
```

Do the predictions.

```
$ghat =
diabetes_gridpredict(diabetes_lda, newdata = diabetes_grid)$class
```

The group centers.

`= diabetes_lda$means centers `

Compute the ellipse. We start from a unit circle (approximated by a polygon with 360 sides) and apply the corresponding affine transformation from the LDA output.

```
= exp(1i * seq(0, 2*pi, length.out = 360)) |>
unitcircle cbind(Re(z), Im(z)))()
(\(z) = unitcircle %*% solve(diabetes_lda$scaling) |> as_tibble() ellipse
```

All three ellipses, one for each group center.

```
= lapply(rownames(centers), function(gr) {
ellipses mutate(ellipse,
sspg = sspg + centers[gr, "sspg"],
glucose = glucose + centers[gr, "glucose"],
group = gr)
|> bind_rows() })
```

Now we are ready to plot (Figure 12.6).

Figure 12.6: As Figure 12.5, with the classification regions from the LDA model shown. The three ellipses represent the class centers and the covariance matrix of the LDA model; note that there is only one covariance matrix, which is the same for all three classes. Therefore also the sizes and orientations of the ellipses are the same for the three classes, only their centers differ. They represent contours of equal class membership probability.

```
+ geom_raster(aes(fill = ghat),
ggdb data = diabetes_grid, alpha = 0.25, interpolate = TRUE) +
geom_point(data = as_tibble(centers), pch = "+", size = 8) +
geom_path(aes(colour = group), data = ellipses) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0))
```

► Question 206

Why is the boundary between the prediction regions for *chemical* and *overt* not perpendicular to the line between the group centers?

► Solution

► Question 207

How confident would you be about the predictions in those areas of the 2D plane that are far from all of the cluster centers?

► Solution

► Question 208

Why is the boundary between the prediction regions for *normal* and *chemical* not half-way between the centers, but shifted in favor of *normal*? Hint: have a look at the `prior`

argument of `lda`

. Try again with uniform prior.

► Solution

► Question 209

► Solution

► Question 210

What is the difference in the prediction accuracy if we use all 5 variables instead of just `glucose`

and `sspg`

?

► Solution

► Question 211

Instead of approximating the prediction regions by classification from a grid of points, compute the separating lines explicitly from the linear determinant coefficients.

► Solution

Assume that we already know that the four genes *FN1*, *TIMD2*, *GATA4* and *SOX7* are relevant to the classification task157 Later in this chapter we will see methods that can drop this assumption and screen all available features.. We want to build a classifier that predicts the developmental time (embryonic days: E3.25, E3.5, E4.5). We load the data and select four corresponding probes.

```
library("Hiiragi2013")
data("x")
= c("1426642_at", "1418765_at", "1418864_at", "1416564_at")
probes = t(Biobase::exprs(x)[probes, ]) |> as_tibble() |>
embryoCells mutate(Embryonic.day = x$Embryonic.day) |>
::filter(x$genotype == "WT") dplyr
```

We can use the Bioconductor annotation package associated with the microarray to verify that the probes correspond to the intended genes.

`annotation(x)`

`## [1] "mouse4302"`

```
library("mouse4302.db")
= AnnotationDbi::select(mouse4302.db, keys = probes,
anno columns = c("SYMBOL", "GENENAME"))
anno
```

```
## PROBEID SYMBOL
## 1 1426642_at Fn1
## 2 1418765_at Timd2
## 3 1418864_at Gata4
## 4 1416564_at Sox7
## GENENAME
## 1 fibronectin 1
## 2 T cell immunoglobulin and mucin domain containing 2
## 3 GATA binding protein 4
## 4 SRY (sex determining region Y)-box 7
```

```
= match(anno$PROBEID, colnames(embryoCells))
mt colnames(embryoCells)[mt] = anno$SYMBOL
```

Now we are ready to visualize the data in a pairs plot (Figure 12.8).

```
library("GGally")
ggpairs(embryoCells, mapping = aes(col = Embryonic.day),
columns = anno$SYMBOL, upper = list(continuous = "points"))
```

We can now call `lda`

on these data. The linear combinations `LD1`

and `LD2`

that serve as discriminating variables are given in the slot `ed_lda$scaling`

of the output from `lda`

.

```
= lda(Embryonic.day ~ Fn1 + Timd2 + Gata4 + Sox7,
ec_lda data = embryoCells)
round(ec_lda$scaling, 1)
```

```
## LD1 LD2
## Fn1 -0.2 -0.4
## Timd2 0.5 0.0
## Gata4 -0.1 -0.6
## Sox7 -0.7 0.5
```

For the visualization of the learned model in Figure 12.9, we need to build the prediction regions and their boundaries by expanding the grid in the space of the two new coordinates LD1 and LD2.

Figure 12.9: LDA classification regions for Embryonic.day.

```
= predict(ec_lda)$x |> as_tibble() |>
ec_rot mutate(ed = embryoCells$Embryonic.day)
= lda(ec_rot[, 1:2], predict(ec_lda)$class)
ec_lda2 = with(ec_rot, expand.grid(
ec_grid LD1 = make1Dgrid(LD1),
LD2 = make1Dgrid(LD2)))
$edhat = predict(ec_lda2, newdata = ec_grid)$class
ec_gridggplot() +
geom_point(aes(x = LD1, y = LD2, colour = ed), data = ec_rot) +
geom_raster(aes(x = LD1, y = LD2, fill = edhat),
data = ec_grid, alpha = 0.4, interpolate = TRUE) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
coord_fixed()
```

► Question 212

Repeat these analyses using quadratic discriminant analysis (`qda`

). What difference do you see in the shape of the boundaries?

► Solution

► Question 213

What happens if you call `lda`

or `qda`

with a lot more genes, say the first 1000, in the Hiiragi dataset?

► Solution

Computers are really good at memorizing facts. In the worst case, a machine learning algorithm is a roundabout way of doing this158 The not-so roundabout way is database technologies.. The central goal in statistical learning, however, is *generalizability*. We want an algorithm that is able to generalize, i.e., interpolate and extrapolate from given data to make good predictions about future data.

Let’s look at the following example. We generate random data (`rnorm`

) for `n`

objects, with different numbers of features (given by `p`

). We train a LDA on these data and compute the **misclassification rate**, i.e., the fraction of times the prediction is wrong (`pred != resp`

).

```
library("dplyr")
= 2:21
p = 20
n
= lapply(p, function(pp) {
mcl replicate(100, {
= matrix(rnorm(n * pp), nrow = n)
xmat = sample(c("apple", "orange"), n, replace = TRUE)
resp = lda(xmat, resp)
fit = predict(fit)$class
pred mean(pred != resp)
|> mean() |> (\(x) tibble(mcl = x, p = pp))()
}) |> bind_rows() })
```

Figure 12.11: Misclassification rate of LDA applied to random data. While the number of observations `n`

is held constant (at 20), we are increasing the number of features `p`

starting from 2 up to 21. The misclassification rate becomes almost zero as `p`

approaches 20. The LDA model becomes so elaborate and over-parameterized that it manages to learn the random labels “by heart”. (As `p`

becomes even larger, the “performance” degrades again somewhat, apparently due to numerical properties of the `lda`

implementation used here.)

```
ggplot(mcl, aes(x = p, y = mcl)) + geom_line() + geom_point() +
ylab("Misclassification rate")
```

► Question 214

What is the purpose of the `replicate`

loop in the above code? What happens if you omit it (or replace the 100 by 1)?

► Solution

Figure 12.11 seems to imply that we can perfectly predict random labels from random data, if we only fit a complex enough model, i.e., one with many parameters. How can we overcome such an absurd conclusion? The problem with the above code is that the model performance is evaluated on the same data on which it was trained. This generally leads to positive bias, as you see in this crass example. How can we overcome this problem? The key idea is to assess model performance on different data than those on which the model was trained.

A naive approach might be to split the data in two halves, and use the first half for learning (“training”) and the second half for assessment (“testing”). It turns out that this is needlessly variable and needlessly inefficient. It is needlessly variable, since by splitting the data only once, our results can be quite affected by how the split happens to fall. It seems better to do the splitting many times, and average. This will give us more stable results. It is needlessly inefficient, since the performance of machine learning algorithms depends on the number of observations, and the performance measured on half the data is likely159 Unless we have such an excess of data that it doesn’t matter. to be worse than what it is with all the data. For this reason, it is better to use unequal sizes of training and test data. In the extreme case, we’ll use as much as \(n-1\) observations for training, and the remaining one for testing. After we’ve done this likewise for all observations, we can average our performance metric. This is called **leave-one-out cross-validation**.

See Chapter *Model Assessment and Selection* in the book by Hastie, Tibshirani, and Friedman (2008) for further discussion on these trade-offs.

An alternative is **\(k\)-fold cross-validation**, where the observations are repeatedly split into a training set of size of around \(n(k-1)/k\) and a test set of size of around \(n/k\). Both alternatives have pros and contras, and there is not a universally best choice. An advantage of leave-one-out is that the amount of data used for training is close to the maximally available data; this is especially important if the sample size is limiting and “every little matters” for the algorithm. A drawback of leave-one-out is that the training sets are all very similar, so they may not model sufficiently well the kind of sampling changes to be expected if a new dataset came along. For large \(n\), leave-one-out cross-validation can be needlessly time-consuming.

```
= function(x, resp) {
estimate_mcl_loocv vapply(seq_len(nrow(x)), function(i) {
= lda(x[-i, ], resp[-i])
fit = predict(fit, newdata = x[-i,, drop = FALSE])$class
ptrn = predict(fit, newdata = x[ i,, drop = FALSE])$class
ptst c(train = mean(ptrn != resp[-i]), test = (ptst != resp[i]))
FUN.VALUE = numeric(2)) |> rowMeans() |> t() |> as_tibble()
},
}
= matrix(rnorm(n * last(p)), nrow = n)
xmat = sample(c("apple", "orange"), n, replace = TRUE)
resp
= lapply(p, function(k) {
mcl estimate_mcl_loocv(xmat[, 1:k], resp)
|> bind_rows() |> data.frame(p) |> melt(id.var = "p") })
```

Figure 12.12: Cross-validation: the misclassification rate of LDA applied to random data, when evaluated on test data that were not used for learning, hovers around 0.5 independent of `p`

. The misclassification rate on the training data is also shown. It behaves similar to what we already saw in Figure 12.11.

```
ggplot(mcl, aes(x = p, y = value, col = variable)) + geom_line() +
geom_point() + ylab("Misclassification rate")
```

The result is show in Figure 12.12.

► Question 215

► Solution

In Section 12.4.1 we have seen overfitting and cross-validation on random data, but how does it look if there is in fact a relevant class separation?

```
= 2:20
p = replicate(100, {
mcl = matrix(rnorm(n * last(p)), nrow = n)
xmat = sample(c("apple", "orange"), n, replace = TRUE)
resp 1:6] = xmat[, 1:6] + as.integer(factor(resp))
xmat[,
lapply(p, function(k) {
estimate_mcl_loocv(xmat[, 1:k], resp)
|> bind_rows() |> cbind(p = p) |> melt(id.var = "p")
}) simplify = FALSE) |> bind_rows() },
```

Figure 12.13: As we increase the number of features included in the model, the misclassification rate initially improves; as we start including more and more irrelevant features, it increases again, as we are fitting noise.

```
= group_by(mcl, p, variable) |> summarise(value = mean(value))
mcl
ggplot(mcl, aes(x = p, y = value, col = variable)) + geom_line() +
geom_point() + ylab("Misclassification rate")
```

Figure 12.14: Idealized version of Figure 12.13, from Hastie, Tibshirani, and Friedman (2008). A recurrent goal in machine learning is finding the sweet spot in the variance–bias trade-off.

The result is shown in Figure 12.13. The group centers are the vectors (in \(\mathbb{R}^{20}\)) given by the coordinates \((1, 1, 1, 1, 1, 1, 0, 0, 0, ...)\) (apples) and \((2, 2, 2, 2, 2, 2, 0, 0, 0, ...)\) (oranges), and the optimal decision boundary is the hyperplane orthogonal to the line between them. For \(p\) smaller than \(6\), the decision rule cannot reach this hyperplane – it is biased. As a result, the misclassification rate is suboptimal, and it decreases with \(p\). But what happens for \(p\) larger than \(6\)? The algorithm is, in principle, able to model the optimal hyperplane, and it should not be distracted by the additional features. The problem is that it is. The more additional features enter the dataset, the higher the probability that one or more of them happen to fall in a way that they *look like* good, discriminating features in the training data – only to mislead the classifier and degrade its performance in the test data. Shortly we’ll see how to use penalization to (try to) control this problem.

The term **curse of dimensionality** was coined by Bellman (1961). It refers to the fact that high-dimensional spaces are very hard, if not impossible, to sample thoroughly: for instance, to cover a 2-dimensional square of side length 1 with grid points that are 0.1 apart, we need \(10^2=100\) points. In 100 dimensions, we need \(10^{100}\) – which is already more than the number of protons in the universe. In genomics, we often aim to fit models to data with thousands of features. Also our intuitions about distances between points or about the relationship between a volume and its surface break down in a high-dimensional settings. We’ll explore some of the weirdnesses of high-dimensional spaces in the next few questions.

► Question 216

Assume you have a dataset with 1 000 000 data points in \(p\) dimensions. The data are uniformly distributed in the unit hybercube (i.e., all features lie in the interval \([0,1]\)). What’s the side length of a hybercube that can be expected to contain just 10 of the points, as a function of \(p\)?

► Solution

Next, let’s look at the relation between inner regions of the feature space versus its boundary regions. Generally speaking, prediction at the boundaries of feature space is more difficult than in its interior, as it tends to involve extrapolation, rather than interpolation. In the next question you’ll see how this difficulty explodes with feature space dimension.

► Question 217

What fraction of a unit cube’s total volume is closer than 0.01 to any of its surfaces, as a function of the dimension?

► Solution

► Question 218

What is the coefficient of variation (ratio of standard deviation over average) of the distance between two randomly picked points in the unit hypercube, as a function of the dimension?

► Solution

We’ve already seen the **misclassification rate** (MCR) used to assess our classification performance in Figures 12.11–12.13. Its population version is defined as

\[\begin{equation}
\text{MCR} = \text{E}\left[ 𝟙_{\hat{y} \neq y} \right],
\tag{12.1}
\end{equation}\]

and for a finite sample

\[\begin{equation}
\widehat{\text{MCR}} = \frac{1}{n}\sum_{i=1}^n 𝟙_{\hat{y_i} \neq y_i}.
\tag{12.2}
\end{equation}\]

This is not the only choice we could make. Perhaps we care more about the misclassification of apples as oranges than vice versa, and we can reflect this by introducing weights that depend on the type of error made into the sum of Equation (12.2) (or the integral of Equation (12.1)). This can get even more elaborate if we have more than two classes. Often we want to see the whole **confusion table**, which we can get via

`table(truth, response)`

An important special case is binary classification with asymmetric costs – think about, say, a medical test. Here, the **sensitivity** (a.k.a. **true positive rate** or **recall**) is related to the misclassification of healthy as ill, and the **specificity** (or **true negative rate**) depends on the probability of misclassification of ill as healthy. Often, there is a single parameter (e.g., a threshold) that can be moved up and down, allowing a trade-off between sensitivity and specificity (and thus, equivalently, between the two types of misclassification). In those cases, we usually are not content to know the classifier performance at one single choice of threshold, but at many (or all) of them. This leads to **receiver operating characteristic** (**ROC**) or **precision-recall** curves.

► Question 219

What are the exact relationships between the per-class misclassification rates and sensitivity and specificity?

► Solution

Another cost function can be computed from the **Jaccard index**, which we already saw in Chapter 5.

\[\begin{equation}
J(A,B) = \frac{|\,A\cap B\,|}{|\,A\cup B\,|},
\tag{12.3}
\end{equation}\]

where \(A\) is the set of observations for which the true class is 1 (\(A=\{i\,|\,y_i=1\}\)) and \(B\) is the set of observations for which the predicted class is 1. The number \(J\) is between 0 and 1, and when \(J\) is large, it indicates high overlap of the two sets. Note that \(J\) does not depend on the number of observations for which both true and predicted class is 0 – so it is particularly suitable for measuring the performance of methods that try to find rare events.

We can also consider probabilistic class predictions, which come in the form \(\hat{P}(Y\,|\,X)\). In this case, a possible risk function would be obtained by looking at distances between the true probability distribution and the estimated probability distributions. For two classes, the finite sample version of the \(\log \text{loss}\) is

\[\begin{equation}
\log \text{loss} = -\frac{1}{n}\sum_{i=1}^n y_i\log(\hat{p}_i) + (1 - y_i)\log(1 - \hat{p}_i),
\tag{12.4}
\end{equation}\]

where \(\hat{p}_i \in [0,1]\) is the prediction, and \(y_i\in\{0,1\}\) is the truth.

Note that the \(\log \text{loss}\) will be infinite if a prediction is totally confident (\(\hat{p}_i\) is exactly \(0\) or \(1\)) but wrong.

For continuous continuous response variables (regression), a natural choice is the **mean squared error** (**MSE**). It is the average squared error,

\[\begin{equation}
\widehat{\text{MSE}} = \frac{1}{n}\sum_{i=1}^n ( \hat{Y}_i - Y_i )^2.
\tag{12.5}
\end{equation}\]

The population version is defined analogously, by turning the summation into an integral as in Equations (12.1) and (12.2).

Statisticians call functions like Equations ((12.1)–(12.5)) variously (and depending on context and predisposition) **risk function**, **cost function**, **objective function**160 There is even an R package dedicated to evaluation of statistical learners called **metrics**..

Figure 12.18: In the upper bull’s eye, the estimates are systematically off target, but in a quite reproducible manner. The green segment represents the bias. In the lower bull’s eye, the estimates are not biased, as they are centered in the right place, however they have high variance. We can distinguish the two scenarios since we see the result from many shots. If we only had one shot and missed the bull’s eye, we could not easily know whether that’s because of bias or variance.

An important fact that helps us understand the tradeoffs when picking a statistical learning model is that the MSE is the sum of two terms, and often the choices we can make are such that one of those terms goes down while the other one goes up. The bias measures how different the average of all the different estimates is from the truth, and variance, how much an individual one might scatter from the average value (Figure 12.18). In applications, we often only get one shot, therefore being reliably almost on target can beat being right on the long term average but really off today. The decomposition

\[\begin{equation}
\text{MSE} = \underbrace{\text{Var}(\hat{Y})}_{\text{variance}}
+ \underbrace{\mathbb{E}[\hat{Y}-Y]^2}_{\text{bias}}
\tag{12.6}
\end{equation}\]

follows by straightforward algebra.

When trying to minimize the MSE, it is important to realize that sometimes we can pay the price of a small bias to greatly reduce variance, and thus overall improve MSE. We already encountered shrinkage estimation in Chapter 8. In classification (i.e., when we have categorical response variables), different objective functions than the MSE are used, and there is usually no such straightforward decomposition as in Equation (12.6). The good news is that we can usually go even much further than in the case of continuous responses with our trading biases for variance. This is because the discreteness of the response absorbs certain biases (Friedman 1997), so that the cost of higher bias is almost zero, while we still get the benefit of better (smaller) variance.

In high-dimensional statistics, we are constantly plagued by variance: there is just not enough data to fit all the possible parameters. One of the most fruitful ideas in high-dimensional statistics is **penalization**: a tool to actively control and exploit the variance-bias tradeoff. Penalization is part of a larger class of regularization methods that are used to ensure stable estimates.

Although generalization of LDA to high-dimensional settings is possible (Clemmensen et al. 2011; Witten and Tibshirani 2011), it turns out that logistic regression is a more general approach161 It fits into the framework of generalized linear models, which we encounted in Chapter 8., and therefore we’ll now switch to that, using the **glmnet** package.

For multinomial—or, for the special case of two classes, binomial—logistic regression models, the posterior log-odds between \(k\) classes and can be written in the form (see the section on *Logistic Regression* in the book by Hastie, Tibshirani, and Friedman (2008) for a more complete presentation):

\[\begin{equation}
\log \frac{P(Y=i\,|\,X=x)}{P(Y=k\,|\,X=x)} = \beta^0_i + \beta_i x,
\tag{12.7}
\end{equation}\]

where \(i=1,...,k-1\) enumerates the different classes and the \(k\)-th class is chosen as a reference. The data matrix \(x\) has dimensions \(n\times p\), where \(n\) is number of observations and \(p\) the number of features. The \(p\)-dimensional vector \(\beta_i\) determines how the classification odds for class \(i\) versus class \(k\) depend on \(x\). The numbers \(\beta^0_i\) are intercepts and depend, among other things, on the classes’ prior probabilities. Instead of the log odds (12.7) (i.e., ratios of class probabilities), we can also write down an equivalent model for the class probabilities themselves, and the fact that we here used the \(k\)-th class as a reference is an arbitrary choice, as the model estimates are equivariant under this choice (Hastie, Tibshirani, and Friedman 2008). The model is fit by maximising the log-likelihood \(\mathcal{l}(\beta, \beta^0; x)\), where \(\beta=(\beta_1,...,\beta_{k-1})\) and analogously for \(\beta^0\).

So far, so good. But as \(p\) gets larger, there is an increasing chance that some of the estimates go wildly off the mark, due to random sampling happenstances in the data (remember Figure 12.1). This is true even if for each individual coordinate of the vector \(\beta_i\), the error distribution is bounded: the probabilty of there being one coordinate that is in the far tails increases the more coordiates there are, i.e., the larger \(p\) is.

A related problem can also occur, not in (12.7), but in other, non-linear models, as the model dimension \(p\) increases while the sample size \(n\) remains the same: the likelihood landscape around its maximum becomes increasingly flat, and the maximum-likelihood estimate of the model parameters becomes more and more variable. Eventually, the maximum is no longer a point, but a submanifold, and the maximum likelihood estimate is unidentifiable. Both of these limitations can be overcome with a modification of the objective: instead of maximising the bare log-likelihood, we maximise a penalized version of it,

\[\begin{equation}
\hat{\beta}= \arg\max_\beta \mathcal{l}(\beta, \beta^0; x) + \lambda \operatorname{pen}(\beta),
\tag{12.8}
\end{equation}\]

where \(\lambda\ge0\) is a real number, and \(\operatorname{pen}\) is a convex function, called the **penalty function**. Popular choices are \(\operatorname{pen}(\beta)=|\beta|^2\) (**ridge regression**) and \(\operatorname{pen}(\beta)=|\beta|^1\) (**lasso**).

Here, \(|\beta|^\nu=\sum_i \beta_i^\nu\) is the \(L_\nu\)-norm of the vector \(\beta\). Variations are possible, for instead we could include in this summation only some but not all of the elements of \(\beta\); or we could scale different elements differently, for instance based on some prior belief of their scale and importance. In the **elastic net**, ridge and lasso are hybridized by using the penalty function \(\operatorname{pen}(\beta)=(1-\alpha)|\beta|^1+\alpha|\beta|^2\) with some further parameter \(\alpha\in[0,1]\). The crux is, of course, how to choose the right \(\lambda\), and we will discuss that in the following.

Zeller et al. (2014) studied metagenome sequencing data from fecal samples of 156 humans that included colorectal cancer patients and tumor-free controls. Their aim was to see whether they could identify biomarkers (presence or abundance of certain taxa) that could help with early tumor detection. The data are available from Bioconductor through its **ExperimentHub** service under the identifier EH361.

```
library("ExperimentHub")
= ExperimentHub()
eh = eh[["EH361"]] zeller
```

`table(zeller$disease)`

```
##
## cancer large_adenoma n small_adenoma
## 53 15 61 27
```

► Question 220

Explore the `eh`

object to see what other datasets there are.

► Solution

For the following, let’s focus on the normal and cancer samples and set the adenomas aside.

`= zeller[, zeller$disease %in% c("n", "cancer")] zellerNC `

Before jumping into model fitting, as always it’s a good idea to do some exploration of the data. First, let’s look at the sample annotations. The following code prints the data from three randomly picked samples. (Only looking at the first ones, say with the R function `head`

, is also an option, but may not be representative of the whole dataset).

`pData(zellerNC)[ sample(ncol(zellerNC), 3), ]`

```
## subjectID age gender bmi country disease
## CCIS07277498ST-4-0 FR-276 63 male NA france n
## CCIS90164298ST-4-0 FR-294 84 male 23 france n
## CCIS98832363ST-4-0 FR-552 55 female 25 france cancer
## tnm_stage ajcc_stage localization fobt
## CCIS07277498ST-4-0 <NA> <NA> <NA> <NA>
## CCIS90164298ST-4-0 <NA> <NA> <NA> negative
## CCIS98832363ST-4-0 t3n1m0 iii lc positive
## wif-1_gene_methylation_test group bodysite
## CCIS07277498ST-4-0 negative control stool
## CCIS90164298ST-4-0 negative control stool
## CCIS98832363ST-4-0 negative crc stool
## ethnicity number_reads
## CCIS07277498ST-4-0 white 66936604
## CCIS90164298ST-4-0 white 64903644
## CCIS98832363ST-4-0 white 69793122
```

Next, let’s explore the feature names:

We define the helper function `formatfn`

to line wrap these long character strings for the available space here.

```
= function(x)
formatfn gsub("|", "| ", x, fixed = TRUE) |> lapply(strwrap)
rownames(zellerNC)[1:4]
```

```
## [1] "k__Bacteria"
## [2] "k__Viruses"
## [3] "k__Bacteria|p__Firmicutes"
## [4] "k__Bacteria|p__Bacteroidetes"
```

`rownames(zellerNC)[nrow(zellerNC) + (-2:0)] |> formatfn()`

```
## [[1]]
## [1] "k__Bacteria| p__Proteobacteria| c__Deltaproteobacteria|"
## [2] "o__Desulfovibrionales| f__Desulfovibrionaceae|"
## [3] "g__Desulfovibrio| s__Desulfovibrio_termitidis"
##
## [[2]]
## [1] "k__Viruses| p__Viruses_noname| c__Viruses_noname|"
## [2] "o__Viruses_noname| f__Baculoviridae|"
## [3] "g__Alphabaculovirus|"
## [4] "s__Bombyx_mori_nucleopolyhedrovirus|"
## [5] "t__Bombyx_mori_nucleopolyhedrovirus_unclassified"
##
## [[3]]
## [1] "k__Bacteria| p__Proteobacteria| c__Deltaproteobacteria|"
## [2] "o__Desulfovibrionales| f__Desulfovibrionaceae|"
## [3] "g__Desulfovibrio| s__Desulfovibrio_termitidis|"
## [4] "t__GCF_000504305"
```

As you can see, the features are a mixture of abundance quantifications at different taxonomic levels, from *k*ingdom over *p*hylum to *s*pecies. We could select only some of these, but here we continue with all of them. Next, let’s look at the distribution of some of the features. Here, we show an arbitrary choice of two, number 510 and 527; in practice, it is helpful to scroll through many such plots quickly to get an impression (Figure 12.19).

Figure 12.19: Histograms of the distributions for two randomly selected features. The distributions are highly skewed, with many zero values and a thin, long tail of non-zero values.

```
ggplot(melt(Biobase::exprs(zellerNC)[c(510, 527), ]), aes(x = value)) +
geom_histogram(bins = 25) +
facet_wrap( ~ Var1, ncol = 1, scales = "free")
```

In the simplest case, we fit model (12.7) as follows.

```
library("glmnet")
= glmnet(x = t(Biobase::exprs(zellerNC)),
glmfit y = factor(zellerNC$disease),
family = "binomial")
```

A remarkable feature of the `glmnet`

function is that it fits (12.7) not only for one choice of \(\lambda\), but for all possible \(\lambda\)s at once. For now, let’s look at the prediction performance for, say, \(\lambda=0.04\). The name of the function parameter is `s`

:

```
= predict(glmfit, newx = t(Biobase::exprs(zellerNC)),
predTrsf type = "class", s = 0.04)
table(predTrsf, zellerNC$disease)
```

```
##
## predTrsf cancer n
## cancer 51 0
## n 2 61
```

Not bad – but remember that this is on the training data, without cross-validation. Let’s have a closer look at `glmfit`

. The **glmnet** package offers a a diagnostic plot that is worth looking at (Figure 12.20).

Figure 12.20: Regularization paths for `glmfit`

.

`plot(glmfit, col = brewer.pal(8, "Dark2"), lwd = sqrt(3), ylab = "")`

► Question 221

What are the \(x\)- and \(y\)-axes in Figure 12.20? What are the different lines?

► Solution

Let’s get back to the question of how to choose the parameter \(\lambda\). We could try many different choices –and indeed, all possible choices– of \(\lambda\), assess classification performance in each case using cross-validation, and then choose the best \(\lambda\).

You'll already realize from the description of this strategy that if we optimize \(\lambda\) in this way, the resulting apparent classification performance will likely be exaggerated. We need a truly independent dataset, or at least another, outer cross-validation loop to get a more realistic impression of the generalizability. We will get back to this question at the end of the chapter. We could do so by writing a loop as we did in the `estimate_mcl_loocv`

function in Section 12.4.1. It turns out that the **glmnet** package already has built-in functionality for that, with the function `cv.glmnet`

, which we can use instead.

Figure 12.21: Diagnostic plot for `cv.glmnet`

: shown is a measure of cross-validated prediction performance, the deviance, as a function of \(\lambda\). The dashed vertical lines show `lambda.min`

and `lambda.1se`

.

```
= cv.glmnet(x = t(Biobase::exprs(zellerNC)),
cvglmfit y = factor(zellerNC$disease),
family = "binomial")
plot(cvglmfit)
```

The diagnostic plot is shown in Figure 12.21. We can access the optimal value with

`$lambda.min cvglmfit`

`## [1] 0.06376534`

As this value results from finding a minimum in an estimated curve, it turns out that it is often too small, i.e., that the implied penalization is too weak. A heuristic recommended by the authors of the **glmnet** package is to use a somewhat larger value instead, namely the largest value of \(\lambda\) such that the performance measure is within 1 standard error of the minimum.

`$lambda.1se cvglmfit`

`## [1] 0.1114317`

► Question 222

How does the confusion table look like for \(\lambda=\;\)`lambda.1se`

?

► Solution

► Question 223

What features drive the classification?

► Solution

► Question 224

How do the results change if we transform the data, say, with the `asinh`

transformation as we saw in Chapter 5?

► Solution

► Question 225

Would a good classification performance on these data mean that this assay is ready for screening and early cancer detection?

► Solution

Figures 12.21 and 12.22 are textbook examples of how we expect the dependence of (cross-validated) classification performance versus model complexity (\(\lambda\)) to look. Now let’s get back to the mouse embryo cells data. We’ll try to classify the cells from embryonic day `E3.25`

with respect to their genotype.

Figure 12.23: Cross-validated misclassification error versus penalty parameter for the mouse cells data.

```
= x[, x$Embryonic.day == "E3.25"]
sx = cv.glmnet(t(Biobase::exprs(sx)), sx$genotype,
embryoCellsClassifier family = "binomial", type.measure = "class")
plot(embryoCellsClassifier)
```

In Figure 12.23 we see that the misclassification error is (essentially) monotonously increasing with \(\lambda\), and is smallest for \(\lambda\to 0\), i.e., if we apply no penalization at all.

► Question 226

What is going on with these data?

► Solution

We have now seen three classification methods: linear discriminant analysis (`lda`

), quadratic discriminant analysis (`qda`

) and logistic regression using elastic net penalization (`glmnet`

). In fact, there are hundreds of different learning algorithms162 For an introduction to the subject that uses R and provides many examples and exercises, we recommend (James et al. 2013). available in R and its add-on packages. You can get an overview in the CRAN task view Machine Learning & Statistical Learning. Some examples are:

Support vector machines: the function

`svm`

in the package**e1071**;`ksvm`

in**kernlab**Tree based methods in the packages

**rpart**,**tree**,**randomForest**Boosting methods: the functions

`glmboost`

and`gamboost`

in package**mboost**`PenalizedLDA`

in the package**PenalizedLDA**,`dudi.discr`

and`dist.pcaiv`

in**ade4**).

The complexity and heterogeneity of choices of learning strategies, tuning parameters and evaluation criteria in each of these packages can be confusing. You will already have noted differences in the interfaces of the `lda`

, `qda`

and `glmnet`

functions, i.e., in how they expect their input data to presented and what they return. There is even greater diversity across all the other packages and functions. At the same time, there are common tasks such as cross-validation, parameter tuning and performance assessment that are more or less the same no matter what specific method is used. As you have seen, e.g., in our `estimate_mcl_loocv`

function, the looping and data shuffling involved led to rather verbose code.

So what to do if you want to try out and explore different learning algorithms? Fortunately, there are several projects that provide unified interfaces to the large number of different machine learning interfaces in R, and also try to provide “best practice” implementations of the common tasks such as parameter tuning and performance assessment. The two most well-known ones are the packages **caret** and **mlr**. Here were have a look at **caret**. You can get a list of supported methods through its `getModelInfo`

function. There are quite a few, here we just show the first 8.

```
library("caret")
= names(getModelInfo())
caretMethods head(caretMethods, 8)
```

```
## [1] "ada" "AdaBag" "AdaBoost.M1" "adaboost"
## [5] "amdai" "ANFIS" "avNNet" "awnb"
```

`length(caretMethods)`

`## [1] 239`

We will check out a neural network method, the `nnet`

function from the eponymous package. The `parameter`

slot informs us on the the available tuning parameters163 They are described in the manual of the `nnet`

function..

`getModelInfo("nnet", regex = FALSE)[[1]]$parameter`

```
## parameter class label
## 1 size numeric #Hidden Units
## 2 decay numeric Weight Decay
```

Let’s try it out.

```
= trainControl(
trnCtrl method = "repeatedcv",
repeats = 3,
classProbs = TRUE)
= expand.grid(
tuneGrid size = c(2, 4, 8),
decay = c(0, 1e-2, 1e-1))
= train(
nnfit ~ Fn1 + Timd2 + Gata4 + Sox7,
Embryonic.day data = embryoCells,
method = "nnet",
tuneGrid = tuneGrid,
trControl = trnCtrl,
metric = "Accuracy")
```

That’s quite a mouthful, but the nice thing is that this syntax is standardized and applies across many different methods. All you need to do specify the name of the method and the grid of tuning parameters that should be explored via the `tuneGrid`

argument.

Now we can have a look at the output (Figure 12.25).

` nnfit`

```
## Neural Network
##
## 66 samples
## 4 predictor
## 3 classes: 'E3.25', 'E3.5', 'E4.5'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 60, 59, 60, 59, 59, 61, ...
## Resampling results across tuning parameters:
##
## size decay Accuracy Kappa
## 2 0.00 0.7177778 0.4252365
## 2 0.01 0.7850397 0.6022012
## 2 0.10 0.7705556 0.5658417
## 4 0.00 0.7484524 0.5291407
## 4 0.01 0.7517063 0.5297076
## 4 0.10 0.7792857 0.5811742
## 8 0.00 0.7649603 0.5968491
## 8 0.01 0.7846825 0.6006144
## 8 0.10 0.7608333 0.5475901
##
## Accuracy was used to select the optimal model using the
## largest value.
## The final values used for the model were size = 2 and decay
## = 0.01.
```

Figure 12.25: Parameter tuning of the neural net by cross-validation.

```
plot(nnfit)
predict(nnfit) |> head(10)
```

```
## [1] E3.25 E3.25 E3.25 E3.25 E3.25 E3.25 E3.25 E3.25 E3.25 E3.25
## Levels: E3.25 E3.5 E4.5
```

► Question 227

Will the accuracy that we obtained above for the optimal tuning parameters generalize to a new dataset? What could you do to address that?

► Solution

In Chapter 6 we encountered *p-value hacking*. A similar phenomenon exists in statistical learning: given a dataset, we explore various different methods of preprocessing (such as normalization, outlier detection, transformation, feature selection), try out different machine learning algorithms and tune their parameters until we are content with the result. The measured accuracy is likely to be too optimistic, i.e., will not generalize to a new dataset. Embedding as many of our methodical choices into a computational formalism and having an outer cross-validation loop (not to be confused with the inner loop that does the parameter tuning) will ameliorate the problem. But is unlikely to address it completely, since not all our choices can be formalized.

The gold standard remains validation on truly unseen data. In addition, it is never a bad thing if the classifier is not a black box but can be interpreted in terms of domain knowledge. Finally, report not just summary statistics, such as misclassification rates, but lay open the complete computational workflow, so that anyone (including your future self) can convince themselves of the robustness of the result or of the influence of the preprocessing, model selection and tuning choices (Holmes 2018).

We have seen examples of machine learning applications; we have focused on predicting categorical variables (like diabetes type or cell class). Predicting continuous outcomes is also part of machine learning, although we have not considered it here. There are many parallels and overlaps between *machine learning* and *statistical regression* (which we studied in Chapter 8). One can consider them two different names for pretty much the same activity, although each has its own flavors: in machine learning, the emphasis is on the prediction of the outcome variables, whereas in regression we often care at least as much about the role of the covariates – which of them have an effect on the outcome, and what is the nature of these effects? In other words, we do not only want predictions, we also want to understand them.

We saw linear and quadratic discriminant analysis, two intuitive methods for partitioning a two-dimensional data plane (or a \(p\)-dimensional space) into regions using either linear or quadratic separation lines (or hypersurfaces). We also saw logistic regression, which takes a slightly different approach but is more amenable to operating in higher dimensions and to regularization.

We encountered the main challenge of machine learning: how to avoid overfitting? We explored why overfitting happens in the context of the so-called curse of dimensionality, and we learned how it may be overcome using regularization.

In other words, machine learning would be easy if we had infinite amounts of data representatively covering the whole space of possible inputs and outputs164 It would “just” be a formidable database / data management problem.. The challenge is to make the best out of a finite amount of training data, and to generalize these to new, unseen inputs. There is a vigorous trade-off between the amount, resolution and coverage of training data and the complexity of the model. Many models have continuous parameters that enable us to “tune” their complexity or the strength of their regularization. Cross-validation can help us with such tuning, although it is not a panacea, and caveats apply, as we saw in Section 12.6.3.

An introduction to statistical learning that employs many concrete data examples and uses little mathematical formalism is given by James et al. (2013). An extension, with more mathematical background, is the textbook by Hastie, Tibshirani, and Friedman (2008).

The CRAN task view on machine learning gives an overview over machine learning software in R.

RStudio’s API for the “deep learning” platforms Keras and TensorFlow and the associated teaching materials and demos are a good place to try out some of the recent developments in this field.

**► Exercise 12.1 **Apply a *kernel support vector machine*, available in the **kernlab** package, to the `zeller`

microbiome data. What kernel function works well?

▢

**► Exercise 12.2 **Use `glmnet`

for a *prediction of a continous variable*, i.e., for regression. Use the prostate cancer data from Chapter 3 of (Hastie, Tibshirani, and Friedman 2008). The data are available in the CRAN package **ElemStatLearn**. Explore the effects of using ridge versus lasso penalty.

▢

**► Exercise 12.3 **Consider *smoothing as a regression and model selection problem* (remember Figure 12.1). What is the equivalent quantity to the penalization parameter \(\lambda\) in Equation (12.8)? How do you choose it?

▢

► Solution

We refer to Chapter 5 of (Hastie, Tibshirani, and Friedman 2008)

**► Exercise 12.4 ***Scale invariance*. Consider a rescaling of one of the features in the (generalized) linear model (12.7). For instance, denote the \(\nu\)-th column of \(x\) by \(x_{\cdot\nu}\), and suppose that \(p\ge2\) and that we rescale \(x_{\cdot\nu} \mapsto s\, x_{\cdot\nu}\) with some number \(s\neq0\). What will happen to the estimate \(\hat{\beta}\) from Equation (12.8) in (a) the unpenalized case (\(\lambda=0\)) and (b) the penalized case (\(\lambda>0\))?

▢

► Solution

In the unpenalized case, the estimates will be scaled by \(1/s\), so that the resulting model is, in effect, the same. In the penalized case, the penalty from the \(\nu\)-th component of \(\beta\) will be different. If \(|s|>1\), the amplitude of the feature is increased, smaller \(\beta\)-components are required for it to have the same effect in the prediction, and therefore the feature is more likely to receive a non-zero and/or larger estimate, possibly on the cost of the other features; conversely for \(|s|<1\). Regular linear regression is scale-invariant, whereas penalized regression is scale-dependent. It’s important to remember this when interpreting penalized model fits.

**► Exercise 12.5 **It has been quipped that all classification methods are just refinements of *two archetypal ideas*: discriminant analysis and \(k\) nearest neighbors. In what sense might that be a useful classification?

▢

► Solution

In linear discriminant analysis, we consider our objects as elements of \(\mathbb{R}^p\), and the learning task is to define regions in this space, or boundary hyperplanes between them, which we use to predict the class membership of new objects. This is archetypal for *classification by partition*. Generalizations of linear discriminant analysis permit more general spaces and more general boundary shapes.

In \(k\) nearest neighbors, no embedding into a coordinate space is needed, but instead we require a distance (or dissimilarity) measure that can be computed between each pair of objects, and the classification decision for a new object depends on its distances to the training objects and their classes. This is archetypal for *kernel-based* methods.

Page built: 2022-08-09 using R version 4.2.0 (2022-04-22)

Support for maintaining the online version of this book is provided by de.NBI

For website support please contact msmith@embl.de