12 Supervised Learning

In a supervised learning setting, we have a yardstick or plumbline to judge how well we are doing: the response itself.

A frequent question in biological and biomedical applications is whether a property of interest (say, disease type, cell type, the prognosis of a patient) can be “predicted”, given one or more other properties, called the predictors. Often we are motivated by a situation in which the property to be predicted is unknown (it lies in the future, or is hard to measure), while the predictors are known. The crucial point is that we learn the prediction rule from a set of training data in which the property of interest is also known. Once we have the rule, we can either apply it to new data, and make actual predictions of unknown outcomes; or we can dissect the rule with the aim of better understanding the underlying biology.

Compared to unsupervised learning and what we have seen in Chapters 5, 7 and 9, where we do not know what we are looking for or how to decide whether our result is “right”, we are on much more solid ground with supervised learning: the objective is clearly stated, and there are straightforward criteria to measure how well we are doing.

The central issues in supervised learning151 Sometimes the term statistical learning is used, more or less exchangeably. are overfitting and generalizability: did we just learn the training data “by heart” by constructing a rule that has 100% accuracy on the training data, but would perform poorly on any new data? Or did our rule indeed pick up some of the pertinent patterns in the system being studied, which will also apply to yet unseen new data? (Figure 12.1)

An example for **overfitting**: two regression lines are fit to data in the $(x, y)$-plane (black points). We can think of such a line as a rule that predicts the $y$-value, given an $x$-value. Both lines are smooth, but the fits differ in what is called their **bandwidth**, which intuitively can be interpreted their stiffness. The blue line seems overly keen to follow minor wiggles in the data, while the orange line captures the general trend but is less detailed. The effective number of parameters needed to describe the blue line is much higher than for the orange line. Also, if we were to obtain additional data, it is likely that the blue line would do a **worse** job than the orange line in modeling the new data. We'll formalize these concepts --training error and test set error-- later in this chapter. Although exemplified here with line fitting, the concept applies more generally to prediction models. Figure 12.1: An example for overfitting: two regression lines are fit to data in the \((x, y)\)-plane (black points). We can think of such a line as a rule that predicts the \(y\)-value, given an \(x\)-value. Both lines are smooth, but the fits differ in what is called their bandwidth, which intuitively can be interpreted their stiffness. The blue line seems overly keen to follow minor wiggles in the data, while the orange line captures the general trend but is less detailed. The effective number of parameters needed to describe the blue line is much higher than for the orange line. Also, if we were to obtain additional data, it is likely that the blue line would do a worse job than the orange line in modeling the new data. We’ll formalize these concepts –training error and test set error– later in this chapter. Although exemplified here with line fitting, the concept applies more generally to prediction models.

12.1 Goals for this chapter

In this chapter we will:

  • See exemplary applications that motivate the use of supervised learning methods.

  • Learn what discriminant analysis does.

  • Define measures of performance.

  • Encounter the curse of dimensionality and see what overfitting is.

  • Find out about regularization – in particular, penalization – and understand the concepts of generizability and model complexity.

  • See how to use cross-validation to tune parameters of algorithms.

  • Discuss method hacking.

12.2 What are the data?

The basic data structure for both supervised and unsupervised learning is (at least conceptually) a dataframe, where each row corresponds to an object and the columns are different features (usually numerical values) of the objects152 This is a simplified description. Machine learning is a huge field, and lots of generalizations of this simple conceptual picture have been made. Already the construction of relevant features is an art by itself — we have seen examples with images of cells in Chapter 11, and more generally there are lots of possibilities to extract features from images, sounds, movies, free text, \(...\) Moreover, there is a variant of machine learning methods called kernel methods that do not need features at all; instead, kernel methods use distances or measures of similarity between objects. It may be easier, for instance, to define a measure of similarity between two natural language text objects than to find relevant numerical features to represent them. Kernel methods are beyond the scope of this book.. While in unsupervised learning we aim to find (dis)similarity relationships between the objects based on their feature values (e.g., by clustering or ordination), in supervised learning we aim to find a mathematical function (or a computational algorithm) that predicts the value of one of the features from the other features. Many implementations require that there are no missing values, whereas other methods can be made to work with some amount of missing data.

The feature that we select over all the others with the aim of predicting is called the objective or the response. Sometimes the choice is natural, but sometimes it is also instructive to reverse the roles, especially if we are interested in dissecting the prediction function for the purpose of biological understanding, or in disentangling correlations from causation.

The framework for supervised learning covers both continuous and categorical response variables. In the continuous case we also call it regression, in the categorical case, classification. It turns out that this distinction is not a detail, as it has quite far-reaching consequences for the choice of loss function (Section 12.5) and thus the choice of algorithm (Friedman 1997).

The first question to consider in any supervised learning task is how the number of objects compares to the number of predictors. The more objects, the better, and much of the hard work in supervised learning has to do with overcoming the limitations of having a finite (and typically, too small) training set.

In supervised learning, we assign two different roles to our variables. We have labeled the explanatory variables $X$ and the response variable(s) $Y$. There are also two different sets of observations: the training set $X_\ell$ and $Y_\ell$ and the test set $X_v$ and $Y_v$. (The subscripts refer to alternative names for the two sets: "learning" and "validation".) Figure 12.2: In supervised learning, we assign two different roles to our variables. We have labeled the explanatory variables \(X\) and the response variable(s) \(Y\). There are also two different sets of observations: the training set \(X_\ell\) and \(Y_\ell\) and the test set \(X_v\) and \(Y_v\). (The subscripts refer to alternative names for the two sets: “learning” and “validation”.)

► Task

Give examples where we have encountered instances of supervised learning with a categorical response in this book.

12.2.1 Motivating examples

Predicting diabetes type

The diabetes dataset (Reaven and Miller 1979) presents three different groups of diabetes patients and five clinical variables measured on them.

library("readr")
library("magrittr")
diabetes = read_csv("../data/diabetes.csv", col_names = TRUE)
diabetes
## # A tibble: 144 x 7
##       id relwt glufast glutest steady insulin group
##    <dbl> <dbl>   <dbl>   <dbl>  <dbl>   <dbl> <dbl>
##  1     1  0.81      80     356    124      55     3
##  2     3  0.94     105     319    143     105     3
##  3     5  1         90     323    240     143     3
##  4     7  0.91     100     350    221     119     3
##  5     9  0.99      97     379    142      98     3
##  6    11  0.9       91     353    221      53     3
##  7    13  0.96      78     290    136     142     3
##  8    15  0.74      86     312    208      68     3
##  9    17  1.1       90     364    152      76     3
## 10    19  0.83      85     296    116      60     3
## # … with 134 more rows
diabetes$group %<>% factor

We used the forward-backward pipe operator %<>% to convert the group column into a factor. The plot is shown in Figure 12.3.

We see already from the one-dimensional distributions that some of the individual variables could potentially predict which group a patient is more likely to belong to. Our goal will be to combine variables to improve over such one-dimensional prediction models. Figure 12.3: We see already from the one-dimensional distributions that some of the individual variables could potentially predict which group a patient is more likely to belong to. Our goal will be to combine variables to improve over such one-dimensional prediction models.

library("ggplot2")
library("reshape2")
ggplot(melt(diabetes, id.vars = c("id", "group")),
       aes(x = value, col = group)) +
 geom_density() + facet_wrap( ~variable, ncol = 1, scales = "free") +
 theme(legend.position = "bottom")

Predicting cellular phenotypes

Neumann et al. (2010) observed human cancer cells using live-cell imaging. The cells were genetically engineered so that their histones were tagged with a green fluorescent protein (GFP). A genome-wide RNAi library was applied to the cells, and for each siRNA perturbation, movies of a few hundred cells were recorded for about two days, to see what effect the depletion of each gene had on cell cycle, nuclear morphology and cell proliferation. Their paper reports the use of an automated image classification algorithm that quantified the visual appearance of each cell’s nucleus and enabled the prediction of normal mitosis states or aberrant nuclei. The algorithm was trained on the data from around 3000 cells that were annotated by a human expert. It was then applied to almost 2 billions images of nuclei (Figure 12.4). Using automated image classification provided scalablity (annotating 2 billion images manually would take a long time) and objectivity.

Figure 12.4: The data were images of \(2\times10^9\) nuclei from movies. The images were segmented to identify the nuclei, and numeric features were computed for each nucleus, corresponding to size, shape, brightness and lots of other more or less abstract quantitative summaries of the joint distribution of pixel intensities. From the features, the cells were classified into 16 different nuclei morphology classes, represented by the rows of the barplot. Representative images for each class are shown in black and white in the center column. The class frequencies, which are very unbalanced, are shown by the lengths of the bars.

The data were images of $2\times10^9$ nuclei from movies. The images were segmented to identify the nuclei, and numeric features were computed for each nucleus, corresponding to size, shape, brightness and lots of other more or less abstract quantitative summaries of the joint distribution of pixel intensities. From the features, the cells were classified into 16 different nuclei morphology classes, represented by the rows of the barplot. Representative images for each class are shown in black and white in the center column. The class frequencies, which are very unbalanced, are shown by the lengths of the bars.

Predicting embryonic cell states

We will revisit the mouse embryo data (Ohnishi et al. 2014), which we have already seen in Chapters 3, 5 and 7. We’ll try to predict cell state and genotype from the gene expression measurements in Sections 12.3.2 and 12.6.3.

12.3 Linear discrimination

We start with one of the simplest possible discrimination problems153 Arguably the simplest possible problem is a single continuous feature, two classes, and the task of finding a single threshold to discriminate between the two groups – as in Figure 6.5.: we have objects described by two continuous features (so the objects can be thought of as points in the 2D plane) and falling into three groups. Our aim is to define class boundaries, which are lines in the 2D space.

12.3.1 Diabetes data

Let’s see whether we can predict the group from the insulin and glutest variables in the diabetes data. It’s always a good idea to first visualise the data (Figure 12.5).

Scatterplot of two of the variables in the `diabetes` data. Each point is a sample, and the color indicates the diabetes type as encoded in the `group` variable. Figure 12.5: Scatterplot of two of the variables in the diabetes data. Each point is a sample, and the color indicates the diabetes type as encoded in the group variable.

ggdb = ggplot(mapping = aes(x = insulin, y = glutest)) +
  geom_point(aes(colour = group), data = diabetes)
ggdb

We’ll start with a method called linear discriminant analysis (LDA). This method is a foundation stone of classification, many of the more complicated (and sometimes more powerful) algorithms are really just generalizations of LDA.

library("MASS")
diabetes_lda = lda(group ~ insulin + glutest, data = diabetes)
diabetes_lda
## Call:
## lda(group ~ insulin + glutest, data = diabetes)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.2222222 0.2500000 0.5277778 
## 
## Group means:
##    insulin   glutest
## 1 320.9375 1027.3750
## 2 208.9722  493.9444
## 3 114.0000  349.9737
## 
## Coefficients of linear discriminants:
##                  LD1         LD2
## insulin -0.004463900 -0.01591192
## glutest -0.005784238  0.00480830
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9677 0.0323
ghat = predict(diabetes_lda)$class
table(ghat, diabetes$group)
##     
## ghat  1  2  3
##    1 25  0  0
##    2  6 24  6
##    3  1 12 70
mean(ghat != diabetes$group)
## [1] 0.1736111

► Question 12.1

What do the different parts of the above output mean?

Now, let’s visualise the LDA result. We are going to plot the prediction regions for each of the three groups. We do this by creating a grid of points and using our prediction rule on each of them. We’ll then also dig a bit deeper into the mechanics of LDA and plot the class centers (diabetes_lda$means) and ellipses that correspond to the fitted covariance matrix (diabetes_lda$scaling). Assembling this visualization requires us to write a bit of code.

make1Dgrid = function(x) {
  rg = grDevices::extendrange(x)
  seq(from = rg[1], to = rg[2], length.out = 100)
}

Set up the points for prediction, a \(100 \times 100\) grid that covers the data range.

diabetes_grid = with(diabetes,
  expand.grid(insulin = make1Dgrid(insulin),
              glutest = make1Dgrid(glutest)))

Do the predictions.

diabetes_grid$ghat =
  predict(diabetes_lda, newdata = diabetes_grid)$class

The group centers.

centers = diabetes_lda$means

Compute the ellipse. We start from a unit circle and apply the corresponding affine transformation from the LDA output.

unitcircle = exp(1i * seq(0, 2*pi, length.out = 90)) %>%
          {cbind(Re(.), Im(.))}
ellipse = unitcircle %*% solve(diabetes_lda$scaling)

All three ellipses, one for each group center.

ellipses = lapply(seq_len(nrow(centers)), function(i) {
  (ellipse +
   matrix(centers[i, ], byrow = TRUE,
          ncol = ncol(centers), nrow = nrow(ellipse))) %>%
     cbind(group = i)
}) %>% do.call(rbind, .) %>% data.frame
ellipses$group %<>% factor

Now we are ready to plot (Figure 12.6).

As Figure \@ref(fig:chap16-r-scatterdiabetes-1), with the classification regions from the LDA model shown. The three ellipses represent the class centers and the covariance matrix of the LDA model; note that there is only one covariance matrix, which is the same for all three classes. Therefore also the sizes and orientations of the ellipses are the same for the three classes, only their centers differ. They represent contours of equal class membership probability. Figure 12.6: As Figure 12.5, with the classification regions from the LDA model shown. The three ellipses represent the class centers and the covariance matrix of the LDA model; note that there is only one covariance matrix, which is the same for all three classes. Therefore also the sizes and orientations of the ellipses are the same for the three classes, only their centers differ. They represent contours of equal class membership probability.

ggdb + geom_raster(aes(fill = ghat),
            data = diabetes_grid, alpha = 0.25, interpolate = TRUE) +
    geom_point(data = as_tibble(centers), pch = "+", size = 8) +
    geom_path(aes(colour = group), data = ellipses) +
    scale_x_continuous(expand = c(0, 0)) +
    scale_y_continuous(expand = c(0, 0))

► Question 12.2

Why is the boundary between the prediction regions for groups 1 and 2 not perpendicular to the line between the cluster centers?

► Solution

► Question 12.3

How confident would you be about the predictions in those areas of the 2D plane that are far from all of the cluster centers?

► Solution

► Question 12.4

Why is the boundary between the prediction regions for groups 2 and 3 not half-way between the centers, but shifted in favor of class 3? (Hint: have a look at the prior argument of lda.) Try again with uniform prior.

► Solution

► Question 12.5

Figures 12.6 and 12.7 show both the fitted LDA model, through the ellipses, and the prediction regions, through the area coloring. What part of this visualization is generic for all sorts of classification methods, what part is method-specific?

► Solution

► Question 12.6

What is the difference in the prediction accuracy if we use all 5 variables instead of just insulin and glufast?

► Solution

► Question 12.7

Instead of approximating the prediction regions by classification from a grid of points, compute the separating lines explicitly from the linear determinant coefficients.

► Solution

12.3.2 Predicting embryonic cell state from gene expression

Assume that we already know that the four genes FN1, TIMD2, GATA4 and SOX7 are relevant to the classification task154 Later in this chapter we will see methods that can drop this assumption and screen all available features.. We want to build a classifier that predicts the developmental time (embryonic days: E3.25, E3.5, E4.5). We load the data and select four corresponding probes.

library("Hiiragi2013")
data("x")
probes = c("1426642_at", "1418765_at", "1418864_at", "1416564_at")
embryoCells = t(Biobase::exprs(x)[probes, ]) %>% as_tibble %>%
  mutate(Embryonic.day = x$Embryonic.day) %>%
  dplyr::filter(x$genotype == "WT")

We can use the Bioconductor annotation package associated with the microarray to verify that the probes correspond to the intended genes.

annotation(x)
## [1] "mouse4302"
library("mouse4302.db")
anno = AnnotationDbi::select(mouse4302.db, keys = probes,
         columns = c("SYMBOL", "GENENAME"))
anno
##      PROBEID SYMBOL
## 1 1426642_at    Fn1
## 2 1418765_at  Timd2
## 3 1418864_at  Gata4
## 4 1416564_at   Sox7
##                                              GENENAME
## 1                                       fibronectin 1
## 2 T cell immunoglobulin and mucin domain containing 2
## 3                              GATA binding protein 4
## 4                SRY (sex determining region Y)-box 7
mt = match(anno$PROBEID, colnames(embryoCells))
colnames(embryoCells)[mt] = anno$SYMBOL

Now we are ready to visualize the data in a pairs plot (Figure 12.8).

library("GGally")
ggpairs(embryoCells, mapping = aes(col = Embryonic.day),
  columns = anno$SYMBOL, upper = list(continuous = "points"))

Figure 12.8: Expression values of the discriminating genes, with the prediction target Embryonic.day shown by color.

Expression values of the discriminating genes, with the prediction target Embryonic.day shown by color.

We can now call lda on these data. The linear combinations LD1 and LD2 that serve as discriminating variables are given in the slot ed_lda$scaling of the output from lda.

ec_lda = lda(Embryonic.day ~ Fn1 + Timd2 + Gata4 + Sox7,
             data = embryoCells)
round(ec_lda$scaling, 1)
##        LD1  LD2
## Fn1   -0.2 -0.4
## Timd2  0.5  0.0
## Gata4 -0.1 -0.6
## Sox7  -0.7  0.5

For the visualization of the learned model in Figure 12.9, we need to build the prediction regions and their boundaries by expanding the grid in the space of the two new coordinates LD1 and LD2.

LDA classification regions for Embryonic.day. Figure 12.9: LDA classification regions for Embryonic.day.

ec_rot = predict(ec_lda)$x %>% as_tibble %>%
           mutate(ed = embryoCells$Embryonic.day)
ec_lda2 = lda(ec_rot[, 1:2], predict(ec_lda)$class)
ec_grid = with(ec_rot, expand.grid(
  LD1 = make1Dgrid(LD1),
  LD2 = make1Dgrid(LD2)))
ec_grid$edhat = predict(ec_lda2, newdata = ec_grid)$class
ggplot() +
  geom_point(aes(x = LD1, y = LD2, colour = ed), data = ec_rot) +
  geom_raster(aes(x = LD1, y = LD2, fill = edhat),
            data = ec_grid, alpha = 0.4, interpolate = TRUE) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  coord_fixed()

► Question 12.8

Repeat these analyses using quadratic discriminant analysis (qda). What difference do you see in the shape of the boundaries?

► Solution

► Question 12.9

What happens if you call lda or qda with a lot more genes, say the first 1000, in the Hiiragi dataset?

► Solution

12.4 Machine learning vs rote learning

Computers are really good at memorizing facts. In the worst case, a machine learning algorithm is a roundabout way of doing this155 The not-so roundabout way is database technologies.. The central goal in statistical learning, however, is generalizability. We want an algorithm that is able to generalize, i.e., interpolate and extrapolate from given data to make good predictions about future data.

Let’s look at the following example. We generate random data (rnorm) for n objects, with different numbers of features (given by p). We train a LDA on these data and compute the misclassification rate, i.e., the fraction of times the prediction is wrong (pred != resp).

Misclassification rate of LDA applied to random data. While the number of observations `n` is held constant (at 20), we are increasing the number of features `p` starting from 2 up to 21. The misclassification rate becomes almost zero as `p` approaches 20. The LDA model becomes so elaborate and over-parameterized that it manages to learn the random labels "by heart". (As `p` becomes even larger, the "performance" degrades again somewhat, apparently due to numerical properties of the `lda` implementation used here.) Figure 12.11: Misclassification rate of LDA applied to random data. While the number of observations n is held constant (at 20), we are increasing the number of features p starting from 2 up to 21. The misclassification rate becomes almost zero as p approaches 20. The LDA model becomes so elaborate and over-parameterized that it manages to learn the random labels “by heart”. (As p becomes even larger, the “performance” degrades again somewhat, apparently due to numerical properties of the lda implementation used here.)

library("dplyr")
p = 2:21
n = 20

mcl = lapply(p, function(pp) {
  replicate(100, {
    xmat = matrix(rnorm(n * pp), nrow = n)
    resp = sample(c("apple", "orange"), n, replace = TRUE)
    fit  = lda(xmat, resp)
    pred = predict(fit)$class
    mean(pred != resp)
  }) %>% mean %>% tibble(mcl = ., p = pp)
}) %>% bind_rows

ggplot(mcl, aes(x = p, y = mcl)) + geom_line() + geom_point() +
  ylab("Misclassification rate")

► Question 12.10

What is the purpose of the replicate loop in the above code? What happens if you omit it (or replace the 100 by 1)?

► Solution

Figure 12.11 seems to imply that we can perfectly predict random labels from random data, if we only fit a complex enough model, i.e., one with many parameters. How can we overcome such an absurd conclusion? The problem with the above code is that the model performance is evaluated on the same data on which it was trained. This generally leads to positive bias, as you see in this crass example. How can we overcome this problem? The key idea is to assess model performance on different data than those on which the model was trained.

12.4.1 Cross-validation

A naive approach might be to split the data in two halves, and use the first half for learning (“training”) and the second half for assessment (“testing”). It turns out that this is needlessly variable and needlessly inefficient. It is needlessly variable, since by splitting the data only once, our results can be quite affected by how the split happens to fall. It seems better to do the splitting many times, and average. This will give us more stable results. It is needlessly inefficient, since the performance of machine learning algorithms depends on the number of observations, and the performance measured on half the data is likely156 Unless we have such an excess of data that it doesn’t matter. to be worse than what it is with all the data. For this reason, it is better to use unequal sizes of training and test data. In the extreme case, we’ll use as much as \(n-1\) observations for training, and the remaining one for testing. After we’ve done this likewise for all observations, we can average our performance metric. This is called leave-one-out cross-validation.

See Chapter Model Assessment and Selection in the book by Hastie, Tibshirani, and Friedman (2008) for further discussion on these trade-offs.

An alternative is \(k\)-fold cross-validation, where the observations are repeatedly split into a training set of size of around \(n(k-1)/k\) and a test set of size of around \(n/k\). Both alternatives have pros and contras, and there is not a universally best choice. An advantage of leave-one-out is that the amount of data used for training is close to the maximally available data; this is especially important if the sample size is limiting and “every little matters” for the algorithm. A drawback of leave-one-out is that the training sets are all very similar, so they may not model sufficiently well the kind of sampling changes to be expected if a new dataset came along. For large \(n\), leave-one-out cross-validation can be needlessly time-consuming.

Cross-validation: the misclassification rate of LDA applied to random data, when evaluated on test data that were not used for learning, hovers around 0.5 independent of `p`. The misclassification rate on the training data is also shown. It behaves similar to what we already saw in Figure \@ref(fig:chap16-r-learnbyheart-1). Figure 12.12: Cross-validation: the misclassification rate of LDA applied to random data, when evaluated on test data that were not used for learning, hovers around 0.5 independent of p. The misclassification rate on the training data is also shown. It behaves similar to what we already saw in Figure 12.11.

estimate_mcl_loocv = function(x, resp) {
  vapply(seq_len(nrow(x)), function(i) {
    fit  = lda(x[-i, ], resp[-i])
    ptrn = predict(fit, newdata = x[-i,, drop = FALSE])$class
    ptst = predict(fit, newdata = x[ i,, drop = FALSE])$class
    c(train = mean(ptrn != resp[-i]), test = (ptst != resp[i]))
  }, FUN.VALUE = numeric(2)) %>% rowMeans %>% t %>% as_tibble
}

xmat = matrix(rnorm(n * last(p)), nrow = n)
resp = sample(c("apple", "orange"), n, replace = TRUE)

mcl = lapply(p, function(k) {
  estimate_mcl_loocv(xmat[, 1:k], resp)
}) %>% bind_rows %>% data.frame(p) %>% melt(id.var = "p")

ggplot(mcl, aes(x = p, y = value, col = variable)) + geom_line() +
  geom_point() + ylab("Misclassification rate")

The result is show in Figure 12.12.

► Question 12.11

Why are the curves in Figure 12.12 more variable (“wiggly”) than in Figure 12.11? How can you overcome this?

► Solution

12.4.2 The curse of dimensionality

In Section 12.4.1 we have seen overfitting and cross-validation on random data, but how does it look if there is in fact a relevant class separation?

As we increase the number of features included in the model, the misclassification rate initially improves; as we start including more and more irrelevant features, it increases again, as we are fitting noise. Figure 12.13: As we increase the number of features included in the model, the misclassification rate initially improves; as we start including more and more irrelevant features, it increases again, as we are fitting noise.

p   = 2:20
mcl = replicate(100, {
  xmat = matrix(rnorm(n * last(p)), nrow = n)
  resp = sample(c("apple", "orange"), n, replace = TRUE)
  xmat[, 1:6] = xmat[, 1:6] + as.integer(factor(resp))

  lapply(p, function(k) {
    estimate_mcl_loocv(xmat[, 1:k], resp)
  }) %>% bind_rows %>% cbind(p = p) %>% melt(id.var = "p")
}, simplify = FALSE) %>% bind_rows

mcl =  group_by(mcl, p, variable) %>%
   summarise(value = mean(value))

ggplot(mcl, aes(x = p, y = value, col = variable)) + geom_line() +
   geom_point() + ylab("Misclassification rate")

Idealized version of Figure \@ref(fig:chap16-r-curseofdim), from @HastieTibshiraniFriedman. A recurrent goal in machine learning is finding the sweet spot in the variance--bias trade-off. Figure 12.14: Idealized version of Figure 12.13, from Hastie, Tibshirani, and Friedman (2008). A recurrent goal in machine learning is finding the sweet spot in the variance–bias trade-off.

The result is shown in Figure 12.13. The group centers are the vectors (in \(\mathbb{R}^{20}\)) given by the coordinates \((1, 1, 1, 1, 1, 1, 0, 0, 0, ...)\) (apples) and \((2, 2, 2, 2, 2, 2, 0, 0, 0, ...)\) (oranges), and the optimal decision boundary is the hyperplane orthogonal to the line between them. For \(p\) smaller than \(6\), the decision rule cannot reach this hyperplane – it is biased. As a result, the misclassification rate is suboptimal, and it decreases with \(p\). But what happens for \(p\) larger than \(6\)? The algorithm is, in principle, able to model the optimal hyperplane, and it should not be distracted by the additional features. The problem is that it is. The more additional features enter the dataset, the higher the probability that one or more of them happen to fall in a way that they look like good, discriminating features in the training data – only to mislead the classifier and degrade its performance in the test data. Shortly we’ll see how to use penalization to (try to) control this problem.

The term curse of dimensionality was coined by Bellman (1961). It refers to the fact that high-dimensional spaces are very hard, if not impossible, to sample thoroughly: for instance, to cover a 2-dimensional square of side length 1 with grid points that are 0.1 apart, we need \(10^2=100\) points. In 100 dimensions, we need \(10^{100}\) – which is already more than the number of protons in the universe. In genomics, we often aim to fit models to data with thousands of features. Also our intuitions about distances between points or about the relationship between a volume and its surface break down in a high-dimensional settings. We’ll explore some of the weirdnesses of high-dimensional spaces in the next few questions.

► Question 12.12

Assume you have a dataset with 1 000 000 data points in \(p\) dimensions. The data are uniformly distributed in the unit hybercube (i.e., all features lie in the interval \([0,1]\)). What’s the side length of a hybercube that can be expected to contain just 10 of the points, as a function of \(p\)?

► Solution

Next, let’s look at the relation between inner regions of the feature space versus its boundary regions. Generally speaking, prediction at the boundaries of feature space is more difficult than in its interior, as it tends to involve extrapolation, rather than interpolation. In the next question you’ll see how this difficulty explodes with feature space dimension.

► Question 12.13

What fraction of a unit cube’s total volume is closer than 0.01 to any of its surfaces, as a function of the dimension?

► Solution

► Question 12.14

What is the coefficient of variation (ratio of standard deviation over average) of the distance between two randomly picked points in the unit hypercube, as a function of the dimension?

► Solution

12.5 Objective functions

We’ve already seen the misclassification rate (MCR) used to assess our classification performance in Figures 12.1112.13. Its population version is defined as

\[\begin{equation} \text{MCR} = \text{E}\left[ 𝟙_{\hat{y} \neq y} \right], \tag{12.1} \end{equation}\]

and for a finite sample

\[\begin{equation} \widehat{\text{MCR}} = \frac{1}{n}\sum_{i=1}^n 𝟙_{\hat{y_i} \neq y_i}. \tag{12.2} \end{equation}\]

This is not the only choice we could make. Perhaps we care more about the misclassification of apples as oranges than vice versa, and we can reflect this by introducing weights that depend on the type of error made into the sum of Equation (12.2) (or the integral of Equation (12.1)). This can get even more elaborate if we have more than two classes. Often we want to see the whole confusion table, which we can get via

table(truth, response)

An important special case is binary classification with asymmetric costs – think about, say, a medical test. Here, the sensitivity (a.k.a. true positive rate or recall) is related to the misclassification of healthy as ill, and the specificity (or true negative rate) depends on the probability of misclassification of ill as healthy. Often, there is a single parameter (e.g., a threshold) that can be moved up and down, allowing a trade-off between sensitivity and specificity (and thus, equivalently, between the two types of misclassification). In those cases, we usually are not content to know the classifier performance at one single choice of threshold, but at many (or all) of them. This leads to receiver operating characteristic (ROC) or precision-recall curves.

► Question 12.15

What are the exact relationships between the per-class misclassification rates and sensitivity and specificity?

► Solution

Another cost function can be computed from the Jaccard index, which we already saw in Chapter 5.

\[\begin{equation} J(A,B) = \frac{|\,A\cap B\,|}{|\,A\cup B\,|}, \tag{12.3} \end{equation}\]

where \(A\) is the set of observations for which the true class is 1 (\(A=\{i\,|\,y_i=1\}\)) and \(B\) is the set of observations for which the predicted class is 1. The number \(J\) is between 0 and 1, and when \(J\) is large, it indicates high overlap of the two sets. Note that \(J\) does not depend on the number of observations for which both true and predicted class is 0 – so it is particularly suitable for measuring the performance of methods that try to find rare events.

We can also consider probabilistic class predictions, which come in the form \(\hat{P}(Y\,|\,X)\). In this case, a possible risk function would be obtained by looking at distances between the true probability distribution and the estimated probability distributions. For two classes, the finite sample version of the \(\log \text{loss}\) is

\[\begin{equation} \log \text{loss} = -\frac{1}{n}\sum_{i=1}^n y_i\log(\hat{p}_i) + (1 - y_i)\log(1 - \hat{p}_i), \tag{12.4} \end{equation}\]

where \(\hat{p}_i \in [0,1]\) is the prediction, and \(y_i\in\{0,1\}\) is the truth.

Note that the \(\log \text{loss}\) will be infinite if a prediction is totally confident (\(\hat{p}_i\) is exactly \(0\) or \(1\)) but wrong.

For continuous continuous response variables (regression), a natural choice is the mean squared error (MSE). It is the average squared error,

\[\begin{equation} \widehat{\text{MSE}} = \frac{1}{n}\sum_{i=1}^n ( \hat{Y}_i - Y_i )^2. \tag{12.5} \end{equation}\]

The population version is defined analogously, by turning the summation into an integral as in Equations (12.1) and (12.2).

Statisticians call functions like Equations ((12.1)(12.5)) variously (and depending on context and predisposition) risk function, cost function, objective function157 There is even an R package dedicated to evaluation of statistical learners called metrics..

12.6 Variance–bias trade-off

In the upper bull's eye, the estimates are systematically off target, but in a quite reproducible manner. The green segment represents the bias. In the lower bull's eye, the estimates are not biased, as they are centered in the right place, however they have high variance. We can distinguish the two scenarios since we see the result from many shots. If we only had one shot and missed the bull's eye, we could not easily know whether that's because of bias or variance.In the upper bull's eye, the estimates are systematically off target, but in a quite reproducible manner. The green segment represents the bias. In the lower bull's eye, the estimates are not biased, as they are centered in the right place, however they have high variance. We can distinguish the two scenarios since we see the result from many shots. If we only had one shot and missed the bull's eye, we could not easily know whether that's because of bias or variance. Figure 12.18: In the upper bull’s eye, the estimates are systematically off target, but in a quite reproducible manner. The green segment represents the bias. In the lower bull’s eye, the estimates are not biased, as they are centered in the right place, however they have high variance. We can distinguish the two scenarios since we see the result from many shots. If we only had one shot and missed the bull’s eye, we could not easily know whether that’s because of bias or variance.

An important fact that helps us understand the tradeoffs when picking a statistical learning model is that the MSE is the sum of two terms, and often the choices we can make are such that one of those terms goes down while the other one goes up. The bias measures how different the average of all the different estimates is from the truth, and variance, how much an individual one might scatter from the average value (Figure 12.18). In applications, we often only get one shot, therefore being reliably almost on target can beat being right on the long term average but really off today. The decomposition

\[\begin{equation} \text{MSE} = \underbrace{\text{Var}(\hat{Y})}_{\text{variance}} + \underbrace{\mathbb{E}[\hat{Y}-Y]^2}_{\text{bias}} \tag{12.6} \end{equation}\]

follows by straightforward algebra.

When trying to minimize the MSE, it is important to realize that sometimes we can pay the price of a small bias to greatly reduce variance, and thus overall improve MSE. We already encountered shrinkage estimation in Chapter 8. In classification (i.e., when we have categorical response variables), different objective functions than the MSE are used, and there is usually no such straightforward decomposition as in Equation (12.6). The good news is that we can usually go even much further than in the case of continuous responses with our trading biases for variance. This is because the discreteness of the response absorbs certain biases (Friedman 1997), so that the cost of higher bias is almost zero, while we still get the benefit of better (smaller) variance.

12.6.1 Penalization

In high-dimensional statistics, we are constantly plagued by variance: there is just not enough data to fit all the possible parameters. One of the most fruitful ideas in high-dimensional statistics is penalization: a tool to actively control and exploit the variance-bias tradeoff. Penalization is part of a larger class of regularization methods that are used to ensure stable estimates.

Although generalization of LDA to high-dimensional settings is possible (Clemmensen et al. 2011; Witten and Tibshirani 2011), it turns out that logistic regression is a more general approach158 It fits into the framework of generalized linear models, which we encounted in Chapter 8., and therefore we’ll now switch to that, using the glmnet package.

Multinomial–or, for the special case of two classes, binomial– logistic regression models the posterior log-odds between \(k\) classes and can be written in the form (see Hastie, Tibshirani, and Friedman (2008) for a complete presentation).

\[\begin{equation} \log \frac{P(Y=i\,|\,X=x)}{P(Y=k\,|\,X=x)} = \beta^0_i + \beta_i x, \tag{12.7} \end{equation}\]

where \(i=1,...,k-1\) counts over the different classes. The data matrix \(x\) has dimensions \(n\times p\), where \(n\) is number of observations and \(p\) the number of features). The \(p\)-dimensional vector \(\beta_i\) determines how the classification odds for class \(i\) versus class \(k\) depend on \(x\). The numbers \(\beta^0_i\) are intercepts and depend, among other things, on the classes’ prior probabilities. Instead of the log odds (12.7) (i.e., ratios of class probabilities), we can also write down an equivalent model for the class probabilities themselves, and the fact that we here used the \(k\)-th class as a reference is an arbitrary choice, as the model estimates are equivariant under this choice (Hastie, Tibshirani, and Friedman 2008). The model is fit by maximising the log-likelihood \(\mathcal{l}(\beta, \beta^0; x)\), where \(\beta=(\beta_1,...,\beta_{k-1})\) and analogously for \(\beta^0\).

So far, so good. But as \(p\) gets larger, there is an increasing chance that some of the estimates go wildly off the mark, due to random sampling happenstances in the data (remember Figure 12.1). This is true even if for each individual coordinate of the vector \(\beta_i\), the error distribution is bounded: the probabilty of there being one coordinate that is in the far tails increases the more coordiates there are, i.e., the larger \(p\) is.

A related problem can also occur, not in (12.7), but in other, non-linear models, as the model dimension \(p\) increases while the sample size \(n\) remains the same: the likelihood landscape around its maximum becomes increasingly flat, and the maximum-likelihood estimate of the model parameters becomes more and more variable. Eventually, the maximum is no longer a point, but a submanifold, and the maximum likelihood estimate is unidentifiable. Both of these limitations can be overcome with a modification of the objective: instead of maximising the bare log-likelihood, we maximise a penalized version of it,

\[\begin{equation} \hat{\beta}= \arg\max_\beta \mathcal{l}(\beta, \beta^0; x) + \lambda \operatorname{pen}(\beta), \tag{12.8} \end{equation}\]

where \(\lambda\ge0\) is a real number, and \(\operatorname{pen}\) is a convex function, called the penalty function. Popular choices are \(\operatorname{pen}(\beta)=|\beta|^2\) (ridge regression) and \(\operatorname{pen}(\beta)=|\beta|^1\) (lasso).

Here, \(|\beta|^\nu=\sum_i \beta_i^\nu\) is the \(L_\nu\)-norm of the vector \(\beta\). Variations are possible, for instead we could include in this summation only some but not all of the elements of \(\beta\); or we could scale different elements differently, for instance based on some prior belief of their scale and importance. In the elastic net, ridge and lasso are hybridized by using the penalty function \(\operatorname{pen}(\beta)=(1-\alpha)|\beta|^1+\alpha|\beta|^2\) with some further parameter \(\alpha\in[0,1]\). The crux is, of course, how to choose the right \(\lambda\), and we will discuss that in the following.

12.6.2 Example: predicting colon cancer from stool microbiome composition

Zeller et al. (2014) studied metagenome sequencing data from fecal samples of 156 humans that included colorectal cancer patients and tumor-free controls. Their aim was to see whether they could identify biomarkers (presence or abundance of certain taxa) that could help with early tumor detection. The data are available from Bioconductor through its ExperimentHub service under the identifier EH361.

library("ExperimentHub")
eh = ExperimentHub()
zeller = eh[["EH361"]]
table(zeller$disease)
## 
##        cancer large_adenoma             n small_adenoma 
##            53            15            61            27

► Question 12.16

Explore the eh object to see what other datasets there are.

► Solution

For the following, let’s focus on the normal and cancer samples and set the adenomas aside.

zellerNC = zeller[, zeller$disease %in% c("n", "cancer")]

Before jumping into model fitting, as always it’s a good idea to do some exploration of the data. First, let’s look at the sample annotations. The following code prints the data from three randomly picked samples. (Only looking at the first ones, say with the R function head, is also an option, but may not be representative of the whole dataset).

pData(zellerNC)[ sample(ncol(zellerNC), 3), ]
##                    subjectID age gender bmi country disease
## CCIS77252613ST-4-0    FR-783  65   male  26  france  cancer
## CCIS36797902ST-4-0    FR-198  63   male  25  france       n
## CCIS78318719ST-4-0    FR-768  69 female  30  france  cancer
##                    tnm_stage ajcc_stage localization     fobt
## CCIS77252613ST-4-0    t1n0m0          i           lc positive
## CCIS36797902ST-4-0      <NA>       <NA>         <NA> negative
## CCIS78318719ST-4-0    t3n0m0         ii           rc negative
##                    wif-1_gene_methylation_test   group bodysite
## CCIS77252613ST-4-0                    positive     crc    stool
## CCIS36797902ST-4-0                    positive control    stool
## CCIS78318719ST-4-0                    negative     crc    stool
##                    ethnicity number_reads
## CCIS77252613ST-4-0     white     89494083
## CCIS36797902ST-4-0     white     43056463
## CCIS78318719ST-4-0     white     59124409

Next, let’s explore the feature names:

We define the helper function formatfn to line wrap these long character strings for the available space here.

formatfn = function(x)
   gsub("|", "| ", x, fixed = TRUE) %>% lapply(strwrap)

rownames(zellerNC)[1:4]
## [1] "k__Bacteria"                 
## [2] "k__Viruses"                  
## [3] "k__Bacteria|p__Firmicutes"   
## [4] "k__Bacteria|p__Bacteroidetes"
rownames(zellerNC)[nrow(zellerNC) + (-2:0)] %>% formatfn
## [[1]]
## [1] "k__Bacteria| p__Proteobacteria| c__Deltaproteobacteria|"
## [2] "o__Desulfovibrionales| f__Desulfovibrionaceae|"         
## [3] "g__Desulfovibrio| s__Desulfovibrio_termitidis"          
## 
## [[2]]
## [1] "k__Viruses| p__Viruses_noname| c__Viruses_noname|"
## [2] "o__Viruses_noname| f__Baculoviridae|"             
## [3] "g__Alphabaculovirus|"                             
## [4] "s__Bombyx_mori_nucleopolyhedrovirus|"             
## [5] "t__Bombyx_mori_nucleopolyhedrovirus_unclassified" 
## 
## [[3]]
## [1] "k__Bacteria| p__Proteobacteria| c__Deltaproteobacteria|"
## [2] "o__Desulfovibrionales| f__Desulfovibrionaceae|"         
## [3] "g__Desulfovibrio| s__Desulfovibrio_termitidis|"         
## [4] "t__GCF_000504305"

As you can see, the features are a mixture of abundance quantifications at different taxonomic levels, from kingdom over phylum to species. We could select only some of these, but here we continue with all of them. Next, let’s look at the distribution of some of the features. Here, we show an arbitrary choice of two, number 510 and 527; in practice, it is helpful to scroll through many such plots quickly to get an impression (Figure 12.19).

Histograms of the distributions for two randomly selected features. The distributions are highly skewed, with many zero values and a thin, long tail of non-zero values. Figure 12.19: Histograms of the distributions for two randomly selected features. The distributions are highly skewed, with many zero values and a thin, long tail of non-zero values.

ggplot(melt(Biobase::exprs(zellerNC)[c(510, 527), ]), aes(x = value)) +
    geom_histogram(bins = 25) +
    facet_wrap( ~ Var1, ncol = 1, scales = "free")

In the simplest case, we fit model (12.7) as follows.

library("glmnet")
glmfit = glmnet(x = t(Biobase::exprs(zellerNC)),
                y = factor(zellerNC$disease),
                family = "binomial")

A remarkable feature of the glmnet function is that it fits (12.7) not only for one choice of \(\lambda\), but for all possible \(\lambda\)s at once. For now, let’s look at the prediction performance for, say, \(\lambda=0.04\). The name of the function parameter is s:

predTrsf = predict(glmfit, newx = t(Biobase::exprs(zellerNC)),
                   type = "class", s = 0.04)
table(predTrsf, zellerNC$disease)
##         
## predTrsf cancer  n
##   cancer     51  0
##   n           2 61

Not bad – but remember that this is on the training data, without cross-validation. Let’s have a closer look at glmfit. The glmnet package offers a a diagnostic plot that is worth looking at (Figure 12.20).

Regularization paths for `glmfit`. Figure 12.20: Regularization paths for glmfit.

plot(glmfit, col = brewer.pal(8, "Dark2"), lwd = sqrt(3), ylab = "")

► Question 12.17

What are the \(x\)- and \(y\)-axes in Figure 12.20? What are the different lines?

► Solution

Let’s get back to the question of how to choose the parameter \(\lambda\). We could try many different choices –and indeed, all possible choices– of \(\lambda\), assess classification performance in each case using cross-validation, and then choose the best \(\lambda\).

You'll already realize from the description of this strategy that if we optimize \(\lambda\) in this way, the resulting apparent classification performance will likely be exaggerated. We need a truly independent dataset, or at least another, outer cross-validation loop to get a more realistic impression of the generalizability. We will get back to this question at the end of the chapter. We could do so by writing a loop as we did in the estimate_mcl_loocv function in Section 12.4.1. It turns out that the glmnet package already has built-in functionality for that, with the function cv.glmnet, which we can use instead.

Diagnostic plot for `cv.glmnet`: shown is a measure of cross-validated prediction performance, the deviance, as a function of $\lambda$. The dashed vertical lines show `lambda.min` and `lambda.1se`. Figure 12.21: Diagnostic plot for cv.glmnet: shown is a measure of cross-validated prediction performance, the deviance, as a function of \(\lambda\). The dashed vertical lines show lambda.min and lambda.1se.

cvglmfit = cv.glmnet(x = t(Biobase::exprs(zellerNC)),
                     y = factor(zellerNC$disease),
                     family = "binomial")
plot(cvglmfit)

The diagnostic plot is shown in Figure 12.21. We can access the optimal value with

cvglmfit$lambda.min
## [1] 0.106367

As this value results from finding a minimum in an estimated curve, it turns out that it is often too small, i.e., that the implied penalization is too weak. A heuristic recommended by the authors of the glmnet package is to use a somewhat larger value instead, namely the largest value of \(\lambda\) such that the performance measure is within 1 standard error of the minimum.

cvglmfit$lambda.1se
## [1] 0.1473062

► Question 12.18

How does the confusion table look like for \(\lambda=\;\)lambda.1se?

► Solution

► Question 12.19

What features drive the classification?

► Solution

► Question 12.20

How do the results change if we transform the data, say, with the asinh transformation as we saw in Chapter 5?

► Solution

► Question 12.21

Would a good classification performance on these data mean that this assay is ready for screening and early cancer detection?

► Solution

12.6.3 Example: classifying mouse cells from their expression profiles

Figures 12.21 and 12.22 are textbook examples of how we expect the dependence of (cross-validated) classification performance versus model complexity (\(\lambda\)) to look. Now let’s get back to the mouse embryo cells data. We’ll try to classify the cells from embryonic day E3.25 with respect to their genotype.

Cross-validated misclassification error versus penalty parameter for the mouse cells data. Figure 12.23: Cross-validated misclassification error versus penalty parameter for the mouse cells data.

sx = x[, x$Embryonic.day == "E3.25"]
embryoCellsClassifier = cv.glmnet(t(Biobase::exprs(sx)), sx$genotype,
                family = "binomial", type.measure = "class")
plot(embryoCellsClassifier)

In Figure 12.23 we see that the misclassification error is (essentially) monotonously increasing with \(\lambda\), and is smallest for \(\lambda\to 0\), i.e., if we apply no penalization at all.

► Question 12.22

What is going on with these data?

► Solution

12.7 A large choice of methods

We have now seen three classification methods: linear discriminant analysis (lda), quadratic discriminant analysis (qda) and logistic regression using elastic net penalization (glmnet). In fact, there are hundreds of different learning algorithms159 For an introduction to the subject that uses R and provides many examples and exercises, we recommend (James et al. 2013). available in R and its add-on packages. You can get an overview in the CRAN task view Machine Learning & Statistical Learning. Some examples are:

  • Support vector machines: the function svm in the package e1071; ksvm in kernlab

  • Tree based methods in the packages rpart, tree, randomForest

  • Boosting methods: the functions glmboost and gamboost in package mboost

  • PenalizedLDA in the package PenalizedLDA, dudi.discr and dist.pcaiv in ade4).

The complexity and heterogeneity of choices of learning strategies, tuning parameters and evaluation criteria in each of these packages can be confusing. You will already have noted differences in the interfaces of the lda, qda and glmnet functions, i.e., in how they expect their input data to presented and what they return. There is even greater diversity across all the other packages and functions. At the same time, there are common tasks such as cross-validation, parameter tuning and performance assessment that are more or less the same no matter what specific method is used. As you have seen, e.g., in our estimate_mcl_loocv function, the looping and data shuffling involved led to rather verbose code.

So what to do if you want to try out and explore different learning algorithms? Fortunately, there are several projects that provide unified interfaces to the large number of different machine learning interfaces in R, and also try to provide “best practice” implementations of the common tasks such as parameter tuning and performance assessment. The two most well-known ones are the packages caret and mlr. Here were have a look at caret. You can get a list of supported methods through its getModelInfo function. There are quite a few, here we just show the first 8.

library("caret")
caretMethods = names(getModelInfo())
head(caretMethods, 8)
## [1] "ada"         "AdaBag"      "AdaBoost.M1" "adaboost"   
## [5] "amdai"       "ANFIS"       "avNNet"      "awnb"
length(caretMethods)
## [1] 238

We will check out a neural network method, the nnet function from the eponymous package. The parameter slot informs us on the the available tuning parameters160 They are described in the manual of the nnet function..

getModelInfo("nnet", regex = FALSE)[[1]]$parameter
##   parameter   class         label
## 1      size numeric #Hidden Units
## 2     decay numeric  Weight Decay

Let’s try it out.

trnCtrl = trainControl(
  method = "repeatedcv",
  repeats = 3,
  classProbs = TRUE)
tuneGrid = expand.grid(
  size = c(2, 4, 8),
  decay = c(0, 1e-2, 1e-1))
nnfit = train(
  Embryonic.day ~ Fn1 + Timd2 + Gata4 + Sox7,
  data = embryoCells,
  method = "nnet",
  tuneGrid  = tuneGrid,
  trControl = trnCtrl,
  metric = "Accuracy")

That’s quite a mouthful, but the nice thing is that this syntax is standardized and applies across many different methods. All you need to do specify the name of the method and the grid of tuning parameters that should be explored via the tuneGrid argument.

Now we can have a look at the output (Figure 12.25).

nnfit
## Neural Network 
## 
## 66 samples
##  4 predictor
##  3 classes: 'E3.25', 'E3.5', 'E4.5' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 59, 60, 59, 58, 59, 60, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  Accuracy   Kappa    
##   2     0.00   0.7646429  0.5599349
##   2     0.01   0.7569048  0.5580271
##   2     0.10   0.7643254  0.5709431
##   4     0.00   0.7486508  0.5335503
##   4     0.01   0.7641270  0.5735682
##   4     0.10   0.7655159  0.5713145
##   8     0.00   0.7854762  0.6156410
##   8     0.01   0.7755952  0.6045886
##   8     0.10   0.7504365  0.5433072
## 
## Accuracy was used to select the optimal model using the
##  largest value.
## The final values used for the model were size = 8 and decay = 0.

Parameter tuning of the neural net by cross-validation. Figure 12.25: Parameter tuning of the neural net by cross-validation.

plot(nnfit)
predict(nnfit) %>% head(10)
##  [1] E3.25 E3.25 E3.25 E3.25 E3.25 E3.25 E3.25 E3.25 E3.25 E3.25
## Levels: E3.25 E3.5 E4.5

► Question 12.23

Will the accuracy that we obtained above for the optimal tuning parameters generalize to a new dataset? What could you do to address that?

► Solution

12.7.1 Method hacking

In Chapter 6 we encountered p-value hacking. A similar phenomenon exists in statistical learning: given a dataset, we explore various different methods of preprocessing (such as normalization, outlier detection, transformation, feature selection), try out different machine learning algorithms and tune their parameters until we are content with the result. The measured accuracy is likely to be too optimistic, i.e., will not generalize to a new dataset. Embedding as many of our methodical choices into a computational formalism and having an outer cross-validation loop (not to be confused with the inner loop that does the parameter tuning) will ameliorate the problem. But is unlikely to address it completely, since not all our choices can be formalized.

The gold standard remains validation on truly unseen data. In addition, it is never a bad thing if the classifier is not a black box but can be interpreted in terms of domain knowledge. Finally, report not just summary statistics, such as misclassification rates, but lay open the complete computational workflow, so that anyone (including your future self) can convince themselves of the robustness of the result or of the influence of the preprocessing, model selection and tuning choices (Holmes 2018).

12.8 Summary of this chapter

We have seen examples of machine learning applications; we have focused on predicting categorical variables (like diabetes type or cell class). Predicting continuous outcomes is also part of machine learning, although we have not considered it here. There are many parallels and overlaps between machine learning and statistical regression (which we studied in Chapter 8). One can consider them two different names for pretty much the same activity, although each has its own flavors: in machine learning, the emphasis is on the prediction of the outcome variables, whereas in regression we often care at least as much about the role of the covariates – which of them have an effect on the outcome, and what is the nature of these effects? In other words, we do not only want predictions, we also want to understand them.

We saw linear and quadratic discriminant analysis, two intuitive methods for partitioning a two-dimensional data plane (or a \(p\)-dimensional space) into regions using either linear or quadratic separation lines (or hypersurfaces). We also saw logistic regression, which takes a slightly different approach but is more amenable to operating in higher dimensions and to regularization.

We encountered the main challenge of machine learning: how to avoid overfitting? We explored why overfitting happens in the context of the so-called curse of dimensionality, and we learned how it may be overcome using regularization.

In other words, machine learning would be easy if we had infinite amounts of data representatively covering the whole space of possible inputs and outputs161 It would “just” be a formidable database / data management problem.. The challenge is to make the best out of a finite amount of training data, and to generalize these to new, unseen inputs. There is a vigorous trade-off between the amount, resolution and coverage of training data and the complexity of the model. Many models have continuous parameters that enable us to “tune” their complexity or the strength of their regularization. Cross-validation can help us with such tuning, although it is not a panacea, and caveats apply, as we saw in Section 12.6.3.

12.9 Further reading

12.10 Exercises

► Exercise 12.1

Apply a kernel support vector machine, available in the kernlab package, to the zeller microbiome data. What kernel function works well?

► Exercise 12.2

Use glmnet for a prediction of a continous variable, i.e., for regression. Use the prostate cancer data from Chapter 3 of (Hastie, Tibshirani, and Friedman 2008). The data are available in the CRAN package ElemStatLearn. Explore the effects of using ridge versus lasso penalty.

► Exercise 12.3

Consider smoothing as a regression and model selection problem (remember Figure 12.1). What is the equivalent quantity to the penalization parameter \(\lambda\) in Equation (12.8)? How do you choose it?

► Exercise 12.4

Scale invariance. Consider a rescaling of one of the features in the (generalized) linear model (12.7). For instance, denote the \(\nu\)-th column of \(x\) by \(x_{\cdot\nu}\), and suppose that \(p\ge2\) and that we rescale \(x_{\cdot\nu} \mapsto s\, x_{\cdot\nu}\) with some number \(s\neq0\). What will happen to the estimate \(\hat{\beta}\) from Equation (12.8) in (a) the unpenalized case (\(\lambda=0\)) and (b) the penalized case (\(\lambda>0\))?

► Exercise 12.5

It has been quipped that all classification methods are just refinements of two archetypal ideas: discriminant analysis and \(k\) nearest neighbors. In what sense might that be a useful classification?

References

Ambroise, Christophe, and Geoffrey J. McLachlan. 2002. “Selection Bias in Gene Extraction on the Basis of Microarray Gene-Expression Data.” PNAS 99 (10): 6562–6.

Bellman, Richard Ernest. 1961. Adaptive Control Processes: A Guided Tour. Princeton University Press.

Clemmensen, Line, Trevor Hastie, Daniela Witten, and Bjarne Ersbøll. 2011. “Sparse Discriminant Analysis.” Technometrics 53: 406–13.

Friedman, Jerome H. 1997. “On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality.” Data Mining and Knowledge Discovery 1: 55–77.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2008. The Elements of Statistical Learning. \(2^{\text{nd}}\). Springer.

Holmes, Susan. 2018. “Statistical Proof? The Problem of Irreproducibility.” Bulletin of the AMS 55 (1): 31–55.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Springer.

Neumann, B., T. Walter, J. K. Heriche, J. Bulkescher, H. Erfle, C. Conrad, P. Rogers, et al. 2010. “Phenotypic profiling of the human genome by time-lapse microscopy reveals cell division genes.” Nature 464 (7289): 721–27.

Ohnishi, Y., W. Huber, A. Tsumura, M. Kang, P. Xenopoulos, K. Kurimoto, A. K. Oles, et al. 2014. “Cell-to-Cell Expression Variability Followed by Signal Reinforcement Progressively Segregates Early Mouse Lineages.” Nature Cell Biology 16 (1): 27–37.

Reaven, GM, and RG Miller. 1979. “An Attempt to Define the Nature of Chemical Diabetes Using a Multidimensional Analysis.” Diabetologia 16 (1). Springer: 17–24.

Witten, Daniela M, and Robert Tibshirani. 2011. “Penalized Classification Using Fisher’s Linear Discriminant.” JRSSB 73 (5): 753–72.

Zeller, Georg, Julien Tap, Anita Y Voigt, Shinichi Sunagawa, Jens Roat Kultima, Paul I Costea, Aurélien Amiot, et al. 2014. “Potential of Fecal Microbiota for Early-Stage Detection of Colorectal Cancer.” Molecular Systems Biology 10 (11): 766. https://doi.org/10.15252/msb.20145645.

Page built: 2019-05-15

Support for maintaining the online version of this book is provided by de.NBI
For website support please contact msmith@embl.de