Many datasets consist of several variables measured on the same set of subjects: patients, samples, or organisms. For instance, we may have biometric characteristics such as height, weight, age as well as clinical variables such as blood pressure, blood sugar, heart rate, and genetic data for, say, a thousand patients. The *raison d’être* for multivariate analysis is the investigation of connections or associations between the different variables measured. Usually the data are reported in a tabular data structure with one row for each subject and one column for each variable. In the following, we will focus on the special case where each of the variables is numeric, so we can represent the data structure as a *matrix* in R.

If the columns of the matrix are all independent of each other (unrelated), we can simply study each column separately and do standard “univariate” statistics on them one by one; there would be no benefit in studying them as a matrix.

More often, there will be patterns and dependencies. For instance, in the biology of cells, we know that the proliferation rate will influence the expression of many genes simultaneously. Studying the expression of 25,000 gene (columns) on many samples (rows) of patient-derived cells, we notice that many of the genes act together, either that they are positively correlated or that they are anti-correlated. We would miss a lot of important information if we were to only study each gene separately. Important connections between genes are detectable only if we consider the data as a whole: each row representing the many measurements made on the same observational unit. However, having 25,000 dimensions of variation to consider at once is daunting; we will show how to reduce our data to a smaller number of most important dimensions107 We will elaborate this idea of dimension reduction in much more detail below. For the time being, remember we live in a four-dimensional world. without losing too much information.

It might be a while since you saw the type of mathematics that underlies PCA, so we illustrate it more by pictures more than by formula notation. For a refresher, see the Khan academy material on Youtube, or textbooks such as that by Strang (2009). This chapter presents many examples of multivariate data matrices that we encounter in high-throughput experiments, as well as some more elementary examples that we hope will enhance your intuition. We will focus in this chapter on **P**rincipal **C**omponent **A**nalysis, abbreviated as **PCA**, a **dimension reduction** method. We will provide geometric explanations of the algorithm as well as visualizations that help interprete the output of PCA analyses.

In this chapter we will:

See examples of matrices that come up in the study of biological data.

Perform dimension reduction to understand correlations between variables.

Preprocess, rescale and center the data before starting a multivariate analysis.

Build new variables, called principal components (PC), that are more useful than the original measurements.

See what is “under the hood” of PCA: the singular value decomposition of a matrix.

Visualize what this decomposition achieves and learn how to choose the number of principal components.

Run through a complete PCA analysis from start to finish.

Project factor covariates onto the PCA map to enable a more useful interpretation of the results.

First, let’s look at a set of examples of rectangular **matrices** used to represent tables of measurements. In each matrix, the rows and columns represent specific entities.

**Turtles:** A very simple data set that will help us understand the basic principles is a matrix of three dimensions of biometric measurements on painted turtles.

```
= read.table("../data/PaintedTurtles.txt", header = TRUE)
turtles 1:4, ] turtles[
```

```
## sex length width height
## 1 f 98 81 38
## 2 f 103 84 38
## 3 f 103 86 42
## 4 f 105 86 40
```

The last three columns are length measurements (in millimetres), whereas the first column is a factor variable that tells us the sex of each animal.

**Athletes:** This matrix is an interesting example from the sports world. It reports the performances for 33 athletes in the ten disciplines of the decathlon: `m100`

, `m400`

and `m1500`

are times in seconds for the 100 meters, 400 meters, and 1500 meters respectively; `m110`

is the time to finish the 110 meters hurdles; `pole`

is the pole-vault height, and `highj`

and `long`

are the results of the high and long jumps, all in meters; `weight`

, `disc`

, and `javel`

are the lengths in meters the athletes were able to throw the weight, discus and javelin. Here are these variables for the first three athletes:

```
load("../data/athletes.RData")
1:3, ] athletes[
```

```
## m100 long weight highj m400 m110 disc pole javel m1500
## 1 11.25 7.43 15.48 2.27 48.90 15.13 49.28 4.7 61.32 268.95
## 2 10.87 7.45 14.97 1.97 47.71 14.46 44.36 5.1 61.76 273.02
## 3 11.18 7.44 14.20 1.97 48.29 14.81 43.66 5.2 64.16 263.20
```

**Cell Types:** Holmes et al. (2005) studied gene expression profiles of sorted T-cell populations from different subjects. The columns are a subset of gene expression measurements, they correspond to 156 genes that show differential expression between cell types.

```
load("../data/Msig3transp.RData")
round(Msig3transp,2)[1:5, 1:6]
```

```
## X3968 X14831 X13492 X5108 X16348 X585
## HEA26_EFFE_1 -2.61 -1.19 -0.06 -0.15 0.52 -0.02
## HEA26_MEM_1 -2.26 -0.47 0.28 0.54 -0.37 0.11
## HEA26_NAI_1 -0.27 0.82 0.81 0.72 -0.90 0.75
## MEL36_EFFE_1 -2.24 -1.08 -0.24 -0.18 0.64 0.01
## MEL36_MEM_1 -2.68 -0.15 0.25 0.95 -0.20 0.17
```

**Bacterial Species Abundances:** Matrices of counts are used in microbial ecology studies (as we saw in Chapter 4). Here the columns represent different species (or operational taxonomic units, OTUs) of bacteria, which are identified by numerical tags. The rows are labeled according to the samples in which they were measured, and the (integer) numbers represent the number of times of each of the OTUs was observed in each of the samples.

```
data("GlobalPatterns", package = "phyloseq")
= as.matrix(t(phyloseq::otu_table(GlobalPatterns)))
GPOTUs 1:4, 6:13] GPOTUs[
```

```
## OTU Table: [8 taxa and 4 samples]
## taxa are columns
## 246140 143239 244960 255340 144887 141782 215972 31759
## CL3 0 7 0 153 3 9 0 0
## CC1 0 1 0 194 5 35 3 1
## SV1 0 0 0 0 0 0 0 0
## M31Fcsw 0 0 0 0 0 0 0 0
```

Notice the propensity of the matrix entries to be zero; we call such data **sparse**.

**mRNA reads:** RNA-Seq transcriptome data report the number of sequence reads matching each gene108 Or sub-gene structures, such as exons. in each of several biological samples. We will study this type of data in detail in Chapter 8.

```
library("SummarizedExperiment")
data("airway", package = "airway")
assay(airway)[1:3, 1:4]
```

```
## SRR1039508 SRR1039509 SRR1039512 SRR1039513
## ENSG00000000003 679 448 873 408
## ENSG00000000005 0 0 0 0
## ENSG00000000419 467 515 621 365
```

It is customary in the RNA-Seq field—and so it is for the `airway`

data above—to report the genes in the rows and the samples in the columns. Compared to the other matrices we look at here, this is *transposed*: rows and columns are swapped. Such different conventions easily lead to errors, so they are worthwhile paying attention to109 The Bioconductor project tries to help users and developers to avoid such ambiguities by defining data containers in which such conventions are explicitly fixed. In Chapter 8, we will see the example of the *SummarizedExperiment* class.. **Proteomic profiles:** Here, the columns are aligned **mass spectroscopy** peaks or molecules identified through their \(m/z\)-ratios; the entries in the matrix are the measured intensities110 More details can be found, e.g., on Wikipedia..

```
= t(as.matrix(read.csv("../data/metabolites.csv", row.names = 1)))
metab 1:4, 1:4] metab[
```

```
## 146.0985388 148.7053275 310.1505057 132.4512963
## KOGCHUM1 29932.36 17055.70 1132.82 785.5129
## KOGCHUM2 94067.61 74631.69 28240.85 5232.0499
## KOGCHUM3 146411.33 147788.71 64950.49 10283.0037
## WTGCHUM1 229912.57 384932.56 220730.39 26115.2007
```

In many of the matrices we have seen here, important information about the samples (subjects) and the measured features is stored in the row or column names, often through some ad hoc string concatenation. This is not the best place to store all available information, and quickly becomes limiting and error-prone. A much better approach is the Bioconductor *SummarizedExperiment* class.

► Task

When a peak was not detected for a particular \(m/z\) score in the mass spectrometry run, a zero was recorded in `metab`

. Similarly, zeros in `GPOTUs`

or in the `airway`

object occur when there were no matching sequence reads deteced. Tabulate the frequencies of zeros in these data matrices.

► Question 113

a) What are the columns of these data matrices usually called ?

b) In each of these examples, what are the rows of the matrix ?

c) What does a cell in a matrix represent ?

d) If the data matrix is called `athletes`

and you want to see the

value of the third variable for the fifth athlete, what do you type into R?

If we are studying only one variable, i.e., just the third column of the turtles matrix111 The third column of a matrix \(X\) is denoted mathematically by \({\mathbf x}_{\cdot 3}\) or accessed in R using `X[, 3]`

., we say we are looking at one-dimensional data. Such a vector, say all the turtle weights, can be visualized by plots such as those that we saw in Section 3.6, e.g., a histogram. If we compute a one number summary, say mean or median, we have made a zero-dimensional summary of our one-dimensional data. This is already an example of dimension reduction.

In Chapter 3 we studied two-dimensional scatterplots. We saw that if there are too many observations, it can be beneficial to group the data into (hexagonal) bins: these are *two-dimensional* histograms. When considering two variables (\(x\) and \(y\)) measured together on a set of observations, the **correlation coefficient** measures how the variables co-vary. This is a single number summary of two-dimensional data. Its formula involves the summaries \(\bar{x}\) and \(\bar{y}\):

\[\begin{equation}
\hat{\rho}=
\frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}
{\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}
\sqrt{\sum_{j=1}^n (y_j-\bar{y})^2}}
\tag{7.1}
\end{equation}\]

In R, we use the `cor`

function to calculate its value. Applied to a matrix this function computes all the two way correlations between continuous variables. In Chapter 9 we will see how to analyse multivariate categorical data.

► Question 114

Compute the matrix of all correlations between the measurements from the turtles data. What do you notice ?

► Solution

It is always beneficial to start a multidimensional analysis by checking these simple one-dimensional and two-dimensional summary statistics using visual displays such as those we look at in the next two questions.

► Question 115

- Produce all pairwise scatterplots, as well as the one-dimensional histograms on the diagonal, for the turtles data. Use the package
**GGally**.

- Guess the underlying or “true dimension” of these data?

► Solution

► Question 116

Compute all pairwise correlations of the variables in the `athletes`

data and display the matrix in a heatmap. What do you notice?

► Solution

In many cases, different variables are measured in different units, so they have different baselines and different scales112 Common measures of scale are the range and the standard deviation. For instance, the times for the 110 metres vary between 14.18 and 16.2, with a standard deviation of 0.51, whereas the times to complete the 1500 metres vary between 256.64 and 303.17, with a standard deviation of 13.66; more than an order of magnitude larger. Moreover, the `athletes`

data also contain measurements in different units (seconds, metres), whose choice is arbitrary (lengths could also be recorded in centimetres or feet, times in milliseconds).. These are not directly comparable in their original form.

For PCA and many other methods, we therefore need to transform the numeric values to some common scale in order to make comparisons meaningful. **Centering** means subtracting the mean, so that the mean of the centered data is at the origin. **Scaling** or **standardizing** then means dividing by the standard deviation, so that the new standard deviation is \(1\). In fact, we have already encountered these operations when computing the correlation coefficient (Equation (7.1)): the correlation coefficient is simply the vector product of the centered and scaled variables. To perform these operations, there is the R function `scale`

, whose default behavior when given a matrix or a data frame is to make every column have a mean of zero and a standard deviation of \(1\).

► Question 117

- Compute the means and standard deviations of the
`turtle`

data, then use the`scale`

function to center and standardize the continuous variables. Call this`scaledTurtles`

, then verify the new values for mean and standard deviation of`scaledTurtles`

.

- Make a scatterplot of the scaled and centered width and height variables of the turtle data and color the points by their sex.

► Solution

We have already encountered other data transformation choices in Chapters 4 and 5, where we used the log and asinh functions. The aim of these transformations is (usually) variance stabilization, i.e., to make the variances of replicate measurements of *one and the same variable* in different parts of its dynamic range more similar. In contrast, the standardizing transformation described above aims to make the scale (as measured by mean and standard deviation) of *different variables* the same.

Sometimes it is preferable to leave variables at different scales because they are truly of different importance. If their original scale is relevant, then we can (should) leave the data as is. In other cases, the variables have different precisions known a priori. We will see in Chapter 9 that there are several ways of weighting such variables.

After preprocessing the data, we are ready to undertake data *simplification* through **dimension reduction**.

Useful books with relevant chapters are Flury (1997) for an introductory account and Mardia, Kent, and Bibby (1979) for a detailed mathematical approach.

We will explain dimension reduction from several different perspectives. It was invented in 1901 by Karl Pearson (Pearson 1901) as a way to reduce a two-variable scatterplot to a single coordinate. It was used by statisticians in the 1930s to summarize a battery of psychological tests run on the same subjects (Hotelling 1933); thus providing overall scores that summarize many tested variables at once.

*Principal* and *principle* are two different words, which have different meanings. So please do not confuse them. With PCA, it is always *principal*. This idea of **principal** scores inspired the name Principal Component Analysis (abbreviated PCA). PCA is called an **unsupervised learning** technique because, as in clustering, it treats all variables as having the same **status**. We are not trying to predict or explain one particular variable’s value from the others; rather, we are trying to find a mathematical model for an underlying structure for all the variables. PCA is primarily an exploratory technique that produces maps that show the relations between variables and between observations in a useful way.

We first provide a flavor of what this multivariate analysis does to the data. There is an elegant mathematical formulation of these methods through linear algebra, although here we will try to minimize its use and focus on visualization and data examples.

We use geometrical **projections** that take points in higher-dimensional spaces and projects them down onto lower dimensions. Figure 7.5 shows the projection of the point \(A\) onto the line generated by the vector \({\mathbf v}\).

Figure 7.5: Point \(A\) is projected onto the red line generated by the vector \(v\). The dashed projection line is perpendicular (or **orthogonal**) to the red line. The intersection point of the projection line and the red line is called the orthogonal projection of A onto the red line generated by the vector \(v\).

PCA is a **linear** technique, meaning that we look for linear relations between variables and that we will use new variables that are linear functions of the original ones (\(f(ax+by)=af(x)+b(y)\)). The linearity constraints makes computations particularly easy. We will see non-linear techniques in Chapter 9.

Here we show one way of projecting two-dimensional data onto a line using the `athletes`

data. The code below provides the preprocessing and plotting steps that were used to generate Figure 7.6:

Figure 7.6: Scatterplot of two variables showing the projection on the horizontal x axis (defined by \(y=0\)) in red and the lines of projection appear as dashed.

```
= data.frame(scale(athletes))
athletes = ggplot(athletes, aes(x = weight, y = disc)) +
ath_gg geom_point(size = 2, shape = 21)
+ geom_point(aes(y = 0), colour = "red") +
ath_gg geom_segment(aes(xend = weight, yend = 0), linetype = "dashed")
```

► Task

a) Calculate the variance of the red points in Figure 7.6.

b) Make a plot showing projection lines onto the \(y\) axis and projected points.

c) Compute the variance of the points projected onto the vertical \(y\) axis.

In general, we lose information about the points when we project from two dimensions (a plane) to one (a line). If we do it just by using the original coordinates, as we did on the weight variable in Figure 7.6, we lose all the information about the disc variable. Our goal is to keep as much information as we can about *both* variables. There are actually many ways of projecting the point cloud onto a line. One is to use what are known as **regression lines**. Let’s look at these lines and how they are constructed in **R**.

If you have seen linear regression, you already know how to compute lines that summarize scatterplots; **linear regression** is a **supervised** method that gives preference minimizing the residual sum of squares in one direction: that of the response variable.

In Figure 7.7, we use the `lm`

(linear model) function to find the regression line. Its slope and intercept are given by the values in the `coefficients`

slot of the resulting object `reg1`

.

Figure 7.7: The blue line minimizes the sum of squares of the vertical residuals (in red).

```
= lm(disc ~ weight, data = athletes)
reg1 = reg1$coefficients[1] # intercept
a1 = reg1$coefficients[2] # slope
b1 = ath_gg + geom_abline(intercept = a1, slope = b1,
pline1 col = "blue", lwd = 1.5)
+ geom_segment(aes(xend = weight, yend = reg1$fitted),
pline1 colour = "red", arrow = arrow(length = unit(0.15, "cm")))
```

Figure 7.8 shows the line produced when reversing the roles of the two variables; weight becomes the response variable.

Figure 7.8: The green line minimizes the sum of squares of the horizontal residuals (in orange).

```
= lm(weight ~ disc, data = athletes)
reg2 = reg2$coefficients[1] # intercept
a2 = reg2$coefficients[2] # slope
b2 = ath_gg + geom_abline(intercept = -a2/b2, slope = 1/b2,
pline2 col = "darkgreen", lwd = 1.5)
+ geom_segment(aes(xend=reg2$fitted, yend=disc),
pline2 colour = "orange", arrow = arrow(length = unit(0.15, "cm")))
```

Each of the regression lines in Figures 7.7 and 7.8 gives us an approximate linear relationship between `disc`

and `weight`

. However, the relationship differs depending on which of the variables we choose to be the predictor and which the response.

► Question 118

How large is the variance of the projected points that lie on the blue regression line of Figure 7.7? Compare this to the variance of the data when projected on the original axes, `weight`

and `disc`

.

► Solution

Figure 7.9 shows the line chosen to minimize the sum of squares of the orthogonal (perpendicular) projections of data points onto it; we call this the **principal component** line. All our three ways of fitting a line (Figures 7.7–7.9) together in one plot are shown in Figure 7.10.

Figure 7.9: The purple **principal component** line minimizes the sums of squares of the orthogonal projections.

```
= cbind(athletes$disc, athletes$weight)
xy = svd(xy)
svda = xy %*% svda$v[, 1] %*% t(svda$v[, 1])
pc = svda$v[2, 1] / svda$v[1, 1]
bp = mean(pc[, 2]) - bp * mean(pc[, 1])
ap + geom_segment(xend = pc[, 1], yend = pc[, 2]) +
ath_gg geom_abline(intercept = ap, slope = bp, col = "purple", lwd = 1.5)
```

Figure 7.10: The blue line minimizes the sum of squares of the vertical residuals, the green line minimizes the horizontal residuals, the purple line, called the **principal component**, minimizes the orthogonal projections. Notice the ordering of the slopes of the three lines.

► Question 119

- What is particular about the slope of the purple line?

- Redo the plots on the original (unscaled) variables. What happens?

► Solution

► Question 120

Compute the variance of the points on the purple line.

► Solution

Pythagoras’ theorem tells us two interesting things here:

If we are minimizing in both horizontal and vertical directions we are in fact minimizing the orthogonal projections onto the line from each point.

The total variability of the points is measured by the sum of squares of the projection of the points onto the center of gravity, which is the origin (0,0) if the data are centered. This is called the

*total variance*or the**inertia**of the point cloud. This inertia can be decomposed into the sum of the squares of the projections onto the line plus the variances along that line. For a fixed variance, minimizing the projection distances also maximizes the variance along that line. Often we define the first principal component as the line with maximum variance.

small Image credit: Sara Holmes The PC line we found in the previous section could be written

\[\begin{equation}
PC= \frac{1}{2} \mbox{disc} + \frac{1}{2} \mbox{weight}.
\tag{7.2}
\end{equation}\]

Principal components are *linear combinations* of the variables that were originally measured, they provide a *new coordinate system*. To understand what a **linear combination** really is, we can take an analogy. When making a healthy juice mix, you will follow a recipe like

\[\begin{equation*} V = 2 \times \text{Beet} + 1 \times \text{Carrot} + \frac{1}{2} \text{Gala} + \frac{1}{2} \text{GrannySmith} + 0.02 \times \text{Ginger} + 0.25 \times \text{Lemon}. \end{equation*}\]

This recipe is a linear combination of individual juice types (the original variables). The result is a new variable, \(V\), and the coefficients \((2,1,\frac{1}{2},\frac{1}{2},0.02,0.25)\) are called the **loadings**.

► Question 121

How would you compute the calories in a glass of juice?

A linear combination of variables defines a line in higher dimensions in the same way we constructed lines in the scatterplot plane of two dimensions. As we saw in that case, there are many ways to choose lines onto which we project the data, there is however a `best’ line for our purpose.

The total variance of all the points in all the variables can de decomposed. In PCA, we use the fact that the total sums of squares of the distances between the points and any line can be decomposed into the distance to the line and the variance along the line.

We saw that the principal component minimizes the distance to the line, and it also maximizes the variance of the projections along the line.

Why is maximizing the variance along a line a good idea? Let’s look at another example of a projection from three dimensions into two. In fact, human vision depends on such dimension reduction:

Figure 7.11: A mystery silhouette.

► Question 122

In Figure 7.11, there is a two-dimensional projection of a three-dimensional object. What is the object?

► Question 123

► Solution

Figure 7.12: Many choices have to be made during PCA processing.

PCA is based on the principle of finding the axis showing the largest inertia/variability, removing the variability in that direction and then iterating to find the next best orthogonal axis, and so on. In fact, we do not have to run iterations, all the axes can be found in one linear algebra operation called the **S**ingular **V**alue **D**ecomposition (we will delve more deeply into the details below).

In the diagram in Figure 7.12, we see that first the means and variances are computed and the choice of whether to work directly with the covariance matrix or with the correlation matrix has to be made. The next step is the choice of \(k\), the number of components we deem relevant to the data. We say that \(k\) is the rank of the approximation. The best choice of \(k\) is a difficult question, and we discuss on how to approach it below. The choice of \(k\) requires looking at a plot of the variances explained by the successive principal components. Once we have chosen \(k\), we can proceed to the projections of the data in the new \(k\)-dimensional subspace.

The end results of the PCA workflow are useful maps of both the variables and the samples. Understanding how these maps are constructed will maximize the information we can gather from them.

This is a small section for those whose background in linear algebra is but a faint memory. It tries to give some intuition to the singular value decomposition method underlying PCA, without too much notation.

Figure 7.13: Another two-dimensional projection of the same object shown in Figure 7.11. Here, the perspective is more informative. Generally, choosing the perspective such that the spread (in other words, the variance) of the points is maximal generally provides most information. We want to see as much of the variation as possible, that’s what PCA does.

The singular value decomposition of a matrix finds horizontal and vertical vectors (called the singular vectors) and normalizing values (called singular values). As before, we start by giving the forward-generative explanation before doing the actual reverse engineering that is used in creating the decomposition. To calibrate the meaning of each step, we will start with an artificial example before moving to the complexity of real data.

A simple generative model demonstrates the meaning of the **rank of a matrix** and explains how we find it in practice. Suppose we have two vectors, \(u\) (a one-column matrix), and \(v^t=t(v)\) (a one-row matrix–the transpose of a one-column matrix \(v\)). For instance, \(u =\left( \begin{smallmatrix} 1\\2\\3\\4 \end{smallmatrix} \right)\) and \(v =\left( \begin{smallmatrix} 2\\4\\8 \end{smallmatrix} \right)\). The transpose of \(v\) is written \(v^t = t(v) = (2\; 4\; 8)\). We multiply a copy of \(u\) by each of the elements of \(v^t\) in turn as follows:

Step 0:

X | 2 | 4 | 8 |
---|---|---|---|

1 | |||

2 | |||

3 | |||

4 |

Step 1:

X | 2 | 4 | 8 |
---|---|---|---|

1 | 2 | ||

2 | 4 | ||

3 | 6 | ||

4 | 8 |

Step 2:

X | 2 | 4 | 8 |
---|---|---|---|

1 | 2 | 4 | |

2 | 4 | 8 | |

3 | 6 | 12 | |

4 | 8 | 16 |

Step 3:

X | 2 | 4 | 8 |
---|---|---|---|

1 | 2 | 4 | 8 |

2 | 4 | 8 | 16 |

3 | 6 | 12 | 24 |

4 | 8 | 16 | 32 |

Thus, the \((2,3)\) entry of the matrix \(X\), written \(x_{2,3}\), is obtained by multiplying \(u_2\) by \(v_3\). We can write this

\[\begin{equation}
X=\begin{pmatrix}
2&4&8\\ 4&8&16\\
6 &12&24\\
8&16&32\\
\end{pmatrix}
= u * t(v)= u * v^t
\tag{7.3}
\end{equation}\]

The matrix \(X\) we obtain here is said to be of rank 1, because both \(u\) and \(v\) have one column.

► Question 124

Why can we say that writing \(X = u*v^t\) is more economical than spelling out the full matrix \(X\)?

► Solution

On the other hand, suppose that we want to reverse the process and simplify another matrix \(X\) given below with 3 rows and 4 columns (12 numbers). Can we always express it in a similar way as a product of vectors without loss of information? In the diagrams shown in Figures 7.14 and 7.15, the colored boxes have areas proportional to the numbers in the cells of the matrix ((7.4)).

► Question 125

Here is a matrix \(X\) we want to decompose.

Figure 7.14: Some special matrices have numbers in them that make them easy to decompose. Each colored rectangle in this diagram has an area that corresponds to the number in it.

\[\begin{equation}
\begin{array}{rrrrr}
\hline
{\large X} & x_{.1} & x_{.2} & x_{.3} & x_{.4} \\
\hline
x_{1.} & 780 & 936 & 1300 & 728\\
x_{2.} & 75 & 90 & 125 & 70\\
x_{3.} & 540 & 648 & 900 & 504\\
\hline
\end{array}
\tag{7.4}
\end{equation}\]

\(X\) has been redrawn as series of rectangles in Figure 7.14. What numbers could we put in the white \(u\) and \(v\) boxes so that the values of the sides of the rectangle give the numbers as their product?

A matrix with the special property of being perfectly “rectangular” like \(X\) is said to be of rank 1. We can represent the numbers in \(X\) by the areas of rectangles, where the sides of rectangles are given by the values in the side vectors (\(u\) and \(v\)).

We see in Figure 7.15 that the decomposition of \(X\) is not unique: there are several candidate choices for the vectors \(u\) and \(v\). We will make the choice unique by requiring that the sum of the squares of each vector’s elements add to 1 (we say the vectors \(v\) and \(u\) have norm 1). Then we have to keep track of one extra number by which to multiply each of the products, and which represents the “overall scale” of \(X\). This is the value we have put in the upper left hand corner. It is called the singular value \(s_1\). In the R code below, we start by supposing we know the values in `u`

, `v`

and `s1`

; later we will see a function that finds them for us. Let’s check the multiplication and norm properties in R:

```
= matrix(c(780, 75, 540,
X 936, 90, 648,
1300, 125, 900,
728, 70, 504), nrow = 3)
= c(0.8196, 0.0788, 0.5674)
u = c(0.4053, 0.4863, 0.6754, 0.3782)
v = 2348.2
s1 sum(u^2)
```

`## [1] 1`

`sum(v^2)`

`## [1] 1`

`* u %*% t(v) s1 `

```
## [,1] [,2] [,3] [,4]
## [1,] 780 936 1300 728
## [2,] 75 90 125 70
## [3,] 540 648 900 504
```

`- s1 * u %*% t(v) X `

```
## [,1] [,2] [,3] [,4]
## [1,] -0.03419 0.0745 0.1355 0.1221
## [2,] 0.00403 0.0159 0.0252 0.0186
## [3,] -0.00903 0.0691 0.1182 0.0982
```

► Question 126

Try `svd(X)`

in R. Look at the components of the output of the `svd`

function carefully. Check the norm of the columns of the matrices that result from this call. Where did the above value of `s1`

= 2348.2 come from?

► Solution

In fact, in this particular case we were lucky: we see that the second and third singular values are 0 (up to the numeric precision we care about). That is why we say that \(X\) is of **rank** 1. For a more general matrix \(X\), it is rare to be able to write \(X\) exactly as this type of two-vector product. The next subsection shows how we can decompose \(X\) when it is not of rank 1: we will just need more pieces.

In the above decomposition, there were three elements: the horizontal and vertical singular vectors, and the diagonal corner, called the singular value. These can be found using the singular value decomposition function (`svd`

). For instance:

```
= matrix(c(12.5, 35.0, 25.0, 25, 9, 14, 26, 18, 16, 21, 49, 32,
Xtwo 18, 28, 52, 36, 18, 10.5, 64.5, 36), ncol = 4, byrow = TRUE)
= svd(Xtwo) USV
```

► Question 127

Look at the `USV`

object, the result of calling the `svd`

function. What are its components?

► Solution

► Question 128

Check how each successive pair of singular vectors improves our approximation to `Xtwo`

. What do you notice about the third and fourth singular values?

► Solution

Again, there are many ways to write a rank two matrix such as `Xtwo`

as a sum of rank one matrices: in order to ensure uniqueness, we impose yet another113 Above, we had chosen the norm of the vectors to be 1. condition on the singular vectors. The output vectors of the singular decomposition do not only have their norms equal to 1, each column vector in the \(U\) matrix is orthogonal to all the previous ones. We write \(u_{\cdot 1} \perp u_{\cdot 2}\), this means that the sum of the products of the values in the same positions is \(0\): \(\sum_i u_{i1} u_{i2} = 0\). Ditto for the \(V\) matrix.

► Task

Check the orthonormality by computing the cross product of the \(U\) and \(V\) matrices:

```
t(USV$u) %*% USV$u
t(USV$v) %*% USV$v
```

Let’s submit our `scaledTurtles`

matrix to a singular value decomposition.

```
= svd(scaledTurtles)
turtles.svd $d turtles.svd
```

`## [1] 11.746475 1.419035 1.003329`

`$v turtles.svd`

```
## [,1] [,2] [,3]
## [1,] 0.5787981 -0.3250273 -0.74789704
## [2,] 0.5779840 -0.4834699 0.65741263
## [3,] 0.5752628 0.8127817 0.09197088
```

`dim(turtles.svd$u)`

`## [1] 48 3`

► Question 129

What can you conclude about the turtles matrix from the `svd`

output?

► Solution

“`[r, SumRankOneD, eval = TRUE, echo = FALSE, fig.show = 'hold', fig.keep = 'high']{} knitr::include_graphics(c('images/SumRankOneD.png'), dpi = NA) "`

\(X\) is decomposed additively into rank-one pieces. Each of the \(u\) vectors is combined into the \(U\) matrix, and each of \(v\) vectors into \(V\). The *Singular Value Decomposition* is

\[\begin{equation}
{\mathbf X} = U S V^t, V^t V={\mathbb I}, U^t U={\mathbb I},
\tag{7.5}
\end{equation}\]

where \(S\) is the diagonal matrix of singular values, \(V^t\) is the transpose of \(V\), and \({\mathbb I}\) is the Identity matrix. Expression (7.5) can be written elementwise as

\[\begin{equation*} X_{ij} = u_{i1}s_1v_{1j} + u_{i2}s_2v_{2j} + u_{i3}s_3v_{3j} +... + u_{ir}s_rv_{rj}, \end{equation*}\]

\(U\) and \(V\) are said to be orthonormal114 Nothing to do with the normal distribution, it stands for orthogonal and having norm 1., because their self-crossproducts are the identity matrix.

The singular vectors from the singular value decomposition (provided by the `svd`

function in R) contain the coefficients to put in front of the original variables to make the more informative ones we call the principal components. We write this as:

\[\begin{equation}
Z_1=v_{11} X_{\cdot 1} +v_{21} X_{\cdot 2} + v_{31} X_{\cdot 3}+ \cdots + v_{p1} X_{\cdot p}.
\tag{7.6}
\end{equation}\]

If `usv = svd(X)`

, then \((v_{11},v_{21},v_{31},...)\) are given by the first column of `usv$v`

; similarly for \(Z_2\) with the second column of `usv$v`

, and so son. \(p\) is the number of columns of \(X\) and the number of rows of \(V\). These new variables \(Z_1, Z_2, Z_3, ...\) have variances that decrease in size: \(s_1^2 \geq s_2^2 \geq s_3^2 \geq ...\).

► Question 130

Compute the first principal component for the turtles data by multiplying by the first singular value `d[1]`

by `u[,1]`

. What is another way of computing it ?

► Solution

Here are two useful facts, first in words, then with the mathematical shorthand.

The number of principal components \(k\) is always chosen to be fewer than the number of original variables or the number of observations. We are “lowering” the dimension of the problem:

\[\begin{equation*} k\leq \min(n,p). \end{equation*}\]

The principal component transformation is defined so that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each successive component in turn has the highest variance possible under the constraint that it be orthogonal to the preceding components:

\[\begin{equation*} \max_{aX \perp bX}\mbox{var}(\mbox{Proj}_{aX} (X)), \qquad \mbox{where } bX=\mbox{previous components} \end{equation*}\]

We revisit our two-variable athletes data with the `discus`

and the `weight`

variables. In section 7.3.2, we computed the first principal component and represented it as the purple line in Figure 7.10. We showed that \(Z_1\) was the linear combination given by the diagonal. As the coefficients have to have their sum of squares add to \(1\), we have that

Z1 = -0.707*athletes\(disc -0.707*athletes\)weight. This is the same as if the two coordinates were \(c_1=0.7071\) and \(c_2=0.7071\).

► Question 131

What part of the output of the svd functions leads us to the first PC coefficients, also known as the PC **loadings** ?

Note that we use `svda`

which was the svd applied to the two variables `discus`

and `weight`

.

► Solution

If we rotate the `(discus, weight)`

plane by making the purple line the horizontal \(x\) axis, we obtain what is know as the first **principal plane**.

```
= tibble(PC1n = -svda$u[, 1] * svda$d[1],
ppdf PC2n = svda$u[, 2] * svda$d[2])
= ggplot(ppdf, aes(x = PC1n, y = PC2n)) +
gg geom_point() +
geom_hline(yintercept = 0, color = "purple", lwd = 1.5, alpha = 0.5) +
xlab("PC1 ")+ ylab("PC2") + xlim(-3.5, 2.7) + ylim(-2, 2) + coord_fixed()
+ geom_point(aes(x = PC1n, y = 0), color = "red") +
gg geom_segment(aes(xend = PC1n, yend = 0), color = "red")
+ geom_point(aes(x = 0, y = PC2n), color = "blue") +
gg geom_segment(aes(yend = PC2n, xend = 0), color = "blue") +
geom_vline(xintercept = 0, color = "skyblue", lwd = 1.5, alpha = 0.5)
```

► Question 132

► Solution

► Task

Use `prcomp`

to compute the PCA of the first two columns of the athletes data, look at the output. Compare to the singular value decomposition.

We now want to do a complete PCA analysis on the turtles data. Remember, we already looked at the summary statistics for the one- and two-dimensional data. Now we are going to answer the question about the “true” dimensionality of these rescaled data.

In the following code, we use the function `princomp`

. Its return value is a list of all the important pieces of information needed to plot and interpret a PCA.

In fact, PCA is such a fundamental technique that there are many different implementations of it in various R packages. Unfortunately, the input arguments and the formating and naming of their output is not standardized, and some even use different conventions for the scaling of their output. We will experiment with several different ones to familiarize ourselves with these choices.

`cor(scaledTurtles)`

```
## length width height
## length 1.0000000 0.9783116 0.9646946
## width 0.9783116 1.0000000 0.9605705
## height 0.9646946 0.9605705 1.0000000
```

```
= princomp(scaledTurtles)
pcaturtles pcaturtles
```

```
## Call:
## princomp(x = scaledTurtles)
##
## Standard deviations:
## Comp.1 Comp.2 Comp.3
## 1.6954576 0.2048201 0.1448180
##
## 3 variables and 48 observations.
```

Figure 7.17: The screeplot shows the eigenvalues for the standardized turtles data (`scaledTurtles`

): there is one large value and two small ones. The data are (almost) one-dimensional. We will see why this dimension is called an axis of size, a frequent phenomenon in biometric data (Jolicoeur, Mosimann, et al. 1960), .

```
library("factoextra")
fviz_eig(pcaturtles, geom = "bar", bar_width = 0.4) + ggtitle("")
```

► Question 133

Many PCA functions have been created by different teams who worked in different areas at different times. This can lead to confusion, especially because they have different naming conventions. Let’s compare three of them; run the following lines of code and look at the resulting objects:

```
svd(scaledTurtles)$v[, 1]
prcomp(turtles[, -1])$rotation[, 1]
princomp(scaledTurtles)$loadings[, 1]
library("ade4")
dudi.pca(turtles[, -1], nf = 2, scannf = FALSE)$c1[, 1]
```

What happens when you disable the scaling in the `prcomp`

and `princomp`

functions?

In what follows, we always suppose that the matrix \(X\) represents the centered and scaled matrix.

► Question 134

The coordinates of the observations in the new variables from the `prcomp`

function (call it `res`

) are in the `scores`

slot of the result. Take a look at PC1 for the `turtles`

and compare it to `res$scores`

. Compare the standard deviation sd1 to that in the `res`

object and to the standard deviation of the scores.

► Solution

► Question 135

Check the orthogonality of the `res$scores`

matrix. Why can’t we say that it is **orthonormal**?

Now we are going to combine both the PC scores (\(US\)) and the loadings-coefficients (\(V\)). The plots with both the samples and the variables represented are called **biplots**. This can be done in one line using the following **factoextra** package function.

```
fviz_pca_biplot(pcaturtles, label = "var", habillage = turtles[, 1]) +
ggtitle("")
```

Beware the aspect ratio when plotting a PCA. It is rare to have the two components be of similar norm, so square shaped plots will be the exception. More common are elongated plots, which show that the horizontal (first) principal component is more important than the second. This matters, e.g., for interpreting distances between points in the plots.

► Question 136

Is it possible to have a PCA plot with the PC1 as the horizontal axis whose height is longer than its width?

► Solution

► Question 137

Looking at Figure 7.18: a) Did the males or female turtles tend to be larger?

b) What do the arrows tell us about the correlations?

► Question 138

Compare the variance of each new coordinate to the eigenvalues returned by the PCA dudi.pca function.

► Solution

Now we look at the relationships between the variables, both old and new by drawing what is known as the correlation circle. The aspect ratio is 1 here and the variables are represented by arrows as shown in Figure 7.19. The lengths of the arrows indicate the quality of the projections onto the first principal plane:

Figure 7.19: Part of the “circle of correlations” showing the original variables. Their correlations with each other and with the new principal components are given by the angles between the vectors and between the axes and the vectors.

```
fviz_pca_var(pcaturtles, col.circle = "black") + ggtitle("") +
xlim(c(-1.2, 1.2)) + ylim(c(-1.2, 1.2))
```

► Question 139

Explain the relationships between the number of rows of our turtles data matrix and the following numbers:

```
svd(scaledTurtles)$d/pcaturtles$sdev
sqrt(47)
```

► Solution

These data are a good example of how sometimes almost all the variation in the data can be captured in a lower-dimensional space: here, three-dimensional data can be essentially replaced by a line. If we only want the first one then it is just \(c_1=s_1 u_1\). Notice that \(||c_1||^2=s_1^tu_1 u_1^t s_1= s_1^2 u_1^tu_1=s_1^2=\lambda_1\)

Keep in mind: \(X^tC=VSU^tUS=VS^2.\) The **principal components** are the columns of the matrix \(C=US\). The \(p\) columns of \(U\) (the matrix given as `USV$u`

in the output from the `svd`

function above) are rescaled to have norms \((s_1^2,s_2^2,...,s_p^2)\). Each column has a different variance it is *responsible* for explaining. Notice that these will be decreasing numbers.

If the matrix \(X\) comes from the study of \(n\) different samples or specimens, then the principal components provides new coordinates for these \(n\) points as in Figure 7.16. These are sometimes called the *scores* in the results of PCA functions.

Figure 7.20: Another great xkcd take: this time eigenvectors.

Before we go into more detailed examples, let’s summarize what SVD and PCA provide:

Each principal component has a variance measured by the corresponding eigenvalue, the square of the corresponding singular value.

The new variables are made to be orthogonal. Since they are also centered, this means they are uncorrelated. In the case of normal distributed data, this also means they are independent.

**Eigen Decomposition:**The crossproduct of X with itself verifies \[X^tX=VSU^tUSV^t=VS^2V^t=V\Lambda V^t\] where \(V\) is called the eigenvector matrix of the symmetric matrix \(X^tX\) and \(\Lambda\) is the diagonal matrix of eigenvalues of \(X^tX\).When the variables are have been rescaled, the sum of the variances of all the variables is the number of variables (\(=p\)). The sum of the variances is computed by adding the diagonal of the crossproduct matrix116 This sum of the diagonal elements is called

**the trace**of the matrix..The principal components are ordered by the size of their eigenvalues. We always check the screeplot before deciding how many components to retain. It is also best practice to do as we did in Figure 7.18 and annotate each PC axis with the proportion of variance it explains.

► Task

Look up eigenvalue in the Wikipedia. Try to find a sentence that defines it without using a formula. Why would eigenvectors come into use in Cinderella (at a stretch)? (See the xkcd cartoon in Figure 7.20.)

For help with the basics of linear algebra, a motivated student pressed for time may consult Khan's Academy. If you have more time and would like in depth coverage, Gil Strang's MIT course is a classic, and some of the book is available online (Strang 2009).

We looked briefly at part of this data earlier in the chapter; here we will follow step by step a complete multivariate analysis. First we take a second look at the rounded correlation matrix capturing the essential of the bivariate associations. This was represented with colors in Figure 7.3.

`cor(athletes) %>% round(1)`

```
## m100 long weight highj m400 m110 disc pole javel m1500
## m100 1.0 -0.5 -0.2 -0.1 0.6 0.6 0.0 -0.4 -0.1 0.3
## long -0.5 1.0 0.1 0.3 -0.5 -0.5 0.0 0.3 0.2 -0.4
## weight -0.2 0.1 1.0 0.1 0.1 -0.3 0.8 0.5 0.6 0.3
## highj -0.1 0.3 0.1 1.0 -0.1 -0.3 0.1 0.2 0.1 -0.1
## m400 0.6 -0.5 0.1 -0.1 1.0 0.5 0.1 -0.3 0.1 0.6
## m110 0.6 -0.5 -0.3 -0.3 0.5 1.0 -0.1 -0.5 -0.1 0.1
## disc 0.0 0.0 0.8 0.1 0.1 -0.1 1.0 0.3 0.4 0.4
## pole -0.4 0.3 0.5 0.2 -0.3 -0.5 0.3 1.0 0.3 0.0
## javel -0.1 0.2 0.6 0.1 0.1 -0.1 0.4 0.3 1.0 0.1
## m1500 0.3 -0.4 0.3 -0.1 0.6 0.1 0.4 0.0 0.1 1.0
```

Then we start looking at the screeplot, which will help us choose a rank \(k\) for these data.

```
= dudi.pca(athletes, scannf = FALSE)
pca.ath $eig pca.ath
```

```
## [1] 3.4182381 2.6063931 0.9432964 0.8780212 0.5566267 0.4912275
## [7] 0.4305952 0.3067981 0.2669494 0.1018542
```

Figure 7.21: The screeplot is the first thing to consult, it tells us that it is satisfactory to use a two-dimensional plot.

`fviz_eig(pca.ath, geom = "bar", bar_width = 0.3) + ggtitle("")`

Figure 7.21 shows that eigenvalues in the screeplot make a clear drop after the second eigenvalue. This indicates a good approximation will be obtained at rank 2. Let’s look at an interpretation of the first two axes by projecting the loadings of the original (old) variables as they project onto the two new ones.

Figure 7.22: Correlation circle of the original variables.

`fviz_pca_var(pca.ath, col.circle = "black") + ggtitle("")`

The correlation circle Figure 7.22 displays the projection of the original variables onto the two first new principal axes. The angles between vectors are interpreted as correlations. On the right side of the plane, we have the track and field events (m110, m100, m400, m1500), and on the left, we have the throwing and jumping events. Maybe there is an opposition of skills as characterized in the correlation matrix. We did see the correlations were negative between variables from these two groups. How can we interpret this?

It seems that those who throw the best have lower scores in the track competitions. In fact, if we look at the original measurements, we can see what is happening. The athletes who run in short times are the stronger ones, as are the ones who throw or jump longer distances. We should probably change the scores of the track variables and redo the analysis.

► Question 140

What transformations of the variables induce the best athletic performances to vary in the same direction, i.e. be mostly positively correlated?

► Solution

Figure 7.23: Correlation circle after changing the signs of the running variables.

`fviz_pca_var(pcan.ath, col.circle="black") + ggtitle("")`

Figure 7.23 shows the correlation circle of the transformed variable. We now see we have an axis of **size**, all the arrows are pointing in the same direction.

Now all the negative correlations are quite small ones. The screeplot will show no change, as the eigenvalues are unchanged. The signs in the coefficients of the PC loadings for the \(m\) variables are the only ouput that changes.

We now plot the athletes projected in the first principal plane using:

Figure 7.24: First principal plane showing the projections of the athletes. Do you notice something about the organization of the numbers?

`fviz_pca_ind(pcan.ath) + ggtitle("") + ylim(c(-2.5,5.7))`

► Question 141

► Solution

It turns out there is complementary information available on the `olympic`

data. An extra variable called `score`

is a vector of the final scores at the competition (men’s decathlon at the 1988 Olympics).

```
data("olympic", package = "ade4")
$score olympic
```

```
## [1] 8488 8399 8328 8306 8286 8272 8216 8189 8180 8167 8143 8114
## [13] 8093 8083 8036 8021 7869 7860 7859 7781 7753 7745 7743 7623
## [25] 7579 7517 7505 7422 7310 7237 7231 7016 6907
```

We now make the scatterplot comparing the first principal component score of the athletes to this score from the data. This is shown in Figure 7.25; we can see a strong correlation between the two variables. We notice that the athlete number one (who in fact won the Olympic decathlon gold medal) has the highest score but not the highest value for the PC1 linear combination. Why do you think that is?

Figure 7.25: Scatterplot of the scores given as a supplementary variable and the first principal component. The points are labeled by their order in the data set. We can see a very strong correlation between this supplementary score variable and the first principal coordinate, why is it not a perfectly linear fit?

Figure 7.26: A screeplot showing ‘dangerously’ similar variances. Choosing to cutoff at a hard threshold of 80% of the variance would give unstable PC plots. With so such cutoff, the axes corresponding to the 3D subspace of 3 similar eigenvalues are unstable and cannot be individually interpreted.

We have seen in the examples that the first step in PCA is to make the screeplot of the variances of the new variables (equal to the **eigenvalues**). We cannot decide how many dimensions are needed before seeing this plot. The reason is that there are situations when the principal components are ill-defined: when two or three successive PCs have very similar variances, giving a screeplot as in Figure 7.26, the subspace corresponding to a group of similar eigenvalues exists. In this case this would be 3D space generated by \(u_2,u_3,u_4\). The vectors are not meaningful individually and one cannot interpret their loadings. This is because a very slight change in one observations could give a completely different set of three vectors. These would generate the same 3D space, but could have very different loadings. We say the PCs are unstable.

We have seen that unlike regression, PCA treats all variables equally (to the extent that they were preprocessed to have equivalent standard deviations). However, it is still possible to map other continuous variables or categorical factors onto the plots in order to help interpret the results. Often we have complementary information on the samples, for example, diagnostic labels in the diabetes data or the cell types in the T-cell gene expression data.

Here we see how we can use such extra variables to inform our interpretation. The best place to store such so-called *metadata* is in appropriate slots of the data object (such as in the Bioconductor *SummarizedExperiment* class); the second-best, in additional columns of the data frame that also contains the numeric data. In practice, such information is often stored in a more or less cryptic manner in the row names of the matrix. Below, we need to face the latter scenario, and we use `substr`

gymnastics to extract the cell types and show the screeplot in Figure 7.27 and the PCA in Figure 7.28.

Figure 7.27: T-cell expression PCA screeplot.

```
= dudi.pca(Msig3transp, center = TRUE, scale = TRUE,
pcaMsig3 scannf = FALSE, nf = 4)
fviz_screeplot(pcaMsig3) + ggtitle("")
```

```
= rownames(Msig3transp)
ids = factor(substr(ids, 7, 9))
celltypes = factor(substr(ids, 1, 3))
status table(celltypes)
```

```
## celltypes
## EFF MEM NAI
## 10 9 11
```

```
cbind(pcaMsig3$li, tibble(Cluster = celltypes, sample = ids)) %>%
ggplot(aes(x = Axis1, y = Axis2)) +
geom_point(aes(color = Cluster), size = 5) +
geom_hline(yintercept = 0, linetype = 2) +
geom_vline(xintercept = 0, linetype = 2) +
scale_color_discrete(name = "Cluster") + coord_fixed()
```

These data requires delicate preprocessing before we obtain our desired matrix with the relevant features as columns and the samples as rows. Starting with the raw mass spectroscopy readings, the steps involve extracting peaks of relevant features, aligning them across multiple samples and estimating peak heights. We refer the reader to the vignette of the Bioconductor **xcms** package for gruesome details. We load a matrix of data generated in such a way from the file `mat1xcms.RData`

. The output of the below code is in Figures 7.29 and 7.30.

```
load("../data/mat1xcms.RData")
dim(mat1)
```

`## [1] 399 12`

Figure 7.29: Screeplot showing the eigenvalues for the mice data.

```
= dudi.pca(t(mat1), scannf = FALSE, nf = 3)
pcamat1 fviz_eig(pcamat1, geom = "bar", bar_width = 0.7) + ggtitle("")
```

```
= cbind(pcamat1$li, tibble(
dfmat1 label = rownames(pcamat1$li),
number = substr(label, 3, 4),
type = factor(substr(label, 1, 2))))
= ggplot(dfmat1,
pcsplot aes(x=Axis1, y=Axis2, label=label, group=number, colour=type)) +
geom_text(size = 4, vjust = -0.5)+ geom_point(size = 3)+ylim(c(-18,19))
+ geom_hline(yintercept = 0, linetype = 2) +
pcsplot geom_vline(xintercept = 0, linetype = 2)
```

► Question 142

Looking at Figure 7.30, do the samples seem to be randomly placed in the plane? Do you notice any structure explained by the labels?

► Solution

In the previous example, the number of variables measured was too large to enable useful concurrent plotting of both variables and samples. In this example we plot the PCA biplot of a simple data set where chemical measurements were made on different wines for which we also have a categorical `wine.class`

variable. We start the analysis by looking at the two-dimensional correlations and a heatmap of the variables.

```
library("pheatmap")
load("../data/wine.RData")
load("../data/wineClass.RData")
1:2, 1:7] wine[
```

```
## Alcohol MalicAcid Ash AlcAsh Mg Phenols Flav
## 1 14.23 1.71 2.43 15.6 127 2.80 3.06
## 2 13.20 1.78 2.14 11.2 100 2.65 2.76
```

Figure 7.31: The difference between 1 and the correlation can be used as a distance between variables and is used to make a heatmap of the associations between the variables.

`pheatmap(1 - cor(wine), treeheight_row = 0.2)`

```
= dudi.pca(wine, scannf=FALSE)
winePCAd table(wine.class)
```

```
## wine.class
## barolo grignolino barbera
## 59 71 48
```

```
fviz_pca_biplot(winePCAd, geom = "point", habillage = wine.class,
col.var = "violet", addEllipses = TRUE, ellipse.level = 0.69) +
ggtitle("") + coord_fixed()
```

A **biplot** is a simultaneous representation of both the space of observations and the space of variables. In the case of a PCA biplot like Figure 7.32 the arrows represent the directions of the old variables as they project onto the plane defined by the first two new axes. Here the observations are just colored dots, the color has been chosen according to which type of wine is being plotted. We can interpret the variables’ directions with regards to the sample points, for instance the blue points are from the barbera group and show higher Malic Acid content than the other wines.

Interpretation of multivariate plots requires the use of as much of the available information as possible; here we have used the samples and their groups as well as the variables to understand the main differences between the wines.

Sometimes we want to see variability between different groups or observations, but want to weight them. This can be the case if, e.g., the groups have very different sizes. Let’s re-examine the Hiiragi data we already saw in Chapter 3 (Ohnishi et al. 2014). In the code below, we select the wildtype (WT) samples and the top 100 features with the highest overall variance.

```
data("x", package = "Hiiragi2013")
= x[, x$genotype == "WT"]
xwt = order(rowVars(Biobase::exprs(xwt)), decreasing = TRUE)[1:100]
sel = xwt[sel, ]
xwt = table(xwt$sampleGroup)
tab tab
```

```
##
## E3.25 E3.5 (EPI) E3.5 (PE) E4.5 (EPI) E4.5 (PE)
## 36 11 11 4 4
```

Figure 7.33: Screeplot from the weighted PCA of the Hiiragi data. The drop after the second eigenvalue suggests that a two-dimensional PCA is appropriate.

```
$weight = 1 / as.numeric(tab[xwt$sampleGroup])
xwt= dudi.pca(as.data.frame(t(Biobase::exprs(xwt))),
pcaMouse row.w = xwt$weight,
center = TRUE, scale = TRUE, nf = 2, scannf = FALSE)
fviz_eig(pcaMouse) + ggtitle("")
```

```
fviz_pca_ind(pcaMouse, geom = "point", col.ind = xwt$sampleGroup) +
ggtitle("") + coord_fixed()
```

We see from `tab`

that the groups are represented rather unequally. To account for this, we reweigh each sample by the inverse of its group size. The function `dudi.pca`

in the **ade4** package has a `row.w`

argument into which we can enter the weights. The output of the code is in Figures 7.33 and 7.34.

**Preprocessing matrices** Multivariate data analyses require “conscious” preprocessing. After consulting all the means, variances and one-dimensional histograms we saw how to rescale and recenter the data.

**Projecting onto new variables** We saw how we can make projections into lower dimensions (2D planes and 3D are the most frequently used) of high dimensional data without losing too much information. PCA searches for new “more informative” variables that are linear combinations of the original (old) ones.

**Matrix decomposition** PCA is based on finding decompositions of the matrix \(X\) called SVD. This decomposition provides a lower rank approximation and is equivalent to the eigendecomposition of \(X^tX\). The squares of the singular values are equal to the eigenvalues and to the variances of the new variables. We systematically plotted these values before deciding how many axes are necessary to reproduce the signal in the data.

Two eigenvalues which are quite close can give rise to scores or PC scores which are highly unstable. It is always necessary to look at the screeplot of the eigenvalues and avoid separating the axes corresponding to the these close eigenvalues. This may require using interactive three or four-dimensional projections, which are available in several R packages.

**Biplot representations** The space of observations is naturally a \(p\)-dimensional space (the \(p\) original variables provide the coordinates). The space of variables is \(n\)-dimensional. Both decompositions we have studied (singular values / eigenvalues and singular vectors / eigenvectors) provide new coordinates for both of these spaces, sometimes we call one the dual of the other. We can plot the projection of both the observations and the variables onto the same eigenvectors. This provides a biplot that can be useful for interpreting the PCA output.

**Projecting other group variables** Interpretation of PCA can also be facilitated by redundant or contiguous data about the observations.

The best way to deepen your understanding of singular value decomposition is to read Chapter 7 of Strang (2009). The whole book sets the foundations for the linear algebra necessary to understanding the meaning of the rank of matrix and the duality between row spaces and column spaces (Holmes 2006).

Complete textbooks have been written on the subject of PCA and related methods. Mardia, Kent, and Bibby (1979) is a standard text that covers all multivariate methods in a classical way, with linear algebra and matrices. By making the parametric assumptions that the data come from multivariate normal distributions, Mardia, Kent, and Bibby (1979) also provide inferential tests for the number of components and limiting properties for principal components. Jolliffe (2002) is a book-long treatment of everything to do with PCA with extensive examples.

We can incorporate supplementary information into weights for the observations and the variables. This was introduced in the 1970’s by French data scientists, see Holmes (2006) for a review and chapter 9 for further examples.

Improvements to the interpretation and stability of PCA can be obtained by adding a penalty that minimizes the number of nonzero coefficients that appear in the linear combinations. Zou, Hastie, and Tibshirani (2006) and Witten, Tibshirani, and Hastie (2009) have developed sparse versions of principal component analysis, and their packages **elasticnet** and **PMA** provide implementations in R.

**► Exercise 7.1 **Revise the material about svd by reading sections 1, 2, and 3 of the Wikipedia article about SVD. It will also be beneficial to read about the related eigenvalue decomposition by reading sections 1, 2, and 2.1 of the Wikipedia article about eigendecomposition of a matrix. We know that we can decompose a \(n\) row by \(p\) column rank 1 matrix \(X\) as :

\[\begin{equation*} **X** = \begin{pmatrix} x_{11} & x_{12} &... & x_{1p}\\ x_{21} & x_{22} &... & x_{2p}\\ \vdots & \vdots &\vdots & \vdots \\ x_{n1} & x_{n2} &... & x_{np} \end{pmatrix} = \begin{pmatrix} u_{11} \\ u_{21} \\ \vdots \\ u_{n1} \\ \end{pmatrix} \times \begin{pmatrix} v_{11} & v_{21} & \cdots & v_{p1} \end{pmatrix} \end{equation*}\]

7.1a: if \(X\) has no rows and no columns which are all zeros, then is this decomposition unique?

7.1b: generate a rank-one matrix. Start by taking a vector of length 15 with values from 2 to 30 in increments of 2, and a vector of length 4 with values 3, 6, 9, 12, then take their outer product.

```
= seq(2, 30, by = 2)
u = seq(3, 12, by = 3)
v = u %*% t(v) X1
```

Why do we have to take `t(v)`

?

7.1c: now we add some noise in the form a matrix we call `Materr`

so we have an “approximately rank-one” matrix.

```
= matrix(rnorm(60,1),nrow=15,ncol=4)
Materr = X1+Materr X
```

Visualize \(X\) using `ggplot`

.

7.1d: redo the same analyses with a rank 2 matrix.

▢

► Solution

Note that `X1`

can also be computed as

`outer(u, v)`

`ggplot(data = data.frame(X), aes(x = X1, y = X2, col = X3, size = X4)) + geom_point()`

Here we see that the data looks linear in all four dimensions. This is what it means to be of rank-one. Now let’s consider a rank 2 matrix.

```
= 100
n = 4
p = outer(rnorm(n), rnorm(p)) + outer(rnorm(n), rnorm(p))
Y2 head(Y2)
```

```
## [,1] [,2] [,3] [,4]
## [1,] 1.5203274 0.09229365 -1.2597163 0.1320842
## [2,] -1.7032438 -0.34644284 0.9929607 -0.9437504
## [3,] 5.6160688 1.33394309 -2.9442531 3.7392226
## [4,] -1.1374357 -0.98983024 -0.6423422 -3.1136265
## [5,] 0.6513664 0.47624255 0.2119162 1.4864286
## [6,] 0.5597983 -0.64450113 -1.6316123 -2.1728502
```

`ggplot(data=data.frame(Y2), aes(x=X1, y=X2, col=X3, size=X4)) + geom_point()`

Now there are obviously at least two dimensions because if we project the data onto the first two coordinates (by default called X1 and X2 when you convert a matrix into a data frame in **R**), then the data varies in both dimensions. So the next step is to try to decide if there are more than two dimensions. The top right points are the closest to you (they’re biggest) and as you go down and left in the plot those points are farther away. In the left are the bluest points and they seem to get darker linearly as you move right. As you can probably tell, it is very hard to visually discover a low dimensional space in higher dimensions, even when “high dimensions” only means 4! This is one reason why we rely on the singular value decomposition.

`svd(Y2)$d # two non-zero eigenvalues`

`## [1] 2.689135e+01 1.443597e+01 2.836391e-15 1.111141e-15`

```
= Y2 + matrix(rnorm(n*p, sd=0.01),n,p) # add some noise to Y2
Y svd(Y)$d # four non-zero eigenvalues (but only 2 big ones)
```

`## [1] 26.87816277 14.44632707 0.11840460 0.09038033`

Here we have two dimensions which are non-zero and two dimensions which are approximately 0 (for “Y2”, they are within square root of computer tolerance of 0).

**► Exercise 7.2 **7.2a: create a first a matrix of highly correlated bivariate data such as that shown in Figure 7.35.

Hint: Use the function `mvrnorm`

.

Check the rank of the matrix by looking at its singular values.

7.2b: perform a PCA and show the rotated principal component axes.

▢

► Solution

7.2a: we generate correlated bivariate normal data using:

```
= 1; mu2 = 2; s1=2.5; s2=0.8; rho=0.9;
mu1 = matrix(c(s1^2, s1*s2*rho, s1*s2*rho, s2^2),2)
sigma library("MASS")
= data.frame(mvrnorm(50, mu = c(mu1,mu2), Sigma = sigma))
sim2d svd(scale(sim2d))$d
```

`## [1] 9.589139 2.459351`

`svd(scale(sim2d))$v[,1]`

`## [1] 0.7071068 0.7071068`

```
ggplot(data.frame(sim2d),aes(x=X1,y=X2)) +
geom_point()
=princomp(sim2d)
respc= data.frame(pc1=respc$scores[,1],
dfpc pc2 = respc$scores[,2])
ggplot(dfpc,aes(x=pc1,y=pc2)) +
geom_point() + coord_fixed(2)
```

7.2b: We use prcomp to perform a PCA and the scores provide the desired rotation.

**► Exercise 7.3 **Part (A) in Figure 7.35 shows a very elongated plotting region, why?

What happens if you do not use the `coord_fixed()`

option and have a square plotting zone? Why can this be misleading?

▢

**► Exercise 7.4 **Let’s revisit the Hiiragi data and compare the weighted and unweighted approaches.

7.4a: make a correlation circle for the unweighted Hiiragi data xwt. Which genes have the best projections on the first principal plane (best approximation)?

7.4b: make a biplot showing the labels of the extreme gene-variables that explain most of the variance in the first plane. Add the the sample-points.

▢

Page built: 2022-10-01 using R version 4.2.1 (2022-06-23)

Support for maintaining the online version of this book is provided by de.NBI

For website support please contact msmith@embl.de