```
= read.table("../data/PaintedTurtles.txt", header = TRUE)
turtles 1:4, ] turtles[
```

```
sex length width height
1 f 98 81 38
2 f 103 84 38
3 f 103 86 42
4 f 105 86 40
```

- 7.1 Goals for this chapter
- 7.2 What are the data? Matrices and their motivation
- 7.3 Dimension reduction
- 7.4 The new linear combinations
- 7.5 The PCA workflow
- 7.6 The inner workings of PCA: rank reduction
- 7.7 Plotting the observations in the principal plane
- 7.8 PCA as an exploratory tool: using extra information
- 7.9 Summary of this chapter
- 7.10 Further reading
- 7.11 Exercises

Many datasets consist of several variables measured on the same set of subjects: patients, samples, or organisms. For instance, we may have biometric characteristics such as height, weight, age as well as clinical variables such as blood pressure, blood sugar, heart rate, and genetic data for, say, a thousand patients. The *raison d’être* for multivariate analysis is the investigation of connections or associations between the different variables measured. Usually the data are reported in a tabular data structure with one row for each subject and one column for each variable. In the following, we will focus on the special case where each of the variables is numeric, so we can represent the data structure as a *matrix* in R.

If the columns of the matrix are all independent of each other (unrelated), we can simply study each column separately and do standard “univariate” statistics on them one by one; there would be no benefit in studying them as a matrix.

More often, there will be patterns and dependencies. For instance, in the biology of cells, we know that the proliferation rate will influence the expression of many genes simultaneously. Studying the expression of 25,000 gene (columns) on many samples (rows) of patient-derived cells, we notice that many of the genes act together, either that they are positively correlated or that they are anti-correlated. We would miss a lot of important information if we were to only study each gene separately. Important connections between genes are detectable only if we consider the data as a whole: each row representing the many measurements made on the same observational unit. However, having 25,000 dimensions of variation to consider at once is daunting; we will show how to reduce our data to a smaller number of most important dimensions^{1} without losing too much information.

^{1} We will elaborate this idea of dimension reduction in much more detail below. For the time being, remember we live in a four-dimensional world.

This chapter presents many examples of multivariate data matrices that we encounter in high-throughput experiments, as well as some more elementary examples that we hope will enhance your intuition. We will focus in this chapter on **P**rincipal **C**omponent **A**nalysis, abbreviated as **PCA**, a **dimension reduction** method. We will provide geometric explanations of the algorithm as well as visualizations that help interprete the output of PCA analyses.

In this chapter we will:

See examples of matrices that come up in the study of biological data.

Perform dimension reduction to understand correlations between variables.

Preprocess, rescale and center the data before starting a multivariate analysis.

Build new variables, called principal components (PC), that are more useful than the original measurements.

See what is “under the hood” of PCA: the singular value decomposition of a matrix.

Visualize what this decomposition achieves and learn how to choose the number of principal components.

Run through a complete PCA analysis from start to finish.

Project factor covariates onto the PCA map to enable a more useful interpretation of the results.

First, let’s look at a set of examples of rectangular **matrices** used to represent tables of measurements. In each matrix, the rows and columns represent specific entities.

**Turtles:** A simple data set that will help us understand the basic principles is a matrix of three dimensions of biometric measurements on painted turtles (Jolicoeur and Mosimann 1960).

```
sex length width height
1 f 98 81 38
2 f 103 84 38
3 f 103 86 42
4 f 105 86 40
```

The last three columns are length measurements (in millimetres), whereas the first column is a factor variable that tells us the sex of each animal.

**Athletes:** This matrix is an interesting example from the sports world. It reports the performances for 33 athletes in the ten disciplines of the decathlon: `m100`

, `m400`

and `m1500`

are times in seconds for the 100 meters, 400 meters, and 1500 meters respectively; `m110`

is the time to finish the 110 meters hurdles; `pole`

is the pole-vault height, and `high`

and `long`

are the results of the high and long jumps, all in meters; `weight`

, `disc`

, and `javel`

are the lengths in meters the athletes were able to throw the weight, discus and javelin. Here are these variables for the first three athletes:

```
data("olympic", package = "ade4")
athletes = setNames(olympic$tab,
c("m100", "long", "weight", "high", "m400", "m110", "disc", "pole", "javel", "m1500"))
athletes[1:3, ]
```

```
m100 long weight high m400 m110 disc pole javel m1500
1 11.25 7.43 15.48 2.27 48.90 15.13 49.28 4.7 61.32 268.95
2 10.87 7.45 14.97 1.97 47.71 14.46 44.36 5.1 61.76 273.02
3 11.18 7.44 14.20 1.97 48.29 14.81 43.66 5.2 64.16 263.20
```

**Cell Types:** Holmes et al. (2005) studied gene expression profiles of sorted T-cell populations from different subjects. The columns are a subset of gene expression measurements, they correspond to 156 genes that show differential expression between cell types.

```
X3968 X14831 X13492 X5108 X16348 X585
HEA26_EFFE_1 -2.61 -1.19 -0.06 -0.15 0.52 -0.02
HEA26_MEM_1 -2.26 -0.47 0.28 0.54 -0.37 0.11
HEA26_NAI_1 -0.27 0.82 0.81 0.72 -0.90 0.75
MEL36_EFFE_1 -2.24 -1.08 -0.24 -0.18 0.64 0.01
MEL36_MEM_1 -2.68 -0.15 0.25 0.95 -0.20 0.17
```

**Bacterial Species Abundances:** Matrices of counts are used in microbial ecology studies (as we saw in Chapter 4). Here the columns represent different species (or operational taxonomic units, OTUs) of bacteria, which are identified by numerical tags. The rows are labeled according to the samples in which they were measured, and the (integer) numbers represent the number of times of each of the OTUs was observed in each of the samples.

```
data("GlobalPatterns", package = "phyloseq")
GPOTUs = as.matrix(t(phyloseq::otu_table(GlobalPatterns)))
GPOTUs[1:4, 6:13]
```

```
OTU Table: [4 taxa and 8 samples]
taxa are rows
246140 143239 244960 255340 144887 141782 215972 31759
CL3 0 7 0 153 3 9 0 0
CC1 0 1 0 194 5 35 3 1
SV1 0 0 0 0 0 0 0 0
M31Fcsw 0 0 0 0 0 0 0 0
```

**mRNA reads:** RNA-Seq transcriptome data report the number of sequence reads matching each gene^{2} in each of several biological samples. We will study this type of data in detail in Chapter 8

^{2} Or sub-gene structures, such as exons.

```
SRR1039508 SRR1039509 SRR1039512 SRR1039513
ENSG00000000003 679 448 873 408
ENSG00000000005 0 0 0 0
ENSG00000000419 467 515 621 365
```

It is customary in the RNA-Seq field—and so it is for the `airway`

data above—to report the genes in the rows and the samples in the columns. Compared to the other matrices we look at here, this is *transposed*: rows and columns are swapped. Such different conventions easily lead to errors, so they are worthwhile paying attention to^{3}. **Proteomic profiles:** Here, the columns are aligned **mass spectroscopy** peaks or molecules identified through their \(m/z\)-ratios; the entries in the matrix are the measured intensities^{4}.

^{3} The Bioconductor project tries to help users and developers to avoid such ambiguities by defining data containers in which such conventions are explicitly fixed. In Chapter 8, we will see the example of the *SummarizedExperiment* class.

^{4} More details can be found, e.g., on Wikipedia.

```
146.0985388 148.7053275 310.1505057 132.4512963
KOGCHUM1 29932.36 17055.70 1132.82 785.5129
KOGCHUM2 94067.61 74631.69 28240.85 5232.0499
KOGCHUM3 146411.33 147788.71 64950.49 10283.0037
WTGCHUM1 229912.57 384932.56 220730.39 26115.2007
```

Task

When a peak was not detected for a particular \(m/z\) score in the mass spectrometry run, a zero was recorded in `metab`

. Similarly, zeros in `GPOTUs`

or in the `airway`

object occur when there were no matching sequence reads detected. Tabulate the frequencies of zeros in these data matrices.

What are the columns of these data matrices usually called?

In each of these examples, what are the rows of the matrix?

What does a cell in a matrix represent?

If the data matrix is called

`athletes`

and you want to see the value of the third variable for the fifth athlete, what do you type into R?

If we are studying only one variable, i.e., just the third column of the turtles matrix^{5}, we say we are looking at one-dimensional data. Such a vector, say all the turtle weights, can be visualized by plots such as those that we saw in Section 3.6, e.g., a histogram. If we compute a one number summary, say mean or median, we have made a zero-dimensional summary of our one-dimensional data. This is already an example of dimension reduction.

^{5} The third column of a matrix \(X\) is denoted mathematically by \({\mathbf x}_{\cdot 3}\) or accessed in R using `X[, 3]`

.

In Chapter 3 we studied two-dimensional scatterplots. We saw that if there are too many observations, it can be beneficial to group the data into (hexagonal) bins: these are *two-dimensional* histograms. When considering two variables (\(x\) and \(y\)) measured together on a set of observations, the **correlation coefficient** measures how the variables co-vary. This is a single number summary of two-dimensional data. Its formula involves the summaries \(\bar{x}\) and \(\bar{y}\):

\[ \hat{\rho}= \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})} {\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2} \sqrt{\sum_{j=1}^n (y_j-\bar{y})^2}} \tag{7.1}\]

In R, we use the `cor`

function to calculate its value. Applied to a matrix this function computes all the two way correlations between continuous variables. In Chapter 9 we will see how to analyse multivariate categorical data.

Compute the matrix of all correlations between the measurements from the turtles data. What do you notice ?

Solution

We take out the categorical variable and compute the matrix.

```
length width height
length 1.0000000 0.9783116 0.9646946
width 0.9783116 1.0000000 0.9605705
height 0.9646946 0.9605705 1.0000000
```

We see that this square matrix is symmetric and the values are all close to 1. The diagonal values are always 1.

It is always beneficial to start a multidimensional analysis by checking these simple one-dimensional and two-dimensional summary statistics using visual displays such as those we look at in the next two questions.

- Produce all pairwise scatterplots, as well as the one-dimensional histograms on the diagonal, for the turtles data. Use the package
**GGally**.

- Guess the underlying or “true dimension” of these data?

Solution

From Figure 7.2, it looks like all three of the variables are highly correlated and mostly reflect the same “underlying” variable, which we might interpret as *the size* of the turtle.

Compute all pairwise correlations of the variables in the `athletes`

data and display the matrix in a heatmap. What do you notice?

Solution

Figure 7.3 shows how the 10 variables cluster into groups: running, throwing and jumping.

In many cases, different variables are measured in different units, so they have different baselines and different scales^{6}. These are not directly comparable in their original form.

^{6} Common measures of scale are the range and the standard deviation. For instance, the times for the 110 metres vary between 14.18 and 16.2, with a standard deviation of 0.51, whereas the times to complete the 1500 metres vary between 256.64 and 303.17, with a standard deviation of 13.66; more than an order of magnitude larger. Moreover, the `athletes`

data also contain measurements in different units (seconds, metres), whose choice is arbitrary (lengths could also be recorded in centimetres or feet, times in milliseconds).

For PCA and many other methods, we therefore need to transform the numeric values to some common scale in order to make comparisons meaningful. **Centering** means subtracting the mean, so that the mean of the centered data is at the origin. **Scaling** or **standardizing** then means dividing by the standard deviation, so that the new standard deviation is \(1\). In fact, we have already encountered these operations when computing the correlation coefficient (Equation 7.1): the correlation coefficient is simply the vector product of the centered and scaled variables. To perform these operations, there is the R function `scale`

, whose default behavior when given a matrix or a data frame is to make every column have a mean of zero and a standard deviation of \(1\).

- Compute the means and standard deviations of the
`turtle`

data, then use the`scale`

function to center and standardize the continuous variables. Call this`scaledTurtles`

, then verify the new values for mean and standard deviation of`scaledTurtles`

.

- Make a scatterplot of the scaled and centered width and height variables of the turtle data and color the points by their sex.

Solution

```
length width height
20.481602 12.675838 8.392837
```

```
length width height
124.68750 95.43750 46.33333
```

```
length width height
-1.432050e-18 1.940383e-17 -2.870967e-16
```

```
length width height
1 1 1
```

```
data.frame(scaledTurtles, sex = turtles[, 1]) %>%
ggplot(aes(x = width, y = height, group = sex)) +
geom_point(aes(color = sex)) + coord_fixed()
```

We have already encountered other data transformation choices in Chapters 4 and 5, where we used the `log`

and `asinh`

functions. The aim of these transformations is (usually) variance stabilization, i.e., to make the variances of replicate measurements of *one and the same variable* in different parts of its dynamic range more similar. In contrast, the standardizing transformation described above aims to make the scale (as measured by mean and standard deviation) of *different variables* the same.

Sometimes it is preferable to leave variables at different scales because they are truly of different importance. If their original scale is relevant, then we can (should) leave the data as is. In other cases, the variables have different precisions known a priori. We will see in Chapter 9 that there are several ways of weighting such variables.

After preprocessing the data, we are ready to undertake data *simplification* through **dimension reduction**.

We will explain dimension reduction from several different perspectives. It was invented in 1901 by Karl Pearson (Pearson 1901) as a way to reduce a two-variable scatterplot to a single coordinate. It was used by statisticians in the 1930s to summarize a battery of psychological tests run on the same subjects (Hotelling 1933); thus providing overall scores that summarize many tested variables at once.

This idea of **principal** scores inspired the name Principal Component Analysis (abbreviated PCA). PCA is called an **unsupervised learning** technique because, as in clustering, it treats all variables as having the same **status**. We are not trying to predict or explain one particular variable’s value from the others; rather, we are trying to find a mathematical model for an underlying structure for all the variables. PCA is primarily an exploratory technique that produces maps that show the relations between variables and between observations in a useful way.

We first provide a flavor of what this multivariate analysis does to the data. There is an elegant mathematical formulation of these methods through linear algebra, although here we will try to minimize its use and focus on visualization and data examples.

We use geometrical **projections** that take points in higher-dimensional spaces and projects them down onto lower dimensions. Figure 7.5 shows the projection of the point \(A\) onto the line generated by the vector \({\mathbf v}\).

PCA is a **linear** technique, meaning that we look for linear relations between variables and that we will use new variables that are linear functions of the original ones (\(f(ax+by)=af(x)+b(y)\)). The linearity constraints makes computations particularly easy. We will see non-linear techniques in Chapter 9.

Here we show one way of projecting two-dimensional data onto a line using the `athletes`

data. The code below provides the preprocessing and plotting steps that were used to generate Figure 7.6:

```
athletes = data.frame(scale(athletes))
ath_gg = ggplot(athletes, aes(x = weight, y = disc)) +
geom_point(size = 2, shape = 21)
ath_gg + geom_point(aes(y = 0), colour = "red") +
geom_segment(aes(xend = weight, yend = 0), linetype = "dashed")
```

Task

- Calculate the variance of the red points in Figure 7.6.

- Make a plot showing projection lines onto the \(y\) axis and projected points.

- Compute the variance of the points projected onto the vertical \(y\) axis.

In general, we lose information about the points when we project from two dimensions (a plane) to one (a line). If we do it just by using the original coordinates, as we did on the `weight`

variable in Figure 7.6, we lose all the information about the `disc`

variable. Our goal is to keep as much information as we can about *both* variables. There are actually many ways of projecting the point cloud onto a line. One is to use what are known as **regression lines**. Let’s look at these lines and how they are constructed in R.

If you have seen linear regression, you already know how to compute lines that summarize scatterplots; **linear regression** is a **supervised** method that gives preference minimizing the residual sum of squares in one direction: that of the response variable.

`disc`

variable on `weight`

.In Figure 7.7, we use the `lm`

(linear model) function to find the regression line. Its slope and intercept are given by the values in the `coefficients`

slot of the resulting object `reg1`

.

```
reg1 = lm(disc ~ weight, data = athletes)
a1 = reg1$coefficients[1] # intercept
b1 = reg1$coefficients[2] # slope
pline1 = ath_gg + geom_abline(intercept = a1, slope = b1,
col = "blue", linewidth = 1.5)
pline1 + geom_segment(aes(xend = weight, yend = reg1$fitted),
colour = "red", arrow = arrow(length = unit(0.15, "cm")))
```

`weight`

on `discus`

.Figure 7.8 shows the line produced when reversing the roles of the two variables; `weight`

becomes the response variable.

```
reg2 = lm(weight ~ disc, data = athletes)
a2 = reg2$coefficients[1] # intercept
b2 = reg2$coefficients[2] # slope
pline2 = ath_gg + geom_abline(intercept = -a2/b2, slope = 1/b2,
col = "darkgreen", linewidth = 1.5)
pline2 + geom_segment(aes(xend=reg2$fitted, yend=disc),
colour = "orange", arrow = arrow(length = unit(0.15, "cm")))
```

Each of the regression lines in Figures 7.7 and 7.8 gives us an approximate linear relationship between `disc`

and `weight`

. However, the relationship differs depending on which of the variables we choose to be the predictor and which the response.

How large is the variance of the projected points that lie on the blue regression line of Figure 7.7? Compare this to the variance of the data when projected on the original axes, `weight`

and `disc`

.

Solution

Pythagoras’ theorem tells us that the squared length of the hypotenuse of a right-angled triangle is equal to the sum of the squared lengths of the other two sides, which we apply as follows:

The variances of the points along the original axes `weight`

and `disc`

are 1, since we scaled the variables.

Figure 7.9 shows the line chosen to minimize the sum of squares of the orthogonal (perpendicular) projections of data points onto it; we call this the **principal component** line. All our three ways of fitting a line (Figures 7.7–7.9) together in one plot are shown in Figure 7.10.

```
xy = cbind(athletes$disc, athletes$weight)
svda = svd(xy)
pc = xy %*% svda$v[, 1] %*% t(svda$v[, 1])
bp = svda$v[2, 1] / svda$v[1, 1]
ap = mean(pc[, 2]) - bp * mean(pc[, 1])
ath_gg + geom_segment(xend = pc[, 1], yend = pc[, 2]) +
geom_abline(intercept = ap, slope = bp, col = "purple", linewidth = 1.5)
```

- What is particular about the slope of the purple line?

- Redo the plots on the original (unscaled) variables. What happens?

Solution

The lines computed here depend on the choice of units. Because we have made the standard deviations equal to one for both variables, the PCA line is the diagonal that cuts exactly in the middle of both regression lines. Since the data were centered by subtracting their means, the line passes through the origin \((0,0)\).

Solution

We have computed the coordinates of the points when we made the plot, these are in the `pc`

vector:

We see that the variance along this axis is larger than the other variances we calculated in Question 7.6.

Pythagoras’ theorem tells us two interesting things here:

If we are minimizing in both horizontal and vertical directions we are in fact minimizing the orthogonal projections onto the line from each point.

The total variability of the points is measured by the sum of squares of the projection of the points onto the center of gravity, which is the origin (0,0) if the data are centered. This is called the

*total variance*or the**inertia**of the point cloud. This inertia can be decomposed into the sum of the squares of the projections onto the line plus the variances along that line. For a fixed variance, minimizing the projection distances also maximizes the variance along that line. Often we define the first principal component as the line with maximum variance.

The PC line we found in the previous section could be written

Image credit: Sara Holmes

\[ PC = \frac{1}{2} \mbox{disc} + \frac{1}{2} \mbox{weight}. \tag{7.2}\]

Principal components are *linear combinations* of the variables that were originally measured, they provide a *new coordinate system*. To understand what a **linear combination** really is, we can take an analogy. When making a healthy juice mix, you will follow a recipe like

\[ \begin{align} V &= 2 \times \text{Beet} + 1 \times \text{Carrot} \\ &+ \tfrac{1}{2} \text{Gala} + \tfrac{1}{2} \text{GrannySmith} \\ &+ 0.02 \times \text{Ginger} + 0.25 \times \text{Lemon}. \end{align} \]

This recipe is a linear combination of individual juice types (the original variables). The result is a new variable, \(V\), and the coefficients \((2,1,\frac{1}{2},\frac{1}{2},0.02,0.25)\) are called the **loadings**.

A linear combination of variables defines a line in higher dimensions in the same way we constructed lines in the scatterplot plane of two dimensions. As we saw in that case, there are many ways to choose lines onto which we project the data, there is however a `best’ line for our purpose.

The total variance of all the points in all the variables can de decomposed. In PCA, we use the fact that the total sums of squares of the distances between the points and any line can be decomposed into the distance to the line and the variance along the line.

We saw that the principal component minimizes the distance to the line, and it also maximizes the variance of the projections along the line.

Why is maximizing the variance along a line a good idea? Let’s look at another example of a projection from three dimensions into two. In fact, human vision depends on such dimension reduction:

In Figure 7.11, there is a two-dimensional projection of a three-dimensional object. What is the object?

Which of the two projections, Figure 7.11 or 7.13, do you find more informative, and why?

Solution

One can argue that the projection that maximizes the area of the shadow shows more `information’.

PCA is based on the principle of finding the axis showing the largest inertia/variability, removing the variability in that direction and then iterating to find the next best orthogonal axis, and so on. In fact, we do not have to run iterations, all the axes can be found in one linear algebra operation called the **S**ingular **V**alue **D**ecomposition (we will delve more deeply into the details below).

In the diagram in Figure 7.12, we see that first the means and variances are computed and the choice of whether to work directly with the covariance matrix or with the correlation matrix has to be made. The next step is the choice of \(k\), the number of components we deem relevant to the data. We say that \(k\) is the rank of the approximation. The best choice of \(k\) is a difficult question, and we discuss on how to approach it below. The choice of \(k\) requires looking at a plot of the variances explained by the successive principal components. Once we have chosen \(k\), we can proceed to the projections of the data in the new \(k\)-dimensional subspace.

The end results of the PCA workflow are useful maps of both the variables and the samples. Understanding how these maps are constructed will maximize the information we can gather from them.

This is a small section for those whose background in linear algebra is but a faint memory. It tries to give some intuition to the singular value decomposition method underlying PCA, without too much notation.

The singular value decomposition of a matrix finds horizontal and vertical vectors (called the singular vectors) and normalizing values (called singular values). As before, we start by giving the forward-generative explanation before doing the actual reverse engineering that is used in creating the decomposition. To calibrate the meaning of each step, we will start with an artificial example before moving to the complexity of real data.

A simple generative model demonstrates the meaning of the **rank of a matrix** and explains how we find it in practice. Suppose we have two vectors, \(u\) (a one-column matrix), and \(v^t=t(v)\) (a one-row matrix–the transpose of a one-column matrix \(v\)). For instance, \(u =\left( \begin{smallmatrix} 1\\2\\3\\4 \end{smallmatrix} \right)\) and \(v =\left( \begin{smallmatrix} 2\\4\\8 \end{smallmatrix} \right)\). The transpose of \(v\) is written \(v^t = t(v) = (2\; 4\; 8)\). We multiply a copy of \(u\) by each of the elements of \(v^t\) in turn as follows:

Step 0:

X | 2 | 4 | 8 |
---|---|---|---|

1 | |||

2 | |||

3 | |||

4 |

Step 1:

X | 2 | 4 | 8 |
---|---|---|---|

1 | 2 | ||

2 | 4 | ||

3 | 6 | ||

4 | 8 |

Step 2:

X | 2 | 4 | 8 |
---|---|---|---|

1 | 2 | 4 | |

2 | 4 | 8 | |

3 | 6 | 12 | |

4 | 8 | 16 |

Step 3:

X | 2 | 4 | 8 |
---|---|---|---|

1 | 2 | 4 | 8 |

2 | 4 | 8 | 16 |

3 | 6 | 12 | 24 |

4 | 8 | 16 | 32 |

Thus, the \((2,3)\) entry of the matrix \(X\), written \(x_{2,3}\), is obtained by multiplying \(u_2\) by \(v_3\). We can write this

\[ X=\begin{pmatrix} 2&4&8\\ 4&8&16\\ 6 &12&24\\ 8&16&32\\ \end{pmatrix} = u * t(v)= u * v^t \tag{7.3}\]

The matrix \(X\) we obtain here is said to be of rank 1, because both \(u\) and \(v\) have one column.

Why can we say that writing \(X = u*v^t\) is more economical than spelling out the full matrix \(X\)?

Solution

\(X\) has \(4\times3=12\) elements, while in terms of \(u\) and \(v\) it can be expressed by only \(4+3=7\) numbers. The compression is even more impressive when \(u\) or \(v\) are longer.

On the other hand, suppose that we want to reverse the process and simplify another matrix \(X\) given below with 3 rows and 4 columns (12 numbers). Can we always express it in a similar way as a product of vectors without loss of information? In the diagrams shown in Figures 7.14 and 7.15, the colored boxes have areas proportional to the numbers in the cells of the matrix (7.4).

Here is a matrix \(X\) we want to decompose.

\[ \begin{array}{rrrrr} \hline {\large X} & x_{.1} & x_{.2} & x_{.3} & x_{.4} \\ \hline x_{1.} & 780 & 936 & 1300 & 728\\ x_{2.} & 75 & 90 & 125 & 70\\ x_{3.} & 540 & 648 & 900 & 504\\ \hline \end{array} \tag{7.4}\]

\(X\) has been redrawn as series of rectangles in Figure 7.14. What numbers could we put in the white \(u\) and \(v\) boxes so that the values of the sides of the rectangle give the numbers as their product?

A matrix with the special property of being perfectly “rectangular” like \(X\) is said to be of rank 1. We can represent the numbers in \(X\) by the areas of rectangles, where the sides of rectangles are given by the values in the side vectors (\(u\) and \(v\)).

We see in Figure 7.15 that the decomposition of \(X\) is not unique: there are several candidate choices for the vectors \(u\) and \(v\). We will make the choice unique by requiring that the sum of the squares of each vector’s elements add to 1 (we say the vectors \(v\) and \(u\) have norm 1). Then we have to keep track of one extra number by which to multiply each of the products, and which represents the “overall scale” of \(X\). This is the value we have put in the upper left hand corner. It is called the singular value \(s_1\). In the R code below, we start by supposing we know the values in `u`

, `v`

and `s1`

; later we will see a function that finds them for us. Let’s check the multiplication and norm properties in R:

```
X = matrix(c(780, 75, 540,
936, 90, 648,
1300, 125, 900,
728, 70, 504), nrow = 3)
u = c(0.8196, 0.0788, 0.5674)
v = c(0.4053, 0.4863, 0.6754, 0.3782)
s1 = 2348.2
sum(u^2)
```

`[1] 1`

`[1] 1`

```
[,1] [,2] [,3] [,4]
[1,] 780 936 1300 728
[2,] 75 90 125 70
[3,] 540 648 900 504
```

```
[,1] [,2] [,3] [,4]
[1,] -0.03419 0.0745 0.1355 0.1221
[2,] 0.00403 0.0159 0.0252 0.0186
[3,] -0.00903 0.0691 0.1182 0.0982
```

Try `svd(X)`

in R. Look at the components of the output of the `svd`

function carefully. Check the norm of the columns of the matrices that result from this call. Where did the above value of `s1`

= 2348.2 come from?

In fact, in this particular case we were lucky: we see that the second and third singular values are 0 (up to the numeric precision we care about). That is why we say that \(X\) is of **rank** 1. For a more general matrix \(X\), it is rare to be able to write \(X\) exactly as this type of two-vector product. The next subsection shows how we can decompose \(X\) when it is not of rank 1: we will just need more pieces.

In the above decomposition, there were three elements: the horizontal and vertical singular vectors, and the diagonal corner, called the singular value. These can be found using the singular value decomposition function (`svd`

). For instance:

Look at the `USV`

object, the result of calling the `svd`

function. What are its components?

Solution

Check how each successive pair of singular vectors improves our approximation to `Xtwo`

. What do you notice about the third and fourth singular values?

Solution

Again, there are many ways to write a rank two matrix such as `Xtwo`

as a sum of rank one matrices: in order to ensure uniqueness, we impose yet another^{7} condition on the singular vectors. The output vectors of the singular decomposition do not only have their norms equal to 1, each column vector in the \(U\) matrix is orthogonal to all the previous ones. We write \(u_{\cdot 1} \perp u_{\cdot 2}\), this means that the sum of the products of the values in the same positions is \(0\): \(\sum_i u_{i1} u_{i2} = 0\). Ditto for the \(V\) matrix.

^{7} Above, we had chosen the norm of the vectors to be 1.

Task

Let’s submit our `scaledTurtles`

matrix to a singular value decomposition.

`[1] 11.746475 1.419035 1.003329`

```
[,1] [,2] [,3]
[1,] 0.5787981 0.3250273 0.74789704
[2,] 0.5779840 0.4834699 -0.65741263
[3,] 0.5752628 -0.8127817 -0.09197088
```

`[1] 48 3`

Solution

The first column of `turtles.svd$v`

shows that the coefficients for the three variables are practically equal. Other noticeable “coincidences” include:

We see that the coefficients are in fact \(\sqrt{1/3}\) and the sum of squares of the singular values is equal to \((n-1)\times p\).

\(X\) is decomposed additively into rank-one pieces. Each of the \(u\) vectors is combined into the \(U\) matrix, and each of \(v\) vectors into \(V\). The *Singular Value Decomposition* is

\[ {\mathbf X} = U S V^t, V^t V={\mathbb I}, U^t U={\mathbb I}, \tag{7.5}\]

where \(S\) is the diagonal matrix of singular values, \(V^t\) is the transpose of \(V\), and \({\mathbb I}\) is the Identity matrix. Expression 7.5 can be written elementwise as

\[ X_{ij} = u_{i1}s_1v_{1j} + u_{i2}s_2v_{2j} + u_{i3}s_3v_{3j} +... + u_{ir}s_rv_{rj}, \]

\(U\) and \(V\) are said to be orthonormal^{8}, because their self-crossproducts are the identity matrix.

^{8} Nothing to do with the normal distribution, it stands for orthogonal and having norm 1.

The singular vectors from the singular value decomposition (provided by the `svd`

function in R) contain the coefficients to put in front of the original variables to make the more informative ones we call the principal components. We write this as:

\[ Z_1=v_{11} X_{\cdot 1} +v_{21} X_{\cdot 2} + v_{31} X_{\cdot 3}+ \cdots + v_{p1} X_{\cdot p}. \]

If `usv = svd(X)`

, then \((v_{11},v_{21},v_{31},...)\) are given by the first column of `usv$v`

; similarly for \(Z_2\) with the second column of `usv$v`

, and so son. \(p\) is the number of columns of \(X\) and the number of rows of \(V\). These new variables \(Z_1, Z_2, Z_3, ...\) have variances that decrease in size: \(s_1^2 \geq s_2^2 \geq s_3^2 \geq ...\).

Compute the first principal component for the turtles data by multiplying by the first singular value `d[1]`

by `u[,1]`

. What is another way of computing it ?

Solution

We show this using the code:

We can also see using matrix algebra that \(XV\) and \(US\) are the same. Remember that \(V\) is orthogonal, so \(V^t V={\mathbb I}\) and \(XV = USV^tV=US\,{\mathbb I}\).

*Note:* The `drop = FALSE`

argument in the first line of the below code makes sure that the selected matrix column retains *matrix* / *array* class attributes and thus is eligible for the matrix multiplication operator. Alternatively, you could use the regular multiplication operator `*`

. In the second line, the `drop = FALSE`

is not strictly necessary, but we have it there for symmetry.

Here are two useful facts, first in words, then with the mathematical shorthand.

The number of principal components \(k\) is always chosen to be fewer than the number of original variables or the number of observations. We are “lowering” the dimension of the problem:

\[ k\leq \min(n,p). \]

The principal component transformation is defined so that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each successive component in turn has the highest variance possible under the constraint that it be orthogonal to the preceding components:

\[ \max_{aX \perp bX}\mbox{var}(\mbox{Proj}_{aX} (X)), \qquad \mbox{where } bX=\mbox{previous components} \]

We revisit our two-variable athletes data with the `discus`

and the `weight`

variables. In Section 7.3.2, we computed the first principal component and represented it as the purple line in Figure 7.10. We showed that \(Z_1\) was the linear combination given by the diagonal. As the coefficients have to have their sum of squares add to \(1\), we have that \[Z_1=-0.707*\text{athletes\$disc}- 0.707*\text{athletes\$weight}.\]

This is the same as if the two coordinates were \(c_1=0.7071\) and \(c_2=0.7071\).

What part of the output of the `svd`

functions leads us to the first PC coefficients, also known as the PC **loadings** ?

Note that we use

`svda`

which was the svd applied to the two variables `discus`

and `weight`

.If we rotate the `(discus, weight)`

plane by making the purple line the horizontal \(x\) axis, we obtain what is know as the first **principal plane**.

```
ppdf = tibble(PC1n = -svda$u[, 1] * svda$d[1],
PC2n = svda$u[, 2] * svda$d[2])
gg = ggplot(ppdf, aes(x = PC1n, y = PC2n)) +
geom_point() +
geom_hline(yintercept = 0, color = "purple", linewidth = 1.5, alpha = 0.5) +
xlab("PC1 ")+ ylab("PC2") + xlim(-3.5, 2.7) + ylim(-2, 2) + coord_fixed()
gg + geom_point(aes(x = PC1n, y = 0), color = "red") +
geom_segment(aes(xend = PC1n, yend = 0), color = "red")
gg + geom_point(aes(x = 0, y = PC2n), color = "blue") +
geom_segment(aes(yend = PC2n, xend = 0), color = "blue") +
geom_vline(xintercept = 0, color = "skyblue", linewidth = 1.5, alpha = 0.5)
```

- What is the mean of the sums of squares of the red segments in Figure 7.16 equal to?

- How does this compare to the variance of the red points?

- Compute the ratio of the standard deviation of the blue segments to the red segments in Figure 7.16. Compare this to the ratio of singular values 1 and 2.

Solution

- The sum of squares of the red segments corresponds to the square of the second singular value:

Since the mean of the red segments is zero, the above quantities are also proportional to the variance:

- The variance of the red points is
`var(ppdf$PC1n)`

, which is larger than what we calculated in a) by design of the first PC.

- We take the ratios of the standard deviations explained by the points on the vertical and horizontal axes by computing:

Task

Use `prcomp`

to compute the PCA of the first two columns of the athletes data, look at the output. Compare to the singular value decomposition.

We now want to do a complete PCA analysis on the turtles data. Remember, we already looked at the summary statistics for the one- and two-dimensional data. Now we are going to answer the question about the “true” dimensionality of these rescaled data.

In the following code, we use the function `princomp`

. Its return value is a list of all the important pieces of information needed to plot and interpret a PCA.

```
length width height
length 1.0000000 0.9783116 0.9646946
width 0.9783116 1.0000000 0.9605705
height 0.9646946 0.9605705 1.0000000
```

```
Call:
princomp(x = scaledTurtles)
Standard deviations:
Comp.1 Comp.2 Comp.3
1.6954576 0.2048201 0.1448180
3 variables and 48 observations.
```

`scaledTurtles`

): there is one large value and two small ones. The data are (almost) one-dimensional. We will see why this dimension is called an axis of size, a frequent phenomenon in biometric data (Jolicoeur and Mosimann 1960).
Many PCA functions have been created by different teams who worked in different areas at different times. This can lead to confusion, especially because they have different naming conventions. Let’s compare three of them; run the following lines of code and look at the resulting objects:

What happens when you disable the scaling in the `prcomp`

and `princomp`

functions?

In what follows, we always suppose that the matrix \(X\) represents the centered and scaled matrix.

The coordinates of the observations in the new variables from the `prcomp`

function (call it `res`

) are in the `scores`

slot of the result. Take a look at PC1 for the `turtles`

and compare it to `res$scores`

. Compare the standard deviation `sd1`

to that in the `res`

object and to the standard deviation of the scores.

Solution

Check the orthogonality of the `res$scores`

matrix. Why can’t we say that it is **orthonormal**?

Now we are going to combine both the PC scores (\(US\)) and the loadings-coefficients (\(V\)). The plots with both the samples and the variables represented are called **biplots**. This can be done in one line using the following **factoextra** package function.

Is it possible to have a PCA plot with the PC1 as the horizontal axis whose height is longer than its width?

Solution

The variance of points in the PC1 direction is \(\lambda_1=s_1^2\) which is always larger than \(\lambda_2=s_2^2\), so the PCA plot will always be wider than high.

Looking at Figure 7.18: a) Did the males or female turtles tend to be larger?

b) What do the arrows tell us about the correlations?

Compare the variance of each new coordinate to the eigenvalues returned by the PCA `dudi.pca`

function.

Solution

Now we look at the relationships between the variables, both old and new by drawing what is known as the correlation circle. The aspect ratio is 1 here and the variables are represented by arrows as shown in Figure 7.19. The lengths of the arrows indicate the quality of the projections onto the first principal plane:

Solution

When computing the variance covariance matrix, many implementations use \(1/(n-1)\) as the denominator. Here, \(n=48\) so the sum of the variances are off by a factor of 48/47.

These data are a good example of how sometimes almost all the variation in the data can be captured in a lower-dimensional space: here, three-dimensional data can be essentially replaced by a line. Keep in mind: \(X^tC=VSU^tUS=VS^2.\) The **principal components** are the columns of the matrix \(C=US\). The \(p\) columns of \(U\) (the matrix given as `USV$u`

in the output from the `svd`

function above) are rescaled to have norms \((s_1^2,s_2^2,...,s_p^2)\). Each column has a different variance it is *responsible* for explaining. Notice that these will be decreasing numbers.

If we only want the first one then it is just \(c_1=s_1 u_1\). Notice that \(||c_1||^2=s_1^tu_1 u_1^t s_1= s_1^2 u_1^tu_1=s_1^2=\lambda_1\)

If the matrix \(X\) comes from the study of \(n\) different samples or specimens, then the principal components provides new coordinates for these \(n\) points as in Figure 7.16. These are sometimes called the *scores* in the results of PCA functions.

Before we go into more detailed examples, let’s summarize what SVD and PCA provide:

Each principal component has a variance measured by the corresponding eigenvalue, the square of the corresponding singular value.

The new variables are made to be orthogonal. Since they are also centered, this means they are uncorrelated. In the case of normal distributed data, this also means they are independent.

When the variables are have been rescaled, the sum of the variances of all the variables is the number of variables (\(=p\)). The sum of the variances is computed by adding the diagonal of the crossproduct matrix

^{9}.The principal components are ordered by the size of their eigenvalues. We always check the screeplot before deciding how many components to retain. It is also best practice to do as we did in Figure 7.18 and annotate each PC axis with the proportion of variance it explains.

^{9} This sum of the diagonal elements is called **the trace** of the matrix.

Task

Look up eigenvalue in the Wikipedia. Try to find a sentence that defines it without using a formula. Why would eigenvectors come into use in Cinderella (at a stretch)? (See the xkcd cartoon in Figure 7.20.)

For help with the basics of linear algebra, a motivated student pressed for time may consult Khan’s Academy. If you have more time and would like in depth coverage, Gil Strang’s MIT course is a classic, and some of the book is available online (Strang 2009).

We started looking at these data earlier in this chapter. Here, we will follow step by step a complete multivariate analysis. First, let us have another look at the correlation matrix (rounded to 2 digits after the decimal point), which captures the bivariate associations. We already plotted it as a colored heatmap in Figure 7.3.

```
m100 long weight high m400 m110 disc pole javel m1500
m100 1.00 -0.54 -0.21 -0.15 0.61 0.64 -0.05 -0.39 -0.06 0.26
long -0.54 1.00 0.14 0.27 -0.52 -0.48 0.04 0.35 0.18 -0.40
weight -0.21 0.14 1.00 0.12 0.09 -0.30 0.81 0.48 0.60 0.27
high -0.15 0.27 0.12 1.00 -0.09 -0.31 0.15 0.21 0.12 -0.11
m400 0.61 -0.52 0.09 -0.09 1.00 0.55 0.14 -0.32 0.12 0.59
m110 0.64 -0.48 -0.30 -0.31 0.55 1.00 -0.11 -0.52 -0.06 0.14
disc -0.05 0.04 0.81 0.15 0.14 -0.11 1.00 0.34 0.44 0.40
pole -0.39 0.35 0.48 0.21 -0.32 -0.52 0.34 1.00 0.27 -0.03
javel -0.06 0.18 0.60 0.12 0.12 -0.06 0.44 0.27 1.00 0.10
m1500 0.26 -0.40 0.27 -0.11 0.59 0.14 0.40 -0.03 0.10 1.00
```

Then we look at the screeplot, which will help us choose a rank \(k\) for representing the essence of these data.

```
[1] 3.4182381 2.6063931 0.9432964 0.8780212 0.5566267 0.4912275 0.4305952
[8] 0.3067981 0.2669494 0.1018542
```

The screeplot in Figure 7.21 shows a clear drop in the eigenvalues after the second one. This indicates that a good approximation will be obtained at rank 2. Let’s look at an interpretation of the first two axes by projecting the loadings of the original variables onto the two new ones, the principal components.

The correlation circle Figure 7.22 displays the projection of the original variables onto the two first new principal axes. The angles between vectors are interpreted as correlations. On the right side of the plane, we have the track and field events (m110, m100, m400, m1500), and on the left, we have the throwing and jumping events. Maybe there is an opposition of skills as characterized in the correlation matrix. We did see the correlations were negative between variables from these two groups. How can we interpret this?

It seems that those who throw the best have lower scores in the track competitions. In fact, if we look at the original measurements, we can see what is happening. The athletes who run in short times are the stronger ones, as are the ones who throw or jump longer distances. We should probably change the scores of the track variables and redo the analysis.

What transformations of the variables induce the best athletic performances to vary in the same direction, i.e., be mostly positively correlated?

Solution

If we change the signs on the running performances, almost all the variables will be positively correlated.

`[1] "m100" "m400" "m110" "m1500"`

```
m100 long weight high m400 m110 disc pole javel m1500
m100 1.00 0.54 0.21 0.15 0.61 0.64 0.05 0.39 0.06 0.26
long 0.54 1.00 0.14 0.27 0.52 0.48 0.04 0.35 0.18 0.40
weight 0.21 0.14 1.00 0.12 -0.09 0.30 0.81 0.48 0.60 -0.27
high 0.15 0.27 0.12 1.00 0.09 0.31 0.15 0.21 0.12 0.11
m400 0.61 0.52 -0.09 0.09 1.00 0.55 -0.14 0.32 -0.12 0.59
m110 0.64 0.48 0.30 0.31 0.55 1.00 0.11 0.52 0.06 0.14
disc 0.05 0.04 0.81 0.15 -0.14 0.11 1.00 0.34 0.44 -0.40
pole 0.39 0.35 0.48 0.21 0.32 0.52 0.34 1.00 0.27 0.03
javel 0.06 0.18 0.60 0.12 -0.12 0.06 0.44 0.27 1.00 -0.10
m1500 0.26 0.40 -0.27 0.11 0.59 0.14 -0.40 0.03 -0.10 1.00
```

```
[1] 3.4182381 2.6063931 0.9432964 0.8780212 0.5566267 0.4912275 0.4305952
[8] 0.3067981 0.2669494 0.1018542
```

Now all the negative correlations are quite small. The screeplot will show no change, as the eigenvalues of the matrix are unaffected by the above sign flips. The only ouput that changes are the signs of the coefficients of the principal component loadings for the variables whose signs we flipped.

Figure 7.23 shows the correlation circle of the transformed variables. We now see we have a broad common overall axis: all the arrows are pointing broadly in the same direction.

We now plot the athletes projected in the principal plane using:

If we look at the athletes themselves as they are shown in Figure 7.24, we notice a slight ordering effect. Do you see a relation between the performance of the athletes and their numbering in Figure 7.24 ?

Solution

If you play join the dots following the order of the numbers, you will probably realize you are spending more time on one side of the plot than you would be if the numbers were randomly assigned.

It turns out there is complementary information available in the `olympic`

dataset. An extra vector variable called `score`

reports the final scores at the competition, the men’s decathlon at the 1988 Olympics.

```
[1] 8488 8399 8328 8306 8286 8272 8216 8189 8180 8167 8143 8114 8093 8083 8036
[16] 8021 7869 7860 7859 7781 7753 7745 7743 7623 7579 7517 7505 7422 7310 7237
[31] 7231 7016 6907
```

So let us look at the scatterplot comparing the first principal component coordinate of the athletes to this score. This is shown in Figure 7.25. We can see a strong correlation between the two variables. We note that athlete number 1 (who in fact won the Olympic decathlon gold medal) has the highest score, but not the highest value in PC1. Why do you think that is?

We have seen in the examples that the first step in PCA is to make the screeplot of the variances of the new variables (equal to the **eigenvalues**). We cannot decide how many dimensions are needed before seeing this plot. The reason is that there are situations when the principal components are ill-defined: when two or three successive PCs have very similar variances, giving a screeplot as in Figure 7.26, the subspace corresponding to a group of similar eigenvalues exists. In this case this would be 3D space generated by \(u_2,u_3,u_4\). The vectors are not meaningful individually and one cannot interpret their loadings. This is because a very slight change in one observations could give a completely different set of three vectors. These would generate the same 3D space, but could have very different loadings. We say the PCs are unstable.

We have seen that unlike regression, PCA treats all variables equally (to the extent that they were preprocessed to have equivalent standard deviations). However, it is still possible to map other continuous variables or categorical factors onto the plots in order to help interpret the results. Often we have complementary information on the samples, for example, diagnostic labels in the diabetes data or the cell types in the T-cell gene expression data.

Here we see how we can use such extra variables to inform our interpretation. The best place to store such so-called *metadata* is in appropriate slots of the data object (such as in the Bioconductor *SummarizedExperiment* class); the second-best, in additional columns of the data frame that also contains the numeric data. In practice, such information is often stored in a more or less cryptic manner in the row names of the matrix. Below, we need to face the latter scenario, and we use `substr`

gymnastics to extract the cell types and show the screeplot in Figure 7.27 and the PCA in Figure 7.28.

```
ids = rownames(Msig3transp)
celltypes = factor(substr(ids, 7, 9))
status = factor(substr(ids, 1, 3))
table(celltypes)
```

```
celltypes
EFF MEM NAI
10 9 11
```

```
cbind(pcaMsig3$li, tibble(Cluster = celltypes, sample = ids)) %>%
ggplot(aes(x = Axis1, y = Axis2)) +
geom_point(aes(color = Cluster), size = 5) +
geom_hline(yintercept = 0, linetype = 2) +
geom_vline(xintercept = 0, linetype = 2) +
scale_color_discrete(name = "Cluster") + coord_fixed()
```

These data requires delicate preprocessing before we obtain our desired matrix with the relevant features as columns and the samples as rows. Starting with the raw mass spectroscopy readings, the steps involve extracting peaks of relevant features, aligning them across multiple samples and estimating peak heights. We refer the reader to the vignette of the Bioconductor **xcms** package for gruesome details. We load a matrix of data generated in such a way from the file `mat1xcms.RData`

. The output of the below code is in Figures 7.29 and 7.30.

`[1] 399 12`

```
dfmat1 = cbind(pcamat1$li, tibble(
label = rownames(pcamat1$li),
number = substr(label, 3, 4),
type = factor(substr(label, 1, 2))))
pcsplot = ggplot(dfmat1,
aes(x=Axis1, y=Axis2, label=label, group=number, colour=type)) +
geom_text(size = 4, vjust = -0.5)+ geom_point(size = 3)+ylim(c(-18,19))
pcsplot + geom_hline(yintercept = 0, linetype = 2) +
geom_vline(xintercept = 0, linetype = 2)
```

Looking at Figure 7.30, do the samples seem to be randomly placed in the plane? Do you notice any structure explained by the labels?

In the previous example, the number of variables measured was too large to enable useful concurrent plotting of both variables and samples. In this example we plot the PCA biplot of a simple data set where chemical measurements were made on different wines for which we also have a categorical `wine.class`

variable. We start the analysis by looking at the two-dimensional correlations and a heatmap of the variables.

```
Alcohol MalicAcid Ash AlcAsh Mg Phenols Flav
1 14.23 1.71 2.43 15.6 127 2.80 3.06
2 13.20 1.78 2.14 11.2 100 2.65 2.76
```

A **biplot** is a simultaneous representation of both the space of observations and the space of variables. In the case of a PCA biplot like Figure 7.32 the arrows represent the directions of the old variables as they project onto the plane defined by the first two new axes. Here the observations are just colored dots, the color has been chosen according to which type of wine is being plotted. We can interpret the variables’ directions with regards to the sample points, for instance the blue points are from the barbera group and show higher Malic Acid content than the other wines.

```
wine.class
barolo grignolino barbera
59 71 48
```

```
fviz_pca_biplot(winePCAd, geom = "point", habillage = wine.class,
col.var = "violet", addEllipses = TRUE, ellipse.level = 0.69) +
ggtitle("") + coord_fixed()
```

`Phenols`

, `Flav`

and `Proa`

indicate that they are strongly correlated, whereas `Hue`

and `Alcohol`

are uncorrelated.
Interpretation of multivariate plots requires the use of as much of the available information as possible; here we have used the samples and their groups as well as the variables to understand the main differences between the wines.

Sometimes we want to see variability between different groups or observations, but want to weight them. This can be the case if, e.g., the groups have very different sizes. Let’s re-examine the Hiiragi data we already saw in Chapter 3. In the code below, we select the wildtype (WT) samples and the top 100 features with the highest overall variance.

```
data("x", package = "Hiiragi2013")
xwt = x[, x$genotype == "WT"]
sel = order(rowVars(Biobase::exprs(xwt)), decreasing = TRUE)[1:100]
xwt = xwt[sel, ]
tab = table(xwt$sampleGroup)
tab
```

```
E3.25 E3.5 (EPI) E3.5 (PE) E4.5 (EPI) E4.5 (PE)
36 11 11 4 4
```

We see from `tab`

that the groups are represented rather unequally. To account for this, we reweigh each sample by the inverse of its group size. The function `dudi.pca`

in the **ade4** package has a `row.w`

argument into which we can enter the weights. The output of the code is in Figures 7.33 and 7.34.

**Preprocessing matrices** Multivariate data analyses require “conscious” preprocessing. After consulting all the means, variances and one-dimensional histograms we saw how to rescale and recenter the data.

**Projecting onto new variables** We saw how we can make projections into lower dimensions (2D planes and 3D are the most frequently used) of high dimensional data without losing too much information. PCA searches for new “more informative” variables that are linear combinations of the original (old) ones.

**Matrix decomposition** PCA is based on finding decompositions of the matrix \(X\) called SVD. This decomposition provides a lower rank approximation and is equivalent to the eigendecomposition of \(X^tX\). The squares of the singular values are equal to the eigenvalues and to the variances of the new variables. We systematically plotted these values before deciding how many axes are necessary to reproduce the signal in the data.

Two eigenvalues which are quite close can give rise to scores or PC scores which are highly unstable. It is always necessary to look at the screeplot of the eigenvalues and avoid separating the axes corresponding to the these close eigenvalues. This may require using interactive three or four-dimensional projections, which are available in several R packages.

**Biplot representations** The space of observations is naturally a \(p\)-dimensional space (the \(p\) original variables provide the coordinates). The space of variables is \(n\)-dimensional. Both decompositions we have studied (singular values / eigenvalues and singular vectors / eigenvectors) provide new coordinates for both of these spaces, sometimes we call one the dual of the other. We can plot the projection of both the observations and the variables onto the same eigenvectors. This provides a biplot that can be useful for interpreting the PCA output.

**Projecting other group variables** Interpretation of PCA can also be facilitated by redundant or contiguous data about the observations.

The best way to deepen your understanding of singular value decomposition is to read Chapter 7 of Strang (2009). The whole book sets the foundations for the linear algebra necessary to understanding the meaning of the rank of matrix and the duality between row spaces and column spaces (Holmes 2006).

Complete textbooks have been written on the subject of PCA and related methods. Mardia, Kent, and Bibby (1979) is a standard text that covers all multivariate methods in a classical way, with linear algebra and matrices. By making the parametric assumptions that the data come from multivariate normal distributions, Mardia, Kent, and Bibby (1979) also provide inferential tests for the number of components and limiting properties for principal components. Jolliffe (2002) is a book-long treatment of everything to do with PCA with extensive examples.

We can incorporate supplementary information into weights for the observations and the variables. This was introduced in the 1970’s by French data scientists, see Holmes (2006) for a review and Chapter 9 for further examples.

Improvements to the interpretation and stability of PCA can be obtained by adding a penalty that minimizes the number of nonzero coefficients that appear in the linear combinations. Zou, Hastie, and Tibshirani (2006) and Witten, Tibshirani, and Hastie (2009) have developed sparse versions of principal component analysis, and their packages **elasticnet** and **PMA** provide implementations in R.

Revise the material about svd by reading sections 1, 2, and 3 of the Wikipedia article about SVD. It will also be beneficial to read about the related eigenvalue decomposition by reading sections 1, 2, and 2.1 of the Wikipedia article about eigendecomposition of a matrix. We know that we can decompose a \(n\) row by \(p\) column rank 1 matrix \(X\) as:

\[ **X** = \begin{pmatrix} x_{11} & x_{12} & ... & x_{1p}\\ x_{21} & x_{22} & ... & x_{2p}\\ \vdots & \vdots & \vdots & \vdots \\ x_{n1} & x_{n2} & ... & x_{np} \end{pmatrix} = \begin{pmatrix} u_{11} \\ u_{21} \\ \vdots \\ u_{n1} \\ \end{pmatrix} \times \begin{pmatrix} v_{11} & v_{21} & \cdots & v_{p1} \end{pmatrix} \]

If \(X\) has no rows and no columns which are all zeros, then is this decomposition unique?

Generate a rank-one matrix. Start by taking a vector of length 15 with values from 2 to 30 in increments of 2, and a vector of length 4 with values 3, 6, 9, 12, then take their outer product.

Why do we have to take `t(v)`

?

- Now we add some noise in the form a matrix we call
`Materr`

so we have an “approximately rank-one” matrix.

Visualize \(X\) using `ggplot`

.

- Redo the same analyses with a rank 2 matrix.

Solution

Note that `X1`

can also be computed as

Here we see that the data looks linear in all four dimensions. This is what it means to be of rank-one. Now let’s consider a rank 2 matrix.

```
[,1] [,2] [,3] [,4]
[1,] 1.1386707 -0.6722571 1.5302913 -0.20014004
[2,] -0.3574149 0.3252298 -0.1869863 -0.06003997
[3,] -0.1832390 0.1048434 -0.2548350 0.03579861
[4,] 2.7945843 -2.2562659 2.1982998 0.16108026
[5,] 0.9662820 -0.7035995 0.9567117 -0.02664564
[6,] -2.1883056 2.1717134 -0.6813383 -0.56172314
```

Now there are obviously at least two dimensions because if we project the data onto the first two coordinates (by default called `X1`

and `X2`

when you convert a matrix into a data frame in R), then the data varies in both dimensions. So the next step is to try to decide if there are more than two dimensions. The top right points are the closest to you (they’re biggest) and as you go down and left in the plot those points are farther away. In the left are the bluest points and they seem to get darker linearly as you move right. As you can probably tell, it is very hard to visually discover a low dimensional space in higher dimensions, even when “high dimensions” only means 4! This is one reason why we rely on the singular value decomposition.

`[1] 2.411503e+01 9.594192e+00 7.437314e-15 6.170418e-16`

```
Y = Y2 + matrix(rnorm(n*p, sd=0.01),n,p) # add some noise to Y2
svd(Y)$d # four non-zero eigenvalues (but only 2 big ones)
```

`[1] 24.11265533 9.58275554 0.10510423 0.09366624`

Here we have two dimensions which are non-zero and two dimensions which are approximately 0 (for “Y2”, they are within square root of computer tolerance of 0).

- create a first a matrix of highly correlated bivariate data such as that shown in Figure 7.35.

Hint: Use the function`mvrnorm`

.

Check the rank of the matrix by looking at its singular values.

- perform a PCA and show the rotated principal component axes.

Solution

- we generate correlated bivariate normal data using:

```
library("MASS")
mu1 = 1; mu2 = 2; s1=2.5; s2=0.8; rho=0.9;
sigma = matrix(c(s1^2, s1*s2*rho, s1*s2*rho, s2^2),2)
sim2d = data.frame(mvrnorm(50, mu = c(mu1,mu2), Sigma = sigma))
svd(scale(sim2d))$d
```

`[1] 9.669028 2.123654`

`[1] -0.7071068 -0.7071068`

- We use
`prcomp`

to perform a PCA and the scores provide the desired rotation.

Part (a) in Figure 7.35 shows a very elongated plotting region, why?

What happens if you do not use the `coord_fixed()`

option and have a square plotting zone? Why can this be misleading?

Let’s revisit the Hiiragi data and compare the weighted and unweighted approaches.

make a correlation circle for the unweighted Hiiragi data

`xwt`

. Which genes have the best projections on the first principal plane (best approximation)?make a biplot showing the labels of the extreme gene-variables that explain most of the variance in the first plane. Add the the sample-points.

Abbott, Edwin A. 1884. *Flatland: A Romance of Many Dimensions*. OUP Oxford.

Flury, Bernard. 1997. *A First Course in Multivariate Statistics*. Springer.

Holmes, Susan. 2006. “Multivariate Analysis: The French way.” In *Probability and Statistics: Essays in Honor of David a. Freedman*, edited by D. Nolan and T. P. Speed. Vol. 56. IMS Lecture Notes–Monograph Series. Beachwood, OH: IMS. http://www.imstat.org/publications/lecnotes.htm.

Holmes, Susan, Michael He, Tong Xu, and Peter P Lee. 2005. “Memory t Cells Have Gene Expression Patterns Intermediate Between Naive and Effector.” *PNAS* 102 (15): 5519–23.

Hotelling, Harold. 1933. “Analysis of a Complex of Statistical Variables into Principal Components.” *Journal of Educational Psychology* 24 (6): 417–41.

Jolicoeur, Pierre, and James E Mosimann. 1960. “Size and Shape Variation in the Painted Turtle. A Principal Component Analysis.” *Growth* 24: 339–54.

Jolliffe, Ian. 2002. *Principal Component Analysis*. Wiley Online Library.

Mardia, Kanti, John T Kent, and John M Bibby. 1979. *Multiariate Analysis*. New York: Academic Press.

Pearson, Karl. 1901. “LIII. On Lines and Planes of Closest Fit to Systems of Points in Space.” *The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science* 2 (11): 559–72.

Strang, Gilbert. 2009. *Introduction to Linear Algebra*. Fourth. Wellesley-Cambridge Press.

Witten, Daniela M, Robert Tibshirani, and Trevor Hastie. 2009. “A Penalized Matrix Decomposition, with Applications to Sparse Principal Components and Canonical Correlation Analysis.” *Biostatistics*, kxp008.

Zou, Hui, Trevor Hastie, and Robert Tibshirani. 2006. “Sparse Principal Component Analysis.” *Journal of Computational and Graphical Statistics* 15 (2): 265–86.

Page built at 01:16 on 2024-09-15 using R version 4.4.1 (2024-06-14)

Support for maintaining the online version of this book is provided by de.NBI

For website support please contact msmith@embl.de