Many datasets consist of several variables measured on the same set of subjects: patients, samples, or organisms. For instance, we may have biometric characteristics such as height, weight, age as well as clinical variables such as blood pressure, blood sugar, heart rate, and genetic data for, say, a thousand patients. The raison d’être for multivariate analysis is the investigation of connections or associations between the different variables measured. Usually the data are reported in a tabular data structure with one row for each subject and one column for each variable. In the following, we will focus on the special case where each of the variables is numeric, so we can represent the data structure as a matrix in R.
If the columns of the matrix are all independent of each other (unrelated), we can simply study each column separately and do standard “univariate” statistics on them one by one; there would be no benefit in studying them as a matrix.
More often, there will be patterns and dependencies. For instance, in the biology of cells, we know that the proliferation rate will influence the expression of many genes simultaneously. Studying the expression of 25,000 gene (columns) on many samples (rows) of patientderived cells, we notice that many of the genes act together, either that they are positively correlated or that they are anticorrelated. We would miss a lot of important information if we were to only study each gene separately. Important connections between genes are detectable only if we consider the data as a whole: each row representing the many measurements made on the same observational unit. However, having 25,000 dimensions of variation to consider at once is daunting; we will show how to reduce our data to a smaller number of most important dimensions106 We will elaborate this idea of dimension reduction in much more detail below. For the time being, remember we live in a fourdimensional world. without losing too much information.
It might be a while since you saw the type of mathematics that underlies PCA, so we illustrate it more by pictures more than by formula notation. For a refresher, see the Khan academy material on Youtube, or textbooks such as that by Strang (2009). This chapter presents many examples of multivariate data matrices that we encounter in highthroughput experiments, as well as some more elementary examples that we hope will enhance your intuition. We will focus in this chapter on Principal Component Analysis, abbreviated as PCA, a dimension reduction method. We will provide geometric explanations of the algorithm as well as visualizations that help interprete the output of PCA analyses.
In this chapter we will:
See examples of matrices that come up in the study of biological data.
Perform dimension reduction to understand correlations between variables.
Preprocess, rescale and center the data before starting a multivariate analysis.
Build new variables, called principal components (PC), that are more useful than the original measurements.
See what is “under the hood” of PCA: the singular value decomposition of a matrix.
Visualize what this decomposition achieves and learn how to choose the number of principal components.
Run through a complete PCA analysis from start to finish.
Project factor covariates onto the PCA map to enable a more useful interpretation of the results.
First, let’s look at a set of examples of rectangular matrices used to represent tables of measurements. In each matrix, the rows and columns represent specific entities. Turtles: A very simple data set that will help us understand the basic principles is a matrix of three dimensions of biometric measurements on painted turtles.
turtles = read.table("../data/PaintedTurtles.txt", header = TRUE)
turtles[1:4, ]
## sex length width height
## 1 f 98 81 38
## 2 f 103 84 38
## 3 f 103 86 42
## 4 f 105 86 40
The last three columns are length measurements (in millimetres), whereas the first column is a factor variable that tells us the sex of each animal.
Athletes: This matrix is an interesting example from the sports world. It reports the performances for 33 athletes in the ten disciplines of the decathlon: m100
, m400
and m1500
are times in seconds for the 100 meters, 400 meters, and 1500 meters respectively; m110
is the time to finish the 110 meters hurdles; pole
is the polevault height, and highj
and long
are the results of the high and long jumps, all in meters; weight
, disc
, and javel
are the lengths in meters the athletes were able to throw the weight, discus and javelin. Here are these variables for the first three athletes:
load("../data/athletes.RData")
athletes[1:3, ]
## m100 long weight highj m400 m110 disc pole javel m1500
## 1 11.25 7.43 15.48 2.27 48.90 15.13 49.28 4.7 61.32 268.95
## 2 10.87 7.45 14.97 1.97 47.71 14.46 44.36 5.1 61.76 273.02
## 3 11.18 7.44 14.20 1.97 48.29 14.81 43.66 5.2 64.16 263.20
Cell Types: Holmes et al. (2005) studied gene expression profiles of sorted Tcell populations from different subjects. The columns are a subset of gene expression measurements, they correspond to 156 genes that show differential expression between cell types.
load("../data/Msig3transp.RData")
round(Msig3transp,2)[1:5, 1:6]
## X3968 X14831 X13492 X5108 X16348 X585
## HEA26_EFFE_1 2.61 1.19 0.06 0.15 0.52 0.02
## HEA26_MEM_1 2.26 0.47 0.28 0.54 0.37 0.11
## HEA26_NAI_1 0.27 0.82 0.81 0.72 0.90 0.75
## MEL36_EFFE_1 2.24 1.08 0.24 0.18 0.64 0.01
## MEL36_MEM_1 2.68 0.15 0.25 0.95 0.20 0.17
Bacterial Species Abundances: Matrices of counts are used in microbial ecology studies (as we saw in Chapter 4). Here the columns represent different species (or operational taxonomic units, OTUs) of bacteria, which are identified by numerical tags. The rows are labeled according to the samples in which they were measured, and the (integer) numbers represent the number of times of each of the OTUs was observed in each of the samples.
data("GlobalPatterns", package = "phyloseq")
GPOTUs = as.matrix(t(phyloseq::otu_table(GlobalPatterns)))
GPOTUs[1:4, 6:13]
## OTU Table: [8 taxa and 4 samples]
## taxa are columns
## 246140 143239 244960 255340 144887 141782 215972 31759
## CL3 0 7 0 153 3 9 0 0
## CC1 0 1 0 194 5 35 3 1
## SV1 0 0 0 0 0 0 0 0
## M31Fcsw 0 0 0 0 0 0 0 0
Notice the propensity of the matrix entries to be zero; we call such data sparse.
mRNA reads: RNASeq transcriptome data report the number of sequence reads matching each gene107 Or subgene structures, such as exons. in each of several biological samples. We will study this type of data in detail in Chapter 8.
library("SummarizedExperiment")
data("airway", package = "airway")
assay(airway)[1:3, 1:4]
## SRR1039508 SRR1039509 SRR1039512 SRR1039513
## ENSG00000000003 679 448 873 408
## ENSG00000000005 0 0 0 0
## ENSG00000000419 467 515 621 365
It is customary in the RNASeq field—and so it is for the airway
data above—to report the genes in the rows and the samples in the columns. Compared to the other matrices we look at here, this is transposed: rows and columns are swapped. Such different conventions easily lead to errors, so they are worthwhile paying attention to108 The Bioconductor project tries to help users and developers to avoid such ambiguities by defining data containers in which such conventions are explicitly fixed. In Chapter 8, we will see the example of the SummarizedExperiment class.. Proteomic profiles: Here, the columns are aligned mass spectroscopy peaks or molecules identified through their \(m/z\)ratios; the entries in the matrix are the measured intensities109 More details can be found, e.g., on Wikipedia..
metab = t(as.matrix(read.csv("../data/metabolites.csv", row.names = 1)))
metab[1:4, 1:4]
## 146.0985388 148.7053275 310.1505057 132.4512963
## KOGCHUM1 29932.36 17055.70 1132.82 785.5129
## KOGCHUM2 94067.61 74631.69 28240.85 5232.0499
## KOGCHUM3 146411.33 147788.71 64950.49 10283.0037
## WTGCHUM1 229912.57 384932.56 220730.39 26115.2007
In many of the matrices we have seen here, important information is stored in the row or column names. However, this is by far not the best place to store all available information on the subjects (columns) and features (rows). A much better approach is the Bioconductor SummarizedExperiment class.
► Task
When a peak was not detected for a particular \(m/z\) score in the mass spectrometry run, a zero was recorded in metab
. Similarly, zeros in GPOTUs
or in the airway
object occur when there were no matching sequence reads deteced. Tabulate the frequencies of zeros in these data matrices.
► Question 7.1
a) What are the columns of these data matrices usually called ?
b) In each of these examples, what are the rows of the matrix ?
c) What does a cell in a matrix represent ?
d) If the data matrix is called athletes
and you want to see the
value of the third variable for the fifth athlete, what do you type into R?
If we are studying only one variable, i.e., just the third column of the turtles matrix110 The third column of a matrix \(X\) is denoted mathematically by \({\mathbf x}_{\bullet 3}\) or accessed in R using X[, 3]
., we say we are looking at onedimensional data. Such a vector, say all the turtle weights, can be visualized by plots such as those that we saw in Section 3.6, e.g., a histogram. If we compute a one number summary, say mean or median, we have made a zerodimensional summary of our onedimensional data. This is already an example of dimension reduction.
In Chapter 3 we studied twodimensional scatterplots. We saw that if there are too many observations, it can be beneficial to group the data into (hexagonal) bins: these are twodimensional histograms. When considering two variables (\(x\) and \(y\)) measured together on a set of observations, the correlation coefficient measures how the variables covary. This is a single number summary of twodimensional data. Its formula involves the summaries \(\bar{x}\) and \(\bar{y}\):
\[\begin{equation} \hat{\rho}= \frac{\sum_{i=1}^n (x_i\bar{x})(y_i\bar{y})} {\sqrt{\sum_{i=1}^n (x_i\bar{x})^2} \sqrt{\sum_{j=1}^n (y_j\bar{y})^2}} \tag{7.1} \end{equation}\]
In R, we use the cor
function to calculate its value. Applied to a matrix this function computes all the two way correlations between continuous variables. In Chapter 9 we will see how to analyse multivariate categorical data.
► Question 7.2
Compute the matrix of all correlations between the measurements from the turtles data. What do you notice ?
► Solution
It is always beneficial to start a multidimensional analysis by checking these simple onedimensional and twodimensional summary statistics and make visual displays such as those in Figures 7.2 and 7.3.
► Question 7.3
► Solution
► Question 7.4
Make a pairs plot of the athletes
data. What do you notice?
► Solution
In many cases, different variables are measured in different units, so they have different baselines and different scales111 Common measures of scale are the range and the standard deviation. For instance, the times for the 110 metrex vary between 14.18 and 16.2, with a standard deviation of 0.51, whereas the times to complete the 1500 metres vary between 256.64 and 303.17, with a standard deviation of 13.66; more than an order of magnitude larger. Moreover, the athletes
data also contain measurements in different units (seconds, metres), whose choice is arbitrary (lengths could also be recorded in centimetres or feet, times in milliseconds).. These are not directly comparable in their original form.
For PCA and many other methods, we therefore need to transform the numeric values to some common scale in order to make comparisons meaningful. Centering means subtracting the mean, so that the mean of the centered data is at the origin. Scaling or standardizing then means dividing by the standard deviation, so that the new standard deviation is \(1\). In fact, we have already encountered these operations when computing the correlation coefficient (Equation (7.1)): the correlation coefficient is simply the vector product of the centered and scaled variables. To perform these operations, there is the R function scale
, whose default behavior when given a matrix or a data frame is to make every column have a mean of zero and a standard deviation of \(1\).
► Question 7.5
turtle
data, then use the scale
function to center and standardize the continuous variables. Call this scaledTurtles
, then verify the new values for mean and standard deviation of scaledTurtles
.
► Solution
We have already encountered other data transformation choices in Chapters 4 and 5, where we used the log and asinh functions. The aim of these transformations is (usually) variance stabilization, i.e., to make the variances of replicate measurements of one and the same variable in different parts of its dynamic range more similar. In contrast, the standardizing transformation described above aims to make the scale (as measured by mean and standard deviation) of different variables the same.
Sometimes it is preferable to leave variables at different scales because they are truly of different importance. If their original scale is relevant, then we can (should) leave the data as is. In other cases, the variables have different precisions known a priori. We will see in Chapter 9 that there are several ways of weighting such variables.
After preprocessing the data, we are ready to undertake data simplification through dimension reduction.
Useful books with relevant chapters are Flury (1997) for an introductory account and Mardia, Kent, and Bibby (1979) for a detailed mathematical approach.
We will explain dimension reduction from several different perspectives. It was invented in 1901 by Karl Pearson (Pearson 1901) as a way to reduce a twovariable scatterplot to a single coordinate. It was used by statisticians in the 1930s to summarize a battery of psychological tests run on the same subjects (Hotelling 1933); thus providing overall scores that summarize many testvariables at once. This idea of principal scores inspired the name Principal Component Analysis (abbreviated PCA). PCA is called an unsupervised learning technique because, as in clustering, it treats all variables as having the same status. We are not trying to predict or explain one particular variable’s value from the others; rather, we are trying to find a mathematical model for an underlying structure for all the variables. PCA is primarily an exploratory technique that produces maps that show the relations between variables and between observations in a useful way.
We first provide a flavor of what this multivariate analysis does to the data. There is an elegant mathematical formulation of these methods through linear algebra, although here we will try to minimize its use and focus on visualization and data examples.
We use geometrical projections that take points in higherdimensional spaces and projects them down onto lower dimensions. Figure 7.5 shows the projection of the point \(A\) onto the line generated by the vector \({\mathbf v}\).
Figure 7.5: Point \(A\) is projected onto the red line generated by the vector \(v\). The dashed projection line is perpendicular (or orthogonal) to the red line. The intersection point of the projection line and the red line is called the orthogonal projection of A onto the red line generated by the vector \(v\).
PCA is a linear technique, meaning that we look for linear relations between variables and that we will use new variables that are linear functions of the original ones (\(f(ax+by)=af(x)+b(y)\)). The linearity constraints makes computations particularly easy. We will see nonlinear techniques in Chapter 9.
Here we show one way of projecting twodimensional data onto a line using the athletes
data. The code below provides the preprocessing and plotting steps that were used to generate Figure 7.6:
Figure 7.6: Scatterplot of two variables showing the projection on the horizontal x axis (defined by \(y=0\)) in red and the lines of projection appear as dashed.
athletes = data.frame(scale(athletes))
ath_gg = ggplot(athletes, aes(x = weight, y = disc)) +
geom_point(size = 2, shape = 21)
ath_gg + geom_point(aes(y = 0), colour = "red") +
geom_segment(aes(xend = weight, yend = 0), linetype = "dashed")
► Task
a) Calculate the variance of the red points in Figure 7.6.
b) Make a plot showing projection lines onto the \(y\) axis and projected points.
c) Compute the variance of the points projected onto the vertical \(y\) axis.
In general, we lose information about the points when we project from two dimensions (a plane) to one (a line). If we do it just by using the original coordinates, as we did on the weight variable in Figure 7.6, we lose all the information about the disc variable. Our goal is to keep as much information as we can about both variables. There are actually many ways of projecting the point cloud onto a line. One is to use what are known as regression lines. Let’s look at these lines and how they are constructed in R.
If you have seen linear regression, you already know how to compute lines that summarize scatterplots; linear regression is a supervised method that gives preference minimizing the residual sum of squares in one direction: that of the response variable.
In Figure 7.7, we use the lm
(linear model) function to find the regression line. Its slope and intercept are given by the values in the coefficients
slot of the resulting object reg1
.
Figure 7.7: The blue line minimizes the sum of squares of the vertical residuals (in red).
reg1 = lm(disc ~ weight, data = athletes)
a1 = reg1$coefficients[1] # intercept
b1 = reg1$coefficients[2] # slope
pline1 = ath_gg + geom_abline(intercept = a1, slope = b1,
col = "blue", lwd = 1.5)
pline1 + geom_segment(aes(xend = weight, yend = reg1$fitted),
colour = "red", arrow = arrow(length = unit(0.15, "cm")))
Figure 7.8 shows the line produced when reversing the roles of the two variables; weight becomes the response variable.
Figure 7.8: The green line minimizes the sum of squares of the horizontal residuals (in orange).
reg2 = lm(weight ~ disc, data = athletes)
a2 = reg2$coefficients[1] # intercept
b2 = reg2$coefficients[2] # slope
pline2 = ath_gg + geom_abline(intercept = a2/b2, slope = 1/b2,
col = "darkgreen", lwd = 1.5)
pline2 + geom_segment(aes(xend=reg2$fitted, yend=disc),
colour = "orange", arrow = arrow(length = unit(0.15, "cm")))
Each of the regression lines in Figures 7.7 and 7.8 gives us an approximate linear relationship between disc
and weight
. However, the relationship differs depending on which of the variables we choose to be the predictor and which the response.
► Question 7.6
How large is the variance of the projected points that lie on the blue regression line of Figure 7.7? Compare this to the variance of the data when projected on the original axes, weight
and disc
.
► Solution
Figure 7.9 shows the line chosen to minimize the sum of squares of the orthogonal (perpendicular) projections of data points onto it; we call this the principal component line. All our three ways of fitting a line (Figures 7.7–7.9) together in one plot are shown in Figure 7.10.
Figure 7.9: The purple principal component line minimizes the sums of squares of the orthogonal projections.
xy = cbind(athletes$disc, athletes$weight)
svda = svd(xy)
pc = xy %*% svda$v[, 1] %*% t(svda$v[, 1])
bp = svda$v[2, 1] / svda$v[1, 1]
ap = mean(pc[, 2])  bp * mean(pc[, 1])
ath_gg + geom_segment(xend = pc[, 1], yend = pc[, 2]) +
geom_abline(intercept = ap, slope = bp, col = "purple", lwd = 1.5)
Figure 7.10: The blue line minimizes the sum of squares of the vertical residuals, the green line minimizes the horizontal residuals, the purple line, called the principal component, minimizes the orthogonal projections. Notice the ordering of the slopes of the three lines.
► Question 7.7
► Solution
► Question 7.8
Compute the variance of the points on the purple line.
► Solution
Pythagoras’ theorem tells us two interesting things here:
If we are minimizing in both horizontal and vertical directions we are in fact minimizing the orthogonal projections onto the line from each point.
The total variability of the points is measured by the sum of squares of the projection of the points onto the center of gravity, which is the origin (0,0) if the data are centered. This is called the total variance or the inertia of the point cloud. This inertia can be decomposed into the sum of the squares of the projections onto the line plus the variances along that line. For a fixed variance, minimizing the projection distances also maximizes the variance along that line. Often we define the first principal component as the line with maximum variance.
small Image credit: Sara Holmes The PC line we found in the previous section could be written
\[\begin{equation} PC= \frac{1}{2} \mbox{disc} + \frac{1}{2} \mbox{weight}. \tag{7.2} \end{equation}\]
Principal components are linear combinations of the variables that were originally measured, they provide a new coordinate system. To understand what a linear combination really is, we can take an analogy. When making a healthy juice mix, you will follow a recipe like
\[\begin{equation*} V = 2 \times \text{Beet} + 1 \times \text{Carrot} + \frac{1}{2} \text{Gala} + \frac{1}{2} \text{GrannySmith} + 0.02 \times \text{Ginger} + 0.25 \times \text{Lemon}. \end{equation*}\]
This recipe is a linear combination of individual juice types (the original variables). The result is a new variable, \(V\), and the coefficients \((2,1,\frac{1}{2},\frac{1}{2},0.02,0.25)\) are called the loadings.
► Question 7.9
How would you compute the calories in a glass of juice?
A linear combination of variables defines a line in higher dimensions in the same way we constructed lines in the scatterplot plane of two dimensions. As we saw in that case, there are many ways to choose lines onto which we project the data, there is however a `best’ line for our purpose.
The total variance of all the points in all the variables can de decomposed. In PCA, we use the fact that the total sums of squares of the distances between the points and any line can be decomposed into the distance to the line and the variance along the line.
We saw that the principal component minimizes the distance to the line, and it also maximizes the variance of the projections along the line.
Why is maximizing the variance along a line a good idea? Let’s look at another example of a projection from three dimensions into two. In fact, human vision depends on such dimension reducation:
Figure 7.11: A mystery silhouette.
► Question 7.10
In Figure 7.11, there is a twodimensional projection of a threedimensional object. What is the object?
► Question 7.11
► Solution
Figure 7.12: Many choices have to be made during PCA processing.
PCA is based on the principle of finding the axis showing the largest inertia/variability, removing the variability in that direction and then iterating to find the next best orthogonal axis so on. In fact, we do not have to run iterations, all the axes can be found in one operation called the Singular Value Decomposition (we will delve more deeply into the details below).
In the diagram in Figure 7.12, we see that first the means are variances are computed and the choice of whether to work with rescaled covariances –the correlation matrix– or not has to be made. Then the next step is the choice of \(k\), the number of components relevant to the data. We say that \(k\) is the rank of the approximation we choose; we give a careful explanation of how we make that choice below. Unlike clustering, it is impossible to choose the number of components before doing part of the analysis. The choice of \(k\) requires looking at a plot of the variances explained by the successive principal components before proceeding to the projections of the data.
The end results of the PCA workflow are useful maps of both the variables and the samples. Understanding how these maps are constructed will maximize the information we can gather from them.
This is a small section for those whose background in linear algebra is but a faint memory. It tries to give some intuition to the singular value decomposition method underlying PCA, without too much notation.
Figure 7.13: Another twodimensional projection of the same object shown in Figure 7.11. Here, the perspective is more informative. Generally, choosing the perspective such that the spread (in other words, the variance) of the points is maximal generally provides most information. We want to see as much of the variation as possible, that’s what PCA does.
The singular value decomposition of a matrix finds horizontal and vertical vectors (called the singular vectors) and normalizing values (called singular values). As before, we start by giving the forwardgenerative explanation before doing the actual reverse engineering that is used in creating the decomposition. To calibrate the meaning of each step, we will start with an artificial example before moving to the complexity of real data.
\(u =\left(\begin{smallmatrix} 1\\2\\3\\4 \end{smallmatrix}\right)\) and \(v =\left(\begin{smallmatrix} 2\\4\\8 \end{smallmatrix}\right)\). The transpose of \(v\) is written \(v^t=t(v)=(2\; 4\; 8)\). A simple generative model demonstrates the meaning of the rank of a matrix and explains how we find it in practice. Suppose two vectors, \(u\) (a onecolumn matrix), and \(v^t=t(v)\) (a onerow matrix – the transpose of a onecolumn matrix \(v\)). We multiply a copy of \(u\) by each of the elements of \(v^t\) in turn as follows:
The \((2,3)\) entry of the matrix \(X\), written \(x_{2,3}\), is obtained by multiplying \(u_2\) by \(v_3\). We can write this \[X=\begin{pmatrix}
2&4&8\\ 4&8&16\\
6 &12&24\\
8&16&32\\
\end{pmatrix}
= u * t(v)= u * v^t\]




The matrix \(X\) we obtain here is said to be of rank 1, because both \(u\) and \(v\) have one column.
► Question 7.12
Why can we say that writing \(X = u*v^t\) is more economical than spelling out the full matrix \(X\)?
► Solution
On the other hand, suppose that we want to reverse the process and simplify another matrix \(X\) given below with 3 rows and 4 columns (12 numbers). Can we always express it in a similar way as a product of vectors without loss of information? In the diagrams shown in Figures 7.14 and 7.15, the colored boxes have areas proportional to the numbers in the cells of the matrix ((7.3)).
► Question 7.13
Here is a matrix \(X\) we want to decompose.
Figure 7.14: Some special matrices have numbers in them that make them easy to decompose. Each colored rectangle in this diagram has an area that corresponds to the number in it.
\[\begin{equation} \begin{array}{rrrrr} \hline {\large X} & x_{.1} & x_{.2} & x_{.3} & x_{.4} \\ \hline x_{1.} & 780 & 936 & 1300 & 728\\ x_{2.} & 75 & 90 & 125 & 70\\ x_{3.} & 540 & 648 & 900 & 504\\ \hline \end{array} \tag{7.3} \end{equation}\]
\(X\) has been redrawn as series of rectangles in Figure 7.14. What numbers could we put in the white \(u\) and \(v\) boxes so that the values of the sides of the rectangle give the numbers as their product?
A matrix with the special property of being perfectly “rectangular” like \(X\) is said to be of rank 1. We can represent the numbers in \(X\) by the areas of rectangles, where the sides of rectangles are given by the values in the side vectors (\(u\) and \(v\)).
We see in Figure 7.15 that the decomposition of \(X\) is not unique: there are several candidate choices for the vectors \(u\) and \(v\). We will choose these marginal vectors so that each vector has its coordinates’ sum of squares add to 1 (we say the vectors \(v\) and \(u\) have norm 1). Then we have to keep track of one extra number by which to multiply each of the products, and which represents the “overall scale” of \(X\). This is the value we have put in the upper left hand corner. It is called the singular value \(s_1\). In the R code below, we start by supposing we know the values in u
, v
and s1
; later we will see a function that finds them for us. Let’s check the multiplication and norm properties in R:
X = matrix(c(780, 75, 540,
936, 90, 648,
1300, 125, 900,
728, 70, 504), nrow = 3)
u = c(0.8196, 0.0788, 0.5674)
v = c(0.4053, 0.4863, 0.6754, 0.3782)
s1 = 2348.2
sum(u^2)
## [1] 1
sum(v^2)
## [1] 1
s1 * u %*% t(v)
## [,1] [,2] [,3] [,4]
## [1,] 780 936 1300 728
## [2,] 75 90 125 70
## [3,] 540 648 900 504
X  s1 * u %*% t(v)
## [,1] [,2] [,3] [,4]
## [1,] 0.03419 0.0745 0.1355 0.1221
## [2,] 0.00403 0.0159 0.0252 0.0186
## [3,] 0.00903 0.0691 0.1182 0.0982
► Question 7.14
Try svd(X)
in R. Look at the components of the output of the svd
function carefully. Check the norm of the columns of the matrices that result from this call. Where did the above value of s1
= 2348.2 come from?
► Solution
In fact, in this particular case we were lucky: we see that the second and third singular values are 0 (up to the numeric precision we care about). That is why we say that \(X\) is of rank 1. For a more general matrix \(X\), it is rare to be able to write \(X\) exactly as this type of twovector product. The next subsection shows how we can decompose \(X\) when it is not of rank 1: we will just need more pieces.
In the above decomposition, there were three elements: the horizontal and vertical singular vectors, and the diagonal corner, called the singular value. These can be found using the singular value decomposition function (svd
). For instance:
Xtwo = matrix(c(12.5, 35.0, 25.0, 25, 9, 14, 26, 18, 16, 21, 49, 32,
18, 28, 52, 36, 18, 10.5, 64.5, 36), ncol = 4, byrow = TRUE)
USV = svd(Xtwo)
► Question 7.15
Look at the USV
object, the result of calling the svd
function. What are its components?
► Solution
► Question 7.16
Check how each successive pair of singular vectors improves our approximation to Xtwo
. What do you notice about the third and fourth singular values?
► Solution
Again, there are many ways to write a rank two matrix such as Xtwo
as a sum of rank one matrices: in order to ensure uniqueness, we impose yet another112 Above, we had chosen the norm of the vectors to be 1. condition on the singular vectors. The output vectors of the singular decomposition do not only have their norms equal to 1, each vector in the \(U\) matrix is orthogonal to all the previous ones. Ditto for the \(V\) matrix. \(u_{\bullet 1} \perp u_{\bullet 2}\), this means that the sum of the products of the values in the same positions is \(0\): \(\sum_i u_{i1} u_{i2} = 0\).
► Task
Check the orthonormality by computing the cross product of the \(U\) and \(V\) matrices:
t(USV$u) %*% USV$u
t(USV$v) %*% USV$v
Let’s submit our rescaled turtles
matrix to a singular value decomposition.
turtles.svd = svd(scaledTurtles)
turtles.svd$d
## [1] 11.746475 1.419035 1.003329
turtles.svd$v
## [,1] [,2] [,3]
## [1,] 0.5787981 0.3250273 0.74789704
## [2,] 0.5779840 0.4834699 0.65741263
## [3,] 0.5752628 0.8127817 0.09197088
dim(turtles.svd$u)
## [1] 48 3
► Question 7.17
What can you conclude about the turtles matrix from the svd
output?
► Solution
\(X\) is decomposed additively into rankone pieces. Each of the \(u\) vectors is combined into the \(U\) matrix, and each of \(v\) vectors into \(V\). The Singular Value Decomposition is
\[\begin{equation} {\mathbf X} = U S V^t, V^t V={\mathbb I}, U^t U={\mathbb I}, \tag{7.4} \end{equation}\]
where \(S\) is the diagonal matrix of singular values, \(V^t\) is the transpose of \(V\), and \({\mathbb I}\) is the Identity matrix. Expression (7.4) can be written elementwise as
\[\begin{equation*} X_{ij} = u_{i1}s_1v_{1j} + u_{i2}s_2v_{2j} + u_{i3}s_3v_{3j} +... + u_{ir}s_rv_{rj}, \end{equation*}\]
\(U\) and \(V\) are said to be orthonormal113 Nothing to do with the normal distribution, it stands for orthogonal and having norm 1., because their selfcrossproducts are the identity matrix.
The singular vectors from the singular value decomposition (provided by the svd
function in R) contain the coefficients to put in front of the original variables to make the more informative ones we call the principal components. We write this as:
\[\begin{equation*} Z_1=c_1 X_{\bullet 1} +c_2 X_{\bullet 2}+ c_3 X_{\bullet 3}+\cdots c_p X_{\bullet p}. \end{equation*}\]
If usv = svd(X)
, then \((c_1,c_2,c_3...)\) will be given by the first column of usv$v
. These new variables \(Z_1, Z_2, Z_3, ...\) will have variances that decrease in size \(s_1^2 \geq s_2^2 \geq s_3^2 \geq ...\).
► Question 7.18
Compute the first principal component for the turtles data by multiplying by the first singular value usv$d[1] by usv$u[,1].
What is another way of computing it ?
► Solution
Note: If you feel lost about which output matrices we are dealing with, remember the dimension of the principal components: each \({\mathbf C}=(c_1,c_2,...,c_p)\) is of length \(n\), the number of observations, which is also the number of rows of each of the vectors \(X_{\bullet 1},X_{\bullet 2},... X_{\bullet p}\).
Here are two useful facts, first in words, then with the mathematical shorthand.
is always chosen to be less than the number of original variables or the number of observations. We are “lowering” the dimension of the problem:
\[\begin{equation*} k\leq \min(n,p) \end{equation*}\]
The principal component transformation is defined so that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each successive component in turn has the highest variance possible under the constraint that it be orthogonal to the preceding components:
\[\begin{equation*} \max_{aX \perp bX}\mbox{var}(\mbox{Proj}_{aX} (X)), \qquad \mbox{where } bX=\mbox{previous components} \end{equation*}\]
We revisit our twovariable athletes data with the discus
and the weight
variables. In section 7.3.2, we computed the first principal component and represented it as the purple line in Figure 7.10. We showed that \(Z_1\) was the linear combination given by the diagonal. As the coefficients have to have their sum of squares add to \(1\), we have that
Z1 = 0.707*athletes\(disc 0.707*athletes\)weight. This is the same as if the two coordinates were \(c_1=0.7071\) and \(c_2=0.7071\).
► Question 7.19
What part of the output of the svd functions leads us to the first PC coefficients, also known as the PC loadings ?
Note that we use svda
which was the svd applied to the two variables discus
and weight
.
► Solution
If we rotate the (discus,weight) plane making the purple line the horizontal \(x\) axis, we obtain what is know as the first principal plane.
ppdf = tibble(PC1n = svda$u[, 1] * svda$d[1],
PC2n = svda$u[, 2] * svda$d[2])
ggplot(ppdf, aes(x = PC1n, y = PC2n)) + geom_point() + xlab("PC1 ")+
ylab("PC2") + geom_point(aes(x=PC1n,y=0),color="red") +
geom_segment(aes(xend = PC1n, yend = 0), color = "red") +
geom_hline(yintercept = 0, color = "purple", lwd=1.5, alpha=0.5) +
xlim(3.5, 2.7) + ylim(2,2) + coord_fixed()
segm = tibble(xmin = pmin(ppdf$PC1n, 0), xmax = pmax(ppdf$PC1n, 0), yp = seq(1, 2, length = nrow(ppdf)), yo = ppdf$PC2n)
ggplot(ppdf, aes(x = PC1n, y = PC2n)) + geom_point() + ylab("PC2") + xlab("PC1") +
geom_hline(yintercept=0,color="purple",lwd=1.5,alpha=0.5) +
geom_point(aes(x=PC1n,y=0),color="red")+
xlim(3.5, 2.7)+ylim(2,2)+coord_fixed() +
geom_segment(aes(xend=PC1n,yend=0), color="red")+
geom_segment(data=segm,aes(x=xmin,xend=xmax,y=yo,yend=yo), color="blue",alpha=0.5)
► Question 7.20
► Solution
► Task
Use prcomp
to compute the PCA of the first two columns of the athletes data, look at the output. Compare to the singular value decomposition.
We now want to do a complete PCA analysis on the turtles data. Remember, we already looked at the summary statistics for the 1 and 2dimensional data. Now we are going to answer the question about the “true” dimensionality of these rescaled data. In the following code, we use the function princomp
. Its return value is a list of all the important pieces of information needed to plot and interpret a PCA.
In fact, PCA is such a fundamental technique that there are many different implementations of it in various R packages. Unfortunately, the input arguments and the formating and naming of their output is not standardized, and some even use different conventions for the scaling of their output. We will experiment with several different ones to familiarize ourselves with these choices.
cor(scaledTurtles)
## length width height
## length 1.0000000 0.9783116 0.9646946
## width 0.9783116 1.0000000 0.9605705
## height 0.9646946 0.9605705 1.0000000
pcaturtles = princomp(scaledTurtles)
pcaturtles
## Call:
## princomp(x = scaledTurtles)
##
## Standard deviations:
## Comp.1 Comp.2 Comp.3
## 1.6954576 0.2048201 0.1448180
##
## 3 variables and 48 observations.
Figure 7.17: The screeplot shows the eigenvalues for the standardized turtles data (scaledTurtles
): there is one large value and two small ones. The data are (almost) onedimensional. We will see why this dimension is called an axis of size, a frequent phenomenon in biometric data (Jolicoeur, Mosimann, and others 1960), .
library("factoextra")
fviz_eig(pcaturtles, geom = "bar", bar_width = 0.4) + ggtitle("")
► Question 7.21
Many PCA functions have been created by different teams who worked in different areas at different times. This can lead to confusion, especially because they have different naming conventions. Let’s compare three of them; run the following lines of code and look at the resulting objects:
svd(scaledTurtles)$v[, 1]
prcomp(turtles[, 1])$rotation[, 1]
princomp(scaledTurtles)$loadings[, 1]
dudi.pca(turtles[, 1], nf = 2, scannf = FALSE)$c1[, 1]
What happens when you disable the scaling in the prcomp
and princomp
functions?
In what follows, we always suppose that the matrix \(X\) represents the centered and scaled matrix.
► Question 7.22
The coordinates of the observations in the new variables from the prcomp
function (call it res
) are in the scores
slot of the result. Take a look at PC1 for the turtles
and compare it to res$scores
. Compare the standard deviation sd1 to that in the res
object and to the standard deviation of the scores.
► Solution
► Question 7.23
Check the orthogonality of the res$scores
matrix.Why can’t we say that it is orthonormal?
Now we are going to combine both the PC scores (\(US\)) and the loadingscoefficients (\(V\)). The plots with both the samples and the variables represented are called biplots. This can be done in one line using the following factoextra package function.
fviz_pca_biplot(pcaturtles, label = "var", habillage = turtles[, 1]) +
ggtitle("")
Beware the aspect ratio when plotting a PCA. It is rare to have the two components be of similar norm, so square shaped plots will be the exception. More common are elongated plots, which show that the horizontal (first) principal component is more important than the second. This matters, e.g., for interpreting distances between points in the plots.
► Question 7.24
Is it possible to have a PCA plot with the PC1 as the horizontal axis whose height is longer than its width?
► Solution
► Question 7.25
Looking at Figure 7.18: a) Did the males or female turtles tend to be larger?
b) What do the arrows tell us about the correlations?
► Question 7.26
Compare the variance of each new coordinate to the eigenvalues returned by the PCA dudi.pca function.
► Solution
Now we look at the relationships between the variables, both old and new by drawing what is known as the correlation circle. The aspect ratio is 1 here and the variables are represented by arrows as shown in Figure 7.19. The lengths of the arrows indicate the quality of the projections onto the first principal plane:
Figure 7.19: Part of the “circle of correlations” showing the original variables. Their correlations with each other and with the new principal components are given by the angles between the vectors and between the axes and the vectors.
fviz_pca_var(pcaturtles, col.circle = "black") + ggtitle("") +
xlim(c(1.2, 1.2)) + ylim(c(1.2, 1.2))
► Question 7.27
Explain the relationships between the number of rows of our turtles data matrix and the following numbers:
svd(scaledTurtles)$d/pcaturtles$sdev
sqrt(47)
► Solution
These data are a good example of how sometimes almost all the variation in the data can be captured in a lowerdimensional space: here, threedimensional data can be essentially replaced by a line. If we only want the first one then it is just \(c_1=s_1 u_1\). Notice that \(c_1^2=s_1^tu_1 u_1^t s_1= s_1^2 u_1^tu_1=s_1^2=\lambda_1\)
Keep in mind: \(X^tC=VSU^tUS=VS^2.\) The principal components are the columns of the matrix \(C=US\). The \(p\) columns of \(U\) (the matrix given as USV$u
in the output from the svd
function above) are rescaled to have norms \((s_1^2,s_2^2,...,s_p^2)\). Each column has a different variance it is responsible for explaining. Notice that these will be decreasing numbers.
If the matrix \(X\) comes from the study of \(n\) different samples or specimens, then the principal components provides new coordinates for these \(n\) points as in Figure 7.16. These are sometimes called the scores in the results of PCA functions.
Figure 7.20: Another great xkcd take: this time eigenvectors.
Before we go into more detailed examples, let’s summarize what SVD and PCA provide:
Each principal component has a variance measured by the corresponding eigenvalue, the square of the corresponding singular value.
The new variables are made to be orthogonal. Since they are also centered, this means they are uncorrelated. In the case of normal distributed data, this also means they are independent. Eigen Decomposition: The crossproduct of X with itself verifies \[X^tX=VSU^tUSV^t=VS^2V^t=V\Lambda V^t\] where \(V\) is called the eigenvector matrix of the symmetric matrix \(X^tX\) and \(\Lambda\) is the diagonal matrix of eigenvalues of \(X^tX\).
When the variables are have been rescaled, the sum of the variances of all the variables is the number of variables (\(=p\)). The sum of the variances is computed by adding the diagonal of the crossproduct matrix114 This sum of the diagonal elements is called the trace of the matrix..
The principal components are ordered by the size of their eigenvalues. We always check the screeplot before deciding how many components to retain. It is also best practice to do as we did in Figure 7.18 and annotate each PC axis with the proportion of variance it explains.
► Task
Look up eigenvalue in the Wikipedia. Try to find a sentence that defines it without using a formula. Why would eigenvectors come into use in Cinderella (at a stretch)? (See the xkcd cartoon in Figure 7.20.)
For help with the basics of linear algebra, a motivated student pressed for time may consult Khan's Academy. If you have more time and would like in depth coverage, Gil Strang's MIT course is a classic, and some of the book is available online (Strang 2009).
We looked briefly at part of this data earlier in the chapter; here we will follow step by step a complete multivariate analysis. First we take a second look at the rounded correlation matrix capturing the essential of the bivariate associations. This was represented with colors in Figure 7.3.
cor(athletes) %>% round(1)
## m100 long weight highj m400 m110 disc pole javel m1500
## m100 1.0 0.5 0.2 0.1 0.6 0.6 0.0 0.4 0.1 0.3
## long 0.5 1.0 0.1 0.3 0.5 0.5 0.0 0.3 0.2 0.4
## weight 0.2 0.1 1.0 0.1 0.1 0.3 0.8 0.5 0.6 0.3
## highj 0.1 0.3 0.1 1.0 0.1 0.3 0.1 0.2 0.1 0.1
## m400 0.6 0.5 0.1 0.1 1.0 0.5 0.1 0.3 0.1 0.6
## m110 0.6 0.5 0.3 0.3 0.5 1.0 0.1 0.5 0.1 0.1
## disc 0.0 0.0 0.8 0.1 0.1 0.1 1.0 0.3 0.4 0.4
## pole 0.4 0.3 0.5 0.2 0.3 0.5 0.3 1.0 0.3 0.0
## javel 0.1 0.2 0.6 0.1 0.1 0.1 0.4 0.3 1.0 0.1
## m1500 0.3 0.4 0.3 0.1 0.6 0.1 0.4 0.0 0.1 1.0
Then we start looking at the screeplot, which will help us choose a rank \(k\) for these data.
pca.ath = dudi.pca(athletes, scannf = FALSE)
pca.ath$eig
## [1] 3.4182381 2.6063931 0.9432964 0.8780212 0.5566267 0.4912275
## [7] 0.4305952 0.3067981 0.2669494 0.1018542
Figure 7.21: The screeplot is the first thing to consult, it tells us that it is satisfactory to use a twodimensional plot.
fviz_eig(pca.ath, geom = "bar", bar_width = 0.3) + ggtitle("")
Figure 7.21 shows that eigenvalues in the screeplot make a clear drop after the second eigenvalue. This indicates a good approximation will be obtained at rank 2. Let’s look at an interpretation of the first two axes by projecting the loadings of the original (old) variables as they project onto the two new ones.
Figure 7.22: Correlation circle of the original variables.
fviz_pca_var(pca.ath, col.circle = "black") + ggtitle("")
The correlation circle Figure 7.22 displays the projection of the original variables onto the two first new principal axes. The angles between vectors are interpreted as correlations. On the right side of the plane, we have the track and field events (m110, m100, m400, m1500), and on the left, we have the throwing and jumping events. Maybe there is an opposition of skills as characterized in the correlation matrix. We did see the correlations were negative between variables from these two groups. How can we interpret this?
It seems that those who throw the best have lower scores in the track competitions. In fact, if we look at the original measurements, we can see what is happening. The athletes who run in short times are the stronger ones, as are the ones who throw or jump longer distances. We should probably change the scores of the track variables and redo the analysis.
► Question 7.28
What transformations of the variables induce the best athletic performances to vary in the same direction, i.e. be mostly positively correlated?
► Solution
Figure 7.23: Correlation circle after changing the signs of the running variables.
fviz_pca_var(pcan.ath, col.circle="black") + ggtitle("")
Figure 7.23 shows the correlation circle of the transformed variable. We now see we have an axis of size, all the arrows are pointing in the same direction.
Now all the negative correlations are quite small ones. The screeplot will show no change, as the eigenvalues are unchanged. The signs in the coefficients of the PC loadings for the \(m\) variables are the only ouput that changes.
We now plot the athletes projected in the first principal plane using:
Figure 7.24: First principal plane showing the projections of the athletes. Do you notice something about the organization of the numbers?
fviz_pca_ind(pcan.ath) + ggtitle("") + ylim(c(2.5,5.7))
► Question 7.29
► Solution
It turns out there is complementary information available on the olympic
data. An extra variable called score
is a vector of the final scores at the competition (men’s decathlon at the 1988 Olympics).
data("olympic", package = "ade4")
olympic$score
## [1] 8488 8399 8328 8306 8286 8272 8216 8189 8180 8167 8143 8114
## [13] 8093 8083 8036 8021 7869 7860 7859 7781 7753 7745 7743 7623
## [25] 7579 7517 7505 7422 7310 7237 7231 7016 6907
We now make the scatterplot comparing the first principal component score of the athletes to this score from the data. This is shown in Figure 7.25; we can see a strong correlation between the two variables. We notice that the athlete number one (who in fact won the Olympic decathlon gold medal) has the highest score but not the highest value for the PC1 linear combination. Why do you think that is?
Figure 7.25: Scatterplot of the scores given as a supplementary variable and the first principal component. The points are labeled by their order in the data set. We can see a very strong correlation between this supplementary score variable and the first principal coordinate, why is it not a perfectly linear fit?
Figure 7.26: A screeplot showing ‘dangerously’ similar variances. Choosing to cutoff at a hard threshold of 80% of the variance would give unstable PC plots. With so such cutoff, the axes corresponding to the 3D subspace of 3 similar eigenvalues are unstable and cannot be individually interpreted.
We have seen in the examples that the first step in PCA is to make the screeplot of the variances of the new variables (equal to the eigenvalues). We cannot decide how many dimensions are needed before seeing this plot. The reason is that there are situations when the principal components are illdefined: when two or three successive PCs have very similar variances, giving a screeplot as in Figure 7.26, the subspace corresponding to a group of similar eigenvalues exists. In this case this would be 3D space generated by \(u_2,u_3,u_4\). The vectors are not meaningful individually and one cannot interpret their loadings. This is because a very slight change in one observations could give a completely different set of three vectors. These would generate the same 3D space, but could have very different loadings. We say the PCs are unstable.
We have seen that unlike regression, PCA treats all variables equally (to the extent that they were preprocessed to have equivalent standard deviations). However, it is still possible to map other continuous variables or categorical factors onto the plots in order to help interpret the results. Often we have complementary information on the samples, for example, diagnostic labels in the diabetes data or the cell types in the Tcell gene expression data.
Here we see how we can use such extra variables to inform our interpretation. The best place to store such socalled metadata is in appropriate slots of the data object (such as in the Bioconductor SummarizedExperiment class); the secondbest, in additional columns of the data frame that also contains the numeric data. In practice, such information is often stored in a more or less cryptic manner in the row names of the matrix. Below, we need to face the latter scenario, and we use substr
gymnastics to extract the cell types and show the screeplot in Figure 7.27 and the PCA in Figure 7.28.
Figure 7.27: Tcell expression PCA screeplot.
pcaMsig3 = dudi.pca(Msig3transp, center = TRUE, scale = TRUE,
scannf = FALSE, nf = 4)
fviz_screeplot(pcaMsig3) + ggtitle("")
ids = rownames(Msig3transp)
celltypes = factor(substr(ids, 7, 9))
status = factor(substr(ids, 1, 3))
table(celltypes)
## celltypes
## EFF MEM NAI
## 10 9 11
cbind(pcaMsig3$li, tibble(Cluster = celltypes, sample = ids)) %>%
ggplot(aes(x = Axis1, y = Axis2)) +
geom_point(aes(color = Cluster), size = 5) +
geom_hline(yintercept = 0, linetype = 2) +
geom_vline(xintercept = 0, linetype = 2) +
scale_color_discrete(name = "Cluster") + coord_fixed()
These data requires delicate preprocessing before we obtain our desired matrix with the relevant features as columns and the samples as rows. Starting with the raw mass spectroscopy readings, the steps involve extracting peaks of relevant features, aligning them across multiple samples and estimating peak heights. We refer the reader to the vignette of the Bioconductor xcms package for gruesome details. We load a matrix of data generated in such a way from the file mat1xcms.RData
. The output of the below code is in Figures 7.29 and 7.30.
load("../data/mat1xcms.RData")
dim(mat1)
## [1] 399 12
Figure 7.29: Screeplot showing the eigenvalues for the mice data.
pcamat1 = dudi.pca(t(mat1), scannf = FALSE, nf = 3)
fviz_eig(pcamat1, geom = "bar", bar_width = 0.7) + ggtitle("")
dfmat1 = cbind(pcamat1$li, tibble(
label = rownames(pcamat1$li),
number = substr(label, 3, 4),
type = factor(substr(label, 1, 2))))
pcsplot = ggplot(dfmat1,
aes(x=Axis1, y=Axis2, label=label, group=number, colour=type)) +
geom_text(size = 4, vjust = 0.5)+ geom_point(size = 3)+ylim(c(18,19))
pcsplot + geom_hline(yintercept = 0, linetype = 2) +
geom_vline(xintercept = 0, linetype = 2)
► Question 7.30
Looking at Figure 7.30, do the samples seem to be randomly placed in the plane? Do you notice any structure explained by the labels?
► Solution
In the previous example, the number of variables measured was too large to enable useful concurrent plotting of both variables and samples. In this example we plot the PCA biplot of a simple data set where chemical measurements were made on different wines for which we also have a categorical wine.class
variable. We start the analysis by looking at the twodimensional correlations and a heatmap of the variables.
library("pheatmap")
load("../data/wine.RData")
load("../data/wineClass.RData")
wine[1:2, 1:7]
## Alcohol MalicAcid Ash AlcAsh Mg Phenols Flav
## 1 14.23 1.71 2.43 15.6 127 2.80 3.06
## 2 13.20 1.78 2.14 11.2 100 2.65 2.76
Figure 7.31: The difference between 1 and the correlation can be used as a distance between variables and is used to make a heatmap of the associations between the variables.
pheatmap(1  cor(wine), treeheight_row = 0.2)
winePCAd = dudi.pca(wine, scannf=FALSE)
table(wine.class)
## wine.class
## barolo grignolino barbera
## 59 71 48
fviz_pca_biplot(winePCAd, geom = "point", habillage = wine.class,
col.var = "violet", addEllipses = TRUE, ellipse.level = 0.69) +
ggtitle("") + coord_fixed()
A biplot is a simultaneous representation of both the space of observations and the space of variables. In the case of a PCA biplot like Figure 7.32 the arrows represent the directions of the old variables as they project onto the plane defined by the first two new axes. Here the observations are just colored dots, the color has been chosen according to which type of wine is being plotted. We can interpret the variables’ directions with regards to the sample points, for instance the blue points are from the barbera group and show higher Malic Acid content than the other wines.
Interpretation of multivariate plots requires the use of as much of the available information as possible; here we have used the samples and their groups as well as the variables to understand the main differences between the wines.
Sometimes we want to see variability between different groups or observations, but want to weight them. This can be the case if, e.g., the groups have very different sizes. Let’s reexamine the Hiiragi data we already saw in Chapter 3 (Ohnishi et al. 2014). In the code below, we select the wildtype (WT) samples and the top 100 features with the highest overall variance.
data("x", package = "Hiiragi2013")
xwt = x[, x$genotype == "WT"]
sel = order(rowVars(Biobase::exprs(xwt)), decreasing = TRUE)[1:100]
xwt = xwt[sel, ]
tab = table(xwt$sampleGroup)
tab
##
## E3.25 E3.5 (EPI) E3.5 (PE) E4.5 (EPI) E4.5 (PE)
## 36 11 11 4 4
Figure 7.33: Screeplot from the weighted PCA of the Hiiragi data. The drop after the second eigenvalue suggests that a twodimensional PCA is appropriate.
xwt$weight = 1 / as.numeric(tab[xwt$sampleGroup])
pcaMouse = dudi.pca(as.data.frame(t(Biobase::exprs(xwt))),
row.w = xwt$weight,
center = TRUE, scale = TRUE, nf = 2, scannf = FALSE)
fviz_eig(pcaMouse) + ggtitle("")
fviz_pca_ind(pcaMouse, geom = "point", col.ind = xwt$sampleGroup) +
ggtitle("") + coord_fixed()
We see from tab
that the groups are represented rather unequally. To account for this, we reweigh each sample by the inverse of its group size. The function dudi.pca
in the ade4 package has a row.w
argument into which we can enter the weights. The output of the code is in Figures 7.33 and 7.34.
Preprocessing matrices Multivariate data analyses require “conscious” preprocessing. After consulting all the means, variances and onedimensional histograms we saw how to rescale and recenter the data.
Projecting onto new variables We saw how we can make projections on lower dimensions (planes and 3D are the most frequently used) of very high dimensional data without losing too much information. PCA searches for new more informative
variables which are linear combinations of the original (old) ones.
Matrix decomposition PCA is based on finding decompositions of the matrix \(X\) called SVD, this decomposition provides lower rank approximation and is equivalent to the eigenanalysis of \(X^tX\). The squares of the singular values are equal to the eigenvalues and to the variances of the new variables. We systematically plotted these values before deciding how many axes are necessary to reproduce the signal in the data.
Two eigenvalues which are quite close can give rise to scores or PC scores which are highly unstable. It is always necessary to look at the screeplot of the eigenvalues and avoid separating the axes corresponding to the these close eigenvalues. This may require using interactive three or fourdimensional projections, these are available in several R packages.
Biplot representations The space of observations is naturally a pdimensional space (the p original variables provide the coordinates). The space of variables is ndimensional. Both decompositions we have studies (singular values/eigenvalue and singular vectors/eigenvectors) provide new coordinates for both of these spaces, sometimes we call one the dual of the other. We can plot the projection of both the observations and the variables onto the same eigenvectors. This provides a biplot that can be useful for interpreting the PCA output.
Projecting other group variables Interpretation of PCA can also be facilitated by redundant or contiguous data about the observations.
The best way to deepen your understanding of singular value decomposition is to read Chapter 7 of Strang (2009). The whole book sets the foundations for the linear algebra necessary to understanding the meaning of the rank of matrix and the duality between row spaces and column spaces (Holmes 2006).
Complete textbooks have been written on the subject of PCA and related methods; Mardia, Kent, and Bibby (1979) is a standard text that covers all multivariate methods in a classical way; with linear algebra and matrices. By making the parametric assumptions that the data come from multivariate normal distributions, Mardia, Kent, and Bibby (1979) also provide inferential tests for the number of components and limiting properties for principal components. Jolliffe (2002) is a booklong treatment of everything to do with PCA with extensive examples.
We can incorporate supplementary information into weights for the observations and the variables. This was introduced in the 1970’s by French data scientists, see Holmes (2006) for a review and chapter 9 for further examples.
Improvements to the interpretation and stability of PCA can be obtained by adding a penalty that minimizes the number of nonzero coefficients that appear in the linear combinations. Zou, Hastie, and Tibshirani (2006) and Witten, Tibshirani, and Hastie (2009) have developed sparse versions of principal components, and their packages elasticnet and PMA provide implementations in R.
Revise the material about svd by reading sections 1, 2, and 3 of the Wikipedia article about SVD. It will also be beneficial to read about the related eigenvalue decomposition by reading sections 1, 2, and 2.1 of the Wikipedia article about eigendecomposition of a matrix. We know that we can decompose a \(n\) row by \(p\) column rank 1 matrix \(X\) as :
\[\begin{equation*} {\bf X} = \begin{pmatrix} x_{11} & x_{12} &... & x_{1p}\\ x_{21} & x_{22} &... & x_{2p}\\ \vdots & \vdots &\vdots & \vdots \\ x_{n1} & x_{n2} &... & x_{np} \end{pmatrix} = \begin{pmatrix} u_{11} \\ u_{21} \\ \vdots \\ u_{n1} \\ \end{pmatrix} \times \begin{pmatrix} v_{11} & v_{21} & \cdots & v_{p1} \end{pmatrix} \end{equation*}\]
7.1a If \(X\) has no rows and no columns which are all zeros, then is this decomposition unique?
7.1b Generate a rankone matrix: Start by taking a vector of length 15 with values from 2 to 30 in increments of 2, and a vector of length 4 with values 3,6,9,12, take their ‘product’:
u = seq(2, 30, by = 2)
v = seq(3, 12, by = 3)
X1 = u %*% t(v)
Why do we have to take t(v)
?
7.1c Now we add some noise in the form a matrix we call Materr
so we have an “approximately rankone” matrix.
Materr = matrix(rnorm(60,1),nrow=15,ncol=4)
X = X1+Materr
Visualize \(X\) using ggplot
.
d Redo the same analyses with a rank 2 matrix.
7.2 a Create a first a matrix of highly correlated bivariate data such as that shown in Figure 7.35.
Hint: Use the function mvrnorm
.
Check the rank of the matrix by looking at its singular values.
7.2 b Perform a PCA and show the rotated principal component axes.
Part (A) in Figure 7.35 shows a very elongated plotting region, why?
What happens if you do not use the coord_fixed()
option and have a square plotting zone? Why can this be misleading?
Let’s revisit the Hiiragi data and compare the weighted and unweighted approaches. 7.4a Make a correlation circle for the unweighted Hiiragi data xwt. Which genes have the best projections on the first principal plane (best approximation)? 7.4b Make a biplot showing the labels of the extreme genevariables that explain most of the variance in the first plane. Add the the samplepoints.
Abbott, Edwin A. 1884. Flatland: A Romance of Many Dimensions. OUP Oxford.
Flury, Bernard. 1997. A First Course in Multivariate Statistics. Springer.
Holmes, Susan. 2006. “Multivariate Analysis: The French way.” In Probability and Statistics: Essays in Honor of David a. Freedman, edited by D. Nolan and T. P. Speed. Vol. 56. IMS Lecture Notes–Monograph Series. Beachwood, OH: IMS. http://www.imstat.org/publications/lecnotes.htm.
Holmes, Susan, Michael He, Tong Xu, and Peter P Lee. 2005. “Memory T Cells Have Gene Expression Patterns Intermediate Between Naive and Effector.” PNAS 102 (15). National Acad Sciences: 5519–23.
Hotelling, Harold. 1933. “Analysis of a Complex of Statistical Variables into Principal Components.” Journal of Educational Psychology 24 (6): 417–41.
Jolicoeur, Pierre, James E Mosimann, and others. 1960. “Size and Shape Variation in the Painted Turtle. A Principal Component Analysis.” Growth 24 (4): 339–54.
Jolliffe, Ian. 2002. Principal Component Analysis. Wiley Online Library.
Mardia, Kanti, John T Kent, and John M Bibby. 1979. Multiariate Analysis. New York: Academic Press.
Ohnishi, Y., W. Huber, A. Tsumura, M. Kang, P. Xenopoulos, K. Kurimoto, A. K. Oles, et al. 2014. “CelltoCell Expression Variability Followed by Signal Reinforcement Progressively Segregates Early Mouse Lineages.” Nature Cell Biology 16 (1): 27–37.
Pearson, Karl. 1901. “LIII. On Lines and Planes of Closest Fit to Systems of Points in Space.” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2 (11). Taylor & Francis: 559–72.
Strang, Gilbert. 2009. Introduction to Linear Algebra. Fourth. WellesleyCambridge Press.
Witten, Daniela M, Robert Tibshirani, and Trevor Hastie. 2009. “A Penalized Matrix Decomposition, with Applications to Sparse Principal Components and Canonical Correlation Analysis.” Biostatistics. Biometrika Trust, kxp008.
Zou, Hui, Trevor Hastie, and Robert Tibshirani. 2006. “Sparse Principal Component Analysis.” Journal of Computational and Graphical Statistics 15 (2). Taylor & Francis: 265–86.
Page built: 20190515
Support for maintaining the online version of this book is provided by de.NBI
For website support please contact msmith@embl.de