This post discusses how you can get horseshoe-looking patterns (sometimes also called cyclic, or U- or V-shaped) in common dimension reduction methods, from data that are generated by processes that have little to do with horseshoes, cycles, Us or Vs.
In their paper “Horseshoes in multidimensional scaling and local kernel methods” in The Annals of Applied Statistics (2008), Vol. 2, page 777 (DOI: 10.1214/08-AOAS165), Diaconis, Goel and Holmes looked at situations in which the distance matrix of a set of objects is concentrated along the diagonal (see Figure 1, upper left panel). They set out from the example of the voting records of the members of the 2005 United States House of Representatives. From these records, they computed all pairwise distances (or dissimilarities), applied classical multi-dimensional scaling, and observed a horseshoe pattern.
Generalizing this, and following the ideas of Diaconis, Goel and Holmes, consider a set of objects whose (dis)similarities are described by the matrix of pairwise distances \(D\). In particular, consider the idealised situation of matrices like the following one. The elements of \(D\) are 0 along the diagonal, and the values increase with the distance from the diagonal, with some rate \(\lambda^{-1}\), until they saturate at some maximum value, which w.l.o.g. we can set to 1:
\[ D_{ij} = 1 - e^{\displaystyle -\lambda | i-j |} \] where \(n\) is a positive integer, \(i,j=1,\ldots,n\), and \(D\) is an \(n\times n\) matrix.
set.seed(0xbedada)
library("conflicted")
library("dplyr")
library("ggplot2")
library("gridExtra")
library("reshape2")
library("ggfortify")
library("Rtsne")
library("umap")
plotMatrixAsImage <- function(x) {
ggplot(melt(x), aes(x = Var1, y = Var2, fill = value)) + geom_tile() +
scale_fill_gradient2() +
ggtitle(deparse(substitute(x))) + coord_fixed() + xlab("") + ylab("") +
theme(plot.title = element_text(hjust = 0.5))
}
plotMDS <- function(x, tit, flipy) {
dat <- as_tibble(`colnames<-`(x, c("P1", "P2")))
if (tit %in% flipy)
dat <- mutate(dat, P2 = -P2)
ggplot(dat) +
geom_point(aes(x = P1, y = P2)) + coord_fixed() +
ggtitle(tit) +
theme(plot.title = element_text(hjust = 0.5)) +
expand_limits(y = c(-1, 1) * 0.2 * max(abs(dat$P1)))
}
variousMDSPlots <- function(..., perplexity = 30, flipy = character(0)) {
grid.arrange(
plotMatrixAsImage(...),
plotMDS(stats::cmdscale(...), "cmdscale", flipy),
plotMDS(MASS::isoMDS(...)$points, "isoMDS", flipy),
plotMDS(Rtsne(..., is_distance = TRUE, perplexity = perplexity)$Y, "t-SNE", flipy),
plotMDS(umap::umap(..., input = "dist")$layout, "UMAP", flipy),
layout_matrix = rbind(c(1, 2, 3), c(1, 4, 5))
)
}
diaconisMatrix <- function(lambda, n)
1 - outer(seq(0, 1, length.out = n), seq(0, 1, length.out = n),
FUN = function(a, b) exp(-lambda * abs(a-b)))
D <- diaconisMatrix(lambda = 2, n = 100)
variousMDSPlots(D)
For good measure, let us see what happens if we increase lambda
, i.e., shorter range of distance increase away from the diagonal.
D2 <- diaconisMatrix(lambda = 6, n = 100)
variousMDSPlots(D2, flipy = c("cmdscale", "isoMDS"))
D3 <- diaconisMatrix(lambda = 20, n = 100)
variousMDSPlots(D3)
The horseshoe becomes even bendier. In some of these plots, the endpoints (1st and \(n\)-th object) are laid out closer to each other than to the objects in the middle, even though this disrespects the true distances D2
, D3
.
Let’s look at some of the eigenvectors of the distance matrix D
, after double centering. These are used by cmdscale
(albeit not directly by the other methods).
eigD <- eigen(multivariance:::double.center(D, normalize = FALSE))
colnames(eigD$vectors) <- paste0("EV", seq_len(ncol(D)))
pairs(eigD$vectors[, 1:5])
ggplot(melt(eigD$vectors[, 1:5], id = "x") %>% mutate(Var2 = factor(Var2)),
aes_string(x = "Var1", y = "value", colour = "Var2", group = "Var2")) +
geom_line(size = sqrt(2)) +
scale_colour_manual("Eigenvectors", values = unname(pals::alphabet(5)))
ev <- eigD$vectors[, 1:5] *
matrix(eigD$values[1:5], byrow = TRUE, ncol = 5, nrow = nrow(eigD$vectors))
ggplot(melt(ev, id = "x") %>% mutate(Var2 = factor(Var2)),
aes_string(x = "Var1", y = "value", colour = "Var2", group = "Var2")) +
geom_line(size = sqrt(2)) +
scale_colour_manual("Eigenvectors", values = unname(pals::alphabet(5)))
If we think of the (double centered) matrix \(D\) as a discretized version of a differential operator, we see (see Figure 7) that a dominating component is the negative of the second derivative, i.e., \(-d^2/dx^2\), and therefore it is not surprising to see eigenfunctions that resemble harmonic functions.
\[ \begin{align} \frac{d^2}{dx^2} f(x) &= \lim_{h\to0} \frac{f(x+h)-2f(x)+f(x-h)}{h^2}\\ -\frac{d^2}{dx^2} f(x) &= k^2 f(x) \Leftrightarrow f(x) \propto e^{ikx} = \cos kx + i \sin kx \end{align} \]
plotMatrixAsImage(multivariance:::double.center(D, normalize = FALSE)) +
ggtitle("double centered D")
A straight line in \({\mathbb R}^d\) is parameterized by
\[ x = at + b \quad\quad\text{for }t\in{\mathbb R}, \] where \(a\) and \(b\) are fixed vectors in \({\mathbb R}^n\) (slope and offset). In the code below, we also add some Gaussian noise, with \(\sigma=0.5\).
d <- 100 # dimension of the space
n <- 42 # number of points
t <- seq(0, n, by = 1) %>% matrix(nrow = 1) # row vector
a <- runif(d) %>% matrix(nrow = d, ncol = 1) # column vector
b <- runif(d) %>% matrix(nrow = d, ncol = length(t)) # recycle
x <- a %*% t + b
x <- x + matrix(rnorm(prod(dim(x)), sd = 0.5), ncol = ncol(x), nrow = nrow(x))
dm <- as.matrix(dist(t(x)))
variousMDSPlots(dm, perplexity = 10)
Instead of the Euclidean distances, add some saturation (underestimate long distances). For instance, we can take the function \(s(x) = 1-e^{-x/x_0}\), for which \(s(0)=0\) and \(\lim_{x\to\infty}s(x)=1\).
sat <- function(x, x0 = median(x)) { x0 * (1 - exp(-x/x0)) }
ggplot(tibble(x = as.vector(dm),
y = as.vector(sat(dm))), aes(x = x, y = y)) +
geom_line() + xlab("d") + ylab("sat(d)")
variousMDSPlots(sat(dm), perplexity = 11)
Back to the Euclidean distances, but with noise on the points’ coordinates.
dmnoisy <- as.matrix(dist(t(x + matrix(rnorm(prod(dim(x)), sd = 5), ncol = ncol(x), nrow = nrow(x)))))
variousMDSPlots(dmnoisy, perplexity = 11)
In the code below, we create a data matrix of independent, identically distributed random numbers (white.noise
), to which we apply a 2D smooothing operator (filter2
from the EBImage
package) with bandwidth b
. The resulting matrix x
has size n
times k
. (The matrix white.noise
has an additional b
rows and columns padded to the left, right, top and bottom to avoid having to make choices about how filter2
should deal with the boundaries.)
n <- 100
k <- 150
b <- 35
white.noise <- matrix(rnorm((k+2*b) * (n+2*b)), nrow = n+2*b, ncol = k+2*b)
smoothened <- EBImage::filter2(white.noise, EBImage::makeBrush(b))
x <- smoothened[ b+(1:n), b+(1:k) ]
Let’s do a PCA plot of x
, using the first two principal components
autoplot(prcomp(x)) + coord_fixed()
We see an (almost-closed) cycle. If the data were from a single-cell RNA-seq experiment (with, say, n
cells and k
genes), it could be tempting to see a cellular trajectory that we might think is related to a differentiation process and/or the cell cycle. There are software packages that help fit curves into such plots, which could be interpreted as “pseudo-time”. In many cases, this will be very well (and useful and true) - but in this case we know that the data-generating process had nothing to do with time or some other underlying one-dimensional process or phenomenon. What is going on?
Below, we plot the matrix in false color representation, as well as its covariance matrix.
grid.arrange(plotMatrixAsImage(x), plotMatrixAsImage(cov(x)), ncol = 2)
variousMDSPlots(as.matrix(dist(x)))
We can be even more minimalistic with the non-randomness of the input data. In the code below, the k
times n
matrix rw
is filled with the numbers from a random walk (a realisation of 1D Brownian motion).
rw <- matrix(cumsum(rnorm(k*n)), nrow = k, ncol = n)
grid.arrange(plotMatrixAsImage(rw), autoplot(prcomp(rw)) + coord_fixed(),
ncol = 2)
We again see the horseshoe. And interesting patterns in the first five principal components:
pairs(prcomp(rw)$x[, 1:5])
variousMDSPlots(as.matrix(dist(rw)))
Now, instead of a Brownian motion, let’s just take a series of independent random numbers and subject them to a low-pass filter of window size h
(i.e., a running-average smoother).
h <- 20
rw <-
matrix(
stats::filter(rnorm(k*n + h), rep(1, h))[ h/2 + 1:(n*k) ],
nrow = k, ncol = n)
grid.arrange(plotMatrixAsImage(rw), autoplot(prcomp(rw)) + coord_fixed(),
ncol = 2)
variousMDSPlots(as.matrix(dist(rw)))
Embeddings of high-dimensional data into lower dimensional spaces are useful; but they can also create apparent one-dimensional (“time-like”) patterns that have little to do with the data-generating process. Be aware.
devtools::session_info()
## ─ Session info ────────────────────────────────────────────────────────────
## setting value
## version R version 3.5.3 (2019-03-11)
## os macOS Mojave 10.14.4
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Europe/Berlin
## date 2019-04-04
##
## ─ Packages ────────────────────────────────────────────────────────────────
## package * version date lib source
## abind 1.4-5 2016-07-21 [1] CRAN (R 3.5.0)
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.5.3)
## backports 1.1.3 2018-12-14 [1] CRAN (R 3.5.0)
## BiocGenerics 0.28.0 2018-10-30 [1] Bioconductor
## BiocManager 1.30.4 2018-11-13 [1] CRAN (R 3.5.0)
## BiocStyle * 2.10.0 2018-10-30 [1] Bioconductor
## bitops 1.0-6 2013-08-17 [1] CRAN (R 3.5.0)
## bookdown 0.9 2018-12-21 [1] CRAN (R 3.5.0)
## callr 3.2.0 2019-03-15 [1] CRAN (R 3.5.2)
## cli 1.1.0 2019-03-19 [1] CRAN (R 3.5.2)
## codetools 0.2-16 2018-12-24 [1] CRAN (R 3.5.3)
## colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.5.2)
## conflicted * 1.0.2 2019-03-29 [1] CRAN (R 3.5.3)
## crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.0)
## crosstalk 1.0.0 2016-12-21 [1] CRAN (R 3.5.0)
## desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.0)
## devtools 2.0.1 2018-10-26 [1] CRAN (R 3.5.1)
## dichromat 2.0-0 2013-01-24 [1] CRAN (R 3.5.0)
## digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.0)
## dplyr * 0.8.0.1 2019-02-15 [1] CRAN (R 3.5.2)
## EBImage 4.24.0 2018-10-30 [1] Bioconductor
## evaluate 0.13 2019-02-12 [1] CRAN (R 3.5.2)
## fftwtools 0.9-8 2017-03-25 [1] CRAN (R 3.5.0)
## fs 1.2.7 2019-03-19 [1] CRAN (R 3.5.2)
## ggfortify * 0.4.6 2019-03-20 [1] CRAN (R 3.5.2)
## ggplot2 * 3.1.0 2018-10-25 [1] CRAN (R 3.5.0)
## glue 1.3.1 2019-03-12 [1] CRAN (R 3.5.2)
## gridExtra * 2.3 2017-09-09 [1] CRAN (R 3.5.0)
## gtable 0.3.0 2019-03-25 [1] CRAN (R 3.5.3)
## highr 0.8 2019-03-20 [1] CRAN (R 3.5.3)
## htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.0)
## htmlwidgets 1.3 2018-09-30 [1] CRAN (R 3.5.1)
## httpuv 1.5.0 2019-03-15 [1] CRAN (R 3.5.2)
## jpeg 0.1-8 2014-01-23 [1] CRAN (R 3.5.0)
## jsonlite 1.6 2018-12-07 [1] CRAN (R 3.5.0)
## knitr 1.22 2019-03-08 [1] CRAN (R 3.5.2)
## labeling 0.3 2014-08-23 [1] CRAN (R 3.5.0)
## later 0.8.0 2019-02-11 [1] CRAN (R 3.5.2)
## lattice 0.20-38 2018-11-04 [1] CRAN (R 3.5.3)
## lazyeval 0.2.2 2019-03-15 [1] CRAN (R 3.5.2)
## locfit 1.5-9.1 2013-04-20 [1] CRAN (R 3.5.0)
## magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.0)
## manipulateWidget 0.10.0 2018-06-11 [1] CRAN (R 3.5.0)
## mapproj 1.2.6 2018-03-29 [1] CRAN (R 3.5.0)
## maps 3.3.0 2018-04-03 [1] CRAN (R 3.5.0)
## MASS 7.3-51.3 2019-03-31 [1] CRAN (R 3.5.3)
## Matrix 1.2-17 2019-03-22 [1] CRAN (R 3.5.3)
## memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.0)
## mime 0.6 2018-10-05 [1] CRAN (R 3.5.1)
## miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 3.5.0)
## multivariance 2.1.0 2019-03-19 [1] CRAN (R 3.5.2)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 3.5.0)
## pals 1.5 2018-01-22 [1] CRAN (R 3.5.0)
## pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.0)
## pkgbuild 1.0.3 2019-03-20 [1] CRAN (R 3.5.3)
## pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.0)
## pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.0)
## plyr 1.8.4 2016-06-08 [1] CRAN (R 3.5.0)
## png 0.1-7 2013-12-03 [1] CRAN (R 3.5.0)
## prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.0)
## processx 3.3.0 2019-03-10 [1] CRAN (R 3.5.2)
## promises 1.0.1 2018-04-13 [1] CRAN (R 3.5.0)
## ps 1.3.0 2018-12-21 [1] CRAN (R 3.5.0)
## purrr 0.3.2 2019-03-15 [1] CRAN (R 3.5.2)
## R6 2.4.0 2019-02-14 [1] CRAN (R 3.5.2)
## Rcpp 1.0.1 2019-03-17 [1] CRAN (R 3.5.2)
## RCurl 1.95-4.12 2019-03-04 [1] CRAN (R 3.5.2)
## remotes 2.0.2 2018-10-30 [1] CRAN (R 3.5.0)
## reshape2 * 1.4.3 2017-12-11 [1] CRAN (R 3.5.0)
## reticulate 1.11.1 2019-03-06 [1] CRAN (R 3.5.2)
## rgl 0.100.19 2019-03-12 [1] CRAN (R 3.5.2)
## rlang 0.3.3 2019-03-29 [1] CRAN (R 3.5.3)
## rmarkdown 1.12 2019-03-14 [1] CRAN (R 3.5.2)
## rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.0)
## RSpectra 0.13-1 2018-05-22 [1] CRAN (R 3.5.0)
## Rtsne * 0.15 2018-11-10 [1] CRAN (R 3.5.0)
## scales 1.0.0 2018-08-09 [1] CRAN (R 3.5.1)
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.0)
## shiny 1.2.0 2018-11-02 [1] CRAN (R 3.5.0)
## stringi 1.4.3 2019-03-12 [1] CRAN (R 3.5.2)
## stringr 1.4.0 2019-02-10 [1] CRAN (R 3.5.2)
## testthat 2.0.1 2018-10-13 [1] CRAN (R 3.5.1)
## tibble 2.1.1 2019-03-16 [1] CRAN (R 3.5.2)
## tidyr 0.8.3 2019-03-01 [1] CRAN (R 3.5.2)
## tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.0)
## tiff 0.1-5 2013-09-04 [1] CRAN (R 3.5.0)
## umap * 0.2.0.0 2018-07-25 [1] CRAN (R 3.5.2)
## usethis 1.4.0 2018-08-14 [1] CRAN (R 3.5.0)
## webshot 0.5.1 2018-09-28 [1] CRAN (R 3.5.0)
## withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.0)
## xfun 0.5 2019-02-20 [1] CRAN (R 3.5.2)
## xtable 1.8-3 2018-08-29 [1] CRAN (R 3.5.0)
## yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.0)
##
## [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library