Horseshoes in multi-dimensional scaling
This page contains R code that you can edit and run interactively in your browser – there is no server-side computing involved.
Points on a straight line
Let’s consider points on a straight line in
where the slope
We compute the pairwise distance matrix d1
between all points and use cmdscale
to perform classical multidimensional scaling
So far, so good. The embedding method faithfully reproduces the points’ relative positions.
With saturation
Now let’s consider that the distances aren’t measured perfectly well, but saturate: small distances are faithful, but large distances are reported as smaller than they are. We can use the following function to model this
Here,
Let’s run multidimensional scaling again:
There is a horseshoe!
Diaconis, Goel and Holmes
In their paper “Horseshoes in multidimensional scaling and local kernel methods”, Diaconis, Goel, and Holmes (2008) looked at such situations. More precisely, they considered distance matrices that are concentrated along the diagonal. They set out from the example of the voting records of the members of the 2005 United States House of Representatives. From these records, they computed all pairwise distances (or dissimilarities), applied classical multi-dimensional scaling, and observed a horseshoe pattern.
For good measure, let us see what happens if we increase the saturation:
The horseshoe becomes even bendier.
What is going on?
Let’s look at some of the eigenvectors of the distance matrix d2
after double centering. These are used by cmdscale
.
If we think of the double centered matrix d2
as a discretized version of a linear operator, we see that a dominating component is the negative of the second derivative,
Conclusion
Embeddings of high-dimensional data into lower dimensional spaces are useful. But they can also create apparent patterns that have little to do with the data-generating process. Be careful.