Nonlinear Dimensionality Reduction
September 23, 2020
September 22, 2020
Christian Bueno, University of California, Santa Barbara
All Captioned Videos Computational Tutorials
Christian Bueno, University of California, Santa Barbara
Working with lower dimensional representations of data can be valuable for simplifying models, removing noise, and visualization. When data is distributed in geometrically complicated ways, tools such as PCA can quickly run into limitations due to their linear nature. In this tutorial, we will dive into dimension reduction for when data is distributed in ways that have nontrivial topology and curvature. We’ll build up our understanding of these approaches alongside classical ideas from topology and differential geometry and consider their interplay. Finally, we will explore some relationships between nonlinear dimension reduction and stochastic dynamics.
Speaker Bio: Christian Bueno is a mathematics PhD student at the University of California, Santa Barbara advised by Paul J. Atzberger. His research broadly focuses on theoretical properties of different machine learning methods for non-Euclidean problem domains. Previously, Christian has been a three-time intern at NASA Glenn Research Center and continues to collaborate on various projects.
AUDIENCE: We have Christian Bueno presenting. Christian is a mathematics PhD student at the University of California Santa Barbara advised by Paul Atzberger. His research broadly focuses on theoretical properties of different machine learning methods for non-Euclidean problem domains. Previously, Christian has been a three-time intern at NASA. And he still continues to collaborate with them on various projects. So thanks so much, Christian. Please take it away.
CHRISTIAN BUENO: Yeah. Thank you. So yeah, great introduction and just jumping right in. So this is kind of a big map of-- it's outdated. It's from 2009 but just a big map of all the different kinds of ways you could try to approach nonlinear dimensionality reduction.
And there's going to be the convex problems, the nonconvex problems, which you're going to have to iterate through, like auto encoders. And then you can go deeper down this branch into things that you can factor via linear algebra into nice ways like PCA, Kernel PCA, [INAUDIBLE] map, these things. And we're going to try to touch a good amount of these. But it's not going to be possible to hit every single one of these.
So the first one we should focus on is not going to be as exciting as something like diffusion maps. We should start with PCA. So everyone here's probably very familiar, extremely familiar with PCA. But it's good to just go through the basics really quick.
So essentially, what it's trying to do is find the linear subspace that preserves the most variation of the data. And then you go hierarchically. You look for the next vectors if you want more variation preserved.
And so in order to do this, we break down the covariance matrix. And we can express it this way. And because this is symmetric and positive semidefinite, we can find [INAUDIBLE] basis of eigenvectors. Write down the eigenvectors, which all have to be greater than or equal to 0.
And the eigenvectors will form our principal directions. And the variation in those directions will actually be the eigenvalues. And the amounts we project onto those directions will be the principal component. So just kind of writing those things down there, we can now diagonalize our covariance matrix. That's the eigendecomposition..
And we might get some 0 eigenvalues. That can happen, for example, if all your data light on a plane, you're not going to need that orthogonal direction to describe your data at all. And yeah, to reduce the dimensions, just keep the top d eigenvectors and eigenvalues.
We can also do this by SVD, which is just faster and better. We don't have to build the covariance matrix. And we can just read off the principal components right there. To do the projection, we just take that matrix of eigenvectors, multiply it that way. And this will apply to things that were not in your data set. This is going to be an important thing to observe as we go through the other methods.
I think I can see some stuff in the chat. Oh, choppy audio. Let me see if I can bring it closer to me. Hopefully that helps. Yeah, so hopefully bringing this closer will make the audio better.
Now where was I? Yeah, so you get the ability to extend outside your data set. You can also do the pre-image go from the principal component space back up to data space. And if you want to do this with a singular value composition, you can do that as well.
OK, so the gram matrix is kind of, in some sense, like the dual-- not really, not technically, but the dual to the covariance matrix. You build it by doing the multiplication the other way. And of course, this is going to be closely related. It is also symmetric positive semidefinite.
You get an [INAUDIBLE] basis of eigenvectors with eigenvalues. And it turns out that you'll be able to relate the eigenvectors and eigenvalues of this in a formula like that, so long as it corresponds to a non-0 eigenvalue. And if you build your vector like that, you are getting, again, an eigenvector.
And the proof is really quick. You would just write out what it would be to get the norm. Watch what you're doing and then see that, oh, I have a k, also an eigenvalue. Great. For the covariance matrix, do the same thing. Manipulate until you get a k. Pops out an eigenvalue. And you have your b. And the bottom line is that the eigen paired with the k can be used to build the eigen pairs of c and that they share the same positive eigenvalues, the same number of them.
And so if we were to write it out in a matrix way, we'd see that the matrix of eigenvectors b is x transpose ar to the negative 1/2, where a is the matrix of eigenvectors for k. And r is the diagonal matrix of eigenvalues for k. And so to project our data, doing it from this perspective, we can just write it as ar to the 1/2.
And we also get to realize something a little bit more interesting than that is that our matrix k, we can now factor it into ar to the 1/2 times itself transposed. And if we remember how we built k in the first place, we did a bunch of dot products, which can be expressed as x times x transpose. And so what this is saying is we have found, just from looking at k itself even if we don't know how it was built to the dot products, we found a way of rewriting it as dot products by doing the spectral decomposition.
And that trick is what's going to let us to multi-dimensional scaling. So here, it's a different setting. Unlike with PCA, we're not given the positions as vectors. We're just told, here's the distances. You don't know exactly what the original numbers inside those vectors were.
And the question is, can we recover the configuration up to the translation rigid motions? Because that's not going to matter to us. Can we do it? And the answer is yes.
So first, you say, OK, suppose I had the positions I wanted, the x sized the x [? eight. ?] [INAUDIBLE] The ones I'm trying to discover, you can write out what that would mean for the distance squared. You get this expression from just basic vector calc.
And then what you would need to do is-- you would want to center your data, and so you remove the mean from everything, and you get this double centering formula is what it's called.
And the neat thing is that this only depends on the distances you are provided. So you can know this k, this grand matrix because it's just a bunch of dot products and each entry out of these distances.
And so when you do this, you can now break it apart, like we just showed in the previous slide, into 8r to the 1/2 times its transpose. Saying that the rows of 8r to the 1/2 would work perfectly well as your positions for your observations.
And so now we have our embedding.
And this is pretty nice because sometimes you're not going to have-- all you know it's how different your observations are. You might not even know that precisely.
You're not always given features as just a bunch of numbers. So this is a good thing to have.
But there is an issue, not all data is going to come from Euclidean space. So, basically this example should sum it up. If we just look at this graph, we have four vertices or edges. Each edge counts a distance one, which makes opposite vertex, it would be a distance, too. Because you would walk to, for example from A to B, and then B to D.
And so this cannot be realized in Euclidean space while preserving those distances exactly.
And to see that, the distance from A to B is one, distance from B to D is one, and the distance from A to D is two, so that means that B is a midpoint of AB.
Likewise on the other side, C has to be a midpoint of D.
But that together would mean that B is C, but at the same time, we know that the distance between B and C was two, so that's a contradiction. So this cannot be realized in Euclidean space, that's a vector space.
But the algorithm is still useful, you can still build its matrix even though we all know may be a priority, whether it's a plane and you just do the IDN composition.
Now there's a risk that you might have negative eigenvalues, but you just make sure you don't use any of those. We still want to keep the top Q eigenpairs.
And it just might put a limit as to how high a dimension you could embed your data in. So let's say you only had one positive eigenvalue, then you cannot embed it now into a three dimensional space.
So we get this matrix. The rows are our points.
And if we actually were curious, can we realize these distances as a Euclidean thing? It's enough to just check that there is no negative eigenvalues. So being symmetric positive is going to be definite if you remember meant that all the eigenvalues were positive.
So the presence of negative eigenvalues [AUDIO OUT] exactly the instruction.
Another theorem is that this whole procedure, even though we lost the guarantee of being able to--
As I was saying, we no longer have the guarantee that we can realize this is Euclidean points, but we do have the guarantee that we're minimizing this functional called string, which is basically saying, hey, if you were to get points, just choose random points, make the grand matrix out of them , and then compare that to the double centered grand matrix we made up there. How different are you in terms of mean square? And that would be the string.
If you didn't want to be stuck to that, which is a linear method, you can do something called stress minimization, which is you look at this functional and now you really can't do linear algebra to get your answer like we did before, which may be a pro, may be a con.
Yeah, it's not a convex thing, and you're just going to have to do it iteratively and get better and better solutions, and eventually you're going to settle. But one thing about this that's kind of nice is that this is, I think, more intuitive as to being good because you're actually trying to make the distances-- the intrapoint distances be what the distances you were prescribed.
OK, so we saw that the multi-dimensional scaling has these optimization formulations. PCA has a optimization formulation as well. And when we look at it, it kind of suggests another way to generalize it. So, MDS is one way to generalize PCA.
We can also generalize PCA as realizing what we've kind of done here, and what we're trying to reconstruct is a linear autoencoder.
And so if we just stick in our nonlinearities, we're going to get a truly useful neural network, well potentially, assuming we train it well. But a potentially more useful neural network, and make it deeper. We get deep autoencoders.
So, the good thing is it's more general than PCA. It can be trained in an online fashion, though there's variations of PCAs that does that, too.
And it automatically has out of sample extension, and it automatically does this pretty image thing, where you can decode it.
And so, those are great things. The problem is, of course, you have to pick the architecture, the activation, all those other things and how to choose the best one for the job is not always clear. But probably the most annoying thing is that if you wanted to change-- view your thing in a different amount of dimensions, or you wanted to try with more latent dimensions or [INAUDIBLE] dimensions, you're going to have to retrain the whole thing because you're going to have to pick a whole new architecture. And that's not great, but they're still super useful.
Of course, I'm sure everyone is aware. So another thing going down this neural network route is you can use the hidden representations learned during some task. So, a great example of that is word to back. You're training it to do something like predict what words are going to be in its neighborhood, but that's not really what you necessarily care about. You want to get a hold of that hidden representation and use that for downstream processing.
And so, these kinds of things are really great, but sometimes this is actually going to take you to a higher dimension then maybe you started with at some point, and that's OK. Nonlinear dimensionality enlargement.
I guess that's fine, too. We're doing reduction, but we can also do enlargement.
And when we do this, we can take dot products still, assuming we're still in a finite dimensional thing, and even in an infinite dimensional thing, we'll be able to do this if we're in a Hilbert space taking our inner products.
And there, because we're still within linear algebra land, we can do PCA. We can do multi-dimensional scaling.
So if we do go down the PCA route, one thing that will help us is realizing that there's something called Mercer's Theorem in worse condition, which says that if you just give me a function with certain properties, it means that there is a feature map so that I can build this kernel by taking inner products of the embeddings of my points.
So if I have something continuous symmetric and when I evaluate my matrix on just-- so you pick an end point set, evaluate your kernel on that end point set in every which way that builds you a matrix.
I just want that matrix to be symmetric positive semi definite. If that matrix is symmetric positive semi definite for every choice of [? finite limit ?] points, then along with the other conditions gives me the existence of this map.
That's the whole trick to it really because now by just evaluating the kernel, which can be an easy function to do, we're taking advantage of manipulations being done in higher dimensions for free.
And so basically, that's going to be what you end up doing, and fortunately we basically did all of the work because the grand matrix was just the matrix of dot products, and we showed that the eigenpairs for that-- for the non-zero eigenvalues can help us build the eigenvectors for the covariance matrix. And it might not look t, but you can actually make this all work in the infinite dimensional case, too. You just have to be a little bit more careful in how you interpret these symbols, but it still all goes through.
And so the PCA, the doing the eigendecomposition of the covariance matrix reduces to doing the eigendecomposition of the kernel matrix or the grand matrix, and the only thing we have to worry about is were we mean center because we're not really doing traditional PCA if our mean it's not zero. So you do something called double centering, which takes care of that.
Similar to an MDS, and then our projected data is given by the average to the 1/2, which came from the decomposition of k.
So, that gave us quite a few ways to get at these problems. And they each other pros and cons in different situations, but we haven't really incorporated any ideas from specifically knowledge about shapes, and how shapes can behave like spaces, typologies, Riemannian geometry, difference geometry, that didn't come into play.
So, let's review a little bit about these ideas. Let's dive into that. So one of the big things that math people like to distinguish is intrinsic versus extrinsic dimension.
So, when you look at the sphere, you might say that's a three dimensional thing, but a mathematician is going to look at the sphere, and when we hear sphere, we're only thinking of the surface of the sphere, and that's a two dimensional object to us. Same for the Taurus. When we say Taurus, we mean the surface of a donut. That's 2D because you only have two degrees of freedom on the sphere, and on the torus. And in this middle object is supposed to be the universe. That's our 3D space.
We live inside of it, of course. So there we're truly trapped, we can't step outside our universe and look at it from the outside from the extrinsic dimensions and try to understand the shape like we could for the sphere or we could for the torus. We have to explore the geometry of our universe from within. And so we need a language that works from within and depending on how things go, we might just have an infinite universe or we could even in theory-- I mean, it's probably not true for our universe, but have a periodic universe where you go up in one direction, you come back to where you started.
And that's really what we care about in difference geometry and things like that a lot of times is the intrinsic stuff.
So more terminology. So compact manifolds are the ones we're going to focus on because we're not really going to encounter unbounded-- like going off to infinity manifolds in machine learning because our data is always going to have to fit on a computer somewhere, so it's finite in that sense.
But, compact manifolds are already diverse enough. So, we have manifolds with boundaries like this hemisphere has the edge, the equator. You have a Mobius strip, which is non orientable. It also has a boundary and then you have closed manifolds, and which is different than like the notion of a close set, it just means that it's both compact and has no boundary.
So, that's the dichotomy. In the Pac-Man world, the three dimensional Pac-Man world I had introduced in the previous slide, that's considered a closed manifold because even though the way it's kind of drawn right now looks like it has edges, Pac-Man does not see an edge. Pac-Man doesn't see a boundary for him, and so that's what his universe is, no boundary.
So, just going through the basics. The 1D manifolds, the compact manifolds. We only really get curved segments and closed loops.
And from a topology standpoint, but just pure intrinsic geometry at least, if you were an ant living on a segment, a curve. Imagine an infinitesimal land, you would not be able to tell the difference between whether you're on the top curved segment or the bottom curved segment. To you, both of those are the same. You can move forward, you can move back, and the only thing maybe you could pick up besides that is what's the limit to how much you can move.
Maybe if this is a string of length. One, you would figure that out, and the other one might be a string of length two, you'd be able to distinguish them, but that's it.
Same for the close loop if they're equal circumference or perimeter.
OK, equal length. Then you as an infinitesimal ant would never be able to tell the difference between them. It's just a closed one dimensional manifold to you. And so that's a classification there.
For two dimensional closed manifolds, we also to make life simpler for now, we'll throw in the word orientable.
Then the classification is as follows. You have spheres, and you have donuts, then you have double donuts, triple donuts, and so on and so forth. Basically you add more and more handles to your space, and that's the classification of those [AUDIO OUT] animals.
If you wanted to include boundaries, then basically you have to say, OK, we'll cut out disks. Just remove a little disk from whichever place you want on this thing, or maybe remove two disks, remove three disks, you're introducing more and more boundaries. And if you remove orientable, then things get kind of weird because you get [INAUDIBLE] and stuff.
But I think this illustrates at least the idea.
Right now this classification in both situations has been in the sense of I can deform these, I can stretch them, I can squeeze them I just can't tear them, I can't do violent things like that to them. And so long as I'm preserving those things, I'm going to consider them the same. There is a higher level of structure that we can consider, which we sort of alluded to in the 1D thing, which is length. Length is a geometric idea, it goes beyond topology. And that can help us distinguish one kind of sphere from another kind of sphere, one kind of donut from another kind of donut.
And so to do this in difference geometry, we introduced this idea of Riemannian metric.
It's basically just this idea of saying, hey at every point, I have some notion of scale and angle. You give me two vectors, I can tell you their dot product. I can tell you the length of the vector like tangent to my space.
And then you just have to prescribe this if you're building it from the ground up, or if you're inheriting it then you're told what this is.
But once you have this, it forces a lot of things to happen. It forces the notion of arc length to occur because if you know the length of every tangent vector, you can integrate them to get the arc length of a curve. And once you have the notion of arc length of curves, you have this notion of geodesic distance. And that's this idea that if you're on a sphere, the shortest path between two points is not something you would-- let me phrase this better. The shortest path between two points on earth would be like tunnel right through the earth, but if you're stuck to the surface, then you should take a great circle.
And the length of the shortest great circle that you can take, that's the geodesic distance. And every manifold with the Riemannian metric is going to have some notion of geodesic distance between these two points on them.
And as it was mentioned earlier, this is not interesting on 1D manifolds. Curves are very, very simple.
So long as you know the length of the curve, and if you know it, whether it's closed or not closed, then you know everything about the curve as far as Riemannian geometry is concerned from the intrinsic point of view.
OK, so when we get to surfaces though, back to the donuts and the spheres, we have now actual interesting notions of curvature. We have negative curvature, zero curvature, positive curvature.
You can get mixes of these, so the traditional donut that you would eat has both kinds of curvature. Positive on the outside, negative on the inside. And so these things will prevent certain things from happening.
For example, if you have a sphere or a globe, like a map of the earth on a sphere, you cannot flatten that map out because of the curvature.
A sphere is positively curved everywhere and a sheet of paper has zero curvature everywhere. And because we know that, it means that there is no way to deform one wall to the other while preserving all distances, all geodesics.
On the other hand, and this would be good if I guess had my camera on, but if you can imagine this, we've all done this. If you have a sheet of paper, you can roll it up into a cylinder, and if you were an ant living on that sheet of paper that you're rolling up, the distances between two points between two of the parallel lines on the sheet of paper, for example. Those distances aren't changed. It's still going to take you the same amount of time to get from one part of the paper to the other unless you take the shortcut from it being a circle now.
But not locally speaking, the distances are preserved. And so we can do these kinds of things. So these provide obstructions from what's possible and not possible.
OK, so now that we have that set up, we can start trying to see how we can use these ideas to inform the dimensionality reproduction we had been doing. So before we had used the kernel trick, which is great, but that was more functional analysis. Multi-dimensional scaling was closer, we were using distances and trying to realize them.
An isomap takes advantage of this idea. Not the functional analysis thing, but the multidimensional scaling.
The idea is, say you have a point cloud, which represents your data. This would be a right there in the top right and notice that the two circled points are not that far apart from each other when considering how [INAUDIBLE] we fly, the straight line distance.
But as far as the intrinsic dimension, if we are imagining when we look at this, we see that this is coming from some surface in three dimensional space. The intrinsic distance to this manifold is much, much longer. It's far away. And we would completely miss that if we were to do multi-dimensional scaling with the putting distance. So what isomap does introduce 2000 is it says first, you build a neighborhood graph because we trust local distances.
Manifolds. I forgot to mention this, but a manifold is basically a space where locally speaking it looks Euclidean.
And as we build this neighborhood graph, maybe connecting to our five nearest neighbors, we start getting a global picture because then we can ask, OK, for any two points in this neighborhood graph, what is the shortest path between them? What is the geodesic in essence? It's an estimate of the genetic distance. And so we build this matrix of Paraguay's geodesic distances, and these are actual distances, but not Euclidean distances potentially. But we can still do classical and multi-dimensional scaling for this matrix.
And if you do that, go through that algorithm we described, you get the output of isomap. And so that's the three pictures here from [? Tenamont's ?] paper is doing.
Neighborhood graph, use [? x-rays ?] algorithm or something like that to get geodesics, and then now you have your embedding for all those points.
And in this particular case of a Swiss roll, this is analogous to the paper rolling I had mentioned before. We're not distorting distances when we roll it up. So this is in this case a very good approach because we know we can actually realize, or approximately realize the geodesic distances by unrolling it into 2D space.
Building off of that observation that there's a good graph to use here, we can just kind of move away-- we can take that idea and just move in a whole new direction with it. We can say, well since we have this graph and the graph itself kind of doesn't have any dimension to it It's dimension agnostic. We can just ask ourselves, what's the best way to draw a graph? And spectral graph theory has answers to those questions.
So we look at our neighbor graph or even a slightly different graph maybe with weights, where the weights represent high weights mean very similar or close, low weights mean probably not similar. And then we ask ourselves, how do we draw this graph in two dimensions, three dimensions, so on and so forth.
And what we can do is we can try to borrow some ideas from physics and say, OK, how about if they're close, to imagine that to be a spring, a very strong spring. So if high weights, very strong spring between them, low weights kind of weak springs , and then things are pulling on each other and then maybe this will settle to equilibrium. But we run into a problem, which is that if we let this simulation run and the springs don't have a natural length, as in this formula, the natural length is zero, then everything will collapse to a point.
Everything will just collapse the center mass, and so that's bad because that's a terrible embedding for our points unless they were all one point to start with. But since we're not actually doing physics, we're doing just mathematics to try to get a good embedding, we can throw in a constraint to our optimization, and basically what these constraints are doing is-- the bottom one, where you have the vector of ones. That one is just saying, hey, you're not allowed to be a constant vector. The embedding we get is not allowed to be constant. It has to be orthogonal to the constant map. And then the other one above that is basically saying we also want our axes to be uncorrelated, kind of like in PCA.
And when you do this, it turns out that you can solve this exactly by finding the eigenvectors of something called the graph laplacian matrix, which is the weight matrix, so the matrix of weights, that's the w.
Subtract that away from the degree matrix. And the degree matrix is just each vertex has a degree, which is the sum of all the weights incident to it. And so if you take this difference, you get something called the graph laplacian. Find it's eigenvectors. Those will actually tell you how to do this embedding.
So you would just stick your eigenvectors vertically into this r matrix, and then the rows are going to be your positions.
So we introduced this graph laplacian. But what is that?
We know the laplacian from ordinary multivariable calculus. It's just the sum of second derivatives. But what's going on a graph? And so it turns out this is related to it to something continuous called a laplace-beltrami operator, which is basically the generalization of the laplacian for manifolds.
So that's a nice continuous land. The graph laplacian is for discrete land, but that's going to help us build this bridge between these two worlds. And so if you give me this Riemannian metric we had been describing before, this kind of local piece of information that allows us to create notions like length, volume, curvature, et cetera.
From there you can actually in a coordinate free way, because this is all intrinsic it turns out. You can define gradients, you can define divergences and you can define the laplacian as the divergence of the gradient in this coordinate free way, whatever that may mean.
If you want to use coordinates, like use a local coordinate system, it ends up looking like this.
And some examples would be the standard one, where it's just the sum of the second derivatives, but on a sphere, the laplacian portion is different. You have all these other interactions, or at least it looks different when you read it in coordinates. The coordinate-free way is the same for every manifold.
So now that we have the laplace-beltrami operator, we can start asking questions like, OK well what if we ask for eigenfunctions of the laplace-beltrami operator. And so, you kind of have to be specific what kind of eigenfunctions? But, here we're going to say Neumann eigenfunctions, which is if there is a boundary when you take the directional derivative, sorry.
OK, so when you take the directional derivative in the direction of orthogonal to the boundary, you want that derivative to be zero if you're at the boundary, that's all that means.
And so when we look for these kinds of things on a compact manifold, we're going to have a discrete spectrum of eigenvalues, and if we're on a closed manifold, we don't even have to worry about that boundary condition. And so one can ask, can we learn something by looking at these eigenvalues? Can we learn something about the manifold we're on? And Bile's law, which is proven in 1911, says that for a closed manifold, the eigenvalues completely determine the volume of your space, which is kind of cool because we've kind of gone through a very circuitous route to get at the volume. And the next question from that would be well, OK, if I have a way of getting the volume from the spectrum of the laplacian, can I actually determine the entire manifold, the whole Romanian structure of this thing?
And it turns out-- this is related to a question known as, "can you hear the shape of a drum?" And it turns out the answer is no, you can have differently shaped drums that resonate the same way. And in this case, the first example of such a thing actually ended up being 16 dimensional [? tori ?] with the same eigenvalues of people that found similar examples since then.
And so, OK, we can't figure out what we're living on by looking at the eigenvalues alone. But maybe if we look at the eigenfunctions as well, we can learn a lot more. And so the eigenfunctions, if we put them all in a vector.
So that our input is a position on the manifold. We stick it inside of one of the eigenfunctions. Inside the first eigenfunction, inside the second eigenfunction, et cetera, et cetera, et cetera.
And so we make each component of a vector the evaluation of our eigenfunction on a point of m. And we scale it in the right way. It turns out that this mapping into this infinite dimensional space continuously embeds our manifold into that space, and so we've realized it with coordinates coming from the laplacian essentially. And later people show that locally good things are happening, too.
So this is positive news. It says that-- because remember this all came about because we're trying to draw the graph associated to our data. And it turned out that to do a particular way of doing that well, boiled down to looking at the eigenvectors of the graph laplacian. And the graph laplacian, I kind of promise is related to the possible [? tram ?] operator. And so now we kind of have a picture here that can help us connect all these things. And so this leads to the technique known as the laplacian eigenmaps.
Or sometimes it's called spectral embeddings.
And you start with your data in rn, you build a kernel weight matrix, usually use a Gaussian kernel, and you might do nearest neighbors as well, but you can connect them all up if you want to as well anyways because the locality of the Gaussian is going to take care of that.
So you can get at this different ways. There's equivalent ways of doing this algorithm, but this is one way. You can then degree normalize, so you sum up across the rows. Basically that's saying sum up all the weights, insert into that vertex, and then divide by that degree.
And so, once you have this matrix, p, that I want to get the eigenvalues, because it's a stochastic matrix. It's going to have an eigenvalue 1 and then we're going to go down. And I'm going to get the eigenvectors and then I'm going to express the embedding this way. I'm going to use the eigenvectors basically. Not the same way I would in PCA. Because I noticed that to embed the i-th data point, I'm looking at the i-th.
Sorry I'm looking at the i-th row of the matrix whose columns are the eigenvectors, and so it's a little different, but, yeah, that's how you would go about. And so, there is the scale parameter epsilon and what is that doing? So if you take something like the trefoil knot, and try to embed it into two dimensional space while bearing the epsilon with this algorithm, you see that as epsilon gets smaller, it starts unfolding-- untangling, and eventually you get to kind of this almost triangular shape.
The reason that's happening is because as you make epsilon smaller, you're making the variance of this Gaussian smaller, and so basically only looking very, very locally. And if you're looking only very, very locally, you're starting to behave like that ant on the manifold. You can only really see what that ant would see to some extent.
And then how would you possibly ever know that you were knotted? If you're an ant on the [? truffle ?] knot, you can only look forward. You would never know that it's not it and that's why it unfolds.
So larger epsilon means a coarser, more global features. Smaller epsilon [? is ?] finer features.
And that basically summarizes what I said. There's one more thing here that I should mention, which is that the sampling density matters. Notice that as we make epsilon smaller, it still has that three-fold symmetry, and that's because the weight of this truffle might have sampled. Some places are more dense than others. Another thing is that if you make epsilon too small, then things get bad as far as this matrix being something you're going to factor while on the computer. You start getting the numeric layers, though. That's something to watch out for.
So, I had mentioned the laplace-beltrami operator, and the connections and kind of problems with it. And so here's the delivery. If your manifold-- suppose your data came from a manifold, and you sampled it uniformly. Then the eigenvectors that come out of the laplacian eigenmaps are going to limit in some sense that can be made precise two of the normal eigenfunctions of the laplace-beltrami operator. And so, basically, you give me the eigenvector, I can think of that as a function on my manifold basically. Evaluated on discrete points And the problem I also pointed out on the last night, is this is very sensitive to the sampling distribution.
[AUDIO OUT] and then you can think of this just like a truncation of the continued spectrum. So here's realizing that observation, if you just take the unit interval and do a laplace and eigenmaps maps on that, it's kind of a silly exercise because it's in one dimension, you can't really reduce the dimension any further than one. So, that's not the point here. The point here is we want to look at the eigenfunctions and indeed, they end up being the cosines of the correct frequencies, that end up being the eigenfunctions of the second derivative. And so, that's cool.
If we do another example like the unit sphere, we sample the units sphere [? nicean ?] uniformly, and we do the laplacian eigenmaps and look at the eigenvectors. Look at what they're telling us. So I take the absolute value so it's easier to visualize. Then we can see that these are corresponding spherical harmonics. If you ever looked at a quantum book, you'd see things like this for the electron oracles.
That's cool. If you take the flat torus, this is the Pac-Man world, effectively, which you can embed into Euclidean space-- four dimensional Euclidean space by this little map here. It's a localized isometry it's not like-- you're not going to get a true, true, perfect isometry from R4, but you can again see that we're getting the right kinds of stuff.
So, going back to this [? truffle ?] example, PCA would do that, which is bad. That's not a good dimension reduction.
Laplacian eigenmaps is pretty good, but we have this three-fold symmetry, which is not really a problem as far as processing it goes down the line. But this next algorithm that we're going to talk about, diffusion maps, would actually get it perfectly and produce this perfect circle.
So there's lots of ways to get at diffusion maps, but one idea is that there's this notion called diffusion distance. And so imagine you had your manifold, you have two points, p and q. And imagine you placed direct deltas there. And you pretend that that's just an infinitely focused point of heat and then you let this heat diffuse through your manifold. And you can write the heat equation on a manifold because we have the laplace-beltrami operator, which lets us do that. And so you see how it evolves.
You get two functions. f pt, f qt, and you can take their L2 distance as functions. So the integral distance between functions. And that's going to be the diffusion distance between those points. So it seems like kind of a circuitous way of going about things to get the distance between two points. 'cause we did after all, if we have a manifold, we could ask what the geodesic distances are like isomap did.
Instead, we're doing something very different here. Related but different.
And so if we go about this way and try to build our algorithm out of diffusion distance-- trying to preserve diffusion distance, we get nice things like we can denoise our noisy spiral quite nicely.
Of course, other algorithms denoise, too, but just pointing out that this has it for sure.
To build this algorithm, which is going to be described in a second, let's just recall how laplacian eigenmaps goes. We have our data, we make the kernel matrix, we take the degrees, we normalize to get this probability matrix, and then we find the eigenpairs.
So for diffusion maps, we again start with data, but now we have two parameters. We have the epsilon from before, but we also have this alpha. And we're going to do what we did, but [? force ?] a lot more steps. And so we make that matrix, we sum up the rows, sum up across our data to get this kernel density estimate. We normalize and notice that the alpha is occurring in the denominator. Sorry I should be clicking next. So we get kernel density estimate, we normalize. This creates an anisotropic kernel if alpha is bigger than zero.
And then we can [? nand ?] isotropic kernel against the estimate. We get this kind of diffusion maps kernel thing, which is normalized a different way. And then if we normalize it a different way, we get the random walk kernel. That would correspond to the end point-- if we let alpha equal zero, then that random [INAUDIBLE] kernel is exactly equal to the one that we had in laplacian eigenmaps, so it's a direct generalization.
And so, we're going to have to diagonalize these things. To diagonalize p-- it actually is easier to diagonalize j and then translate the information. And so we get this. We get our mapping. It's quite similar except now we have the presence of this lambda to the t-th power.
Because diffusion distance actually-- we had one for every t. t is how long we let the heat equation evolve.
AUDIENCE: I have an naive question.
CHRISTIAN BUENO: OK
AUDIENCE: From the slides you showed just a while before about the spiral, and the spiral data and the unfolding.
So I could see how there-- you objectively know because we live in this 3D world, and we look at the plot that you can kind of unfold this, and you have this neat, perfect pseudo reconstruction, or whatever and you recover the actual latent dimensionality. So, in a higher dimensional space when you pick your nonlinear dimensionality reduction technique, how would you assess that you're picking the right model and knowing if you've actually unfolded it the right way?
CHRISTIAN BUENO: Yeah, so that's a really good question.
So one thing you can look at, kind of similar to PCA, is that the eigenvalues are ordered, and basically like PCA, where PCA if you keep the first q many eigenvalues and eigenvectors, that builds up a certain amount of your variance. With diffusion maps, if you keep the first q eigenvectors and eigenvalues, it's basically saying I'm preserving this much of my diffusion distance. Basically there's a diffusion distance amongst my data set and when I'm running through this algorithm and getting my projection, I'm throwing away some of that information on how to make that distance.
I'm approximating it. Just like I was approximating variance before. And so you can look, kind of like with PCA where you would look for where the impact of throwing away variance where there's a big sharp change, you look at the same thing. You would look at the eigenvalues, where is there a big change? Now that's something that only kind of tells you for diffusion maps itself, like where's kind of the best place to cut off? You kind of have to still eyeball it. But now if you're comparing it with different algorithms. What if you wanted to compare diffusion maps to isomap or t-SNE or an autoencoder? And what if your latent dimension is bigger than you can see?
Then it gets a lot trickier to actually judge whether one is working better than the other.
So, especially because things like diffusion maps and isomap and t-SNE have this problem, which there's not really a clear way to do the pre-image. You can't really reconstruct your data in a natural way, so you can't even test reconstruction error in those situations.
It's a tricky thing. That's why in this part of machine learning, people like to try to prove theorems. Some sort of guarantee that in the limits something is supposed to happen the right way.
But yeah, in case by case, it's a little difficult. I don't if I answered your question.
AUDIENCE: Yeah, yeah, cool. Thanks a lot, Christian.
CHRISTIAN BUENO: OK, so, where was I? Yeah, we built this thing.
One nice thing about diffusion maps, laplacian eigenmaps, and isomap, it turns out as well, is even though at first glance, it doesn't seem like there's a way to apply it to new data without just doing the whole process again from scratch.
So let's say I wanted one more data point. I've done the diffusion maps to find the eigenvector so that I can draw this in one dimension. What do I do with this new point? Well, a priori, you would just have to toss it into your data set and run the whole thing again, which is not great, but it turns out that there's something called the Nystrom extension, which basically comes about by looking at these expressions and realizing that this works as a Mercer kernel.
So, you can actually think of what's happening as a variation of kernel PCA. And when you think of it that way, you get the same projection formulas. All that you need is to validate the kernel and use the eigenvectors to project. And you can just quickly check to see if that works.
Obviously I went too fast for someone to visually check, but it's something you can do.
So we can kind of see this now. If you color it with the first diffusion coordinate, you get this, which makes sense, and then you can extend it to the ambient space.
And it turns out that the way the Nystrom extension formula works is that this is actually a smoothly extending to the whole space. It's infinitely differentiable, which is unusual. Those sharp changes are actually differentiable, but in the limit of infinitely much data you would expect discontinuities to appear.
OK, so that's the Nystrom extension. So, I like diffusion maps. I think it's cool because it's kind of at the middle of all these things. So I wanted to throw that out there and to kind of see how all these things are connected in various different ways.
The Brownian motion on manifolds is related to the heat equation and that connects the stochastic dynamics in differential geometry.
I didn't emphasize it too much, but basically we had this random walk matrix. So you can interpret diffusion maps and laplacian eigenmaps as basically some way of decomposing random walks.
And then you also have spectral graph theory relating these things, too.
So, to wrap things up, I did say in the abstract that you can connect this stuff to stochastic dynamical systems. I kind of alluded to that just now.
But people have been looking at problems and molecular dynamics and things like that for a long time, and we know that sometimes even though lots of crazy stuff is happening, most of the action is happening near a lower dimensional manifold. And so it can be useful to find out what that is.
In fact, you guys have probably seen maybe at some point in high school chemistry, these pictures of reaction coordinates, which is basically if you have a chemical reaction, there's a potential barrier, but if you get over the potential barrier, then you're in a good new lower state. This is like the diagram on the bottom, basically.
And then the 2D version is on the top from this paper by [? Rordance ?] and [? Django ?] [? Clemente ?] and so diffusion maps has been applied to these kinds of things.
It kind of feels somewhat natural because diffusion maps has all these manifold like guarantees and it also is naturally working with the diffusion and random walks, so it feels right. But maybe there's better things, of course.
And so, just as another little toy example of this, let's say we had something called an ito diffusion, which is basically a differential equation with randomness, Brownian motion involved. And so you can get geometric Brownian motion when you solve, which is like what stock prices purportedly do.
At least a simple model. Langevin dynamics, which comes from molecular dynamics, and just a fun little aside if you haven't been exposed to this. The chain rule is different for stochastic calculus, there's additional terms.
And so the idea would be like, hey, maybe we have a situation where when we let the system evolve, there's a bunch of randomness, yeah, but it stays near some lower dimensional manifold. And can we figure out what the stochastic differential equation is like on the lower dimensional manifold from simulation and dimension reduction.
And so you would simulate your SDE and apply diffusion maps and that denoises it, as you can see on the right. And then in this case, particular, we have a lot of control basically because we know that in this case since it's periodic, basically there's a circle there somewhere. And intrinsically all circles can be drawn as the nice round circle. And so now basically all that matters is figuring out what that phase is.
To make things a little better, you can downsample because we have to factor a matrix here, so that's kind of slow.
And so, example, you can look at these things and you can run your computation and actually kind of recover what the underlying components for this differential equation were, the stochastic differential equations.
And yeah, so that's kind of a cool thing. However, this was a very special [? toy ?] case here because it was a one dimensional example and we know all the one dimensional manifolds basically. And we know what their eigenfunctions are so you can wind these things up really nicely in those situations. When you have higher dimensional intrinsic dimensions, boundaries can get weird, the curvature can make it a little bit more unpredictable, so I think this stuff, in particular, diffusion maps really shines in 1D latent spacing. If for some reason you have it, it can pick that up really well.
A bunch of references that I used and, yeah, thank you for your time. That's the end of the top part.
AUDIENCE: Yeah, this is really cool to see all of these different methods, I guess presented in this format. I was wondering if you could give a little bit of intuition for how some of these methods relate to each other or which ones can sort of be embedded in other ones. So I guess one example of that would be like how you can if you have a particular type of autoencoder, then you can essentially capture [AUDIO OUT]. And so does that type of relationship, like a curve or some of these other methods?
CHRISTIAN BUENO: So, that was kind of the things I was trying to hint at along the way and I can make them explicit now that you ask.
So the starting point for sure is PCA in my head and from the grand matrix, you saw that we could in the Euclidean case, PCA is classical multi-dimensional scaling. They are equivalent methods.
But then you can extend in a different direction because multi-dimensional scaling, kind of the algorithm works in a more general setting of, hey, I'm given distances and not vectors. And that's one relationship.
And the grand matrix there is the connection that the covariance matrix is not the way you have to do it, you can do it with the grand matrix.
Another connection also comes from the grand matrix, which is you can-- since the grand matrix only involves dot products, this is something you can kernelize. And that's how you can make the bridge to kernel PCA.
And the next step from there to the other methods is kind of a bigger leap because the other methods I discussed here, they involve kind of trying to bring in this intuition from manifold theory or related ideas. And the thing that ties those together, though, is this idea of some sort of neighborhood graph, some sort of locality-- a representation of your data that forgets what dimension it's living in, which is why I like the graph methods.
And so amongst those, isomap is definitely related to multi-dimensional scaling because it uses it. And laplacian eigenmaps is basically a special case of diffusion maps.
And then something I didn't have time to cover. t-SNE you can also kind of see as an evolution of this thinking. I have neighborhoods and probabilities because diffusion maps has a probabilistic interpretation, too.
Probability of jumping to my neighbor. And so t-SNE has almost the same expression up to a point, and then it does deviate in a new direction and gets to a situation where you cannot solve just by diagonalizing the matrix. You have to solve it iteratively.
And that has its cons and its pros, depends on what you need it for.
So I do definitely think there is relationships between all of them, and those are some of the ones that I think can be made explicit for sure.
AUDIENCE: Well, thank you, those are helpful.
CHRISTIAN BUENO: Yeah.
AUDIENCE: So, one nice thing about PCA is that we can project back from low dimension to high dimension to get an idea of what the principal components are. Which is quite useful in other fields because we want to know the nature of these principal components. But with isomap and perhaps a diffusion map, we can't do that here, so we can't go from-- we can't project back from low dimension to high dimension, yes?
CHRISTIAN BUENO: Yeah. I would agree with that. I haven't seen anything that feels natural or 100% right.
I think people will try to, for example, something sometimes people will do is they'll take something like isomaps, something that goes in one direction, but not the other. And then try to learn the reverse mapping with the neural network. But that definitely feels kind of like, OK, it's not within the same family, it kind of feels like you're cheating a little bit.
So I do think those are definitely big limitations for now at least. Kernel PCA with Gaussian kernel, you can do something that's a bit more convincing.
It turns out that in that case, there is kind of a pretty decent idea of a preimage, but once you move away from Gaussian kernels, it starts getting not clear how you would do that, I think.
AUDIENCE: And the same issue with the diffusion map? Yes, we can do that.
CHRISTIAN BUENO: Yeah, exactly.
And the same issue for t-SNE too. But t-SNE also has an additional issue, which is you can't even take new data and embed it with the same mapping. It only-- because it's iterative you kind of have to just do it again if you have more data.
AUDIENCE: Hyper isomap can be [INAUDIBLE].
CHRISTIAN BUENO: Isomap can be made to work with new data points. There's a paper by one of the [? Bengios, ?] I forget which one, specifically on showing that isomap, local linear embeddings, laplacian eigenmaps, diffusion maps, and I think one more, that all of those can be written in a way so that they become kernel methods. And so then you can just use the ideas from kernel PCA production to get the extension mapping.
AUDIENCE: OK, thank you.
CHRISTIAN BUENO: Yeah, no problem.
AUDIENCE: Hi, can I ask a question that's slightly extended from the talk?
CHRISTIAN BUENO: Yeah.
AUDIENCE: Yeah, so I think many of the methods that takes input as a position or position of the points or their distance matrix, and what if you really have the so-called agreement matrix around each point, or you have some sort of a local distance notion around each point. What that can add to your picture or the measure reduction method?
CHRISTIAN BUENO: That's interesting. I don't personally know.
AUDIENCE: I see.
CHRISTIAN BUENO: Someone doing that because it seems like you would have to-- well one of the things that would mean is if you actually had the tensor, you would then have the intrinsic dimension of your space because the tensor. If it's a 2 by 2 tensor, you're on a surface. If it's a 3 by 3 tensor, then you're on in space kind of a thing. So, an intrinsic dimension estimation is like just as hard I think as the dimension reduction probably.
To do it in a good way.
Yeah, that's an interesting question for sure.
Even as just a pure math question, if you give me all the tensors, is that even possible to realize it as a manifold?
AUDIENCE: Yeah. That's interesting, sir. Another quick question is I think one of the recent methods called UMAP, it's trending up in like data science.
CHRISTIAN BUENO: Yeah. Yeah.
Yeah, sorry what did you say, sorry?
AUDIENCE: How did that fit into the picture of this talk? Do you think?
CHRISTIAN BUENO: Yeah, so I personally don't fully understand how UMAP works, but the gist that I understand is that it's kind of like the next evolution of t-SNE, basically.
Where t-SNE has this issue that you can't actually embed new data. UMAP does not have this issue. You can embed new data with UMAP, so you can actually use it as a preprocessing tool, but also UMAP is at least as fast as t-SNE.
And for a lot of situations, and the quality of the embeddings are usually look really good.
The advantage of UMP also from kind of a theory standpoint is that it has a lot more theory, so the original paper on it uses very abstract mathematics like category theory, fuzzy manifolds, where you don't necessarily, where you're not 100% sure how connected you are.
And all these things to try to basically come up with a mathematically consistent way of realizing these things. I don't know what the limitations are, but it does seem to have a lot of grounding in theory for sure.
AUDIENCE: Thank you so much.
CHRISTIAN BUENO: Yeah, no problem I also say that from what I've read, but I haven't gone into the details, is when you brush away all that higher level math, the core algorithm is a lot like t-SNE, which is nice.
AUDIENCE: If I can ask kind of like a related question talking about UMAP.
So I think that from what I understand, what UMAP actually ends up doing, like the implementation of it at least, not necessarily the theory, is like a one dimensional graph layout algorithm.
I don't know about the one dimensional part, but it's essentially a graph layout algorithm.
CHRISTIAN BUENO: I think that's a fair way to characterize it, so with t-SNE, you also kind of say, hey, each points-- there's a Gaussian kernel involved. Nearby points are going to be more involved with each other, and you do end up getting this kind of weighted graph out of it. How you end up drawing it is the magic of t-SNE. And I think how you end up drawing it is also the magic of UMAP. But I definitely think, yeah, you could think of it as a graph layout problem.
AUDIENCE: I guess the question that I have is that I think UMAP can actually be in theory instead of, I guess, drawing out a graph. Essentially you can construct higher dimensional simplicial complexes, and I don't really have an intuition of what kind of information or what kind of structure or relationships or something like that might you be able to represent in a higher dimensional kind of like graph analog, as opposed to just a one dimensional graph layout?
Are we actually missing something kind of interesting?
CHRISTIAN BUENO: Yeah, so, I'm not on the details, I definitely wouldn't know the answers, but it does strike me as kind of reminiscent of something called topological data analysis.
If you're familiar with that and there's a particular method there called persistent homology, where they build simplicial complexes out of the data, and from those [INAUDIBLE] complices, they perform some linear algebra and what they extract basically is kind of an idea of what kinds of holes are present in your data. If you have a donut, it would pick up that you have a donut shaped hole and so on and so forth. So graphs definitely can tell you about connectivity rate, how many components do I have, you can just use a graph algorithm to find out higher simplicial complices can tell you about the presence of the more complicated folds, basically.
Formally in mathematics, like in algebraic topology, these concepts of homolohy and folds are basically generalizations of the idea of connected component of the graph.
So maybe UMAP is in some way leveraging that, but I wouldn't know. Never done a tutorial of [? colapses, ?] that's my first time, so bear with me if I'm doing this in a silly way.
So we're just going to run each cell, one by one.
We are cloning this little repository just so I can get my implementation of diffusion maps and a little thing that has different manifolds I can sample from.
The next thing is just a bunch of standard libraries and getting some of the other implementations of the other manifold non-linear dimensionality reduction algorithms, like t-SNE, isomap, MDS, et cetera.
OK. And so the first thing we're going to sample is-- I really like the trefoil knot because I feel it's as simple as you can get, while still being something that would just ruin PCA and at the same time look very pretty.
The first thing is just going to be to sample our trefoil knot and to visualize it.
There we go.
Why did that happen?
There we go.
So now, we're just going to fit all these methods.
We have our PCA, we have our isomap, we have MDS, we got t-SNE, laplacian eigenmaps, but implemented by SKLearn.
Diffusion maps with alpha equals 0, and diffusion maps with alpha equals 1.
So, I didn't elaborate too much on this, I don't think, but the alpha, the role it plays is that when alpha equals 1, you can ensure that in the limit of large data, as your data goes to infinity. And assuming your sampling from a manifold, you can prove that the effect of the sampling distribution goes away.
So with laplacian eigenmaps and a lot of these other things, if I sample a sphere in different ways, it's going to affect my embedding, but with diffusion maps, alpha equals 1, it won't. You can expect to get the same thing if you have enough data and epsilon goes to 0.
So, that's done.
So now we're going to just plot all our things. So as expected, PCA cannot do much more than that, which is still nice as a picture, but not what we're looking for.
I scrolled too far. OK.
So isomap gets a pretty round circle, which is good, because isomap is just using the geodesic distances or a proxy for the geodesic distance, and for manifold theory, we know that any loop is indistinguishable intrinsically from a perfect circle. And isomap does pretty much reflect that.
A multi-dimensional scaling gave us this.
t-SNE also gives us a pretty round looking circle.
So that's nice.
Laplacian eigenmaps, the SK line implementation with the default parameters gives us this kind of sharp looking trefoil knot and then diffusion maps, alpha equals 0 gives us this nice little three-fold symmetry one. And diffusion maps with alpha equals 1 gives us a perfect circle.
And we could have sort of predicted this just from the fact that we knew we started with the trefoil knot because if you remember that the diffusion maps is that the eigenvectors are converging to the eigenfunctions of the laplacian for that manifold. The laplacian, locally speaking, is just a double derivative on a circle. And so, we're just getting things that when you take two derivatives, you get a multiple of itself. So we're getting sine and cosine, and that's why-- and it's sine and cosine in terms of the arc length along the trefoil knot.
So as we traverse the trefoil knot, we're just doing sine and cosine, and that literally is how you draw a circle. And so that's why we get a perfect circle.
And we should expect that basically for any sort of knot.
This is just kind of summarizing some of the things I said. t-SNE gave a good example here, but I ran this sometimes and it breaks the knot into two components for some reason. Because there's a little bit of chance involved, that can happen due to the initialization and the optimization.
The next thing we're going to do is a noisy trefoil knot. Let's see how that plays out.
I'm also going to run the fitting.
Let's get a look at it.
PCA, of course, just projects it down along the directions that look nice. That's nice on some of these.
Isomap did something pretty interesting. I have no explanation for this off the top of my head.
Actually, here's probably what happened. So let me guess first. So right here we have this blue is pretty close to the green. So I'm guessing that that's actually also the case in the original thing and that caused a short circuit when it built the neighborhood graph.
So remember, in isomap you build a graph of nearest neighbors and then use that to estimate geodesic distance. And if it just so happened that one of these noises bridged two parts that shouldn't have been bridged, that will completely affect what the shortest distances are.
It's kind of an all or nothing when it comes to the shortest distances. So where is the picture?
Blue and green.
You can see right there that there's a green blue that are very close to each other. And that probably short circuited the neighborhood graph.
AUDIENCE: So quick question on that, Christian.
CHRISTIAN BUENO: Yeah.
AUDIENCE: If that happens to be the case, is there a way that we can detect it?
CHRISTIAN BUENO: Ah, so. That's an interesting question. So I think to be able to do it reliably, you kind of have to know what you're working with. Like here I know that it shouldn't be doing what it did in a production, but if this was some higher-- really, really high dimensional thing that I have no intuition for, and I just see the projection, it's hard to know from looking at it if there was a some problem that occurred or if that's truly what it looks like. As best as we can do. I'm not sure, but maybe there is something you can do to sort that out. That's a good question.
Something you can maybe do is just run it with different settings, different number of neighbors and see how much it changes. If there's a drastic change at some point, maybe that gives a hint that something is happening.
AUDIENCE: OK, thank you.
CHRISTIAN BUENO: Yeah, t-SNE just tore it apart, and yeah, sometimes it will do that. It likes to find clusters, it seems, and so you can already also see that it looks like it was about to break off another piece, too.
That's the [? SKA ?] laplacian eigenmaps, this is the diffusion maps with delta 0. It broke, unfortunately. This one also kind of broke, but it's interesting that we sort of see the effect of that short circuit here as well in both of these.
It seems to have a little bit less of an effect because at least in my implementation, I connect everything to everything because I notice that it makes the eigenfunctions converge faster. And so that's just the design choice there.
Next we're going to do a noisy spiral.
So this is a 1D manifold with noise and it has a boundary. So, since it's that kind of thing, we can probably start making some predictions as to what's going to happen with the different methods.
Obviously that things are going to depend a little bit on the hyperparameters, but with isomap, assuming we are using not too many neighbors, and it seems like it will be pretty difficult to short circuit here.
We're using 5 neighbors, then we should be able to perfectly unroll this.
And diffusion maps is going to basically say, hey, what are the eigenfunctions of a line segment because intrinsically, there's no difference between a straight line segment and one that was kind of rolled up. The ant can't tell the difference. And so asking what's the eigenfunctions of the line segment with respect to the laplacian is going to tell us actually what the components are going to do. So this is too far down.
OK, so we get PCA. Isomap, kind of unrolled it in this blue part.
What's the blue part?
So I think here we are getting kind of close to this other side. So maybe that's where short circuit occurred.
Yeah, I think all the purples aren't here.
That is weird though that [INAUDIBLE].
So MDS isn't doing anything fancy. t-SNE did a pretty good job.
The laplacian eigenmaps looks like it started unfolding, but didn't finish.
Diffusion maps, if you just project onto the first diffusion coordinate, you can see that we have flattened it out. But diffusion maps when you use two components, it's never going to look flat.
And you can see it is more pronounced in the next one, which is with alpha equals 1.
We get this U-shaped curve.
And the reason it looks like this is because what are the eigenfunctions on the line segment? We're going to get cosines, basically. And if you draw them out, you're always going to get, assuming you have enough data to estimate these eigenfunctions well, you're going to get this kind of thing for anything that's an arc, anything that's a curved segment.
OK, so, at least when I wrote this, isomap did quite well, much better than it did this time.
The U shape with alpha equals 1 for a curved segment is still there.
So now let's do a helix. So a helix is also a curved segment.
So I've kind of spoiled it by telling you kind of what to expect for a curved segment and diffusion maps.
Assuming bad things don't happen. Bad things that can happen is your epsilon is too big. Because if epsilon is too big, you can basically imagine that there's just one giant Gaussian sitting on top of your entire data and you can't really distinguish things well. If it's too small, then you get into numerical instabilities. So, there is a bit of a game you have to play for finding the right level.
PCA, didn't do much. It is slightly tilted, so I guess there's maybe center of mass more on one way than the other.
OK. Isomap did really well. It completely flattened it out.
Yes, since there wasn't any noise, the geodesic distance was pretty well approxed. It was really well approximated.
t-SNE, on the other hand, doesn't know what is going on. It decided that the spiral is actually many different segments.
And so this is, I think, a good thing to do every now and then to realize that when you see components in a t-SNE plot, that doesn't necessarily mean that they are actually disjointed.
It's a bit tricky. Maybe it's the hyperparameters were not set up right, but sometimes it does the right job and sometimes it doesn't. It's what I've seen.
So, then we get this. And then we get this. And then with alpha equals 1, we get that U shape.
So this kind of reinforces that, I think, with one dimensional stuff UMAP is really-- sorry, not UMAP, diffusion maps, can be pretty eliminating.
So the next example is an interesting one. It's the flat torus, I mentioned it before. This is like if you took Pac-Man's world, and you tried to fold it in to a cylinder, you could do that. You would have to identify the left and right sides, but then you still have to identify the top and bottom of the video game screen. And you can't really do that without creasing in it three dimensions, but if you had a fourth dimension, you can actually do it and not introduce any curvature, and that's what this map does. It doesn't introduce any curvature, but we need to use the fourth dimension to realize the flat torus.
And again, it's a weird thing, but cylinders have no curvature. They're flat because you can unroll it to a flat space without distorting distances.
Locally speaking. , OK so I don't remember about randoms anything.
So another thing that's interesting is-- it's a known fact that we can't embed this thing in Euclidean spaces [AUDIO OUT].
Yeah, I'll wrap up.
So, I'll just finish this example and skip the digits. So we get this.
Going to fit all that stuff and then plot the embedding, and I'll stop there. So the last one was just digits, but I can spoil it a bit. Diffusion maps, laplacian eigenmaps, they don't do very well, but t-SNE does seem to do, visually speaking by looking at the classes, we can see that it does well.
It separates classes, is what I mean there specifically. Whereas diffusion maps if you run it yourself, runs into some numerical instability issues even with large--
So you can play around with some of these hyperparameters and you'll see that you have to play around with it quite a bit before diffusion maps does anything like vaguely good.
OK, so here's our first visualization of the flat torus, and it kind of looks spheroidal, but in this case all it did was project. So, it is definitely doing some collapsing, and it's definitely not flat. If you look at places, you can definitely see curvature.
So this one, as I go along inside, that's curving up, but going around the other way, it curves in the opposite direction, that's negative curvature, and on the outside we get positive curvature.
OK. So isomap did very similarly.
MDS did very similarly. t-SNE did something interesting.
It did this kind of really strange, but interesting folding of itself.
Laplacian eigenmaps didn't do much. Diffusion maps with alpha equals 0 didn't do much.
Did something similar to t-SNE.
AUDIENCE: It's 2:30, so maybe you can wrap up.
CHRISTIAN BUENO: Yeah, that's it. Yeah, I just wanted to finish this example, I'm done.
AUDIENCE: OK, thanks so much again, Christian.
Maybe we can all give some virtual claps to thank Christian again.