Modal-set Estimation using kNN graphs, and Applications to Clustering
October 12, 2018
October 12, 2018
All Captioned Videos Brains, Minds and Machines Seminar Series
Samory Kpotufe, Princeton University
Estimating the mode or modal-sets (i.e. extrema points or surfaces) of an unknown density from sample is a basic problem in data analysis. Such estimation is relevant to other problems such as clustering, outlier detection, or can simply serve to identify low-dimensional structures in high dimensional-data (e.g. point-cloud data from medical-imaging, astronomy, etc). Theoretical work on mode-estimation has largely concentrated on understanding its statistical difficulty, while less attention has been given to implementable procedures. Thus, theoretical estimators, which are often statistically optimal, are for the most part hard to implement. Furthermore for more general modal-sets (general extrema of any dimension and shape) much less is known, although various existing procedures (e.g. for manifold-denoising or density-ridge estimation) have similar practical aim. I’ll present two related contributions of independent interest: (1) practical estimators of modal-sets – based on particular subgraphs of a k-NN graph – which attain minimax-optimal rates under surprisingly general distributional conditions; (2) high-probability finite sample rates for k-NN density estimation which is at the heart of our analysis. Finally, I’ll discuss recent successful work towards the deployment of these modal-sets estimators for various clustering applications.
Much of the talk is based on a series of work with collaborators S. Dasgupta, K. Chaudhuri, U. von Luxburg, and Heinrich Jiang.
FREDERICO AZEVEDO: So welcome. My name is Frederico Azevedo. I'm a post-doc at the CBMM. And today it's my pleasure to have here Samory Kpotufe, a former colleague of mine from the Max Planck Institute for Biological Cybernetics.
So basically, Samory did his PhD in the UCSD in computer science, and then he moved to Tubingen where he joined the department of Bernhard Scholkopf. And there we met. And after his post-docs there, he came back to the US as an assistant professor at the Toyota Technological Institute.
And after that, he became an assistant professor in Princeton in the Operations Research and Financial Engineering Center. And in January, he's going to be moving to Columbia in New York. And today he's going to tell us everything we can know about modal-set estimation using kNN graphics and applications to clustering. So without further ado, thank you very much, Samory.
SAMORY KPOTUFE: Thank you. Thank you for the introduction. And thanks for having me.
So the talk today-- let me just adjust this. So yeah, so my talk is going to be about clustering in general, but a particular form of clustering called density-based clustering. The talk is actually based on the series of work over the last four or five years or so with Sanjoy Dasgupta, Kamalika Chaudhury, Ulrika von Luxburg, Heinrich Jiang, and Jennifer Jang.
So hopefully this thing is working. OK, I guess my clicker is not working. So I'll start quickly by defining what density-based clustering is about. So clustering itself, most people here know what clustering is about. Clustering itself-- let me see if-- OK, thank you.
So clustering itself is just about grouping points together. You have a dataset. You have a bunch of points. And you want to group points into the closest to most similar into the most similar regions of points. In density-based clustering, the idea is, how do we define similar?
We define similar by just saying that the points belong together to regions of very high density. So in density-based clustering, a cluster is essentially just that region of very high density of the data. And we [INAUDIBLE] grouping all the points together into these maybe four regions.
This is another cartoon where we have a ring here, another ring, another ring. And those will be, essentially, the centers or the cores of the clusters. And so the idea will be try to find these regions of high density and then bring all the points together. Assign all the points to regions of high density of the data. So that's the basis of density-based clustering.
The reason why it is quite appealing is because it's very flexible. In a lot of-- those who are familiar with clustering here, if you think, for instance, of the k means clustering, you need to specify a certain number of clusters a priori. In density-based clustering, you don't need to specify a number of clusters a priori, and it's very flexible in that sense. That the number, the regions of high density of the data could be any number of regions.
And so that flexibility is very appealing. And it could also be that regions of high density could have any shape. And so you could have clusters of any shape. And that's also very appealing in a number of applications.
Another appeal of it is that it allows you to define a mathematical object, a clear ground truth, a mathematical object that you're trying to estimate from the data. Most of the time in clustering, it's very hard to define what we even mean by cluster. What is a technically a cluster?
If you define a cluster as a region where points are closest to each other, you have to define what you mean by closest. So here, by just defining things in terms of high density, we can actually mathematically define what it is that we are trying to estimate. So we can then talk about the performance of the estimation problem.
There are many applications in medical imaging, in text mining, in speech, in computer vision. And at the end of the talk, I'll try to get into some of these applications. So the main motivation for this series of work, for this line of work for us was that in density-based clustering, there are tons of heuristics that people came up with, because of its appeal.
There are tons of heuristics that people came up with. And the heuristics work very well in practice. However, it's very hard to say anything theoretical about these heuristics. It's very hard to give guarantees. Are they truly finding the regions of high density of the data?
It's very hard to say so. However, there are tons of theoretical methods for which we can give these type of guarantees. They would certainly, with high probability, find the regions of high density. However, when you try to apply the theoretical methods, they don't work as well as the heuristics.
So that was our main motivation. Can we come up with a method that is theoretically grounded with guarantees that compete with the heuristics? And so that's what I'll try to talk about today, a series of work that allowed us to eventually arrive at that, arrive at a method that is theoretically grounded and works just as well or better than the heuristics.
So first, I'll talk about the various formalisms of density-based clustering, the various approaches that people have used so far in density-based clustering. The common idea to all these approaches is you cluster the data around so-called high-density cores. You first define what you mean by a high-density core.
You find the high-density cores and then you cluster all the data around the high-density cores. The distinction between the various approaches is really in how they define high-density cores. What's common to all the approaches is that we assume a priori that there is an unknown density. Let's say this is the unknown density.
There is an unknown density from which your data is generated. You might view it as the population density. Your data is sampled randomly from this population density. And you define your clusters with respect to the population, not with respect to the data you observe. With respect to the population. And then you try to estimate with respect to that population.
So here is a simple cartoon. And let's say the data is sampled i.i.d. from some density f. The density itself is not known. It's density over the space. It could be a Gaussian, could be anything you want.
The density is not known. All we observe is the point from the density. So one line of work is essentially the assumption in DBSCAN. DBSCAN is one of the most popular heuristics for density-based clustering. And the main assumptions made in DBSCAN and similar algorithms is that the cores, what are we going to call the centers of every cluster, the cores of a cluster are given by level set of this density.
So I have the density. I cut it off at some level. And then whatever points correspond to that cutoff, the points whose density are above this lambda would be my core, would be the core of my cluster. So those will be the points of very high density.
So DBSCAN and a bunch of other algorithms use this sort of formulation for what high-density cores would be. The idea would then be estimate these high-density cores in data. And then once you've estimated the high-density cores, you assign every other cluster to the high-density cores. And please feel free to stop me at any time if you need better explanation.
So the problem with DBSCAN right away is you have to specify the density level at which you're looking. So this here is one specification of the density level. But if I change the density level, if I put it here, all of a sudden the clustering changes completely. So that's one problem with it is that the practitioner has to specify this density core, and it's not quite clear a priori.
Another line of work is mean shift. And in mean shift, the cluster cores are essentially just modes of the density. By modes of the density, the maxima of the density, the points at which the density is maximal.
Mean shift sort of assumes that the density is maximal at exactly at some points. In practice, it's not necessarily maximal at some points, but it could be maximal on a whole set. But this is the main assumption within mean shift, it's maximal at some point.
And the problem is that mean shift can be very unstable for very general maxima. In practice, a maxima of an unknown density does not have to be at a single point. The unknown density can be flat or approximately flat in some regions of space.
And whenever that happens, if you think about this region for instance, whenever that happens, then the density, the high-density core, is not well-specified. It could be any of the mean shift procedures could be picking up any points in this region. And in which case, you generate a whole different clustering.
So this is also not very well-specified. And in the end, it gives you different answers depending on which sample you're getting from the unknown density. So that's sort of the motivation here.
The other problem is that mean shift, even though it's quite popular in practice, it's very hard to analyze. It's very hard to give guarantees to mean shift. It's also fairly hard to give guarantees to DBSCAN.
And these are the two methods that we are going to try to compete against. We are going to try to do much better and compete against these methods. The key idea for us was simple-- try to fix this instability within mean shift.
So the idea for fixing the instability is rather than assuming that the unknown density is maximal at some point, we allow the fact that the unknown density could be maximal over some very general space, subset of the entire space. And that place were the density is maximal, we're just going to call that a-- sorry. That point where the density is maximal is what we are going to try to estimate. And we'll just call that a modal set, as opposed to a mode.
And so the whole point now for us is come up with a practical and optimal estimator of modal set of the unknown density. So here again in the cartoon, the density here is maximal on this range here. And so the idea will be find, try to discover the rings, and then cluster the rest of the points around the rings. In this other cartoon here, the density is maximal here. Try to discover all these points and assign everything else to the points.
Now, the difficulty here is that we don't know the location of the maxima, and we don't know the shape they have. We don't even know the dimension of the maxima. And yet, we want to do this very general estimation.
The side benefit of this type of estimation is that it applies much beyond the problem of clustering. It also applies to the problem, a very similar problem, called manifold denoising. So in manifold denoising, I'm observing a manifold plus noise. And this is sort of the case here. I'm observing manifold plus noise.
And I want to discover the manifold itself. I want to denoise the manifold. And we'll get the benefit of that by just understanding how to estimate modal set. Any question up till this point?
All right, so this will be the main goal for us. Once we can estimate modal set, we'll just fall back to simple clustering by clustering everything around the modal set. So the rough intuition as to how we are going to estimate modal set, the key problem, as I said earlier, the key problem is that we don't know the location of the modal set. We don't know their shape and we don't know anything.
At the very least, we'll try to isolate them. We'll try to get to their location. That's a first step-- try to get their location. And so to get their location, the idea is isolate points around the maxima.
If we can find roughly, approximately, all the points that are close to a maxima, then we can, within those points, try to estimate the modal set. So how do you isolate these points? We can isolate them if we can estimate level sets.
Suppose we can estimate the level set of the unknown density. Then we can isolate the regions where it's maximal. So a level set here is defined simply this way. It's all-- we're considering upper level set. This is the unknown f. We're considering all the points that are above a particular level.
So as here, if we can estimate all the points above this level-- these are the points-- we can then infer eventually that there is a single mode here. If we could isolate all the points here, then we could say eventually that there is actually a whole flat mode in here. So first step for us is isolate these things, and then learn the maxima in each of these level sets.
The subroutine that we are going to use to do the isolation is called a k-nearest neighbor graph. And the k-nearest neighbor graph is a simple graph, easy to describe. It starts with a parameter k, the number of neighbors.
And for any parameter k, let's say that k is 2, then I take every point and I attach it to the two closest points, to the two nearest neighbors. And over time, that builds an entire graph called the k-nearest neighbor graph. It will turn out that the k-nearest neighbor graph is the main object that we can parse quickly to obtain all the level sets of the unknown density. And from there, isolate the mode, or the modal set.
OK, so the outline of the talk is the following. And this outline sort of follows the outline of our research program. So first thing for us is understanding how nearest neighbor graphs relate to the unknown density.
Once we understand how nearest neighbor graphs relate to the unknown density, we can then think about, OK, what if the unknown density only has point modes? So we're going to look at a simpler problem. What if the unknown density only has point modes? Can we estimate them all?
Once we are able to estimate them all, we then go to the harder problem of estimating more general modal set, more general maxima of the density. And I hope that sort of makes sense. Once we get intuition here, it becomes quite easy to understand how we do this last part.
So before I move on, because I'm going to get into much more technical stuff in a minute, I want to give you a small spoiler. What do we get by doing this? What sort of clustering result do we get by doing this?
And so here are results on a few. You see our data set. QkShift++ is the clustering procedure we get from this approach estimating modal sets first. QkShift++ is the procedure. DBSCAN is one of the heuristics. MnShift is another heuristic. And QkShift is another very popular heuristic.
And on a lot of these data sets, we can see quickly that the approach works really well. We actually do much better than all the heuristics here by just doing this modal set estimation first. And here we didn't actually go and pick the data set so that I can show these results. We just tried a lot of different data sets, and the things kept working very well. So well that we thought at first that there might be a mistake in the code.
But there was no mistake. And I'll show you a lot more other such results in the end. The scores here are called mutual information and then Rand-index. Those are the two most common scores in assessing the performance of the clustering procedure. I can talk a bit more about them later.
So first thing is, what do k-NN graphs encode about the unknown density f? All right, so I'll start off by giving you a bit of history. So how can we characterize the density of the clustering structure given by a density?
So we have a density over rd, over a general space. And Hartigan in '81 tried to apply characterize-- proposed characterizing the cluster structure through the so-called cluster tree of the density. And the cluster tree is fairly easy to explain.
I looked at every level set, every upper level set, of the density. And then in this level set, I will say, if I pick this level set, I will say there is one cluster. I just look at the connected components given by the level set. And there is this one connected component.
If I look at this level set, there are these two connected components. I look at this one, there are these other two. And the whole collection of disconnected components per level set forms a tree. It forms a tree in the sense that all the components here are subsets of the components at the lower level.
And that whole structure, that whole continuum, is called a cluster tree of the unknown density f. And so a first question that Hartigan posed back in '81 is, OK, once this structure is defined, is it estimable? Can we estimate it from data alone?
He showed that it is possible to estimate it from data if we are on R, on the line. But soon as the dimension of the data is 2, he didn't have an answer. So Chaudhuri and Dasgupta in 2010 then showed the first consistent estimator. By consistency, I just mean that as the data size goes higher, can we at least approximate the cluster tree as well as possible?
And Chaudhuri and Dasgupta in 2010 showed the first consistent estimator based on a clustering algorithm called single-linkage. Then with von Luxburg in 2011, we showed that a nearest neighbor graph, actually subgraphs of a nearest neighbor graph, also estimates the cluster tree. And we show a slightly stronger notion of consistency. And I'll explain that also in a minute.
So here first, I'll give you an intuition, why a nearest neighbor graph can estimate the cluster tree. So here is a k-NN graph built on data from [INAUDIBLE] to Gaussians. And so I don't know if it's possible if you can see here, all the way here. So we just took a mixture of two Gaussians in rd.
And you imagine a mixture of two Gaussians in rd has these two modes. If you draw parts from the mixture, most of the points will be concentrated under the modes, and you'll have a few points in between. And that's essentially all this is. And then we build a nearest neighbor graph on this.
Now once the nearest neighbor graph is built on this, the whole idea is the following. Start removing points in regions of low density. And the way you do it is you pick any point and you look at-- you start with points whose nearest neighbors are farthest. You start with those points whose nearest neighbors are farthest.
Those points must be in regions of low density. So if I go back to the drawing I did here, there are very few points here. So if I wanted to look at-- if I build the nearest neighbor graph on this, the points-- there are a lot more points here. And so the points whose nearest neighbors are farthest will be the points in the middle somehow.
So I take those points and I remove them first. And I keep removing points in that order. As you're removing points, you can imagine that the graph will start breaking up into subcomponents. And the subcomponents would actually correspond to the connected components of the original density.
So again, here, if I had looked at the level lambda, and I start breaking up here, after a certain time, these points don't exist anymore and I have these two components. And that's exactly all we are showing there. Over time, as you keep going, you might eventually get false breakups in there. But the false breakups themselves, in some sense, can be identified.
So quickly, we have this theorem. The theorem says the following. It says that let f, the known density, be uniformly continuous, plus some additional conditions, some mild conditions. And let k, the number of nearest neighbors, be greater than log n. n being the number of points you have in your data set.
The reason you have this here is simply that you can show that the original graph-- in order for the original graph to be connected, you need k to be at least this large. And then the statement says this. Let C1 and C2 to be two disjoint connected component at some level of the original density. Then with probability going to 1, all the points coming from C1 and all the points coming from C2 are disjoint in some subgraph of the original k-NN graph. One of the subgraphs constructed this way.
So in that sense, the k-NN graph, or subgraphs of the k-NN graphs [INAUDIBLE] level set of the unknown density. So that's the main result. Hopefully, that sort of makes sense.
However, I told you a second ago about these breakups here that you could get. And that could become a problem. We have to be able to identify false breakups.
And we also had a consistency result that says that there is a simple way to identify when the breakup is false in some sense, when the breakup is just due to the variability within the data. But how exactly we identify it, I can explain it a bit more later.
But the key idea is that if we are here in the-- if we get to the subgraph here, we can look back down and see that in this procedure, these things were connected earlier, were connected [INAUDIBLE] ago. And then whenever we have something like that, that it's connected not so long ago, we reconnect them. But I'll explain it a bit more later.
The key idea is that we can identify these things and reconnect them. And that's actually the stronger form of consistency that I talked about earlier. So in words, let me put it this way.
If I were looking at this mixture of two Gaussians, this is the cluster tree. It looks like this. It's just a tree that goes up and has two branches.
The first theorem says that we'll recover a tree that might have a bit more than two branches. However, a subtree of that is the cluster tree. The second theorem is saying that I can actually identify this as a bad branch and remove it.
So after this work, there were many refinements by various authors. And some of them are very recognizable in the statistics community. But that's as much as I'll say for now about nearest neighbor graphs. The key idea is that nearest neighbor graphs encode the clustering structure of the unknown density. So that's the key thing to remember.
So now, getting into this problem, how do we estimate all modes of the unknown density? So how do we estimate all modes? And to make it clear, I have to first tell you a little bit about all that people have [INAUDIBLE] in order to modes of an unknown density.
So what was known is the following. So suppose this is the unknown density and this is a few points from the unknown density. Much of the work on mode estimation, so estimating the maxima of a density, was started by-- I mean, it was there before, but much of the theoretical work started with this work by Sasha Tsybakov in 1990.
And in this case here, he tried to understand how hard is it-- how hard is the problem of estimating the modes of a density. But he considered only the case of a single mode, that the density had a single mode. So it has a single maxima. It doesn't have so many.
And much of the theoretical work after that considered a case of a single mode. And there is no theoretical work that I know that considered a case of multiple modes, because estimating even a single mode happened to be that hard. And here we want to estimate all modes, and estimate all modes optimally.
The practical procedures, such as mean shift, actually estimate all modes. The theoretical procedures estimate a single mode. It assumed that the density has a single mode.
The practical estimate all modes, but they are hard to analyze, as I said before. There are some consistency results, but it's very hard for us. But we still don't have-- for mean shift, for instance, we still don't have a finite sample result.
So for a finite sample, how well do we estimate all modes? We don't know. But the consistency results say that if the sample size goes to infinity, then eventually mean shift will estimate all modes. So what we want is a rate-optimal estimator that will be based on the nearest neighbor graph.
The program of construction is the following. A first thing for us to understand is to understand how well can we even estimate the unknown density? So there is the density f. We don't observe it. All we observe is a sample from the unknown density.
And we will view f hat-- and whatever I say f hat, I mean an estimate of this unknown density-- and we'll consider estimators that are based on nearest neighbors, since we're looking at the k nearest neighbor graph. And the question for us is, how good of an estimator is this estimator? It turns out that they [INAUDIBLE] great.
There were tons of analyses of this density estimator. But the analyses were not quite satisfactory for our purpose. The reason why is because they, for instance, try to understand how well globally does-- or how well does f hat approach f on average?
And here what we want is understanding how well f hat approaches f at every point simultaneously. Because we don't know the location of the mode, and we sort of needed to understand simultaneously how well we could estimate this unknown f. So that's sort of the first step. That was the first step in our analysis.
The next step was, can we even estimate a single mode? Can we estimate a single mode with a procedure that is practical? The usual estimator for a single mode is this theoretical estimator. It says that start with some estimate of the unknown density f, and then maximize this estimate over the entire space, over all of rd.
This is an optimization problem which, in itself, just a pure optimization problem-- even if we knew f hat well, this pure optimization problem is a tough problem, is a hard problem on itself, finding the maximum over the entire space. So a practical estimator is the following. x tilde is just the maximum over the sample. Just pick the maximum over the sample.
This one is known to be an optimal estimator, however, it's not easily implementable. This one is easily implementable, but we don't know if its optimal. So we know that it's consistent, but we don't know if it's optimal. And here, in this work, we also show that this simple estimator, this practical estimator, is also optimal.
Then we get into the problem of estimating multiple modes. And I said before, practical procedures are to analyze. And there is no theoretical procedure for this case here.
All right, so let's start with which rates we could get for k-NN density estimation. These parts, I'll just go over it quickly, because I just want you to have the flavor of what the result is. So here, what is a nearest neighbor density estimator? What is the density estimator?
Define at any point x, let rk of x be the distance from x to the k-th nearest neighbor in the data-- to it's k-th nearest neighbor in the data. The density estimator is defined in terms of rk of x. And all it's doing-- so it's k [INAUDIBLE] times the volume of the ball, x rk of x. And all it's doing is it's taking the mass of this ball divided by the volume of this ball, which is, in some sense, what a density is.
And then, therefore, in order to understand how far is fk of x from f of x, if you look here, you see that the only thing that depends on f of x here is rk of x. So we just need to understand how rk of x behaves in terms of f of x. And that's the main idea from the analysis.
So this is a fairly complex slide, but I'll just say quickly what it says. It essentially says that fx of x is as close to f of x as about 1 over square root k, where k is the number of nearest neighbors. But it says that this happens uniformly over all x simultaneously as long as k is greater than log n and is less than some quantity.
What is this quantity? The main thing you need to understand about this quantity, the quantity says how fast f changes in a small neighborhood. That's all it says. So we can characterize how far fk is from f at every point in space, just in terms of how much f changes in those points in space.
So that's the key thing to remember from here. And this is a pretty general result. It happens that this result holds with high probability. And this simple result can cover what's known to be the optimal rate of estimation for an unknown density. And the optimal rates were known from before, but not using this particular estimator.
All right, so single mode rates-- how well can we estimate a single mode? And can we estimate a single mode without looking over all of rd? Remember, the commonly studied estimator is this one that looks over all of rd.
There is also some other type of estimators called recursive estimates. But the key question for us is, let's look at this simple estimator which maximizes fk, which I described earlier-- which maximizes fk over just a sample. Can we show that this thing is optimal in some sense? And it turns out to be optimal.
So the assumptions-- this is a bit more technical. The assumptions are the following. They are very general assumptions, however. The assumptions are the following.
Let x be the maximizer of f. And we also assume some regularity assumption about-- we also have some regularity assumptions about f in the neighborhood of x. It says that the Hessian of f, the second derivative, is well-defined in a neighborhood. And not only it's well-defined in a neighborhood, it's strictly negative, definite.
That's all it says. And this is a fairly common assumption in optimization. And then there is some simple technical assumption, a global assumption, but I can get into it later if someone wants to understand it better. So the result says the following.
The result says that x tilde-- let x tilde be the estimator here, so that being the estimator from the sample. It says that x tilde minus x is at most k minus 1/2. x tilde is at most k minus 1/2 away from x.
And this turned out as long as k is between log n and n4 to the 4 plus d. And this turned out to be optimal. It turns out to be optimal for k being set to n4 over 4 plus d. And that was the optimal. This rate was shown by Tsybakov in '90, but was shown for the procedure that was not implementable. And here we show that, OK, this rate is also optimal for this procedure.
So we now know how to estimate just a single from data. And the key idea here in the proof-- and I won't get too much into the proof. The key idea is all we had to show-- go ahead, you have a question?
AUDIENCE: Does x minus x [INAUDIBLE] some kind of geometric distance?
SAMORY KPOTUFE: Just the Euclidean distance in the space, yeah.
AUDIENCE: But the right-hand side k is the number of points. So are they even in the same physical unit, or how do you think about that [INAUDIBLE]?
SAMORY KPOTUFE: The squiggle here just means up to a constant.
AUDIENCE: Oh, so there is something there.
SAMORY KPOTUFE: Yeah, there is something there. So the key idea to show this was to simply show that there is always, with high probability, a point that falls-- if I have a random sample, that there is, with high probability, a point that falls very close to the optimal, and that falls at a distance within the optimal rate of the optimum.
And then the other part is that the assumption, this assumption, [INAUDIBLE] optimization recognizes that this assumption just says that f behaves like a quadratic near the optimum. And that's good enough to have enough curvature and enough flat [INAUDIBLE] to get enough signal and estimate the maximizer. All right, I won't go into the math, but let's get into the multiple mode rates.
But here I'll stop for a second. What we've seen so far is we now know if a have an unknown f and I observe a bunch of data, we know how well the unknown f is being estimated. So that was the first part-- how well is the unknown f being estimated just using this nearest neighbor approach?
The next thing is, if I have a single mode, if the unknown f had a single mode, can I estimate it with a practical algorithm, with one that is implementable? Can I estimate it efficiently? That was that part, that this simple estimator that just picks the maximizer of fk in a sample is optimal, is rate optimal. So that was the first thing-- that was the second thing. We can estimate a single mode well.
Now, the question is, can we now estimate multiple modes?
AUDIENCE: [INAUDIBLE] your f is always [INAUDIBLE] continuous. The assumption is that the f--
SAMORY KPOTUFE: The assumption throughout is that f is continuous and uniformly continuous over the space.
AUDIENCE: If you do something like a [INAUDIBLE] regression or something like that, would you get the second result? Could you estimate the mode?
SAMORY KPOTUFE: Here?
AUDIENCE: [INAUDIBLE] with local [INAUDIBLE] nodes.
SAMORY KPOTUFE: So you are asking, what if I had on a different estimator? Yeah, we'll get the same thing. Because the key point was to understand that there is a point close enough to the mode.
SAMORY KPOTUFE: Then for those, I could adapt to higher orders of smoothness. It turns out that I don't need exact affinity-- control for the rest of the result. Because I sort of need to have good enough estimation at the mode, and OK estimations in the tails.
So now, the setup is this. I'm assuming that f has multiple modes. And I want to be able to estimate all the [INAUDIBLE]. Again, with a procedure that is implementable. That's the key thing, because we want in the end-- remember, we want in the end to compete with the heuristics, which not only are implementable, are very hard to beat.
So the assumption that we're going to make at every mode is similar. We assume that at every mode, f is twice differentiable with enough curvature. Then the idea is the following. Remember that for a single mode, all we did was we look at all the points in the single mode and pick the one with the highest fk value.
Now, as long as we can isolate the points around every mode, we can do the same thing, isolate a bunch of points around every mode, and pick the one with highest estimated density. So now the key problem is how to isolate the points around every mode. So that's where the main intuition comes.
Suppose this were the unknown f. This is what I will do. I'll start with the highest possible density level. If I start with the highest possible density level, then I get the single connected components. And I estimate the mode just in these components.
The key problem is I don't know the location of the other modes, but I don't really care. As long as now I bring lambda down. And as I bring lambda down, another mode appears. And I'll just tell myself here I've already estimated a mode here, because I know this one is contained in this. I've already submitted a mode here.
I haven't estimated a mode here yet. And so I estimate another mode in here. And I keep doing this thing. And eventually, all the modes appear. And I estimate just a mode at every location where there is a mode. I hope that sort of makes sense.
The key thing, however, is I need to identify connected components. And how to identify the connected components, that was the first part of the talk. The connected components I embedded in the nearest neighbor graph.
So I can just parse the nearest neighbor graph. And from parsing the nearest neighbor graph, identify all the connected components, and I can implement this simple idea, this procedure. Now, here is another question.
The reality is that if this were the unknown f, the f hat that I compute is not going to be so nice and smooth. It will be jiggly. So again, that goes back to having false branches in here. So being jiggly means if this were the real thing, maybe I have this.
Maybe I have this in the f hat. This thing is the real f, but I have these false bumps, the f hat. All I have to do is identify the false bumps.
The good thing is from the first part, I know how far f hat is from f at every point. So I know this bump is at most about 1 over square root k. By just knowing how big the bumps are, [INAUDIBLE] the bump that is below 1 over square root k, I view it as a false bump and I remove it. So it just allows me to smooth out the process and recover just the tree, not the false ones.
All right, now there is a small point I'll have to make here, which is the following, which is that in order to recover all these things, I also need a bit of separation between the modes. I need a bit of separation between the modes, because if there is no separation between the modes, then it's very hard to identify a mode by itself.
So all I mean is if I have two modes that turn out to be very close to each other, then it's hard to identify them from sample. Because I would only have a few samples in between to identify it. So the result relies on a separation.
And we say that a mode x is r-salient if there is a valley of width about r between it and any other mode. So the result says the following-- the theorem says the following, that if [INAUDIBLE] salient, then there is a sample size large enough that depends on r such that every such mode that is r-salient is recovered at optimal rate. And so just through that procedure, we now have a first algorithm that is optimal for all modes simultaneously, at once. Any question up till here?
AUDIENCE: So how deep [INAUDIBLE] valleys [INAUDIBLE]?
SAMORY KPOTUFE: Yeah. So they need to be deep also. That's a very good question. The valleys need to be deep also. But because we made the assumption of curvature around every mode, that curvatures assumption gives us the depth.
OK, and here is just a theorem that says what I said earlier, that we can identify all the false modes and remove them above a certain level. Any other question at this point?
So what we've done so far is we now understand essentially how to estimate all modes of an unknown density. But the assumption here so far has been that the density is maximized at points with a space, not maximized over general regions of the space, not maximized over general set. And now, what we want to understand is, how can we estimate this modal set which, by definition, a set on which the density is maximal, is locally maximal-- how can we identify this set from data alone?
And so all we'll do is follow essentially the same intuition as here, but we'll have to do a bit more because we don't know, by default, a priori, what is the shape of these sets on which the density is maximal, or what is their dimension and [INAUDIBLE]? So here, again, to remember, this is the generic cartoon. Here is data from some density.
The density is maximal on these two rings. And the density is maximal on these two rings. And these are my modal sets of the two rings. And the first thing I want to do is being able to identify the two rings.
Point mode, what we've discussed so far, can be viewed as zero dimensional modal sets. Well, the more general one, higher dimensional modal sets. So the key changes to the previous procedures are simple.
Let s be a modal set. So here, this is s. This is a place where the density is maximal. So s is just this segment, this line segment.
The way we estimate s with s hat is the following. Let's say this is the density estimate of this density. I'll start off by just-- let's say I've isolated this mode, the points coming from this area. I start off by just first identifying the highest points, the points with the highest estimated density value.
Once I've identified that point, I just look at the point that is within about 1 over square root k of it. Why 1 over square root k? Because that's what we said is the maximum fluctuation at a point between the density estimate and the known, the true density.
And then I take all those points that are within the fluctuation distance and throw them in and call them s hat. That's all. And now, for multiple modes, modal sets, you can repeat what I explained before. I go down the levels. I isolate them on [INAUDIBLE], and do this procedure.
So there has to be some key changes to the analysis, to the previous analysis. In the previous analysis, I assumed that the unknown f was twice differentiable in a neighborhood of point mode. Now we are looking at more general optima. And I cannot talk, necessarily, about differentiability at the boundary of the optima.
And so this is a technicality, but I cannot talk about differentiability at the boundary of the maxima. However, I can still talk about curvature and smoothness. And it turns out that as long as f is uniformly continuous in some neighborhood of the maxima s, there exists upper bounding and lower bounding functions that act as curvature and smoothness for this [INAUDIBLE].
And then we can express now a result just in terms of these functions. All this is saying is-- I guess I'll draw again here. All that is saying is that I have f like this and some maxima s-- this is s-- there exist two functions that envelope my entire f. And those functions are enough and serve as the curvature and smooth [INAUDIBLE].
So just the existence of those two functions is enough for us to establish a rate of convergence. And the rate now is in terms of the Hausdorff distance between the two sets. Now we can say that the Hausdorff distance between the two sets is in terms of this l close to 0. So we have consistency, and close to 0, so we can estimate s consistently with s hat, provided again that k is greater than log n, but less than a function of u. A function of the upper bounding envelope.
And then we get similar pruning guarantees as before. By the way, if I let the lower bound and the upper bound be quadratic, I recover exactly what we had in the case of the point modes. The whole reason [INAUDIBLE] went through here is because if I have a set of this type, there might not be differentiability at the boundaries of this set.
All right. Any question here? OK, so the key [INAUDIBLE] is that-- if you want to forget everything and just remember this, the key point so far is that we can estimate all modal sets, very general sets, where a density is maximized. We can estimate all these modal sets at optimal rates. That's the key point.
AUDIENCE: So here, [INAUDIBLE]? Is it connective or not?
SAMORY KPOTUFE: The d is just the dimension of the space. It's the full dimension of the space. Because here, by default, we're just assuming that the density exists in the full space. So the support of the [INAUDIBLE] is full dimensional.
However, that's a good question, whether we might let d be just some dimension of the neighborhood. But in this analysis, no, because we're caring about neighborhoods of the sets. And the neighborhoods are full dimensional.
All right, so the resulting procedure is the following. QuickShift++ are just the resulting clustering procedure. QuickShift++, I'll explain it quickly. First, you estimate modal sets, s1 through sk, of the unknown density.
You estimate these modal sets of the unknown density. And then once you've estimated the modal sets, you're going to cluster every point around each modal set. And the way you cluster every point around each modal set is just by doing a gradient ascent to the modal set.
So by doing a gradient ascent, I assign every point to the closest modal set. And rather than an exact gradient ascent, you do [INAUDIBLE] gradient ascent. An approximate gradient ascent just says the following. I'm not going to do a gradient ascent over the entire space, but I'm going to do a gradient ascent just over the points in my data set.
So it says follow a sample path. Starting at any x0, I follow a sample path to si, to some modal set si, by just seeing whether f hat is growing along that direction. And that's all. And so that whole procedure, we call it QuickShift++. And that's sort of the result that I was showing earlier.
And this is a very recent work at ICML with Jang and [INAUDIBLE]. Here, again, is now an expanded version of the result that I had shown earlier. And this is DBSCAN, this is Mean Shift, this is QuickShift, and this is QuickShift++.
And again, we threw out and we didn't pick, we [INAUDIBLE] find the data set on which we are working well. We just picked a bunch of data sets from the UCI database. And we keep doing so much better than a lot of these heuristics, a lot of the competitors.
In a few cases, Mean Shift did somewhat better. In a few other cases, QuickShift did essentially just as well. In some of the cases, such as page blocks, all the procedures, for some reasons, we're not doing so well at all. And yet, we still do reasonably well in terms of clustering.
And so what this is saying is that the additional stability that is brought up by this approach, because rather than just trying to find single points at which the density is maximal, we are finding first whole regions where the density seems to be maximal. And then we cluster around those regions. What Mean Shift would have done is find single points where they think the density is maximal and cluster around those regions. [INAUDIBLE] more stability here is giving us a lot better performance.
AUDIENCE: So what is the dimension of these data sets?
SAMORY KPOTUFE: Sorry.
AUDIENCE: What is the dimension of these samples in these experiments?
SAMORY KPOTUFE: So the highest here was about 200, 300 dimensions.
AUDIENCE: So [INAUDIBLE] one in the--
SAMORY KPOTUFE: That's a very good point. The rates are exponential in dimension. That's what you're getting at, right? Yeah, the rates are exponential in dimension. And they are sort of the worst case rates for the assumptions that we made.
It turned out that in practice, even up to 200, to 300, we did fairly well. But I believe that the reason why is the following. The rates of estimation are the rates of estimation of modal sets. However, the problem here is that of clustering.
Even if we don't estimate the modal sets that well, as long as we estimate it approximately well, it's good enough, as long as the clusters are far from each other. Does that sort of make sense? So that's why we believe that we're doing well, even when the dimension is relatively high.
AUDIENCE: But do you think it has to do with the [INAUDIBLE] of the inherent--
SAMORY KPOTUFE: It could be. It could be. I haven't thought about that, but my first impression was that, potentially, the clusters are far from each other, and I don't need to do an exact estimation of the modal sets.
So here is a different picture that I want to show. What is this picture? OK, here I could offend some of my colleagues. But in clustering, the picture is the following, is that in a lot of results on clustering, people just give you this.
They tell you, this is how we do compared to the competitors. They don't tell you how they tuned it. It's very possible that I tuned this as well as I could, because I had to publish my paper. And I didn't tune these things so properly.
And so this here is actually just showing you the behavior as I go through all the tuning promoters, all the ranges of the tuning parameters of the other algorithms. So I take DBSCAN. And the main parameter is epsilon. I go through the entire range of how it behaves. Same thing with Mean Shift. And same thing with QuickShift++.
I go through the range. And I could see that the best-- what I'm showing on the other page is the best possible for the tuning parameter of the competitor.
AUDIENCE: So this is cross validation for [INAUDIBLE]?
SAMORY KPOTUFE: Sorry.
AUDIENCE: Cross validation [INAUDIBLE]?
SAMORY KPOTUFE: That's the problem in unsupervised learning, right? We cannot do cross-validation, because we don't have a ground truth a priori that we can cross-validate against. The scoring functions themselves are not easy to cross-validate against, unlike in classification and such things.
So that's why I did this. I didn't do cross-validation. We do have labels, labels for the testing data. And so for the labels from the testing data, we're computing the score. But we going through all the possible choice of parameter settings.
But we don't know how to choose the parameters a priori. And so for all the possible parameter settings, I'm reporting the best possible that was done. The good thing with QuickShift++ is that in a lot of cases, the range of parameters on which we are doing well is very large. k is just the number of neighbors.
And it turns out that when we look back at the theory, the theoretical range is also very large. It goes from log n to a root [INAUDIBLE]. And so, for that whole theoretical range, and for a lot of the data sets we get, that for a whole theoretical range, we plateau somehow. The performance remains fairly good. So the procedure in itself is very stable in that sense.
We started also investigating a few other applications. In medical imaging, image segmentation, in computer vision, and also problems having to do with outlier detection in smart cities, internet of things, and such things. Here I'll just show you some quick results for image segmentation.
QuickShift-- OK, before I even start there, we all know that today the go-to procedure is deep neural nets for almost everything. However, in image segmentation, deep neural nets don't work as well unless you have also labels. So there is also what's called semi-supervised image segmentation, in which deep neural nets work very well.
But in purely unsupervised image segmentation, one of the best things is still QuickShift and Mean Shift. And so we're comparing against QuickShift. And what we're reporting here is the best tradeoff between over segmentation and under segmentation.
And some of our results here are very qualitative, but we get pretty good results so far in image segmentation for this problem.
AUDIENCE: What's that density function?
SAMORY KPOTUFE: Sorry. Oh, the density function here, you actually just pretend [INAUDIBLE] the intensity is a form of density.
AUDIENCE: The color doesn't matter?
SAMORY KPOTUFE: So you could use color. But often, you do a first transformation. That gives you a general intensity over seven dimensions [INAUDIBLE]. And then you perform this thing on there. But in black and white, it's easier to understand.
All right, so some future questions are adaptation. I think you were asking such questions. Can we optima-- can we automatically choose the parameter procedure? Even though the procedure is very stable to which choice of parameters, we would still like to be able to just look at the data and make the optimal choice of parameter. So that one is a fairly hard problem so far.
High-dimensional clustering, we would like to be able to go beyond 300 dimensions to much higher. And in those cases, we need to come up with some notion of feature selection within unsupervised learning. Or maybe go through spectral and projection approaches. But the right approach is not so clear so far. All right. That will be all, thank you.