Lorenzo Rosasco: Learning Theory, Part 3 (variable selection (OMP), dimensionality reduction (PCA))
May 30, 2014
May 30, 2014
All Captioned Videos Brains, Minds and Machines Summer Course 2014
Topics: Determining which variables are important for prediction (e.g. given n patients and p genes, which genes are most important for prediction), sparsity (only some coefficients are non-zero), brute force, greedy approaches/matching pursuit, basis pursuit/lasso, unsupervised learning, dimensionality reduction, PCA
LORENZO ROSASCO: So that's where we stand right now. So part one was about-- they give you a very simple algorithm that might have problem in high dimensions, but give an opportunity to show the bias trade-off, which is the basic idea behind parameter tuning, which is the problem in most-- machine learning. One of the biggest problems of machine learning. [INAUDIBLE], and then essentially looking at dimensionality as a motivation when we moved from local methods to global methods. Produced linear models, the computation for linear models, in the course framework. And then now on to dictionary learning-- sorry, dictionary learning. Dictionaries and kernels as ways to move away from the linear model to something more complicated.
My little sketch here when I do this, when I do this, I'm aware that it looks like a shapeless trick. And that's kind of how I want to present it to you. Because the theory of this stuff is actually pretty big, what I wanted to point out is mostly-- like believe me and check that you can do it, because you can run the code and see how it works. And just appreciate how you can use it.
The math behind is actually very nice, but it really would require, like, an hour just by itself. In the course, we do it from in the fall. And we have three hours on this internet right now. So it's long. And I just talk about the first type and don't even talk about the last three. So this-- I'm OK if you see it kind of as a trick. And if you don't believe me, just try on the data today and get it code to see how it works in practice, and convince yourself it does the job.
One thing I didn't mention here which might be interesting is that, again, there is this big problem of how you choose the kernels. And there is a whole lot of literature here related to kernel learning and multiple kernel learning. So how you can-- instead of one, you can mix more than one kernel, and burn one kernel out of a set of kernels, and so on. Again, it is kind of a variation over this theme. You have to add on top of this, or you don't have to start from scratch.
Is there any question about these first parts? So the last section somewhat has a slightly different flavor. The first part is something very recent. This one I'm telling you know is something very old. And I'm just showing to you now because I think it's-- I don't know, I like it. It's refreshing. It's still the main building blocks from any publication. If you understand it right, you can use it to develop a lot of different [INAUDIBLE].
Both these two topics, variable selection, OMP stands for orthogonal matching pursuit, or matching pursuit, is the one algorithm I'm going to tell you about. And dimensionality reduction and PCA are basically somewhat related, at least in my view, which is biased by the kind of machine learning application I have admired. They move away from black box kind of approaches where you basically say, I just want to make a prediction, and I want the prediction to be good. I don't want to know how you make the predictions, but just make the predictions.
They move away from this into a slightly more exploratory world, where you say, I want to get the prediction, but I also want you to tell me how you've made them. And I want to understand more about the data. There are a lot of applications now where this has become the method. Maybe if you want to recognize faces, you don't really care, you just want to do well. But a lot of applications, especially in biology-- and this is kind of perhaps a bit outdated, but it's just to make a point-- are really like that.
So this is the application-- this is a classical problem. Is there any of you working on genomic data? So this was a classical example. So this-- suppose that you have a bunch of patients. And you have disease type one and disease type two. One is much more aggressive than the other. And so when a patient arrives, you want to check whether he's type one or type two to treat him the best possible way. Say this is-- for now, I think it's kind of a dream, but that's kind of the story.
If patient arrive and you take a sample for, say, a tissue, and then what you produce is this kind of stuff here. This is what is called a micro-array, which basically gives you the expression of each gene in the DNA of the person in the affected region.
So basically, this is literally an image. And you can unroll into a big vector. p here is, roughly speaking, the number of gene in the DNA. And each number, x1 is the expression of the first gene in the first person. These are just first person, second person, and so on and so forth. So it's just another instance of the problem I showed you before. Before, there was faces. Now, it's patient and their genes.
Let's say this is, again, plus minus 1, type 1, type 2. And so one question is again, I give you the patient. Give me a rule that given your patient is going to tell me what is going to be more likely, type 1 or type 2?
But a lot of times when you go to a doctor, actually if you give them this rule, he just doesn't believe it. They're like, all right. How did you do that?
You say, well, I have the super-vector machine. They're like, right. And then they ask you, OK, which kind of information did you look at? Because I want to check it out. I want to go and check and double check that that gene is actually important.
They typically ask you a list of genes. They want to know which gene is important for what.
One reason because, one, they don't trust the black magic of statistics and machine learning. But another reason is for example, that these measurements are awful. They're borderline reproducible. Maybe they're a bit better now, but at the time when I would look into this kind of stuff, if you do the same experiments twice, the amount of our data was ginormous. So what they wanted to do is, if I tell you that this one gene is very important, I'm going to go in, check with another kind of technology, which is more expensive. So I cannot do it for 20,000 genes, but I can do it for 5, 10, 20. And I'm really going to get much more careful measurement, and then I'm going to check if this guy was really active.
Because maybe here it looks like it was active, but it was not. Maybe this bunch of [INAUDIBLE]. So they typically want to know which variables, which genes, which of these columns were important to actually make this decision. And they want to make the prediction.
Important theory is important as-- with respect to make predictions. But the two questions are slightly different. And one way to see this is they want an interpretable model. They just don't want to make a prediction. They want a little bit more. How do you solve this problem?
So turns out that the problem is possible in statistics and it's called variable or subset selection. And the simplest way to do this is to just assume the model we'll be assuming so far, the linear model.
Notice that now-- so if you just unroll it, this is our weight. And these are exactly-- say, if this is patient 1, these are going to be his gene expression. And those measurement do matter for us. We want to know which of those. We don't want to combine them. We don't want to put them in [INAUDIBLE]. Which of those is important?
And the way we think of importance here is if the associated w is big, then it is important. If it's not, it's not important. So we kind of look a bit into this and we basically say, to say if a person is likely to be sick or not, we're going to look at this gene expression and weight them. And let them somewhat vote for whether this guy is important or not. And some of the weight of the vote is related to this quantity.
So this is the other definition of variable selection, which is in a classical book. I like it. You can point out one point, which is this is a hard problem. It's a harder problem. Detecting which variable is in play is a much harder problem. And so one new thing in the last few years that we start to have kind of clean protocols and way of doing, which are provably effective at this under certain conditions and efficient. Whereas, I would say until 5, 10 years ago, it was basically just a series of heuristic techniques that would not be-- can't be guaranteed.
And it's easy when you have no guidance in such a hard problem. Imagine to just fool yourselves in believing that you find a question. Again, the classical case of this was the one where you had something like tens of patients at most. And thousands, tens of thousands-- 2, 3, 4, tens of thousands of genes. So in some sense, you have a very, very high-dimensional problem. And there is no sparsity in this matrix. It's a dense matrix. It's a big, sad, dense matrix. So how do you do that?
So the keyword in the picture is sparsity. How many of you know what sparsity means in this context? Good enough?
Basically, we say that this function or this vector is sparse if some of the positions are 0. The sparsity level is how many coefficients are not 0. So the game now is going to be-- OK? I'm hoping that the best possible function that allow me to predict whether a person is going to be sick or not is going to be sparse. So I don't actually have to look at all the possible genes, but I just have to look at a subset.
It's a prior. We hope it's true. And we try to enforce it. And we check and we try to enforce this kind of assumption.
LORENZO ROSASCO: Yes.
AUDIENCE: Sparsity with respect to the weight.
LORENZO ROSASCO: Yes, the weight. Each is a gene measurement. And then you say, if the associated weight is 0, then it doesn't matter. And this is specific for each input, but this is just for any possible input. So whenever you find that, you get new input. If wj is 0, you would never look at that measurement at all.
So there are some disease that turns out to be controlled by a handful of genes like 1, 2, 3. There are other where you have complicated pathways and process that are not just related to one before. But this is also something. OK.
So I'm prepared be burned out. So now you help me out.
Suppose that I give you an infinite notational power. So let's not worry about notational, per se. You have infinite computer and now you want to solve the problem. How would you do it?
AUDIENCE: All possible combinations of variables and run least squares.
LORENZO ROSASCO: Right. So he just take all possible combination of variables. And so I would say take all individual variables and say run all least squares or just least squares. And if they are couples, and if they are triplets, and just keep on going. And clearly-- well, I'm not sure about clearly. But one [INAUDIBLE] is actually optimal.
If there is a sparse solution to the above problem, you're going to get it. But clearly, this combinatorially explosive and it get complete unfeasible, even, I think, after-- you might go up to 5 or 10-some tricks, but [INAUDIBLE] soon. And this is kind of the brute force approach. You try all possible combination. And it turns out that this can be actually written in this way. So as a least square problem where this is not just a linear function. And this is the supposed 0 norm.
Basically, it's not really a norm and it just count the number of coefficients which are different from 0. So give me a vector. 0 norm of the vector is how many w, j are different from 0 in that vector. It's just good to know that it exist for what we're concerned right now.
The other thing it's good to know is that the non-feasibility of the problem is essentially concerning the fact that this norm it's not a convex function of w. So this is in a non-convex problem and the associated-- so you cannot be solved. And it turns out that it cannot be solved in polynomial.
So essentially, once you're fine accepting-- and nobody complained, but you should have-- linear model. If you're willing to accept the linear model, then you know that you could approach the problem this way in theory, but you cannot do it in practice. And the problem becomes, can we find practical approaches for which we can try to find an approximate solution?
And then ideally, derive some kind of guarantee to tell me how good the solution is. So how would you do it? You cannot try all possible combinations. So how would you do it?
AUDIENCE: Would you fit normally, then start cutting off weight below a certain pressure?
LORENZO ROSASCO: One possibility is do least squares, like the one I show you before, the linear least squares. Fit the best way you can. And then you already have a lambda. And then you put another parameter. It could be some kind of threshold. And basically you say, I get the w-vector of my linear model and I look at each entry. And if it's small, I throw in two. And then after this, I want small. And now I have two values. So that's kind of one way of doing it. And I'm going to try to point out in a second.
It's kind of hard to prove anything about this. And you have two parameters. You have a parameter. In some sense, you have this one parameter that controls the stability of the problem. Then, this other parameter controls how many coefficient you want to throw away. And that's not completely clear, how they interact.
So for example, I'm guessing it to be a similar thing. Why did you put the stability one to be sure that you were not sifting too much? Maybe it's the case that the other one could put also the same thing.
AUDIENCE: That was going to be the other idea, [INAUDIBLE].
LORENZO ROSASCO: Right. And this is one approach. And it's exactly the approach I'm not going to talk about. Show it to you. So basically, remember that somewhat a flattering way to look at this from a probability point of view is basically with an exponential here instead of [INAUDIBLE]. And then this is basically the prior.
If you just find not with probabilistic theory, think of this still as a prior. And so what he just said is, can you put the prior and it's going to force the coefficient to be 0?
Notice that when you look at [INAUDIBLE], it's kind of doing something like this. Because it's saying that you want to look at something like this. And this is just wj squared.
So when you look at [INAUDIBLE], you're basically saying, I want this thing to be small. And so this is basically putting a budget on my w.
The main point is that you can check. I'm now going to give you an explanation of this, that most of the thing will be different from 0. So you do a shrinkage, but you don't make them 0. You just make them small. And then you do have the perfect threshold. So the question is, can I actually make them 0 like this would do?
The 0 norm, the 0 product would do. And it turns out that yes, you can do it. There are a bunch of different ways to show it.
You basically have to replace this 0 norm not with the squared norm like this, but with the l1 norm. It's the sum of the absolute value of the coefficient. And it turns out that one can prove a couple of things.
One is this actually does shrink the coefficients to be exactly 0. How much 0 depends on lambda. If lambda is very big, most of the coefficients will be 0. If lambda is small, just drifting along here.
The other cool thing is that this is actually-- when you replace your 0 norm with the 1 norm, the problem you get is convex. So people describe this operation as a convex relaxation of the original problem. It's not strictly convex. So it might have a flat region. It's not just a nice smile. It's flat region. So there might be more than one minimizer. But it is convex. And so it's not computational and feasible anymore. And there are really a lot of methods.
In the last few years, a class of first order methods called proximal methods, [INAUDIBLE] method emerged as one of the best way of doing this because they're simple and they do scale for very large data sets. These have different names because introduced in signal processing and in statistics around the same time, '95 and '96. The terms called basis pursuit in signal processing or lasso in statistics. It's one of the biggest topics. It's been in the last few years. They've been studying all possible flavors. And it turned out that this is something that people in [INAUDIBLE]. Anyway, people have been doing this kind of stuff from a practical way since the late '70s and perhaps even before. Ben Logan is the name that people mention every once in a while in one of the [INAUDIBLE].
The other thing, this kind of stuff became very popular lately also because all of a sudden people were not only saying, I'm going to replacing this with this new product. It becomes convex, and I can do something. But how far I am from this 0 norm, which is the true solution. How much? How big is the price I pay for my approximation?
And the bottom line is that there are results in the last few years that basically quantify exactly how much you pay and tells you the condition in which you pay nothing. You can actually prove that not for any possible problem, but there is a large class of problems for which relaxing the problem this way allow you to get the exact solution. So this is not for-- the hardness of the problem is not for any problem, but it is for a large class of them.
And people that contributed to this are [INAUDIBLE] is one of the main people. And there is a bunch of similar result by [INAUDIBLE], where they kind of laid down the theory of this kind of stuff. And here, I'm tipping the iceberg of this. Although, I chose not to give you more detail about this because to go on from this, I have to either do some kind of geometric idea, which are kind of abstract, or into computation that would require me to introduce causes of innovation. OK.
So I decided not to do that. And instead, to concentrate on another class of methods, which are so-called [INAUDIBLE] methods. And they're much easier to explain. And they, more or less, enjoy the same kind of properties that convex relaxation do have.
So remember that the way we describe this was, OK, you see I have a least square machinery. I can take all possible subsets of variables and run it-- single, double, triplets, and going. Still, you cannot do that.
If you want to try to approach a problem in a similar, what can you do? So want to go in this incremental, inclusive direction.
LORENZO ROSASCO: I was thinking of something [INAUDIBLE]. So one thing you can do-- you start the same way. You start to find the first single variable that matters. And then you basically check whatever is not explained by that variable. And then you start again. And then you start. And then you start.
The hope is that you're shaving off information by including variables. And then there are all kinds of flavors. Because you can select one variable, then you can put it back in the bunch or you can keep it out.
And then once you have selected one variable, you have one coefficient. Then you select another variable, you have a second coefficient. And you can decide, should I recompute this solution, yes or no?
But the general flavor-- and I'm going to show you one instance right now. So the general flavor [INAUDIBLE] is, I look at the first variable which explain my data. It gives me the best prediction in the least square sense.
And then I just basically look at the residual. Whatever I'm not able to explain. And then I rerun the same procedure on the residual, which is whatever is left out of my prediction. These kind of methods or matching pursuit techniques. And with respect to the convex relaxation approach, they're basically easy to explain. They're elementary from a computational point of view. And that's why I chose to show you this guy.
They're kind of slightly annoyed right now. But this is what's going on. So remember that you have this matrix, which is n rows and p columns, which columns is size n. And then you have the vector above, which is also [INAUDIBLE] n. So what you do is you basically say, OK, let me see which direction gives me the most explanation of my vector. This would be this one.
And then you can say, OK, if I do use this as the prediction of this Yn, what's left? Basically, it would be something like this. Then you keep on going. So the general idea is that you have a notion of residual. And then the first round, the residual is just the vector itself. Then you find the variable most related with the residual. Again, at the first round, you find the variable, which is most related or most expressive of the all.
Then, once you have the variable, you compute the associated solution. You put the index of the active sets of the variables that you like. You compute the new residual, which is whatever is left. And then, you start again. This clear?
Just-- and wait.
AUDIENCE: And so for stopping rule then, would you just keep-- you add in the cost function as well and do it until you find something where the total [INAUDIBLE]?
LORENZO ROSASCO: So it turns out that the-- let's look at this better. The only three parameters you have is the number of iteration. OK? There's nothing else. So you literally just put this.
And it turns out that the number of iteration you do is actually related to the number of variables that you're going to include. So that's actually the regularization parameter. So it's not an explicit penalty. It's not an explicit prior regularization. It's just a procedure itself that by including new variables is going to potentially increase the complexity of the model. So by letting the algorithm run, you're actually including more and more variables and including more complexity. How do you choose it?
Well, same as before. How do you choose k and kn? How do you choose lambda or sigma in least squares?
Cross-validation. That's the same. In some sense, I think good question is always, what is the regulation parameter? Once I know it, then I know that I don't know how to choose it, but I can do cross-validation.
Pick one relation parameter which is here, use it. In some sense, the number of iteration here plays the same role of lambda. Implicit. Not [INAUDIBLE] penalization, but that's it.
AUDIENCE: When you're doing this, do you have the axes that you're comparing along pre-defined? Can you choose the first one and then just go orthogonal to that afterwards?
LORENZO ROSASCO: Just this. Just this guy. So this is the data matrix. The x's in that picture are really the column of the data matrix. And in fact, I put them orthogonal. They're not. They're just a bunch of vectors. Typically, you have many.
The [INAUDIBLE] is that you have something like this, and then this is the Yn. So this is kind of misleading. I just did it for the sake of simplicity. But you have many, many possible-- in this equation, if you think about it, you have all of these possible directions, which are many more than n. So it's a very crowded space. You're going to have an orthogonal axis. You actually have very crowded whirlpool of directions.
Then, you look at one. Then, you look another one and go on.
And for this to make sense, you have to normalize the column of the matrix. Otherwise, the length of the column will tell you something about the size of the coefficients. And you don't want [INAUDIBLE]. OK.
And this is actually the code of the whole thing. So don't stare at it yet, just notice how long it is. This is it. This is really the algorithm. You just run this.
And the only part where you have to turn on your brain is this. And I have to explain it to you quickly here. The rest, [INAUDIBLE] algorithm. This is the selection of the one variable, so let's wait a second to look at it.
It's included in the active set. This is the update of the w,or the solution. And this is the computational [INAUDIBLE]. Then you start from scratch. So I have to tell you how I select the solution and how I update the coefficient. It makes sense?
And at the end of this, I will have a set, which will be an important one. And a w, which is a vector I can use to make predictions. OK.
Now, the part that is slightly annoying, this is how they arrived to the selection rule. And it's kind of unrecognizable. But down here, I wrote you-- very small. Again, this is just a little exercise you can do and this is the solution.
You can show that what you're doing here is actually just computing which variables gives you the best least square approximation of the residual. This expression here is actually the expression of the best approximation of the residual by a single variable.
So what you do is I solve the problem with single variables and I select the variable to give me the best possible solution. The usual trick here that one thing is fixed, so you can go from a min to a max. So that's what's going on there. Nothing else. So don't stare at this too much. This is find me the single solution, give me the best least squares. And this is just one way to write it. And this is how long the calculation has to do to show.
If you solve a one-dimensional problem, then you have one variable, one column, and one coefficient, one number. Once you have to update the solution, what you do is that this decay is the coefficient associated to that one variable that you chose.
This is just the orthonormal element. So you have the vector of 0 with 1 in the position of the variables that you chose. And so now you put this number in that position. And then you add to the vector you had before. So this is a number. This is a vector with 1 in the position of the variable that you chose. You form this actual. You form a long vector, which is going to be all 0's, but on position. And then you add it to the one you have before. So this is the simplest, simplest, simplest flavor of this algorithm is what it's called matching pursuit. It's one example for forward stage-wise regression, which is an idea that was around since forever. And it's this forward method.
So you forward because you include one variable at a time. And this is the way you do it. There's a whole set of things called backward method where you basically start from the whole thing and shave it off.
Forget about this. This doesn't work great. The version which is somewhat used to-- and I didn't say. Sorry, didn't say what you do for 0, but it's kind of trivial.
Once you have the coefficient at that point, you check. You just make your prediction. And then you subtract your prediction to the residual. So the first round, you just have the vector and then you just subtract your prediction. The second round, you have what's left and then you keep on subtracting your prediction.
There are a bunch of different flavors where the main difference is really the way you update your coefficients. For example, one thing you could do-- well, you tell me what you could do to update the coefficient. And there's lots of different way, for example.
AUDIENCE: So your e squared is the variables in your set?
LORENZO ROSASCO: Yeah. So you have an active set of variables here. And instead of doing this, you just do least squares on that variable. What do you probably gain and what do you lose?
What you gain is that now, suppose that you have variable 1. Then you put variable 2. Maybe you want to change the weight on variable 1 now that there's another variable. And that's what you're doing, you re-compute.
The problem is that each time while the solution of this one-dimensional problem is trivial, is that each time you have to increase and build a bigger system, you are solving a bigger and bigger and bigger system.So you pay a price computationally.
This thing we just mentioned is essentially what is called orthogonal matching pursuit. This can be shown in practice to be much more stable and works better. And I think today we're going to give you the code for both, or just the second one. OK? And you can play with it.
The key point here is really that by doing this kind of procedure, you're actually approximating the solution of this. The condition under which you can prove that there are classes of problems for which the previous procedure gives an exact solution to this problem.
So these are all I want to tell you about this. This is really a [INAUDIBLE] talk about here. Literally, I think in the last 10 years, there have been an explosion of work in this direction. So there are tons of-- some of it is a bit more theoretical. Some of it's a bit more practical.
From the theoretical point of view, this drawing here is the one that if you have gone to a talk with the sparsity you have seen before. Because to some extent, I didn't push this perspective because I just want to keep it computational. But behind of this is the idea of how you can solve large classes of undetermined system where you basically have a matrix, a vector of measurements, and a vector you have to look at, which is much longer than the vector of measurements. So there is no hope to solve this problem unless you make assumption.
So maybe with the assumption we made is that only a few of the entries of this vector are different from 0. That's what we've been doing so far with our methods. Then you can ask, well clearly, if this number of entries are smaller than the number of this, and I tell you where they are, there is no problem. This becomes an easy problem. But I don't tell you where they are. So now the question is, what is the price you pay no knowing where they are, or can't find them is another way to say the variable selection problem is kind of in our algebra set.
And basically, the theorem I've been referring somewhat vaguely to tell you that if the column of these matrix are not too collinear in a precise sense, then it naturally solved the problem. So you can show that the price of finding the position of these entries is finite and it's actually not too bad.
And there is a whole perspective in what I just said. The point kind of new way of looking at sampling theorems beyond the [INAUDIBLE] sampling. Just the one I said if you have a band, you have to go twice that frequency if you want to have exact reconstruction. And this new theorem says, what if my band is this big, but is almost empty? Do I still have to go twice the frequency, or I can do much better than that?
And this is basically saying you can do much better. It's not exactly as if you just know that you basically pay a price of knowing where the location of the frequencies are. And this is basically, a whole new idea [INAUDIBLE] theory. There is this whole idea in which basically if you have a vector, you can actually sketch it with a very short vector and still be able to get back to it.
And this whole idea of impress sensing and sketching out rhythms and so on and forth. And this is kind of going out of machine learning in a broader domain which you kind of do the same stuff. Slightly different [INAUDIBLE]. But there is also another direction where literally, you actually keep on staying in machine learning.
And for example, one example is, what if I tell you that rather than having individual variables that are different. I know that in my gene example before, there are a group of genes that are actually associated pathways. And they're actually important. I want to keep them grouped. And what I want to set is not individual genes, but pathways. A pathway is a group of genes that are related to each other. Can I do that? Can I do that if my groups are not overlapping? But can I also do it if there is an overlap? And what is my group's actually hierarchy? A big group that is split in other groups. Can I impose this kind of priors? And this is what people are calling structured sparsity. Group lasso.
And in fact, you can mix this idea with kernel and get multiple kernel learning ideas. And this is all somewhat complication of ideas where lots of the tools will remain the same.
A more substantial departure from these kind of ideas is when you move from vector to matrices. You go from a vector to a matrix. And so it's kind of all the same, but you would put the matrix here and a matrix here. And the idea here is now what you have to retrieve is not the vector, but it's a matrix. And you can prove that this simple problem allow you, for example, to describe many vector [INAUDIBLE] regression problem. Many problems that people call matrix factorization.
So the idea is, again, I have a matrix of same thing called measurement. I have measurements, which are themselves a matrix. But then, my unknown is also a matrix. And in this sense, rather than sparsity, oftentimes people think-- so lower rank. They say if my matrix is big, but it's low range. So in some sense, you detect the degree of freedom of the matrix small. Can I use ideas related to this to try to recover from much fewer samples than that.
The answer is, again, yes. And the theory somewhat [INAUDIBLE] of the theory. So that should be exciting [INAUDIBLE]. Here you have four vectors. So what I'm showing to you is just the tip of ice.
All right. Any questions about this?
So the take-home message is it's big. A lot of stuff I can tell you about. The one thing I would like you to remember is if you have a linear model, and if you want to know which variables are important, you can use this kind of technique where you really start to look at one variable, then include another one, and then another one.
You can try to use that. Try to understand why it works. And then if you're curious, you can use that as an entry point to the whole world of different kinds of methods. And find a lot of beautiful math. That's kind of the hope for this.
Yeah, I want to say one thing. The one thing I didn't say, didn't make that I probably should have done, is linear models, [INAUDIBLE] models. What if my genes are not interacting, or are not banging linear way to the out?
That's complete nonsense. Why the gene should [INAUDIBLE] in a linear way to the out? Can they get polynomial square?
Yes. Do we know how to do it?
No. What we know how to do is they assume that each gene depends nonlinearly to the output. But then just assuming that the gene are linearly interacting with each other. That's kind of easy. They already speak to each.
But the full-- so sparsity with no parametric, we don't know. Almost nothing there. The algorithm that stopped working where a dimension is bigger than [INAUDIBLE]. So it's a big direction. But we do linear models because we don't know how to do anything else. And because it's good in a lot of different situation. OK.
I don't know what to do with this. OK. How many of you know what PCA is?
So just go to lunch. So I just give you a quick presentation of PCA, mostly because I liked it. And when I put it together, it was slightly different from the one I found. And I find it was instructive to do other stuff.
We have to go. Exploding.
So what's PCA? You have data, high-dimensional .You don't have labels now. Different game. You don't know [INAUDIBLE], just stuff. [INAUDIBLE] of numbers. And you want to basically put down the dimension. Throw down the dimension, whatever it is [INAUDIBLE]. Why?
A bunch of different reasons. One is, for example, you want to look at the data. You want to take three-dimensional description of the data rather than, say, 120 dimensional. And you want to plot the digit. And you would like to see that some digit are here, some are her, and so on and so forth. What kind of shape they have. Are they all mixed or are all they naturally separated? How do you do that?
Here's the formula dimensionality reduction. It says if you give me vectors of size D, I want to do a map and put them down, so say K, where K is much more than D. Here's my question now.
There are no labels. I'm going to do this on the basis of a bunch of points. In the cases, how about one dimension? I want to project in one dimension.
They give you a bunch of vectors and you want to project them in one dimension. What do you think it's a good idea? What would be a reasonable idea to find-- and let's also make the case that we are fine with linear projections. We want to find a linear projection. We have a bunch of vectors. We want to linearly project them now in just one dimension. And I want to find the one dimension which is best. What would you say?
LORENZO ROSASCO: Maximize the variables. That's one. What does it mean?
You take all the vectors. You project them down in every possible direction. You look at how they are spread out. If they're spread out a lot, you're happy. And you choose that direction. What else?
One point of view is to say if you give me vectors and you don't tell me anything else and I project them down, I might want to be able to reconstruct them. To describe each vector using the least possible amount of information.
Variance is one way, by variance. You can say maybe I just want to keep as much information about this one vector as I can. Make sense?
So the way you would do it is basically saying pick a vector. Now, pick a unit vector. So this is the vector I want to compress. This is the unit vector. And what I'm going to do is project it in one dimension. OK. I just go in one direction w. And then I check how much I'm shaving and how much I'm losing. I computed the difference between the vectors I started from and it's one-dimensional projection. It makes sense?
So this is the picture I showed you before. You take a vector, you project it in one direction, and you check what's left. Only here, the vector are one input vector and it's one-dimensional projection. It makes sense?
This time, it should make sense. This is just a one-dimensional projection on a unit vector. So take this out. You project it here. You get a number. This is the length of this one-dimensional direction, and then you check how far you are from the original vector. So one idea to find this one-dimensional direction. So this is the one-dimensional projection, linear projection. So the idea is I want to do this for all the points at once. And I want to find the direction which, on average, is the best for all the points. That allows me to get the best reconstruction.
AUDIENCE: What's the difference between this and finding maximum variance?
LORENZO ROSASCO: None.
LORENZO ROSASCO: This is just the geometric point of view. They're the same. And that's what's coming in one. Just the same, just a different point of view. OK.
This thing here is a sphere which has unit vectors, nothing else. And to restrict myself to unit vectors, I want to have scaling somewhat.
One thing is about computation, one thing is about statistics. So statistics exactly what y just said.
If you do nothing but taking this square, expanding it, just notice that because it's our unit vector, some nice things happen. No, actually-- yes, less than that. Because when you do the mixed product of these two, it simplifies with the square of this. And then you can go from this min to this max.
Now when you look at this max, you see that if these variables are centered, otherwise you do have to add the centering. Maximizing this is actually maximizing exactly what he said. Just look at one direction which we have the maximum variance. But this theory, the one direction gives the best reconstruction.
The thing which is [INAUDIBLE]. Why reconstruction in variance?
It's here so it's quite convincing. It's not obvious that [INAUDIBLE]. And this is the one picture. Basically, once you have the principal components, well, is this direction. Why?
Because in this perspective, it's a projector. This is the kind of spread I have.
The thing which is kind of cute, I think, is the computations. Now, let's speak to this formulation. And basically, what you can do is that you can just expand here. It's very simple. So you just take the square. You just write it down explicitly. You do symmetry. You get this. And then you use linearity to plug the summation in. And then you get the quadratic form, w transpose times this matrix w.
This matrix here is just a second moment of variance of the data. And so it turns out that you can rewrite this maximization problem as the product of the maximum of this quadratic form under the constraint of unit vectors.
And either you remember or you can just use Lagrange's theorem to prove that the solution of this problem is just a maximum eigenvector of the matrix Cn. I think the beauty of this is that there's nothing else to know about this here. So you start by asking, what is the one-- the best one-dimensional projection?
You write it down. Just put the max problem and it takes one line. Just this one calculation, actually it is quadratic form. Then, let's say for the computation. Because either you pick it up from a book or you prove it in one line. This is the solution, the one direction of this product is the maximum eigenvector of the matrix of the [INAUDIBLE].
And so for k equal to 1, this gives you, at least in this [INAUDIBLE] reconstruction framework, the solution to the problem. What about k equal 2?
With a little effort, what you can show is that this is using the same idea. If you don't make any assumption on the second vector, you say, OK, I found the first vector this way. I just want to find another one. It's a max. But if you ask the other vector to be orthogonal to the first one, then if you get to this problem-- and it turns out that the solution to this probably is the second eigenvector of the variance matrix. You just keep on going.
And you can even do-- of course, you can go from k equal to 2, k equal to 3, and so on and so forth. So the whole problem of dimensionality reduction in this case just reduces to find the eigenvector of this matrix.
There is a big difference with the variable selection problem. The eigenvector of this matrix is going to be linear combination of the vectors. But in both cases, you're reducing the dimensionality of the problem. But in this case, what you find is a mixture of variables.
Suppose you go to the doctor before and you use this. He's going to ask you, OK, nice. You can show me the patients in low dimension. Which variables are important? Which genes characterize these patients?
You don't know because the first principal component, this guy, might look-- might as well be a combination of all the gene describing patients. So it's very different.
This motivated expansion of this trying to find sparse PCA that would give you sparse eigenvectors, and so on and so forth. Because clearly, you cannot just say, oh, let me go back to the other method. Because you need labels for the other method. Whereas, all here it is unsupervised.
And this is, indeed, one possible direction of extension. There are many, many, many others. One direction is basically saying I keep the idea of going down linearly, project down linearly. But instead of reconstructing or maximizing variance, I say that what I want to do is to preserve at least a little distance among my points. I have vector in high dimension. I believe that to some extent, distance do matter and they're reliable. And I want to preserve them when I project them down.
And this is a classical question. And one possible answer to this is one given by Johnson-Lindenstrauss lemma. If you take a random matrix. For example, a matrix with many entries or normal Gaussian entries, then you can achieve this with [INAUDIBLE]. You can project down the data from dimension D with a dimension K that depends on the number on points and number of dimension so that you preserve this low.
Not only that, if you keep the same idea of projecting down in such a way to preserve distances, but you drop the linearity assumption. You don't ask this map be linear anymore-- pretty quickly you get into ideas related to manifold learning. And idea stuff like Laplacian eigenmaps, Essen eigenmaps, a lot of eigenmaps stuff. Fusion maps and so on and so forth. And basically, the idea there is that you try to give them the data that might lie on something which is curvy. You try to find a linear map, this is out. So that the difference between these two points should not be preserved. The distance between these two points should be preserved. And you allow yourself to do it in a nonlinear way.
From a practical perspective, a lot of-- this is easy because it just a random matrix and you apply. A lot of these methods are related essentially to looking at this and doing a kernel version of this. OK.
Most of the medical learning method from a computational point of view, just reduces to look at this. And basically say, this is x transpose x. I want to do x-x transpose. And then I just [INAUDIBLE]. That's more or less [INAUDIBLE].
And again, the list of things you can do just building on this. So you can do a lot of thing where you start from PCA and go in the direction of sparsity. You can do a lot of things you start from PCA and go linear using kernels. [INAUDIBLE] we seen before.
The thing to remember is that you don't get variable selection but you do reduce the dimensionality in an unsupervised. And that's more or less what I wanted to say.
Again, the purpose of this was mostly just to give you-- if you're on a desert island and could bring five machine learning algorithms with you, which one would you pick?
And I would say these are what I would take. I might take kernel k means just in case. I didn't have the chance to tell you about it. I will say least squares, linear, nonlinear, variable selection, PCA, nearest neighbor should be [INAUDIBLE] island.
So again, if you are new to this, this should have been kind of confusing, but also giving you enough pointers that today you can sit down and try to use some of this.
If you already seen this, hopefully you got enough pointer see the overall picture and look for further development. And I'll be around anyway, so you can ask me all kinds of questions. If you have any questions, [INAUDIBLE]. You can go for lunch.
LORENZO ROSASCO: Yeah.
AUDIENCE: I want to ask about matching pursuit. You said one of the limitations is that it's linear--
LORENZO ROSASCO: The standard matching pursuit-- at each step, you just solve a one-dimensional problem. And you-- and that's OK, but turns out at the end there is not enough coupling between the variables you are selecting.
The orthogonal matching pursuit is the one where whenever you select new variables, you add it to your set and you re- the solution. That one is much more stable. But each time you introduce a variables, you have to solve a linear system. And this costs you whatever it is. pq OK.
So if the variables that matter are five, perfect. It's much better than anything else. You start from 1, 2, 3, 4, 5 in linear system, you're good. But if by any chance you start grow systems bigger, you have as many linear systems so going much bigger, that's the main limitation.
In practice, I would say it seems that in hard problem, the convex relaxation approach seems to be somewhat more stable. But [INAUDIBLE].
AUDIENCE: Solving a bigger linear system each time, but you might be able to reuse the partial solution to the last? That might be a closed solution?
LORENZO ROSASCO: Yeah. I have to say again, for some reason, myself particularly, I've been focusing a lot on the convex relaxation approach. Mostly because there is some very nice optimization theory. There has been a lot of interesting first-order methods with the gradient of the squares and then you project it down according to some [INAUDIBLE].
In fact, when you have to start thinking about choosing the parameters and looking at the level of sparsity, to some extent the fact that so much of this is iterative, warm up built-in property is nice. But I am not aware of as many-- as you look more into it, I think lately there has been some things in direction you're saying. But for a while, people didn't look to much.
I think now that you have this data and you have to choose lambda, you cannot just run it many times, matching pursuit will become more interesting.
Associated Research Thrust: