Lorenzo Rosasco: Learning Theory, Part 1 (local methods, bias-variance, cross validation) and Part 2 (regularization: linear least squares, kernel least squares)
May 30, 2014
May 30, 2014
All Captioned Videos Brains, Minds and Machines Summer Course 2014
Topics: Supervised learning, nearest neighbor methods and overfitting, k-nearest neighbors - choosing k, bias-variance tradeoff, cross validation, regularization, least squares, linear systems, computational complexity, kernel least squares using linear, polynomial, and Gaussian kernels
LORENZO ROSASCO: This is going to be three hours of machine learning. I have no idea how diverse the backgrounds, so this is kind of the problem in preparing three hours of slides. Just give me an idea, how many of you have had a machine learning course before? OK, so for you, some of the stuff I'm going to say, quite a lot of it might be a bit introductory.
The whole plan I thought was to give you a kind of a kit like this set of tools that you'll be able to use, basic tools. These are tools that you can further develop. So I'm going to start from basic things in terms of pointers to more sophisticated solutions.
So you already had yesterday quite a bit of an introduction about machine learning. I just want to say, one way to look at machine learning is somewhat sitting in between developing intelligence systems, intelligence solutions, and in general, just making sense of data. And it turns out that since a lot of approaches to developing intelligent system are actually based on data driven approaches, so a lot of problems you cannot just solve by a simple set of rules. You need to actually look at data. An example of potential solutions to the task to devise your own solution.
Data and techniques to deal with data are linked in this two seeming different words. And what I want to start to do today is, again, my goal was try to introduce basic key algorithms that you can use. So I'm not going to do much theory. Yesterday, we mostly covered most of the conceptual parts I would need today. I'm just show you algorithm. I'm actually getting to the point where you can use them. And this afternoon there will be a lab session where you basically have algorithm I showed you this morning implemented and you can play with it. Play data, but also it's a real data problem.
So this is probably a superset I want to be able to talk about. It's divided into three parts.
The first part is the keyword is [INAUDIBLE]. This was only mildly touched upon yesterday. This is basically, if some of you remember the whole story about regularization, fitting, smoothness, it's going to be about how do I find the best lambda, the best parameter, the trade off, fitting and smoothness.
So a key idea was that if you have to learn from data because data can be incomplete and noisy. The issue between [INAUDIBLE] some form of prior, but then how might you believe the data, how much you want to enforce the prior is something you have to decide. And the first part, we're going to look at very simple methods. I don't have to explain much. And then, deal a bit with how you actually tune the hyperparameters of you algorithm. This is basic, but it's the one question you always have to ask in an algorithm, how many hyperparameters I have to choose and how I choose them.
The second part, we'll move on and actually see an instance of regularization. Pretty much what you saw yesterday, but we're going to just zoom in a little bit and just see simple methods. Obviously going to be square solutions. But I'm going to tell you how you can make leaner models into new leaner models and in turn parametric stuff into new parametric stuff. I'll tell you enough parametric [INAUDIBLE]
And I should be able to get here then here we'll start to be in depending also how much [INAUDIBLE] different things. How much we will be able to cover. The last part is really moving away from the black box machine learning methods into asking the questions okay, which information is important in my data. What are the factors that are actually affecting my decision and my predictions. Okay, so we'll try to understand a bit more not only how to get a good prediction, but what are the factors that are actually fundamental to do this.
The first part is going to be, I'm going to give you a glimpse into variable selections, this is sparsity. You still are looking at supervised learning where you have labels and then I'm just going to briefly remind you what PCA is and give you pointers of how to make it more complicated.
And these would be the only unsupervised part. And this last this afternoon, we're going to have this Matlab session. There might be an extra hour and a half tomorrow, I'm not sure if it's confirmed. Still maybe. So here we go.
So this first part, the main idea is that I want to talk about the trade off between stability and fitting, and how you decide the principal way how you strike the best trade off between these two principles.
And because I didn't want to focus on this part, I'm just going to look at extremely simple methods. The setting we're going to discuss for quite awhile is that of supervised learning, which is basically learning function from samples.
So my problem is, I want to learn function from an input to an output. This should be good for every new input I might get. And whatever disposal is just a couple input pairs of input and output. And this is my training set. This is data. Make sense?
So this is the kind of problem I want to look at for quite awhile. And this is just one example so that we can all have at least one example in mind to which we can refer to. So pretty much every camera these days has this kind of stuff that put a little circle on the face. How does it do that? I don't know the exact algorithm on the iPhone, for example, but pretty much the overall idea is, you're going to get an image like this, okay. If you look close, imagine it's just black and white, then each picture will be a number between 0 and 256 or something. And then you can unroll, so this key-map will turn into a matrix and then you cannot unroll it into a long vector.
Same here. This is another image. This does not contain a face, still a set of pixels. I will unroll this into a vector. So I stuck all the images, each one describe by bunch of pixels in rows in a matrix. And this is going to be my input matrix.
Each of these rows, which again is just an image, I'm going to have a set of labels, in this case it's just either plus one, say face. And minus one, not a face.
So we see the minute, you actually use fairly simple idea. One thing that makes everything complicated is here is the number of images, p will be the number of pixels in an image. And even for relatively small images, this number is pretty big. So with respect to a lot of classical statistics, machine learning is very similar to statistics in a lot of respects.
One main difference, classical statistics is that the number of variables you have to deal with is extremely big. Even in the simplest cases. hundreds of variables are just normal, and 50,000 or a million variables or 500,000 variables is actually pretty normal.
So this is kind of a new game. We'll see in a bit if it makes a lot of, it's a big difference. So this is the one example just to keep something in mind. This is, you want to recognize whether an image contains a face or not. Even a bunch of images where there are faces and a bunch of images where there are no faces. You encode them in this naive way. Of course, this is not the best idea just to take the pixel, but it's a beginning, and then you just have [INAUDIBLE]. This is your data, and now you want to be able to, given a new image, say either a face or not. This clear to everybody?
All right so, those of manifest a limitation in my growing capabilities, we're going to think of the images which are made of essentially two pixels just to be [INAUDIBLE]. So, you think of it each of these as a face and each of these as a non-face. And the game is going to be, we want to find a rule, we want to draw here a line that will divide the space in blue and orange and then we're going to say, if a point falls into one region is going to be orange, so face, or otherwise it's going to be non-face.
And the first question is, suppose I give you this one triangle here, and I don't know whether it's orange or blue, what would you say?
LORENZO ROSASCO: Why?
AUDIENCE: It's in a cluster of other oranges.
LORENZO ROSASCO: Yeah. Even simpler, it's in a bunch, immersed in a bunch of other guys. Particularly one way to check this is that, if you looked at the closest one, it's just an orange point. This is like the simplest rule you can imagine, it's called nearest neighbor. It's the intuitive thing that you kind of do when you do that. It is so simple to be mapped out, all the closest ones are actually orange. As simple as it might seem, this is actually nice and working in a lot of situational algorithms. This is based on this very simple intuition, the nearby point should have the same label.
And there is a big large class of methods, and you see that as simple as it is, this is actually a good building block to build more complicated solutions. So the simplest version of this is when you get the new point, compute the distance to all other points, and check the closest neighbor in the training set. So this color stuff, this training set, the black one is a new point. I check what's the closest point, probably my drawing is this guy, it's orange, so I say this is orange.
For this kind of methods-- This one is still okay. Usually it's still a bit of a pain to write them down. So what I do is that if give me x bar as the new point, I compute the distance to all the xi's These are the rows of the matrix before. So each of these guys is an image. I check the index of the closest point and then I assign as the label to the new point, x bar, the label of the nearest point. I don't care if you don't particularly digest most of the questions that I'm showing, most of this stuff you'll actually see implemented and can just go back to this to actually check how it is doing. This is so simple you might digest it right away. Some other will not be as [INAUDIBLE].
So, what's good embedded is overwritten. But as you work, let's see if this, make it work.
So I'm just going to build a very simple data set. So what I take is 20 points of this data sample two moons, two of you might have already seen it. So this is the data set. Forget about this for a second, this is your training set. Two dimensional points, these points here are 1 plus, these other points here are another plus. These are 20 points.
You can see hear what I gave you is just much more points. You're cheating on [INAUDIBLE] for those slides. Typically, this is what you have in practice, but since in my simulation I'm god, I'm actually also going to put here an enormous amount of points that would be the population from where these guys are coming from.
Imagination effort, these you have, these you typically don't have. So what we're going to try to do is now to see a bit how, given this data, you can find a solution and how this solution will work on new data. So these are the new data, the future data. So the game is, find something based on this that will work well for this. Make sense?
So I'm going to choose nearest neighbors. This is kind of what I get. So you can see, again, so what you think? Is it a good solution or a bad solution? What is good and bad in nearest neighbor? You like [INAUDIBLE], if you use it every day, we will build the iPhone apps or not.
AUDIENCE: Outliers, or in the bottom left that probably might be a misclassification that is cutting in two. Would it actually work in the test?
LORENZO ROSASCO: Right. So I guess the comment is if you have a point, a single point might have quite a bit of importance and you might pool your line all the way down here and turn yourself into god just for a few seconds. You check that your density right here because, you see, all the blue points that are falling in that orange division would be mistakes, and the other way around.
AUDIENCE: Most of the train you test--
LORENZO ROSASCO: Right.
LORENZO ROSASCO: I guess, what I'm pointing at essentially is that if you look at this curve, you can say that it's kind of jagged, but also very, very, potentially very complicated. So we follow every point because we gave the algorithm the power to actually do it. Which is kind of good. This is the first example of a non-parametric method. You don't know exactly how to parametrize this curve. But you do know you can build very complicated stuff, as complicated as it gets. It can follow pretty much any point.
So from the point of view of the stability, the stability/fitting trade-off, this is definitely biased towards fitting. It gave him, unleashed to just follow any point. He can be a little but complicated, but can be extremely unstable. And a comment about the [? ELVIRA ?] down here is the first exam. And just to give you an idea, if I generate another data set. It's the same. Looks very similar.
So you see. So this is another manifestation, okay. As simple as it is, this gives you a vague feeling, this is the same solution. This should be the solution to the exact same problem.
So what I did is I take another 20 points, and I just compute the solution. This is what I get. So even more than just following this one point that was here, if I give you another sample, this solution is completely different. This is reflecting the fact-- and this is actually worse-- you cannot pick the data. So remember, the only thing you have is this. And this is what's happening on the other side.
So even more than just looking at the one outlier what you see here is that as soon as you start to assemble the data, and here we're in two dimensions. You will see in higher dimensions things get much worse. I just gave you 20 points. The way this data set works is that I have basically this data set here and this data set here, and then I decide a certain amount of labels that I will flip. And in this case, it was 5%.
So, how shall we fix this? What is nearest neighbor is the dumbest I can think of, what is the second dumbest you can think of?
AUDIENCE: Add more neighbors.
LORENZO ROSASCO: So this is it, boom boom boom. This is my reminder and this is the capped point. If you start to get a point, for example, at this point here, which house would you say it is?
LORENZO ROSASCO: Circle, probably orange, it's not completely clear because now we start to get close to it. I make an effort to make it close to this guy, what you kind of see here is probably the nearest neighbor is wrong. But, most of the guys around him are actually orange, so you probably feel like saying that's orange. That's what your brain does looking at the picture. So you can see what happens.
So this number is kind of a pain to write down, or at least I didn't find a very nice notation, but the same words because it's easy and it's easy to code. Take a point, compute the distance to everybody, sort them, and then take the first k neighbors.
Probably want my k here to be odd, so that we can make a vote. So you can have all the neighbors voting orange or blue, and maturity wins. And that's what you do. And this is called the k nearest neighbor. And these are my [INAUDIBLE] the basic point is, you take all the distances, you compute them, you rank them, you find the indices and you take the first k. Look at this quickly don't get too confused, but the main point here is take the indices of the k nearest neighbor and take the average.
Since this label are just plus 1 or minus 1, if they're odd, I'm just taking a vote. Majority wins. And I could normalize here, doesn't really matter. Does this make this for us or not?
This is exactly the same as before. It might even have it here. So this is one nearest neighbor, this is five nearest neighbors. I'm surprised. See if we can do eleven. They start to make it more regular. This is still jagged, so this is still an extremely nonparametric estimator. Again, it's very complicated. When I say nonparametric, it means that basically to parametrize this to find the number of parameters is not easy. But you do see that the solution is way more regular.
We can check perhaps [INAUDIBLE] data sets of this kind is a small data sets. That's pretty much the same. It does change it still, but a bit less. Take one, you will see that, I think it's [INAUDIBLE]. Not always, OK.
This for example actually looks pretty good. This is the one nearest neighbor, and this happens to be pretty good. For some reason, I don't have a single point here because it just the sample. Again, I make it twice, just don't fool yourself. I just make it twice into the same thing. And this is what I get. The same thing with this guy. This actually is right.
So anyway, this is not rocket science, it's just to give you a feeling. So this algorithm is just an excuse for me to show you in a simple situation the simplest algorithm that manifests control between fitting and somewhat getting something regular.
AUDIENCE: How do you plot your code? How do you get the [INAUDIBLE]?
LORENZO ROSASCO: I don't. So the question is, how do you get a black line. I don't. So what I do for this is I take a uniform grid of points and I classify them. So I take plus 1 and minus 1, and then I throw the 0. I cheat. It's just one way to plot it. I definitely don't have the equation for that.
But anyway, so this is just again, this is just to make the point that this is the simplest algorithm for which you can see this kind of effect. I think this is the first algorithm to know because it's so simple. It's like, whenever you think of something fancy, some important new methods in the world of machine learning, you might want to try this because it takes 15 seconds to code. And then you can check if your fancy method works much better. So that algorithm can actually be made much fancier. One of these intuitions is that closer points should count more. So there are neighbors and neighbors. There's the closest neighbors we kind of like more, and the most far away neighbors.
So what is metadata, is to basically say to put a window around each point and then you can basically say that, for this point should count less. And this is basically the idea behind so called Parzen-windows or kernel estimators. And this is the idea. Instead of just looking at the distance, you're going to put it on top of a Gaussian. And then you can control the width of the Gaussian. If you make the Gaussian very small, each point is just going to look at a very small number of neighbors. But if you make the Gaussian very big, you're going to be able to see everything.
Here, there is not the number of neighbors parametrized models. But there is this width of the Gaussian that basically plays the same role. You could implement this, try do the same game I did, and you will see that this sigma has exactly the same effect. If you make it very small, to basically get delta function over points, you can get a reasonably complicated surfaces. If you make very big Guassians, basically in the end you're just counting the number of points in your whole training set.
The other thing you can do, and you can already see from this equation, but even more clear in this case. I wrote everything thinking that my data are the one I show you. So basically, they are vectors and you just use Euclidean distance. There's no reason why you should do that. If you do have special kind of vectors, for example, binary strings, and you do at a distance, for example the Hamming distance. Then, you can just use that.
The other thing you can have is more complicated similarity measures that come from your friend who is an expert on face recognition, whatever problem you're looking for. And you guys together built together a good way to measure similarities. And once you have it, you can just plug it in here.
This is not an easy problem. Finding the good similarity is probably much harder than doing this. But if you have, then you can just plug it in.
So the simple, nearest neighbor, k nearest neighbor can actually be made fairly complicated. And one point could keep on going, instead of taking just a kind of piecewise constant approximation locally, you can do somewhat a leaner interpolation locally new or a polynomial interpolation locally. And you can make this more and more complicated.
All this for me is an excuse to look at this one remark here. You always have at least one parameter that gets you from a complicated to a simple solution. And there's the one obvious question. How do you choose it? Then if you're mathematically inclined, you can ask if there is actually an optimal value. Should I just make it the bigger the better, or shall I actually choose it in a careful way.
And if you're less mathematically inclined perhaps, you can ask, can I compute a good value in practice. If it exists but I can't compute it, maybe it's not that interesting. And so I guess this is one possible answer. Is there an optimal value? What is the optimal value of the solution? We could go back to the plus, I won't do it.
So which one is the basic idea here. If I could-- and a lot of people, when they submit paper to conferences, actually do-- you will optimize you k, not here, but here. If you have to dispose of data, that represents the future prediction, that's what you would do. You would chose the k, among all the possible ones we showed before, we will chose the best here, right?
Yes, we are happy if we make a good prediction here, but this is what we really care about. Not recognizing the pictures on my mobile phone, but I should be able to recognize faces in any picture I would take in the future.
And so, mathematically, the way you write this down is basically saying, I'm going to take yx, any possible yx, not the one in my training set but any possible pairs. I'm going to take the expectation of the prediction, in this case I'll take these squares of my solution to the true one. So this is the average error on all possible xy pairs of my estimator.
AUDIENCE: What was that?
LORENZO ROSASCO: I haven't talked about that, but I will. What is s? S was the training set, the set of xy pairs at the beginning. The question is, why do I have to put that there? What happens if I don't put this?
To postulate for a minute that you have access to all the x and y, you compute this point to minimize per k. Then what can you say? To find the best k for what? For this guy, this one training set that you have. But then you have to say, well, maybe that training set is not the only want I can get. So I can get other training sets in the future. I want to be sure that there's a good choice for everybody, so the simple thing we can do is we can average out with respect to many training sets.
I'm going to show you much more complicated and fancy way of doing it-- controlling not only the mean, but actually the whole distributional of this quantity-- but today I'm just going to stick to this.
So I take the first expectation over all possible pairs in the future, and then that expectation over the training set I'm given to solve the problem. Make sense?
Do we all agree that this is kind of a reasonable quantity to look at? There is an obvious problem, if you're not mathematically inclined like before. You're not able to compute this expectation, you're not able to compute this expectation.
So whatever I'm going to say for these next two slides is purely theoretical. And we'll answer this question-- is there an optimal value? If there is, then I can try this. The idea is if there is one, then I can sit down and try to be able to compute it. If there's not one, then go home. So we're going to do first a couple of computations to see that there exists one.
And so we basically, I have a bunch of equations and I'm just going to try to point out what you should look at. The basic point is that by trying to [INAUDIBLE], this minimization problem, that you have a solution. What's the shape of the function? How does it look like? You can do this as a one to fit expectation, you can do this as a function of k.
So as a function of k, how does it look like? Does it have a minimum? What does it depend on?
And by looking at this problem, we kind of unveil one of the most fundamental aspects of machine learning. In more general, I'd say e approximation and estimation.
Whatever I'm going to show you now is a bunch of equations. I'm basically going to just give you a vague feeling of how you get certain results. All the math, in pretty much everything I have, doesn't go beyond linear algebra. So I'm not going to prove anything, but basically every step where you see a calculation, if you've not done it before, you can just spend some time, it's an exercise.
There's nothing more complicated than radians and sum of independent variables, it's all linear algebra and basic probability.
So what I'm going to do here, what I want to do is to rewrite these in a few ways. And we do is by looking at the particular model for the data, which is not the classification model. It is a regression model. The idea is that you have two functions, and then there is some noise. So you get an xi, which is random, and I get a delta i which is also random. Then I produce yi. So this is the most standard regression model. And your point notation uses a small i.
So basically, these noises zero mean, and has a certain variance. So just think of the one dimensional situation. And maybe you really have to think about it.
So these would be my x, this will be my y. There will be a function that is fixed, but I don't have it. I have a bunch of points.
These are all green because I don't have of what I just did. But what I have is this coordinate, and this. This is the y-- y x, y x, and so on. The idea here is this noise, this point, is most likely, I will find it there. But, it will be spread out a little.
Then, to do this computation, it is much easier, it is not needed, but it's much easier to actually condition over x. And you can prove that you can interchange the expectations. Forget about that for a second. I'm basically fixing the x. The key point is that we would want to compare this guys to this function here. And to do is we're going to introduce an intermediate object, which is still theoretical.
It's not this, and it's not this, but it is something in between. What is it? It is the average at each point of these guys. So it's a theoretical object because you going to have this.
Remember, that this fk was just the sum of the yi. So if you think the expectation of yi, it comes here, this expectation is 0, and so all you get is the sum of the xr at the points.
The important thing to remember here-- I'm introducing the expectation of my estimator. Which is something that I don't have in practice. But instead of saying if I fix k but you give me an infinite number of points, what kind of function do I get? I won't get that star, because I fixed k, but I would get something which is basically looking at the whole distribution. So it's an intermediate object.
And the basic idea would be how do you think this guy is related to fx and how it is related to f*. We can start to think of it, how choosing k is going to affect my ability to approximate f* or enclose to a simple solution.
So remember that this is just the same with yi instead of f of *.
So this is f hat k. So when do you think f hat k will be closed to this guy? For small k or for big k?
AUDIENCE: Big k.
LORENZO ROSASCO: You think?
AUDIENCE: Big k.
LORENZO ROSASCO: Why?
AUDIENCE: You're averaging across your samples.
LORENZO ROSASCO: So let's see why for small k they would be different.
AUDIENCE: They're not averaging.
LORENZO ROSASCO: They're not averaging. Right. So you're averaging more. That's one way to say it. You're averaging more. So [INAUDIBLE] you get a bit more stable.
But this is just an average over the noise, not over the whole training set. The points are fixed.
The basic point here is that if you let k grow, you average more. The other way around, if you wanted to think very small, you will jump around. Whenever there's a little noise, your solution is going to change completely. So in this case, my f hat-- if I give you this point, this one or this one or this one, the solution is going to change completely. Where as, if I give you the expected version, it's going to average out, so it's going to be very stable. If I now increase k, as you say, I'm averaging out a bit more in the neighborhood.
So even if I had a single point with all the noise, there are other points which might somewhat smooth out my prediction. So averaging is basically the answer.
The point here is that you can literally add and subtract this in this expression. So you basically have f* minus the expectation. And I'm going to have the estimator minus the expectation. You can prove that there is one piece missing if you do it, because it's 0.
You get two pieces. One is the bias, and one is the variance.
Variances is called variance for an obvious reason. Because it is the variance of the estimator himself minus expectation, squared.
This is called the bias, because basically you are not allowing your method to reach the right solution because you're fixing k. If you let k change, you might be able to approximate this more. But if you fix it, then you introduce a gap between your potential solution and the best possible solution.
Again, you can keep on going and what you have is all these quantities are extremely simple. I gave you the equation. They're really simple to study. And what you get, this quantity here turned out to be just proportional to the noise divided by k.
And this is basically what we get just thinking about it. So as k increases, this basically, well, freezes.
This is a bit more complicated to study analytically. Intuitively, what you can think of-- you're actually averaging the right solution. So you have the right solution. You have a bunch of points, and if you average a few, if they're dense enough, it's probably going to be a good estimation of this point here.
So the idea here is, I have this point. Now you give me a new point here. If I take k sufficiently small and now do a local averaging. If I don't have many points, it's going to be not so good. But if you have enough points, you will get something reasonable for sure.
If I start with a very big k, and I'm going to do a ginormous averaging across all possible values. Which, this time are noiseless. There are the noiseless values. And somehow, somewhat dilute information contained in the true value functions by averaging over a lot of things. So this term is harder to study, but we can guess that if you let k increase, you're going to make your prediction worse. Because you're doing a global averaging of the correct fit. Literally like assigning to this one point here the mean of the whole function. Noiseless.
The main point of the story is that if you look at your estimator You compare it to the best solution. You add and subtract this expectation. And then you study analytically the terms you get. You basically get to know that minimizing the error with respect to k gives you two quantities. One is decreasing in k, and the other one is increasing. And you can prove that these are actually not [INAUDIBLE] actually [INAUDIBLE].
So here I just make it into a drawing. Basically, you're going to have-- if you look at the complexity of the model, in our case k controls the complexity. The smaller k is, the more complication is my model.
And basically what you get is that the variance and bias in a different way. And then you can basically choose as the value of k, the optimal value of k, the one that strikes the best balance between these two parts.
On the one hand, you're given a limited, noisy amount of information. So you want to choose a k small enough to control the variance. But eventually, you want to choose k increasing so that you can get the whole complexity of the model. In the old example I showed you before, if you give me an infinite amount of data, I will just get k equal to 1. That's OK. I can just get every single point.
Each of my points will have a neighbor which is essentially indistinguishable. The problem here is really that because we have a finite amount of data, we want to strike a balance between controlling [INAUDIBLE] and the bias.
So mathematically, this is one instance of the idea that you have to control the stability and the fitting of your data. And this answer is for the first question. Is there an optimal value? Yes. It's this guy. Now you can sit down and try to find it.
And the next question is, can we compute it? And the answer is, not quite.
Because you are basically introducing a lot of quantity that you want to have access to. You're introducing the average of the two solutions. [INAUDIBLE]
So one way to do this is to try to find an analytic bound that holds for any possible data set. Historically, this proven very hard. Find sharp [INAUDIBLE]. And it is constantly been suboptimal with respect to another approach, which is basically to try to approximate something from data, with some kind of data driven procedure.
And the one method which I would say is the one that 90% of people use in 95% of the problems is cross validation. There are many algorithms. There might be, for example, in inverse problems, not so much. In some inverse problems, you might have a very good knowledge of the noise and you might use something like an L-curve or [INAUDIBLE] principals. There are a bunch of different methods.
In machine learning, I would say, I don't know anybody who doesn't use this. I would to know if there's anybody who uses anything else.
I'm going to show you what it is. Does everybody use it because it's fantastic? No. Because we are terribly ignorant and unable to choose parameters in a good way.
You'll see that they're very good at inventing priors and loss functions and data fit terms, and we're terrible at finding the balance between these two ideas. And cross validation is kind of one thing you can do.
Yesterday we introduced the probabilistic perspective. From that perspective, cross validation is replaced by a matter [INAUDIBLE] marginalized like related ideas. This doesn't work that great? But it's pretty much what you can do.
AUDIENCE: Maybe I'm missing something. but is the bias the difference between the function and the average estimator?
LORENZO ROSASCO: Yes, of course. So this class here in my answer. How long has this been going on?
AUDIENCE: The whole time.
LORENZO ROSASCO: Always. Always. What else goes on? Three out of four.
All right. What's cross validation? They're different flavors, but the idea is try to somewhat mimic the situation. Unfortunately, you cannot produce the kind of data on the right. So what you're going to do is, you're going to take these points, split in two. Use half as if they were training, the other half as if they were a test. And so you now optimize k over the half you took out.
So whatever number of points you have, you split and you say, a percentage for training and a percentage to choose your parameters.
There are immediately a bunch of questions. If you give me the data, split this for training, compute the nearest neighbor for many possible k here. And then choose k to give me the best solution here. And then when I give you new data, just use that k that you choose.
What's the problem with this? Do you think this is going to work or not? Almost might sound a little stupid [INAUDIBLE]. If you answer quickly, we can keep going, otherwise I don't know if you're all asleep.
AUDIENCE: But wouldn't it highly depend on where you draw the line?
LORENZO ROSASCO: Right. The key point here is if your data set is huge-- so, this whole thing depends on where I drop the line.
If my data set is huge, you can say, well, probably doesn't matter that much because all of the points are going to be pretty similar to themselves. But we've seen before that if this guy is just 20 points, if I drop the line in different points, they're going to be very different. Going to have completely different solutions. And so the basic idea is, this big data kind of stuff, where you have tons of points, maybe this is just fine. You can just drop one line.
If it's not, well, then typically you want to repeat this. The idea is that you're going to do it once, then drop a line in a different way, in a different way, in a different way. You average out and then you take that as your estimate of the future error in the attempt to reduce the variance.
And this is the first one where it's just basically the holdout cross validation. And the rule of thumb is, if you have a lot of data just take that. If you don't have that many data. just split more than once. And this is called k fold or v fold cross validation.
The extreme case is when you have so few points, you're basically are going to take one point out at a time. I don't know if this is any more true, but until five years ago there were a lot of biology [INAUDIBLE] for example, microarray data. Terrible diseases where you have few samples and you might have 50,000 features. And [INAUDIBLE]. That's kind of good, we don't have this kind of data much
At this point you don't really know how to stitch. So what you can do is take one person out, train, do it again, take another person out, keep on going.
And what you clearly see is that, in general, you pay a price for your variance reduction in terms of computation. If you just want to compute for each k, one solution, but then you're done. But if you repeat many times, you have to put many solutions for k here, many solution for k here, many solution for k here. If you just take one out, it might be too much. In some sense, there is really a trade-off being big data sets and small data sets.
AUDIENCE: Is there any analysis of what [INAUDIBLE] in general?
LORENZO ROSASCO: What, sorry?
AUDIENCE: How much your data [INAUDIBLE] to hold up.
LORENZO ROSASCO: So what I know is the following. For the whole [INAUDIBLE], so if you just split once. And so the only question is, I just draw the line once, but I'd like to know where. Basically, anything which is not crazy works, and you can provably show that it's as good as if you knew the bias variance [INAUDIBLE]. So basically anything the rule of, basically anything polynomial, anything which is proportional to the number of points. So half, 70%, is, for example, choices that do work and that's kind of what you would do in practice. 70%, 80% is kind of what you would do in practice, and you can prove that it works.
The question which I believe there are some results. And I've always been postponing looking carefully into it is, an avenue where they tell you what you'd -- how much you'd reduce the variance by choosing many splits, into a difference size. And this is actually -- I think I do have pointers, but I'm not sure actually.
Combinations are this thing where most people use it, and most people have no idea the threat of [INAUDIBLE]. I'm [INAUDIBLE] better because I recently studied the whole [INAUDIBLE]. But the k fold, I'm still with a bunch of people. [INAUDIBLE] There are a few results. There are a few more than you think.
To some extent, this laziness is justified by the fact that none of these methods are working great. So even if you knew the theory, [INAUDIBLE] In the end of the day, you have two, three choices and you kind of use that. I think the big problem is we need to come up with substantially new ways of doing this and try a different way.
It's funny enough, because this is a very long story or very short. It's very long because basically, I would say, substantial amount up here [INAUDIBLE] dealing with this problem. Find adaptive choices that worked as well as the bias/variance trade-off are just based on the data.
On the other hand, the story is very short because in machine learning, it's cross validation. And so to some extent, there is this big gap, one of the biggest gaps between theory and practice ever.
AUDIENCE: One more question. To the general points that you're making for non-least-squares--
LORENZO ROSASCO: Yep. So basically, in this stuff here, I chose least squares, I chose this method, and I chose everything because this complication-- aside from this, you need to be a bit fancier to compute.
So here, you just have to intuitively convince yourself that it is actually increasing with k. But everything else, you can literally do as an exercise. So that's why I choose it. In general, if you can definitely chose other loss functions. You can keep on going and might not be able to get qualities. You might get inequalities. And then immediately, you're into the game of finding how well your upper bound is matching your lower bound, and stuff like that.
Everything I said, it holds for general model of the data-- Classification, regression of any kind, any loss function. But beyond that, it builds for density estimation. It builds for 90% of the estimation problems that I know of.
I want to show it to you. Even if you've seen it before, this is very important pillar of pretty much everything you do. You want to know that it exists, an optimal value, and you want to know in practice, cross validation is one way of doing it. And there is a big question of how to do it [INAUDIBLE].
And this was really the point of this first section of the class.
AUDIENCE: Is it OK to use a different loss function for setting the model and then choosing the hyperparameters? Do people do that?
LORENZO ROSASCO: The second question is easier. Yes. I do it all the time. For example, one question nobody asked, but which is asked quite often is-- why would you use least squares for classification, for example?
Turns out there's tons of reasons to do it. Probably takes me a bit more than a few seconds to explain it. But typically what you do, for example, I use least squares-- the one I'm going to show you in a second-- all the time for classification. But typically I choose the parameter, choosing another loss function like just counting errors.
Why? Because typically, it works better. The hard part for which you really want to have a [INAUDIBLE] function is when you have to do the training, because you want to try to get convex optimization problem or easy optimization problem from a computational point of view. But when you go and try to model, you just have to compute it. So unless you think [INAUDIBLE], you just have to compute it and check if it's the best value. So you don't care [INAUDIBLE] You really want to use the best one you can.
Do we have any analysis of this? I'm not sure. I'm actually kind of sure it's not anything reasonable. But you do it all the time.
In some sense, especially in classification, the choice of the loss is not that much because-- the best possible thing you can do, but it's the one you're forced to do. Because the one you would like to use is [INAUDIBLE]. So you just go to one you can solve in practice.
So as soon as you can go back to the real one, you do it.
So this kind of ends the first part. And as I was saying, I just want to show you an instance, the simplest algorithm to show you this trade off between stability and fitting, and try to give you a glimpse of the theory that tells you how to resolve this trade-off in theory and actually how you can do it in practice.
And at this point, you can wonder-- if you tell me that they're simple to study, if you tell me they work reasonably well in some situations-- shall we just stop here and go home? And so one possible answer comes just from looking at the simple exercise. Kind of boring to read, but simple.
Take a cube. The edge of the cube is 1, so the whole volume is 1. Now the question is, so put you on a chip with the cube inside. Don't admire my amazing [INAUDIBLE]. So I want to know how big this cube has to be to occupy 1% of the volume of the big cube. So the big cube is edge 1, so the volume is 1. I want to have the volume of this to be 1%. And so I'm asking you how big this has to be. What do you have to do?
AUDIENCE: Take the cubic [INAUDIBLE]?
LORENZO ROSASCO: Right, you just have to take the p-th root or whatever it is.
So let's plot it. If your dimension took this p 1-2-3-4 is the actual dimension of the space. So if the space is two dimensional you take the square root, if this is three dimensional, you take the third root, and so on and so forth. So what do you see?
If you take p equals 10, how big does the edge of the little cube have to be to occupy 1% of the volume of the big cube. How do you read it out?
You go here. You check the fraction of the volume you want. Then you go up. And you check the [INAUDIBLE] of the edge.
And you see that basically, in 10 dimension, not exact;y a huge dimension for us, the small cube-- our perception is that is actually this big. And if you go to 100, the small cube is as big as the big cube. More or less, will be 0.99 times something.
Did you ever see this before? Understanding what's happening in high dimension is a mess. And stuff like this is not meant to just give you a perfect picture of what's going on, but just to warn you.
When you go to high dimension stuff [INAUDIBLE] being close or distances are completely different. So you move in a space where, in some sense, everything is close, and everything is far away.
And the very notion of locality has to be tricky. And if you just take a locality for granted, you might get in trouble. This does not mean that k-nearest neighbors will not work.
If you can look at the nearby points, it should work. If you can the distance small enough, whatever that means, it might work. But in general, small enough is a tricky concept. In general, the space is in some sense, very empty. When you look at the number of points, if you have 20 points, is it a lot or a little? Well it depends on the dimension in some sense. If my space is very small, one dimension like here, maybe some 20, 50 points can be a reasonable amount. I don't know. But for sure, if I go and put my set in 10,000 dimensions, 50 points are going to be very few. And all this kind of stuff is going to have a dramatic effect.
And the point of local methods is that not only does distance become a tricky concept, but in some sense that I never had enough points to the local voting.
So oftentimes what I want to do is, I want to move from local methods to something global. So that the prediction here is going to exploit, to some extent, the presence of other points. I'm going to assume that the function here has some regularity so that if you know the value here, this actually might be able to inform the value somewhere else. Exactly because, as we said yesterday, we're going to make a smoothness assumption.
And then all of a sudden, maybe the prediction here, I'm not going to look only at these five points here. I might be able to look at the whole 50 points here. And then I borrow strength from far away points. And so from local, I want to go global to try to beat the terrible curse of dimensionality. This idea that in high dimension, distance acts in a weird way. In this case is mostly empty.
And that's why you're going to move and look at something more complicated from a algorithmic point of view.
The first part is going to be basically the version for [INAUDIBLE] for you yesterday. So I'm just going to stick to matrices and just give you the simplest version.
And then the interesting part would be more when I replace the simplest model with a slightly more complicated model. Good news, there's all the complication where it will be the simplest you can get. But the models are actually getting kind of interesting. You can actually get state of the art machine learning algorithms with three lines of code.
So the general gist here is going bring global and try to impose on smoothness. And this is going to be our main, main character.
I give you the data as before. For the time being and for a little while, we're going to assume that you look at functions. They're actually linear. And then we're going to take this kind of minimization. We're going to discuss it quite a bit. Basically we saw this already yesterday.
Just as a structural remark, this kind of idea kind of came out around the same time. And I thought yesterday was looking a bit-- it's kind of funny that, [INAUDIBLE] regularization.
Looks like this guy Phillips-- and this is the best picture you can find on the web, even though he's everywhere. They kind of did exactly the same idea around the same time. In fact, some people do call it Tikhonov-Phillips. We'll check in the paper, they're very similar.
In some sense, Tikhonov already thought as other related mathematical results was given a bit more context. But in terms of this method, Phillips kind of had this same idea. [INAUDIBLE] statisticians had the same idea. They actually had [INAUDIBLE] is what I'm showing here. As Mahadevan showed you yesterday, Tikhonov set everything up in a much more general [INAUDIBLE] theoretic framework.
Statisticians were more concerned with this problem. They call it ridge regression. But this is one of the first spark in the name gaming machine learning where about 15 different things invented in a different time, and they're all named. At some point, you have to go back and some theory to show that they're all the same.
The other thing I was reading yesterday is called from Legendre who was a big fan of least squares. I am a big fan of least squares. Mostly because they are very nice [INAUDIBLE] to study and they work pretty well. There are very few instances where they are destroyed by something else.
I think that Mahadevan yesterday commented that in some sense, and it's true-- there's an extra parentheses here-- least squares are less robust, which is definitely true. But when you regularize you can recover a lot of the non-robustness in practice. And they key is just that it's much more painful to choose the parameters. We're going to comment a bit.
To some extent, oftentimes in practice, least squares are not as bad as they could be, even that they are so that we square here that makes you pay for big errors so much.
So apparently, Legendre came up with the idea of least squares in 1805, and a few years later Gauss-- I thought that Gauss, I really thought that Gauss came up with this-- but yesterday I was reading. It looks like Legendre came out with this in 1805. And Gauss, from the records [INAUDIBLE] a shy guy came up with a few years later that he already had it in 1795.
And then there was a big fight, blah blah blah, but it looks like historians believe right now that perhaps Gauss was not lying and he invented it before. I like the photo anyways, so good boy Legendre
So what is this doing? Well, for the time being, that's what we're going to do. We're going to take our points as before, and we're going to draw a straight line. And we're not going to look at this because it's going to be interesting from a statistical point of view, but because [INAUDIBLE] everything we need from a computational point of view. Once we've got the computation right for this, everything else will be very easy.
One word of warning. I'll often draw this line. This line is not values transpose x. This is data transpose x equal to 0.
This is my attempt to [INAUDIBLE] transpose x. And the Italian version of this is that the data lies here, and then you have the flame cutting through. And this is value transpose x. And this is just the zero level.
So, if you do classification, you compute w. You apply to a point and then you have to take the sign. If it's positive, you take plus 1, and if it's negative, you say minus 1.
This is the three simple remark. I remember the first time I saw it, I would often forget that this one line here is not data transpose x. And if you make that mistake, then it becomes kind of complicated to recover. You get confused. So just keep that in mind.
So what you're going ask at the beginning is, we're just going to remember real quick. What are the computation underlying this kind of methods? And they're very simple. And next we're going to look into statistics, but a lot of it was basically covered by Tommy and Mahadevan yesterday. So I'm going do that too much.
So, computations. This is why least squares is nice. If you wish, just write this in vector notation. Remember xn is my data matrix is n number of points, t number of dimensions or t [INAUDIBLE] dimensions. And this is my vector of labels. [INAUDIBLE] write is in this vector notation, so that perhaps even more easily you can check that this is just a gradient of this term. And this is the gradient of the second term, up to lambda.
And so now if you just set that gradient equal to 0, you get the normal equation as a [INAUDIBLE] problem.
Again, take a message here. If you've seen this before, sorry, if you have not seen it, you just have to take the gradient and set it equal to 0, and that's everything you have to do.
And then what have to do is to solve this linear system. And we want to look at this linear system a little bit.
Again, this is going to be a sketch of what you saw yesterday. What is this thing doing from a numeric point of view, and from a statistical point of view?
So the simplest case to understand it is basically, you look at a linear system. Let's say it's squared, let's say it's easy. And then what you do if you really want to make it simple is you just takes to be a diagonal matrix.
It's not quite identity, you put something in the diagonal. These are numbers and they go to 0. And then solving this problem, we'll have to invert. But what's the inverse of a diagonal matrix?
AUDIENCE: 1 over the diagonal?
LORENZO ROSASCO: And just one over the diagonal. And I think you immediately see what's the problem. You already saw yesterday. If the eigenvalue is small, even this simple case, everything explodes.
A little perturbation here, little perturbation here can make everything change. So what is Tikhonov in this case? Well, Tikhonov basically tells you, do this, but up a lambda. If this is big, lambda will not too much. But if this is small, lambda is going to kick in and make this stable. And nothing is going to explode.
And this one diagonal case is basically giving you the old picture what you're doing. You're basically considering filtering, and you do the kind of filtering as Mahadevan said yesterday, you can do this as modes, if you wish for your modes. And so this would be something like low frequency, and this would be something like high frequency. And you can view these as a low pass filter if you want, from a filtering perspective.
If you like low frequency and you want to let them be, but if your frequencies are too high and are going to potentially oscillatory behavior, you want to kill them. And this is one way of doing it. And in fact, if you look at this, you can probably think another five ways of doing it. Just a different defining filter. And this is one way to give learning algorithm, but this is by far the simplest.
You probably want as many times where we have diagonal matrices, but if they aren't diagonal, you can in our case you can-- I get the composition of the matrix. And basically the same story holds, just have to squeeze the inverse, this diagonal matrix of eigenvalues values in between the v and v transpose. We chart the eigenvector and their transpose.
And this is just look the way we're going to think of this. So notice that this matrix here, because of the way I built it, is in fact square, and how big it is. So you have n points and you have d variables. How big is that?
LORENZO ROSASCO: This is d times p. And how big is this? This times 1? The basic idea here from a numerical point of view, I'm stabilizing the problem. And to [INAUDIBLE] this for the same reason we are [INAUDIBLE] the statistical problem. The sampling of the y or the perturbation over the x are not going to affect my problem because I'm actually making the problem stable, numerically.
And this is one window over this interesting interaction between stability and generalization. Tommy told you about related ideas in much more general settings. But this is kind of an instance where you really see the interplay between statistics and computations.
And it might be worth mentioning that right now, because we have these huge data sets, where you typically cannot just assume that you have perfect computations. This idea that you might want trade-offs for stability of your computation with model accuracy is becoming one big trend. And this idea that you basically have to make estimation under computational budgets, [INAUDIBLE] budgets, memory budgets. So you really want to know how this whole aspect interacts with each other is one of the most active and exciting fields from a tactical and a theoretical point of view in machine learning.
And to me, this is the simplest instance of these interactions in computation and statistical models.
One thing I don't want to tell you about, but I want to tell you it exists, is a whole different point of view that I personally just discovered in the last couple of years. But Stein actually discovered it exists in '66. And it's a whole, slightly different point of view on why you do regularization, which is not really taken an inverse problem point of view, but is definitely taking a stability-- the interplay between stability and an estimation point of view.
Here, I'm just going to tell you the key words in the [INAUDIBLE] estimators in statistics, so called Stein effect. It's given by the name of the person who was first to figure it out. And this idea of shrinkage, that's how a lot of statisticians would call something like this. They would call it shrinkage, and the justification will not be numerical stability. But in fact, it will be resolved by Stein. So I hope I make you curious enough.
Plotting the solution for least squares is not the most exciting thing ever, but let's do it. What do you expect? You expect a straight line. Very straight.
You can check this probably. You can check that lambda still does what lambda is supposed to do. Especially if I take fewer points, and that's it, noise. [INAUDIBLE] Lambda's this big. Let's see if I can move it around a bit. I get this guy.
So I'm just changing, I just put the linear least squares. And I'm just changing lambda a little bit. And what you expect is that lambda is small, it should be a bit more unstable. If lambda is big, it's a bit less unstable. [INAUDIBLE] Take another data set. In this case, you can write it a few times. Basically here, this is the kind of stuff between [INAUDIBLE]. I don't know how much you've seen this kind of stuff. If this is your bread and butter, [INAUDIBLE]. But if you have not seen this before, if you just play around with this a bit, you get a decent intuition of how things will change.
And instead of doing it twice as I'm doing it now, you can basically check-- if you want to force this to move around, you will probably have to take lambda very small. Take the number of points very small. Increase the number of points. So you get somewhat instability popping out easier. Here I'll stop because [INAUDIBLE]. Then I get all excited [INAUDIBLE] That's kind of what this is all about.
OK so that's about it. In the next couple slides, what I want to show you is a substantial computation of this. So first of all, is this clear to everybody? I'm going to build on this for quite a while now. Does this makes sense to everybody both from a computation and modelling perspective? This is a great moment to stop. Take this [INAUDIBLE] it's perfect.
If you want to understand a couple of things. One is a question related to the computational complexity of the problem and how to [INAUDIBLE] as best we can. And the other one turns out that this question is related to this other question which is, why should we look at linear decision rules? Maybe I want to look something like this.
As I told you, good news would be from a computational point of view, we won't need to learn anything else. But you can just play around and be able to actually look at stuff like this.
How do you do that? The simplest way is this. Somewhat rich conceptually, and very simple mathematically.
You take a vector x. This is x1, x2, x3 up to xb. Component, OK? It's an image and it's pixels. But then, instead of using just pixels, what you do is define a bunch of measurements. If you want functions that take the image, the whole image maybe, and return a number. One function, another function, another function, for example I can pick an image and take [INAUDIBLE] and take the average.
Then I think another [INAUDIBLE] image, then I take another batch, and I take another average. And then I take this same two images into another-- the same two [INAUDIBLE] is an average-- but I do it in another image. And I define this dictionary of measures.
You can think of them as p-charts. That's how they're sometimes called. You have the whole image, and then you start these numbers.
For now we're not going to discuss how. In fact, I'm never going to discuss how. What I'm going to discuss is what you do with that. What you do is very simple.
You can just call this x tilde if you want. It's just a name for another representation of x. That's just another vector. Instead of x, it's another one, defined by other measures. But it looks very similar conceptually. If you now write w transpose x. Notice that this vector might be d-dimensional. Typically you want this d to be much bigger. And you'll see in a second why.
And if you write down this curve, you can actually expand the question this way. And then you see somewhat what I want to call a dictionary. Because if you look at it from that perspective, what you see is that basically, you can think of these in the simplest case as a basis. And this is just a linear combination over a basis.
So you're choosing a basis for this function space. And then you're just trying to find the best extension for the space. This is the analytic perspective. The other perspective is, and this is what a lot of mathematician would like.
The other perspective is actually more geometrical, and it's basically the following. And it's the one that makes the fortune of support vector machines.
And the idea that you basically do the following. You actually map the data and you work linearly here. So from a geometric point of view, you compute as follows.
Suppose that this is your data set. Clearly not linearly separable. But now, this is x1, x2 mapped it into three variables-- z1, z2, z3. This will be my x1, x2, x3 tilde. Defined this way by the first two variables. [INAUDIBLE]
So I take x1. And I take the square. I take x2, and I take the square. And then I take x1 times x2, square root. I take each point-- x1, x2-- and I map it in a corresponding point defined by these three coordinates. Then I look for a linear solution. So this would be z1, z2, or z3 or whatever. I look for a linear solution in this new space here. And here I recycle everything I knew before. This is just the same as before.
But then I can go back. And what is linear here, there's linear separation here. They are no longer linear here. And this functional equation will clearly show it to you. If you put, instead of cj, these numbers, what do you get?
Instead of cjx, just put this. You put it here, so you get x1 squared wj. X2 squared wj. x1 times x2, wj. What is that function?
AUDIENCE: An ellipse? Ellipse.
LORENZO ROSASCO: Yeah, it's polynomial. It's just a polynomial of degree 2. Remember this is not this function. It just the zero level set. What does this function look like? Exactly what we're saying. It's like a parabola coming out of here. Something that goes down, it's positive here and it goes down there.
AUDIENCE: I have a question. Do you have to know how to transform your data into this thing that's then linearly separable? This kind of seems like a chicken and the egg problem.
LORENZO ROSASCO: It is and it is not. It is because you would like to have a principal way of doing it. So I'm hiding behind how you choose this. It is a good comment to have this discussion.
Whenever you have to choose stuff like this, you would love to have a theory that tells you how to do it. Or even better, how to learn this kind of stuff. And a lot of what you will see and [INAUDIBLE]. A lot of ideas for example that brings a different area of science into machine learning are about how you actually build stuff like this. A whole of deep learning ideas that are around these days are basically related to how you do this. And I wish they-- up to five years ago, right now, it's an exciting moment. Because you are exactly thinking about this question. We are thinking about how you choose this.
I would say until five years ago, there were two main trends. One was this is not a theoretical problem, but this is just an engineering problem. So what I'm going to do is I'm just going to give you a different name for the same thing I did before. If this was the name for the raw data, the pixels, and this is just another name. And how do I get there? I go to an expert and I say, when you recognize a face, what are the things that you're going to look at. I said, maybe I'm going to look for a [INAUDIBLE]. And then I'm going to look for something else.
So you just try to build this by hand and it becomes this hand crafting of interesting information. For example, in the case of text, you might do word count for a given dictionary, and stuff like this. And the literature on this ginormous. And just putting some [INAUDIBLE] is very complicated.
The other point of view is really coming from this perspective. And this is another answer to the question. So basically what you do here is that take a dictionary of non-linear functions and you try to make it as large and rich as possible. So that no matter what this function is, this might make it. You remember the other day discussing, for example, the results about neural net. You can prove that in a perfect choice, that neural nets do corresponds to some kind of choice of a model like this. And they're a so called universal estimator, meaning that you can approximate everything. And it can take lines, and it will all have properties like this.
You could prove that if you think [INAUDIBLE] second many other choices of fij you can approximate everything. You don't know how fast you will do it, but you can. The main point is that you might need to be very large. Possibly infinite.
So to some extent, the simple answer to the question is, choosing the fij is a big problem. The typical [INAUDIBLE] is, I'm going to just throw a lot of fij at the problem. I'm going to make this fi ginormous. And hopefully I'm going to let [INAUDIBLE] be able to handle this high dimensionality and somewhat extract the kind of information which is useful for the problem.
For example, you can select any possible feature the dictionary has, and just sort of shove them together. And then I see, if this gives me something to do.
So typically, the idea is that you're really going the direction where you first explode the number of dimensions. So you really map your data in very, very high dimensions. And then you let the supervised learning step somewhat project that unreasonably sized model.
The main point here is to move away from linear models into non-linear models. The simplest way is to look at dictionaries. And you have these two different perspectives.
The first one is more geometric. Where you say OK, you take the data, and map them into a new space. I look at the linear model, and then this model will no longer be linear in the original.
And then the other perspective which is completely equivalent is, I take the set of atoms or dictionary elements, and I take linear combinations of this dictionary element. And this is how I define my function. These two things are completely equivalent.
What about the computational? And in some sense, the answer is here. If you just take this p to be finite, it's just nothing. You're just changing the name.
You have to spot the difference here. The difference here is just that there's a little tilde here. But everything else is the same. Of course, the dimensions are not the same. How big is this matrix? We say it was t by d, now I map it into a t dimensional space, it would be t by t. If my data were two dimensional-- as they are for example in the stuff I'm showing you now. If I now map them into a very high dimensional space, you might do it from two dimensions to, say, 2,000 dimensions or even more.
So the price you pay to get a more flexible statistical model. the potential degrees of freedom, the memory requirement of your model are much harder.
And you should up if you want computational complexity. Do you know what is the complexity of forming this vector? How many of you know what's the price of computing [INAUDIBLE] in two vectors? If they [INAUDIBLE]. You know? Because I prepared this one slide that can show it to you. I skipped it.
AUDIENCE: Isn't it unsolved?
LORENZO ROSASCO: I'm just thinking of a simple, stylized equations where you just have two vectors and have to multiply them. Have to do p operations, are just multiplication. Computational methods [INAUDIBLE] where we have infinite precision and you would just multiply two numbers. This costs one and just count the number of operations you have between this sense. Makes sense? So in this sense, this is just a quick summary. You have vectors of length d and you multiply and this cost you p. If you have the matrix times the vector, you have to do this n times, it's np.
Matrix [INAUDIBLE] np square. And the inverse of the matrix also is not exactly but at least in this case. it vastly speaking, nq.
This is not [INAUDIBLE] but this is kind of [INAUDIBLE] one way of thinking, to just compute the-- To have an idea of the complexity of the methods we're looking at. And this is kind of what you know.
So these are the ones that are most important for us. And the basic idea really is to think of two vectors. Count the number of operations, that's how we think of it.
And so what is the complexity of this? Well, the main complexity is actually building this matrix. This matrix is m by t. I take this transpose, I have to do a p by p matrix. And this is going to cost me p squared n.
And I'm thinking of a situation where n is big. I have many points and not so many dimensions. And so this is the leading term.
If p is much larger than n, then this would be the leading term. Does it make sense? This is the price to build this matrix. And this is, roughly speaking, the price of some solver to invert this matrix or find a solution to this problem.
Typically, inverting this matrix is not a good idea. You want to do a left hand add. You just want to solve this problem. And the classical way of doing it in this case, because this matrix is square [INAUDIBLE] and symmetric, something like Cholesky composition is the way to do. And that's what the backslash command in Matlab will do for you.
Now one question here is-- we just said, does it make sense? This is the complexity we get for this kind of problem.
Now, one hope we might get is the following. Can we do anything better if p is much, much larger than n. Because we kind of pointed at that situation to be the interesting one.
Because I don't know what kind of model I want. I just want to get a very rich model. So can I get anything better than this when p is much larger than n. What do you think?
So you have to do a one line calculation, which is very simple. And again, this is just a simple exercise. Let's just see what I did. So x transpose, xn, and here there's an x transpose. I want to take the x transpose, move it in front of everything. And the rule is that we can do it, but you have to switch the order of these guys inside.
So if you just write the spectral decomposition, the singular value decomposition of this, it's immediate to see this. Just write this as [INAUDIBLE] use the SVD for the matrix xn. Write this, right this, and it's done.
So believe me and try to do a little exercises to remember what the SVD is. Why do we care about this? Because this is nice. It used to be x transpose here. Now I can write it like this.
Because I can write it like this, it's now x transpose of this quantity. But this quantity is how big? How big is this matrix? This was d by d, or p by p. But this is just n by n.
And so this vector here, when I applied to this n by 1, this just became an n dimensional vector.
But then I can write this. If I best write this matrix-vector multiplication explicit, [INAUDIBLE] forget about the [INAUDIBLE] there. But the point is that each w is a linear combination of the points in the training set, weighted with respect to n coefficient.
But then immediately, we can check what's the new computational price. The computational price is now is the one for solving whatever is written. But what's the cost of this?
So building this matrix, it's p squared n. And then inverting it is pq. This is n by n. And to get it, I have to squeeze an n by p and the p by n matrix. What the price? Not this, but it's just? np stands for p. So p just answers linearly into the game. And I have to do nq to invert it.
And then I felt that we're not going to multiplication, but it's just a multiplication. Then it becomes square in t. I have is that from this complexity, I go to this complexity.
As you hoped, when the number of points is much smaller than the size of the dictionary, this has a [INAUDIBLE] you get. It's actually, just playing around, and do this.
The situation right now, unfortunately, is that the one which there are actually start to be some specific problems where this is huge. And this is huge. And so none of these strictly work. We have to do something better than this. And in some sense, the other big trend right now, because of the new, ginormous data sets we have at disposal is to find a and exact an approximate way to kill this complexity. Try to be linear in both p and n, or perhaps even sublinear in some restricted class of problems.
But this is somewhat, again, this is as much as you need to know. And then you can make it more complicated. And really simple methods can actually be built on top of this kind of stuff.
For example, there is some recent work that actually shows that if you have a huge problem, you can actually split it in pieces. Solve this for each problem separately. And then just average out. And as long as you choose the parameter for each piece-- looking at the performance of the aggregated solution-- this actually works pretty well. You take the split, split, split, split. You take the average of what you get in the end. But to choose the parameter, you don't look at each individual split, but you just at the aggregate solution. And this can be determined from a statistical point of view. Because of the independence of the split, it actually works pretty well.
So it's just one possible way. There are a bunch of others. The one that I actually like is the one that [INAUDIBLE] dipped into the detail of how you compute this, and basically showed that you can use first order methods like [INAUDIBLE] gradient, gradient descent, accelerated gradient descent, and a bunch of others. And in those cases in fact, you can do something cool, which is put lambda equal to 0, and just do the acceleration itself to regularize. This is called early stopping. You just look at a different point of view on the way you regularize. This is something I think is cool and can discuss it offline with you.
So this is the basic idea, and this is like the building blocks for a lot of different solutions. So the cool thing is that along the way of doing this, you actually discovered something else. w is written this way. But oftentimes you don't care w. All we care about is w, x transpose w, which is [INAUDIBLE] prediction.
So because w is just this, x transpose w, our fx, is just this quantity. And now, this is really a diminishing slide, but in some sense, to compute this quantity-- as long as you can compute this inner product-- in some sense, you can replace this with an infinite dimensional vector. The only places, if you look at everything here-- this quantity is n dimensional and here you just have to compute the inner product and sums over n points.
So if you have a way to compute efficiently this inner product, you can actually move away from finite dimensional dictionary, potentially looking at the infinite dimensional dictionary. And this is somewhat what people implicity-- again, so suppose for a second that you have a way to compute something like this. And then if you replace this expression-- I just gave the name knn, sorry, kn, this matrix here, which is just data replacing x, xi with x, kxxi.
So what is this k? For example, it could be any of these guys. If the simplest choice is just the one we discussed for the minutes, any of these guys.
But another choice is, for example, this guy. So we take x, x transpose plus 1. You see anything at all? 0.
Anyway, so you take x transpose x plus 1 and you take the e-th power. And the last example, which is the one which is really more powerful. Each of these, you can prove that is in fact equivalent to choosing fij as I showed you before, which is in fact finite dimensional. This is trivial, this is a bit more complicated. But, you can do it.
The last example is the one where instead of just taking the inner product, actually replace it with this Gaussian. So my function becomes a combination of Gaussians, Centered at the training set points I want to prove that in this case, there is not a finite dimensional characterization of the state. This is not a finite dictionary. It is in fact a general and infinite dictionary. The simplest case we can think about if you're in one dimension, you can basically show that it is essentially the k. And also, I kind of abused every word. I didn't point out everything from the point of view which is Gaussian processes, which is another big process, the classic process, that can be used in a lot of modeling problems.
And there is a beautiful connection with ideas in [INAUDIBLE] equation which is completely invisible here. Sampling theorem, inverse problems. I didn't do it just because, again, I'm trying just to give you algorithms and the way to use them. I just want to try to stay away from too much math and theory. Other stuff that I didn't tell you at all is that [? the loss, ?] which is actually more important, what I'm trying to achieve today, is a lot of different generalizations of this. For example, squared loss, twice squared loss, something else.
You take psychologistical function to get logistic regression and kernel logistic regression. It takes a [INAUDIBLE] loss or [INAUDIBLE] loss, you get support vector machines. Once you know this, then you just [INAUDIBLE] a little bit. And you know all these other algorithms.
Another thing you can do is to say, why binary labels are real valued functions? Can I lower vector valued functions? Sure. It's much harder. You don't have a very good idea of how to get function spaces, but you can do that. And I didn't tell you how, but you can use to [INAUDIBLE] good function. And for example, to go from binary classification to multi-category classification.
And it's going to be a minor twist again. I think the basic solution will be a minor twist over the solution I just told you now.
From a practical perspective, the thing you should remember is basically that now, what you can do is you can build your own dictionary. But you can also use these kernels that will give you a bunch of different solutions. [INAUDIBLE] Just one second. Wow, we drew a lot of stuff.
So now let's see what happens if I switch from this to a Gaussian [INAUDIBLE]. This game I'm playing is kind of the one game all the machine learning spend [INAUDIBLE] most of their time doing when they have a problem. They do this kind of algorithm, they take the inner kernel first. And it's not so good, but in high dimensions, it might do well.
Let's try the Gaussian kernel and let's play a bit with the sigma. This was [INAUDIBLE] because that's exactly what I'm doing. So let's try taking the Gaussian kernel. I take the regularization parameters. This is the lambda, this is the sigma width. Train it. This isn't great. Pretty cool.
Of course, all these parameters that I'm throwing in there, I could do and I should do a cross validation. This is just to show you what you get. This is kind of the general idea. This is what you get with least squares. With a linear kernel. And this is what you get with the Gaussian kernel.
How you should think about this? Remember, this is not f of x, this is f of x equals 0. How does f of x look like? It looks like this.
Sit on here and put a Gaussian of greater than 0.04, whatever I chose. And do it on every single point. Then you have to choose the height of the Gaussian, and the sign. If it's positive, it comes out. If it is negative, it goes up. Then you [? connect ?] all the tip on these mountains and you get something up here and something down there.
And then you look at the level set 0, and this is what you get.
And this, I was lucky and it looks like I chose some signal which was reasonable. Let's see what happens if I try to make it much smaller. Nothing.
I'm really going a bit random here because I don't know the numerical [INAUDIBLE]. I didn't look at the eigenvalues, I don't know what it is. [INAUDIBLE]
I just changed sigma. And I make it very small. But very small, because there is a Gaussian here. And basically, I can get to stuff like this. And what you see is that you start to see that this is really probably a mountain and a valley, and it goes up again. And for some reason, it wants to go up here again. This is really something going up and down. I wanted to try to show to you that if I tried to make this smaller [INAUDIBLE].
So what I'm going to show to you is that, if you start to play around a bit, and if you play sigma, what you can do is really put a bunch of Guassians-- I didn't prepare any setting, I'm just playing live on this. But you can literally go and start to-- look at this solution. This is really clear. There is one Gaussian here, one Gaussian here, and one Gaussian here pretty much.
So you can circle out points. These are small. They don't see each. These start to behave very similar to a local method. If they just put a Gaussian around each point, and so using the Gaussian kernel, you can get this whole range of complexity. Does that make sense?
If you go on the other side, and you put here [INAUDIBLE].
So if you go on the other side. So if you do regularize and you just-- the point want I want to make to show to you is that, the more you make the Gaussian big, and the more you're you regularize, the more-- this is exactly the same data set as before. But if I can keep on going, in the end it's going to essentially be like the linear kernel. And make Gaussian so big, that essential you're going to have a function which is very smooth.
There algorithms are really the workforce is most calculations. It might not be least squares, might be support vector machines, might be logistic regression, but this is what's going on. A lot of state of the art solutions for supervised learning are based on these methods. Either in intelligent systems or just make predictions of all kinds of practical problems. We worked for like, an hour on this, and you learned that you could play around and we can switch from one complexity to another complexity. And by essentially using the same code and just changing the per processing, choosing different dictionaries, or choosing different kernels, we can actually get a whole range of different kind of models, from linear to something more complicated.
So we can actually start to ask what else do we need to solve supervised learning problems. It's a pretty big toolkit box already. But before asking what else we need, I think we do need to take a break. Goodbye. And think of what else you want to see.
Associated Research Thrust: