Tomaso Poggio: Learning as the Prototypical Inverse Problem
May 29, 2014
May 29, 2014
All Captioned Videos Brains, Minds and Machines Summer Course 2014
Topics: Overview of learning tasks and methods, ill-posedness and regularization, basic concepts and notation; supervised learning: given training set of labeled examples drawn from a probability distribution, find a function that predicts the outputs from the inputs; noise and sampling issues; goal is to make predictions about future data (generalization); loss function measures the error between actual and predicted values; examples of loss functions for regression and binary classification; expected risk measures loss averaged over unknown distribution; empirical risk as proxy for expected risk; hypothesis space of functions or models to search (e.g. linear functions, polynomial, RBFs, Sobolev spaces); minimizing empirical risk; learning algorithm should generalize and be well-posed, e.g. stable; regularization: classical way to restore well-posedness and ensure generalization; Tikhonov regularization; intelligent behavior optimises under constraints that are critical to problem solving and generalization
TOMASO POGGIO: Yeah, I think we'll pair people. Is that the idea?
AUDIENCE: Given a group code, we can download the [INAUDIBLE]. But I mean, these with grouping [INAUDIBLE]. Yeah.
TOMASO POGGIO: So I think we ran a bit over and were constrained by lunch, which ends at 1:00, I think. So let me be a bit faster and [INAUDIBLE]. This is kind of a introduction to what we need for tomorrow, supervised learning.
So there are different-- learning is a big word. But there are different types of learning. Here is only a partial list. Tomorrow we see some more. Today I am speaking only about supervised learning. You have examples, which are the input/output, x and y. We are trying to learn a relation, a model, a theory, a function that will allow you to predict new output for new inputs.
And one of the themes, as you saw today, for learning is imposedness and generalization. You heard about imposedness. I'll tell you more about generalization, which is another word for prediction, predictability. And these are-- so between today and tomorrow, you'll hear quite a bit about ideas that are at the core of this classifier, regressors, RBS, support vector machines and so on, which are stuff that works. These are kind of things that are used in many of the applications I mentioned earlier, like the mobilized systems.
So I'll tell you only about the first two parts. The third one will be essentially tomorrow. Just some basic concepts on supervised learning. And then just an overview, an idea of the foundational results. Some of them are quite philosophically deep. Have to do-- if one once about the nature of science and what is scientific.
So the problem of supervised learning, as we said, given a training set. And that's a number, and possibly very large, of input/output pairs. And your goal is to find a function which depends on the training set. So it'll change as you change up the training sets, which can predict y for mu x.
Now, for this to be, there's not always need to be possible. It may be impossible. There must be some relations between x and y for you to be able to make some predictions. For instance, one counter-example I had. Suppose I give you, as inputs, you know, names of people. And then as outputs, I give you their telephone numbers.
I give you as many example as you want. Telephone book after telephone book. And I bet you cannot find the [INAUDIBLE]. The telephone number given the names. Right? It's more or less completely random function. If somebody's willing to try the experiment, let me know.
So that's the kind of notation. We assume that the training set is thrown with a IIP. And so it's an independent and identically distributed random variables. And the identically, it's a crucial assumption. And there is a probability distribution, but you don't know it. What you have is only this training set, which depends on the underlying probability distribution. But in a way, you don't know. So you don't know the probability px, comma y. If you'd know that, problem would be solved. So you are trying to infer something without knowing this probability distribution, but only samples out of--
For instance, if you know this py of x, that's really what you'd like to know. Now, we assume that our probability is, in general, because even if the underlying process is deterministic, there is certainly noise in the measurements of all the underlying processes make. [INAUDIBLE]. So there are many different phases.
For instance, suppose this is the deterministical relation between x and y, and sine [INAUDIBLE]. If the probability distribution of these for p of x, for our input part, as we know, then this means you're getting many more samples where there are peaks in the probability distributions and [INAUDIBLE]. Or in this case, vice versa. And so there is always noise involved.
For instance, if you-- let's see. Let me use this blackboard for a second. If you have a spring and you have Hooke's law, that the force depends linearly on the extension of the spring, then the probability of f given x will be a data of f minus a of x. No noise. OK?
But if you have any noise, like additive Gaussian noise, then this probability would be a Gaussian limit.
There is a key thing you want to do in learning, which is you want to make prediction. You want to generalize. And this is kind of different from traditional explanatory statistics. It's just not simply-- you're not happy just to explain the data. But you want to use the data to make predictions. The ability to generalize is also closely related to over-fitting to avoiding over-fitting.
Now, if I want to measure how well I'm generalizing, I have to first decide about how I measure the arrows. And so typical notations in [INAUDIBLE] learning is loss function that measure the error between your prediction and the actual true value one. So usual loss function, the traditional one is the one we have seen [INAUDIBLE] before is the square loss. Yet two loss.
But there are other choices. Like, for instance, the absolute value or the so-called [INAUDIBLE] insensitive loss, introduced by Vladimir Vapnik. And I may have it. Let's see. Yeah. So for instance, this is a graphic loss. Here is the error. And this is the absolute value. And this is the absolute insensitive loss. This is all in the cases of regression, which y can be a regular number.
Now, in the case of classification, when y is either at zero or one, then I can have the other types of losses, which are one is the hinge loss, which means you pay a lot. One, if you are wrong, your sign's wrong in your f. Or you pay the-- this is the analogs of the square loss. You pay the square of your error. So y is 0, 1. And this is what you pay. And as you can see here, this hinge loss, which is the positive part of this error-- 1 minus y times f. y times your prediction.
If you look here, you see that in the hinge loss, would be this kind of loss here. So they are all closely related. They are kind of one an approximation of the other.
But for instance, when we see tomorrow this square loss re-polarization and support vector machines, the real difference-- they are very similar to each other. The only difference is that a different loss function today is square loss versus the [INAUDIBLE]. And this gives rise to a different optimization problem. But the basic structure of the solution and the basic properties are exactly the same.
So now, how can I define generalization in a quantitative way? The key notion is the one of expected loss. So this is the expectation. This is the unknown probability distribution. This is my loss function. Think of f of x minus y squared, for instance. And this is the probability distribution. If I would have the probability distribution, this is a measure of the expected loss, whatever they expect in the future, in the average.
And so this would be very ideal if I want to measure my future error, error of a function I have. Suppose I have an algorithm. I'm hedge fund. I want to predict the market tomorrow. And this would be the expected value of, for instance, how much I'm losing or making tomorrow. Right? So the error I'm making in the error in the future.
The problem is that I don't have the probability distribution, as we said. So if I would add the probability distribution, of course, a good idea would be to find the minimum of that loss, to find the f, the function that minimizes the error that it predicts best. This would give me what somebody called the target function-- the minimizer of the error, the best function.
Now, so these are in summary. The distribution I don't know. The training set I label n, because that's the number of examples we can change. This is the loss function. This is the expected risk. And this is what, since I don't know this, a proxy for [INAUDIBLE]. It's the empirical is what I can measure on my training set. I have these 10 examples. And I just take the loss-- so the difference between the prediction, according to the metric, the loss function I use, and the actual true value. And I take the average of that.
And using this, using this proxy, I can measure this empirical error for a specific function. So I take a [INAUDIBLE], and I can try also to think, well, I could minimize it. That's the most natural thing. Look for the f that minimizes this empirical.
Now, the key question is, if I want to do these presets again, I'm a hedge fund, I found a [INAUDIBLE] from past data about the stock market. I find that this is the error of the past. Find it's pretty good. Doing well. But what I would like, really, to know before I bet $1 billion on this, I want to know that this empirical error that I measured is actually a good proxy for the expected.
This is the kind of warranty that I want to know it on my algorithm. I want to know that, as I increase the number of data in my training set, the probability of the difference between the two being less than epsilon is very large.
And of course, I would like to know, in reality, some bounds on it, you know, for how many data I need to know to have in order to have this epsilon [INAUDIBLE] be good enough for that $1 billion. OK.
So it turns out that minimization problem, finding the best f just from the empirical-- minimize the empirical error, it's actually not even well-- it's not well defined. It's completely well posed. I have to define the space of function of the [INAUDIBLE]. I do the minimization. What is the set of models of functions of relations?
So this is called the hypothesis space. And the intuition is that I have to restrict this. I cannot allow everything to be there, right? I have to say which is the class of functions, of models, of theories. It turns out this is-- restricting this appropriately is the key to making the problem well-posed and simultaneously, also to have predictability. OK?
So now, for instance, I could restrict this space to be linear functions or polynomials. How many of you know what the Sobolev space is? OK. Yeah, and the Sobolev space, you define it in terms of norms of derivatives-- bounds of the norms of derivatives. You don't really need to know. But I'll give you a little insight data, that it's this type relation between restricting the hypothesis space appropriately and smoothness of the functions which are in this hypothesis space.
So let's look at this class of algorithm, which are called Empirical Risk Minimization. And already, describing what it is. It's finding an f that minimizes the empirical risk. And the key here is to define the space of function here, h, in such a way that the problem is well posed and gives you the function it can predict. So that's, again, repeating what I want.
I want functions that minimizes this proxy for the expected error, is the empirical risk. And what I would like to have is generalization. So number of examples that goes to infinity, you want to have that this proxy becomes the same as the expected error. Right?
And I'll give you example. It seems something kind of trivial, but it is not. In most cases, that's not true for h arbitrary, for [INAUDIBLE]. You'd say maybe f is a well-behaved function, like continuous functions. That does not work. So you need something more in terms of which hypothesis space. So you want that.
And you also-- of course also-- well-posed, essentially, stability. Again, if I am a hedge fund, I have used historical data on the S&P for capturing this algorithm that the Greeks wonder if tomorrow, it'd go up or down. I find that if I perturb a little bit, one of the component of my x vectors in the examples-- so for instance, one of the components is the closing price at 4:00 PM the very day of the NASDAQ. That's one of my regressors. I perturb it a little bit. And the function I get out now as a predictor changes a lot. Well, that's very [INAUDIBLE], because the closing prices is quite noisy.
So you want to have stability of your solution. And so this is really the key property of well-posedness if you weren't aware of when you saw, for instance, your minimization.
So of course, you want exist unique, these are typically, you can ensure those. But this one is not trivial at all. And it's key [INAUDIBLE]. Those are the two conditions-- generalization, stability.
Here are examples. Suppose, again, it's the one-dimensional case of function interpolation, or approximation, actually. Here are some sparse data. And suppose these are samples from this function. Now, suppose you have your algorithm giving this solution. And the question is, if I get more data, will it converge to the correct solution?
This is a condition for generalization. That's a case where I have 10 sample. My hypothesis space is polynomial of the 10th degree. So it fits exactly the data, because it has enough degrees of freedom.
But when I perturb the points like this, the solution changes a lot in certain points. It's very unstable. If I restrict, make this hypothesis space simpler-- so now you have the [INAUDIBLE] polynomials. Then I don't fit exactly the data, but the solution is much more stable when I perturb.
So now this is an interesting question-- that there is a lot of learning theory typically Vapnik have been concerned about the problem of generalization, predictability. But independently-- and so one of the conditions that is known guarantees generalization is for the hypothesis space to be compact. And I'll tell you in a second what it is. And it turned out that it was well known since Tikhonov that the same property guarantees what in many problems.
So the question which was open until a few years ago is whether or not these conditions of stability on one hand and generalization on the other are related or not. It turns out they are equivalent and that the appropriate definition of stability, which is, I find, very interesting.
So if you ensure one of the two, the other is automatically false.
And essentially, it's easy to see, in the case of-- let me see if I can let me give a feeling for-- OK. So if we look at this problem, one could think, how can I have an H space of function that could do a reasonable job? It could ensure stability. And generalization. Let's say generalization. One could say, continuous functions. But as you can see, I can make many, many infinite number of continuous functions go through those [AUDIO OUT]. So continuing it by itself is not enough.
Now, on the other hand, if I impose some condition of smoothness, the intuition is if I, for instance, say that the [INAUDIBLE] function cannot have, say, a derivative which is higher than something at endpoint, then the intuition is this may do the job, simply restrict the number of function I can put through the [AUDIO OUT].
And there is this definition, this property of a space of function to be compact. And this means-- can you see, actually?
TOMASO POGGIO: OK. That's a good idea. That's my space of hypothesis. They're a function f na. And if I have that f is uniformly bound for all x, all f are such that f of x is less than a for all x. OK? They cannot be larger or something.
And then if I have that the space H is equi-continuous-- anybody knows what it is? OK. You know what continuity is. So f is continuous if, for any, let's see, delta, there is an epsilon such that-- let's see. I can take x prime minus x2 plus then delta and then f of x is f and epsilon. OK? And if this is true for-- I wanted this to be true for all x prime and x double.
Now, this is continuity. We saw it's not enough to impose continuity. But equi-continuity means that this is true for all f in the space, uniformly. So using the same delta epsilon. That's the trick.
Now, if you have these two properties, equi-continuity-- so this is for all f. And bound to this. There is a theorem from [INAUDIBLE] that say that space is compact. Compactness essentially says that this is probably the best way to describe it. Connect the space to smoothness, to how smooth the functions in the space are. Because it's like saying the derivative, that cannot change too much. And no function in the space can change too much. Uniform. There is a bound for all. Yeah?
AUDIENCE: I'm a little confused. So the rest of this is an expectation, right?
TOMASO POGGIO: Sorry?
AUDIENCE: The [INAUDIBLE], that's an expectation, right? And so if you're looking for--
TOMASO POGGIO: We are not speaking about minimization now.
AUDIENCE: Sorry, expectation.
TOMASO POGGIO: Yeah. But we're not speaking-- I'm just telling you which is the property of the space of functions, right?
AUDIENCE: Right, right.
TOMASO POGGIO: The hypothesis space.
AUDIENCE: So I-- my memory is that it's something like we started to roll functions or something like that. I don't know. Like, this isn't specific to learning, is it?
TOMASO POGGIO: This is not specific to learning, but a condition for the Empirical Risk Minimization to generalize. In other words, this condition. Yeah.
TOMASO POGGIO: This, for instance, is guaranteed if you do Empirical Risk Minimization in a space of functions which is compact.
AUDIENCE: OK. And what you're doing right now is just trying to characterize which functions have disproved that--
TOMASO POGGIO: I'll telling you both are, yeah, a property of a space of function which is compact.
TOMASO POGGIO: These functions is compact, and this is equivalent to equi-continuity. So that's a connection with smoothness. Why smoothness is important. Equi-continuity is very similar to-- it's essentially a smoothness condition on all the functions in the space. Is that-- He seems pretty puzzled. [LAUGHS]
AUDIENCE: Well, yeah. I think I remember [INAUDIBLE] is one way of getting convergence of expectations [INAUDIBLE].
TOMASO POGGIO: OK. So in fact, I mentioned this very quickly. This compactness is a condition which is sufficient for generalization. It's sufficient also for stability. But there are conditions that are more general that are necessary and sufficient. And this is a characterization in terms of what is called Uniform [INAUDIBLE] Function. I just want to mention it.
Now, you don't need it, not even for tomorrow. And there is a corresponding theorem that defines the [INAUDIBLE] in a certain way, which I'm not going to tell you much about it, apart from the fact that it's something like deleting one of the [INAUDIBLE] points and measuring how much your prediction changes.
But that shows that the conditions is condition of an hypothesis space being, you can see, is equivalent to stability. So ensuring stability and generalization are completely equivalent under these definitions. They correspond to the same constraints on H. They're both guaranteed by stability.
So now it's nice that you can get to those two conditions if you'd like to have the ability to predict and stability with respect to the data, you can get them simultaneously. And you can look at either it's having a stable algorithm or restricting the hypothesis space in empirical [INAUDIBLE].
And so the kind of deep philosophical application is something like this cartoon of what science is about. It's kind of starting with Newton taking measurements of the apple falling. And try to find a role, a function that could fit the data and be predicted for future experiments.
So a kind of learning theory, like [INAUDIBLE] that you, in order to do this, in order not to simply feed the data, but also be predicted, you need to restrict the space of possible models, theories, before you'll see the data. Only at the end, a theory can be predicted. If you just feed the data, then you have been astrology, but not astronomy, right? You are just explaining the data. That's often what the financial newspapers do. They tell you, today the market went down, because people sold [INAUDIBLE]. And that's not something you can use to predict tomorrow.
But the other part of the results tells you theories should be stable. So what does that mean is that if you change, you add new data, most of the time, they should not change much. In other words, you cannot expect relativity theory to come very often. It's a kind of tautology. Science has to be predictive, then it cannot be violated all the time. So there is those connections between prediction and stability.
Now, tomorrow, you hear about regularization and Mann spoke about regularization, Tikhonov regularization. This r of f, it's a function that you minimize together with the empirical risk. And one way to look at it is to think of minimizing the empirical risk, ERM. But under some constraints on the [INAUDIBLE] space. Right? Or else it could be the norm of S, this r of x.
And so one way to solve the problem like this is to use Lagrangian multipliers, which means minimizing something like-- turns out Tikhonov was not exactly the same as completely equivalent to [INAUDIBLE]. But that's one way to look at it. And so this is the connection between minimizing the empirical risk under the constraint on the hypothesis space, and looking at Tikhonov regularization as a way to do this.
AUDIENCE: Sorry, can I just ask the question probably in clarification to myself. So basically, what Ivanov regularization is doing is saying there's this hypothesis space, and some parts of it are revoking. And therefore, the parts that remain will ensure stability. Whereas, Tikhonov was saying, any part of the hypothesis space is possible, but the complexity of the hypothesis, in a way, is going to penalize you, pushing you more towards the stable parts of a hypothesis space.
TOMASO POGGIO: Yeah. But just from the point of view of optimization, one way to solve this constraint problem is to solve this, thinking of gamma as a Lagrangian multiplier. And solve for this, right?
AUDIENCE: I can understand now.
TOMASO POGGIO: There's another interpretation that was mentioned before intense probabalistic interpretation. It's this is related to the prior on f. And this is a probabilistic model of your measurements process. In that sense, this would correspond to have the constraints. It would be easier to see the [INAUDIBLE] here. OK.
So that's essentially summarizing what I just said-- Tikhonov regularization. And even of ensured [INAUDIBLE], especially stability, and ensured generalization. And there is this close connection between ERM on a constrained hypothesis space.
The story, of course, is that you have to put some constraints in order to solve the inverse problem. You have to put some constraints in order to solve the problem of learning. You have to put some constraints in order to get generalization. Constraints may be a lot of different things. It turns out that if you have very specific constraints, like for instance, you know that you have a linear mapping between input and output, that's a very strong constraints. You should use it.
But if you are ignorant, then there are these minimal constraints you have to assume if you want a general. So by the generalized-- I should say that-- does not mean that your solution is good. Generalization means that the empirical error, what you measure on the training set is a good proxy for what will happen in the future.
So if I'm the hedge fund, I have to find an algorithm that works well of my data. And I want to have a guarantee on this type of algorithms that, in general, when I will know that my small error [INAUDIBLE] will be within a certain probabilities small error in the future. But generalization does not guarantee that automatically, what I find is a good predictor on the empirical side. I have to measure. But that's easy to do.
AUDIENCE: So it's just the same sort of [INAUDIBLE] decision and accuracy?
TOMASO POGGIO: No, it's more a distinction. You know, one of the standard preoccupation of all the statisticians is consistently, universal consistency. And in our case, this would mean that our algorithm converges to the best possible solution. You can add those guarantees here, but I think in practice, that's my personal [INAUDIBLE], much less important than generalization. Because I can measure the error, the empiric. And I do in practice all the time. What I need to know is that this is a good proxy for the future. That's generalization. It's not consistency. OK.
I think we have a window for lunch, which is 40-- no, 50 minutes. So probably better to stop here. If you have some questions, urgent questions after now. Otherwise, this will be around for the next--
Associated Research Thrust: