Tutorial: Optimization (56:10)
Date Posted:
August 12, 2018
Date Recorded:
August 12, 2018
CBMM Speaker(s):
Kevin Smith All Captioned Videos Brains, Minds and Machines Summer Course 2018
Description:
Kevin Smith, MIT
Introduction to optimization focused on basic concepts, terminology, and its applications. Topics include likelihood and cost functions, single and multi-variable optimization, and optimization for machine learning (stochastic gradient descent, regularization, sparse coding, and momentum).
Online notes , GitHub: slides and notes
KEVIN SMITH: To start out with, I want to focus on this idea that this is the introduction to optimization. What we're going to be learning in this course are things like, well, what is optimization used for? It underlies a lot of stuff that's done in statistics and machine learning.
So as you go through some of the future tutorials or the lectures, you're going to hear a lot of this terminology thrown around. And it's good to have a conceptual knowledge of, what are you trying to do? Different concepts from optimization and how they're used, things like this commonly used terminology that you'll hopefully hear a lot, helpfully marked with a CBMM logo, so you know it's good.
And then I have in the notes some pointers to how to do this optimization, different programming languages, including our Python and Matlab. This is not something we're going to focus on for this tutorial, though. It's more the concepts that I want to impart to you rather than, here's how you actually do this optimization yourself. So this is the set of the important terms that we are going to go over today.
If you look at this and you think, oh, I know all of these, I've got these down pat, this tutorial is probably not for you. It's to give an introduction. In that case, you can go back to your Facebook. You can go back your Instagram, whatever it is you kids do these days.
But if you're still with us and we'd like to learn more about these terms, you can get the slides and some of the materials from this web page. And I think it's easier to grab the notes from the Resource page just because of the way that GitHub causes HTML files to be downloaded. So I'll give you about 10 seconds to get through that. I'm not going to sing the Jeopardy song during it.
So what we're going to go through is we're going to start with just some of the foundational things that you need to know before you can even run optimization, so things like the likelihood and cost functions. Then I'm going to talk about how to optimize over a single variable, bump that up to optimizing over multiple variables, and then finally talk about some of the ways this is used very specifically in machine learning, things that you're probably going to see a lot of over the next few days.
So to start with this question of likelihood and cost functions, I'm going to begin by defining what a likelihood is. And a likelihood is very simply defined as the probability of observing your data given a particular model. So I can write the probability of some data, the set of data points, given a model. And that includes, what is the structure of the model, or its parameters?
And usually when we write the likelihood, we write it the other way around. Because what this means is you know the model. What's the probability of observing your data? But when we care about the likelihood, oftentimes we have a set of data, and we want to know, what would it be like under different models? So that's why the d and theta are flipped here.
So this is very abstract, I realize. So let's make this a little bit more concrete. Imagine you have this urn. It's got black balls in it. It's got white balls in it. You pull them out with replacement, and you get the sequence of balls that includes five black and seven white.
Now, if you want to know the likelihood of various things, you have to define your model. So you can ask questions like, well, let's assume our model is just every draw has the same kind of probability of giving you a black ball. What if that probability were 50%? Well, what would the likelihood be then?
We can expand this and say, OK, there is a 50% chance of drawing black, 50%, 50%, 50%, 50% and so on and so forth and expand this out and get this number for the likelihood. But we can try different parameterization of that model as well. We can say, what if the urn had 75% black balls? Well, now the probability is 75% chance of getting black, 75%, 25% chance of getting white, 75%, so on and so forth, and expand it out and get this lower number.
A nice thing about this is now that you have the structure of there's this binomial distribution, you can define the likelihood as a function of, well, what do you think the probability of getting a black ball is? And we can write that out, in a more generalized form, as this. And now that we have this function, we can draw this out and get this likelihood graph, so all allowable values of theta just ranged between 0 and 1, and you can see how the likelihood changes as that value of theta itself changes.
But this is all well and good, but what does is actually tell us about the urn? To really get at this, what we want to start with this idea of what's called the maximum likelihood estimator. And this is defined as the value of a parameter for which you get the greatest likelihood, or for people who like mathematical notation, this arg max over theta of the likelihood. And the reason this is important is because, when we look at this point here where the likelihood is highest, this is actually the settings of our model that are most likely to give us the data by definition and, therefore, probably the best description of the generating process in the world.
So if we want to say, what is our best guess at the proportion of black balls in this urn, well, it's this maximum likelihood estimator, which here is just 5/12. We have five black balls out of 12 draws. So if we want to find what is the best setting of our model to help us describe the world, we can rely on the maximum likelihood estimator for that. Any questions so far?
All right. So the whole point of optimization is to ask this question of, how do we find this value more generally? Here we can count up the number of balls. Just say it's the same proportion of what you've drawn. Generally, you don't get that nice matching there. Before I do that, though, I want to take a slight detour, this idea of cost functions.
So just like I described likelihood functions in these dry academic terms and then went on to make it a little bit more concrete, I'm going to do the same thing for cost functions. Here a cost function is just a function that maps any set of events onto a cost of that event occurring. And the reason I say this is because you can translate likelihood into cost very easily by just making it the negative log likelihood.
So you take the negative log of it. And what you can think of this is, if you say, well, my value of theta is 50%, well, then if you draw a white ball, there's a moderate cost. You hedged your bets. If you draw a black ball, there's a moderate cost. It's the same. You've hedged your bets.
But if you said I want my theta to be, say, 90% chance of drawing a black ball, if you draw a black ball now, there's not much of a cost to that. You've got it pretty right. If you draw a white ball on the other hand, there's only a 10% chance of that, so the cost is much higher. So in this way, you can very directly map this likelihood onto this cost function.
So why would we want to actually use cost instead of likelihood? Well, there's some nice properties of the logarithm that I won't get into but it makes the math easier. But on top of this, cost functions are generally a lot more general than likelihood. So right now we're talking about fitting parameters where you can get a probability distribution over them. But oftentimes you also want to optimize over things like, in robotics, moving a robot arm subject to some constraints around the joints. We can define that as a cost of the path of motion, cost of getting outside of some comfort areas of the joints. That's not something you can write in likelihood, but you can very easily define it as a cost function.
So generally, when we talk about optimization, we do it over these cost terms. But very often you can also think of it just as maximizing likelihood. And that's the nice thing, is because there's this negative monotonic relationship between cost and likelihood, the maximum likelihood estimator is guaranteed to be also at the point where the cost is at its minimum. So that's what we're going to be doing in the future, is we're going to be taking these cost functions and we're going to be trying to find, where is the lowest point on them? Any questions so far?
All right. Either I'm very clear today or you guys are still tired. So let's get back to this question of we have this urn, we now have this cost function, we want to figure out where is the minimum of this cost function, and how do we find this more generally? So if we don't know something about, oh, you just take the proportion of the balls you've pulled, how do you get to this point down here?
Well, there's a few options. The first one we're going to go through is grid search. This is a brute force method. Because the way it works is you just calculate the cost function at a lot of different values, and you say, hey, here's the minimum. This is very, very simple. That's one of the pros of it.
It's guaranteed to get you at least close to the minimum, and a global minimum as well. It's really easy to implement. All I have to do is run this function over and over. The problem is it's pretty inefficient. If you want to get a precision to 1,000th of the range of the parameters you're looking over, you've got to run your function 1,000 times. If you have two parameters, though, now you've got to run your function 1,000 times for one parameter times 1,000 for another. That gets to a million, starts getting more expensive.
Three parameters, it's a billion. Four, a trillion. Quickly this blows up. So grid search really doesn't work when you've got multiple parameters. It's really only if you're looking over a single parameter, a simple range.
So are we out of luck? Well, no. I've got a lot more time to talk to you guys, so clearly there's more to this. Instead, what we can go to is this this technique called gradient descent. And the way you can think about gradient descent is this cost function is some hillier climbing.
So you're up here somewhere on a hill arbitrarily. It doesn't really matter where. And you know you've made your camp down at the lowest point in the valley. And there's this thick, thick fog, so you can't actually see where your camp is. Instead of just thinking, hey, I'm going to sit down, wait for the fog to pass, you really want to make it back before nightfall. What do you do?
Well, if all you know about this hill is you can look at the ground and say, which way is down, well, that gives you some sense of where to go because you're trying to get to the lowest point. So you just look. You say, this way is downhill. If it's really steep, you want to go a-ways because tends to flatten out at the bottom,
And so you walk downhill. And then you repeat the process. You say, this way is downhill. It's a little less steep, so I'm not going to go as far before checking again. And you can go on and on and on.
And that's basically what gradient descent does. It's a process that's iterative. And what you do is you start at a parameter value, you look at where is downhill, you take a step in that direction, and you keep repeating until you're close enough to the point where it's flat. So you can say at this point, it's iterative process. The parameters you have, you want to check it time t plus 1, the parameters you had the last time, minus this gradient here. And the gradient is just defined as the slope or the derivative of the cost function with respect to the parameter, plus some step size parameter.
Now, for now, we're just going to assume it's arbitrarily chosen. There's a lot more complex techniques to adaptively set this that are outside the scope of this lecture. But you just keep repeating this process until you get to the bottom. So this is what it looks like. You start up here. You very quickly take some steps. And then as it gets flatter and flatter, you take smaller and smaller and smaller steps. And you could see the value of this parameter getting very close to the bottom of this valley.
Makes sense? So you can write this up. This is in the notes. You don't have to pay too much attention to it. You get it out. You get this gradient descent value. And this is very close to the 5/12 that we calculated by just the proportion of black balls we drew.
But one thing about this is we actually have to know what the gradient is to solve it. Now, because we're using the binomial distribution for the urns-- this is well-studied-- we can easily calculate the derivative. Not a problem. But there are cases where that's not possible, where you don't necessarily know the derivative.
In this case, we're still not out of luck, though. We can approximate it. And the way we do this is remember slope is just defined as rise over run. So we take our function, we step in both directions just a little ways, we figure out what is the change in y, and divide that by the step size as well. So we can get this approximation of a gradient fairly easily. Now, one thing to note is, even for the single parameter optimization, you have to call this cost function twice instead of the gradient function once, so it's slightly more expensive, but it does save you from needing to analytically calculate the gradient.
Mkay. So let's move on to another function, this guy up here that's just kind of arbitrarily chosen. And we can write the exact same gradient descent function that we had before, except here now we're kind of taking the arbitrary gradient. I didn't actually do the derivatives here. I was lazy.
And we get out this value of theta is equal to minus 1.294. And so we're done, right? Everybody good with this answer? All right. It seems you're still tired. I'm not great at explaining. Everybody happy with this answer?
AUDIENCE: No.
KEVIN SMITH: All right. I'm getting some no's, and that's good. You don't necessarily just want to trust a gradient descent because it can get you into weird places. For instance, this value of theta gets you in this little saddle point here, whereas where you really want to be to say, what is the best parameter that lowers this cost function, is this point down here around between one and two.
And the way we differentiate those two points is this is called up here a local minimum, and this is called a global minimum. The difference being local minima are ones where it's the bottom of some sort of saddle point. So you go in either direction, your cost function is going to increase. But it is not necessarily guaranteed to be the best fitting parameter. That's the global minima. And that's the value of the parameter that's the lowest across all possible values of your parameterization.
Excuse me. So to help understand this, we can also define what's called convex and non-convex functions. And the most intuitive way I've found to define what a convex function is one that, if you take any two arbitrary points on the graph and you draw a line between them, that line will always go above the graph. A non-convex function, the most intuitive way I've understood it, is it's just not a convex function.
So we can take this graph. And we can see, if we take the line between the local minima and the global minima, this goes under the function, and therefore this isn't convex. And the reason I bring this point up, this definition, is because of a nice property. Any minimum on a convex function is guaranteed to be a global minimum. So if you know your cost function is convex, then you can just do gradient descent and you don't really have to worry about it. You're going to end up at the best point.
If you don't have a convex function, on the other hand, you don't get that same guarantee. And that's where a lot of work in optimization is, is trying to figure out how do you get to these global minima instead of these local minima, because a lot of interesting functions are not convex. But this is a potential definition you will hear a lot in the future, is talking about convex functions.
The reason for that is there's really nice, simple, trivial algorithms you can use for convex optimization. Non0convex optimization is much more computationally costly, difficult, gives, at least me, huge headaches when I have to deal with it. So if you can, stay in convex world. If not, panic a little bit and then deal with it. Any questions so far? Yeah.
AUDIENCE: Is there a way to know whether it's convex or non-convex [INAUDIBLE]?
KEVIN SMITH: Not that I know of. I mean, I think there are some ways of testing for convexity. But generally, I mean, I found the best way is to just kind of take slices through your parameters and try to look at it. Or if you can even run gradient descent a few times from different starting points and if you always end up in the same place, maybe it's not completely convex, but it's a good approximation. But if you're ending up in different local minima all the time, then it's basically guaranteed that it's not convex. Any other questions?
All right. So if you want to implement it, there are various ways of doing single variable optimization in these different programming languages. One thing to note is this gradient descent I've told you about is actually much more simplistic than the way that these algorithms work. They're going to do things much more efficiently and faster for you.
But the general idea still holds. You're trying to find a downhill and you're trying to find the bottom. They just have various approximations to get there faster. If you want to, I'm not going to talk about how to actually use these. They're in the notes if you do need to look at them, though.
OK. So we finished with single variable optimization. How about multivariable optimization? Well, let's start with a problem where, the beginning of the course, people are filling up the lectures. Everyone's bright and chipper. And then as time goes by-- it looks like we're not quite there yet, but maybe-- the attendance starts to drop off.
And I'm hypothesizing that this is for one of two reasons. It's either that you guys are busy working on your projects and being very studious and just oversleep because you've stayed up so late, or you're out partying and are too hung over to wake up and make it there on time. And I want to know, well, what proportion of people who are studying and what proportion of people who are hangover are skipping the lecture?
So I can build this model that basically says-- it's this noisy/noisy-or model that says, if you're working, there's a probability of skipping. If you're hung over, there's a probability of skipping. These combine. But we don't know exactly how.
However, what we can do is we can survey people and say, well, what were they doing, and try to build this model out. So Andre, for instance, he's very studious. He's working all night but he still makes the lecture. Kelsey, working and also drinking while still studious still makes the lecture. Javier doesn't drink. He chilled all night but he's there in the morning. And me, I'm just kind of checked out. I'll see you guys later.
So we can take all of these different observations and build this table of who's attending, are they working or hung over, who's skipping, are they working or hung over, and then we can define our forward model as this noisy/noisy-or. So if you're working and not hung over, there's one parameter. If you're hung over and not working, there's another. And there's this kind of anding of the other two.
And then just as we did for the urns problem, we can say that the cost of each observation, if you skip, is the negative log probability of skipping. If not, it's negative log 1 minus that probability. So this gets us to this cost function over all of these, over these different parameters.
So we now have a cost function of few parameters. And we can try to say, what is the maximum likelihood estimate of these two? Oh, I'm just going to tell you now, this is what the cost function actually looks like. So red areas are places where the cost is lower. And these dotted lines are just contour lines in the cost function.
Now, if we could just get this, that would be brilliant. We wouldn't need to worry about it. Unfortunately, the way I had to do this was by grid search, and it took about five minutes, even for this really simple problem just because you have to run it a million times. So instead, how would we get to the lowest point here if we don't know where we're starting from?
Well, hey, gradient descent. It's not just for single variable optimization, except here we have to define our gradients, not as this single number but instead as a vector. So it's not just, how far is it along one dimension? It's, what is the direction and magnitude? And here you can define this again as the partial derivative of the cost function with respect to each of these different parameters.
So what you can do is you can take these points and you can calculate, what are the gradients look like? And here the vector is starting from certain points with the magnitudes attached. So up here the gradient's pretty, pretty steep and in this direction. Down here it's relatively steep in this direction, pretty steep in this direction, and relatively shallow and pointing that way.
Then we can run the exact same gradient descent procedure where you take your parameter and you iteratively update it by moving in the direction of the gradient. And you can see it moved from here, and turned around, and is very slowly but surely hitting the lowest point in the cost function. Makes sense?
So we can run this forward. You can see the actual code for the gradient descent is relatively similar to the stuff you've seen before. You can get out these values for these two parameters. And actually, these are pretty close to the parameters I put in to generate the data, about the best you're going to do given just the noise in observations. So again, what we did here, though, was calculated the gradient by approximation.
In this case, though, when you're trying to look at a two dimensional gradient, you can't just go in one direction and another and take two function calls. You now have to look in both directions for both parameters, so in this case four function calls. And if you do this for a function of three parameters, that's six function calls. Four goes up and up. It basically scales with the number of parameters.
Now, this is fine for these sort of simple problems. Four function calls is not great, but it's doable. But when you start getting into things like neural networks where you've got you thousands and thousands of parameters for the weights, you don't want to have to run your function multiple thousands of times just to get a gradient out. That is costly. It's wasteful. It's not great.
So that's why we care about this attribute of functions called differentiability. And a function is differentiable just if there's an analytic way of calculating the gradient at each point. And the reason this is nice is because it pretty much means that there is a constant cost to getting the gradient out no matter how many parameters you use. So if you have thousands of parameters in a neural network, you only need to run the gradient function once to figure out what the gradient is, and use backprop on that, instead of the multiple thousands of times you'd need to run the actual cost function to approximate the gradient if you didn't know it. Any questions there?
All right. Again, there's various ways of implementing this multivariable optimization across the different programming languages. Just as with single variable optimization but even more so, there's a lot of complexities to these that I didn't go over. If you just do naive gradient descent on most multivariable optimization problems, it's not going to work very well. It's going to be very slow. You're going to overshoot because you get the wrong step function.
There's lots of different ways that make this much more efficient. Again, outside the scope of this lecture, but luckily it works. You can just take these functions and it's helpful.
OK. So moving into some of the ways this is actually used for machine learning. So there's four different things that I'm going to talk about that you will probably hear a lot more about in the rest of this course. First one that we're going to get into is stochastic gradient descent.
Mkay. So here the question is, what happens when calculating the cost function is really expensive? So for instance, you take ImageNet. Let's say you're trying to fit a convolutional neural network, and you want to take the 14 million images of ImageNet. And you want to say, well, what is the cost of classifying or misclassifying all of these different images across my network.
Well, the way you've got to calculate that is by classifying each image individually, comparing it to the actual classification in ImageNet, and then summing over all of those different images. And that means that you have to do this, if you're doing this just naively, 14 million times each time you calculate the cost function. That's a lot of computation.
And so if you wanted to actually fit a model, like these neural networks, they take tens, hundreds of thousands of iterations to converge, you've got to run these 14 million images each time through each iteration, you're never going to get anywhere. That's just going to take forever, basically. And yet somehow we actually have these models that work. They do fit to ImageNet. They do get these gradients out. And they are able to optimize.
So how is that? Well, for this, there's this very nice property that, if your cost function is decomposable into individual parts, you can approximate the gradient by just getting the gradient out of one or a small subset of your data set. So you don't need to calculate the cost function, the gradient, across the entire 14 million images. You can take one or a small batch of images and calculate the gradient from that.
And it's not going to be perfectly right, but it's going to be directionally correct. And so if you've done that and you adapt your step size function, so you're more willing to tolerate going in the wrong direction early, but when you get closer to the minimum, you're less likely to take smaller steps, you're guaranteed in the limit to actually reach the point of the lowest cost. Yeah?
AUDIENCE: In terms of breaking it into smaller parts that probably aren't perfect or reasonable, are you talking about breaking it into categories?
KEVIN SMITH: Usually, you don't want to break it into single categories because you're going to introduce bias that way. It's kind of randomly slicing up the data set and just saying, if there's 14 million images, I'm going to calculate gradient on 10,000, then another 10,000, then another 10,000, and then another 10,000. If you did it on, say, just birds, you're going to find the gradient that fits birds the best but not overall the best. So generally, this works only if you kind of assume independence between the individual observations, so you don't want to do that in a categorical way. Yeah?
AUDIENCE: [INAUDIBLE]
KEVIN SMITH: Yeah.
AUDIENCE: [INAUDIBLE]
KEVIN SMITH: Sequential sorts of things are something that this doesn't work on. So if there is kind of autocorrelation in your sequence, you can't break it into that because your observation at time 0.1 is going to affect your observation at time 0.2. And therefore, if you're building that structure into your model, then you can't tease those two apart. You have to know about one before getting to two. So I think that's the most basic example here. Any other questions?
All right. So reduced stochastic gradient descent, we get these kind of noisy gradients out. And the way this sort of works-- I can't remember where I heard this example, but I love it. It's basically asking drunk people for directions.
So if I want to know how to get to Boston from here, and everybody was really drunk, and we just approximately point in one direction, well, I'd say which way's Boston from here? It's going to be like, that way, OK? I'm going to travel 20 miles in that direction. I'm going to find somebody else and ask them again. And they'd point in another direction. And I walk 18 miles, ask another person, and so on and so forth.
Now, each person is going to not-- if I just followed that person's direction the entire way, I'd definitely miss Boston. But by asking a number of people who are going to be sort of wrong but, in general, kind of right and taking smaller and smaller steps, eventually I'll end up where I want to be. So we can do that with, for instance, the lecture attendance problem.
And here, what I did was I took this and I said, well, let's calculate the gradient out of each individual observation here. So we got 200 observations. We can get the gradient each time and just follow that, and then go on to the next observation, so on and so forth, and then repeat that.
Oops. Yes, I talked about this, defined that, but important term. And so this is what it looks like. And so you start here in the same place, you end up in the same place. And the reason why this line is purple is because I've color coded this so that each iteration through the data set you get a different color.
So just going through those 200 observations and using the noisy gradients for each of those gets you pretty close. You've got to zoom in pretty far before even seeing the second iteration. And the third iteration is kind of-- I don't even want to zoom in on that. So while this will make you need to run your cost function, say, in this case, 600 times to get to the bottom, whereas with the normal gradient descent, we only to run it about 200 times, in this case, calculating the cost function is cheaper because you're only doing it on one observation instead of 200. So there are cases where, many cases, by decomposing your cost function, you may need to take more steps. But you can do that with a much cheaper gradient calculation. Yeah?
AUDIENCE: [INAUDIBLE]
KEVIN SMITH: I mean, I guess there's going to be the same amount of noise throughout. I guess there's going to be more noise as compared to the flatness of the gradient, the steepness of the gradient as you get closer. But that's why you take smaller and smaller step sizes, is because you're more willing to go in the wrong direction to start. Once you're very close, it's OK to take a tiny step in the wrong direction if, in general, you going to be taking lots of tiny steps in the right direction. So it still works out as long as you temper your step size as you get closer and closer. Any other questions on stochastic gradient descent?
All right. So we can move on to regularization. So now I'm going to give you this problem where we have this polynomial machine, just is this arbitrary part polynomial-- I'm not going to tell you what it is just yet-- and you get out these, say, five points from it. And what we want to do is we want to recover, what was the generating function of this, right?
So who thinks it's kind of a straight line through all of these five points? OK. Who here is listening to me and is responding? About half the room, good. That's a good calibration.
Who here thinks it's maybe a quadratic equation? CORDIC? Who thinks it requires more than that, some arbitrary up to x to the seventh polynomial? All right. So it's not many people. I think some people might think I'm trying to be messing with you.
Well, the actual function looks like this. It's just polynomial cubic, CORDIC, plus some noise. And I'm telling you that just to kind of give you some sense of what's going to go on when we start fitting things. Because now that we have these five points, we can take some arbitrary function here, which is the sum of these betas times x to the n, and we can say, well, that that gives us our prediction. And we get a cost function that's just the sum squared difference between our observations for that point and the prediction.
Now, the nice thing about some square differences is this is actually equivalent to the likelihood if you assume unknown but constant variance Gaussian noise. So you can still think of this as, what is the likelihood of observing this data under this sort of model plus Gaussian noise? And when we do that just naively, we get out a function that looks like this.
Who thinks that this is a good approximation of the underlying function? All right, good. Nobody. This looks pretty bad. It doesn't seem to do very well in the middle. And it definitely falls out outside of things.
This is a problem. And the reason for this is we've got this problem of overfitting. We've got eight different parameters to fit five different data points. So very clearly you can fit any sort of line through all of these data points, but it's not necessarily going to generalize very well. And often when we're trying to do this functional fitting, we want stuff that generalizes.
So how do we fix this? Because this is a problem, again, with a lot of neural networks. You have a limited set of data but tens of thousands of weights. So you could, in theory, learn just to memorize the data the same way that this function is just memorizing the points and not fitting outside of that.
Well, one way to solve this is with regularization, which is to say I want an implicit bias that my models are going to be simple. And the way I do that is by adding an additional cost to the magnitude of the parameters. So if I've got function parameters, a lot of function parameters that are large, that's not good. I'd rather have a set a few parameters that are larger and a lot of smaller, less important ones.
And the way we do this is we kind of take our base costs, and this is defined the way it has been before. And we have this regularization term. And there's various ways of doing this. One of the more common ones is called L2 regularization because it's differentiable. And this is just kind of the sum square of the magnitude of all of the parameters. And of course the strength of regularization as well, which this is a little bit of an art to choose, because it's this question of, how much you want a trade between trusting your data and trusting your prior for simplicity?
But once we do that, we can fit a line that looks like this blue line here. Now, who thinks the blue line is a reasonable fit to the data? Yeah, it definitely looks a lot better than the brown line. Sure, it's not perfect as compared to what we know our actual function is. It's a little bit too linear. But that's a good thing.
We don't have much data. We don't have any reason to believe that things are going to fall off over here or up here. We should just trust it's going to be roughly linear. Seem reasonable?
So this kind of gives us too simple of a model with only five data points, but that's not a bad thing. We can up this to 50 data points, and we can fit both the regularized and the non-regularized functions to it. And you can see here now the regularized function does a really good job of fitting the function within the bounds of the parameters that we have data for but does a really bad job of generalizing outside of those bounds, whereas the regularized function, the blue, does a much better job of extending outside because it's not trying to fit absolutely everything. It's trying to be simple as well.
And this sort of overfitting and non-generalization for the non-regularized function, just when I was playing around with this, it happens up to 250, 500 data points before you really started getting good generalization outside of the parameters. So regularization can help us recover this generalizability very well by imposing this simplicity prior. Yeah?
AUDIENCE: [INAUDIBLE]
KEVIN SMITH: Yes, but be--
AUDIENCE: [INAUDIBLE]
KEVIN SMITH: Yes. So it's only limiting the size, but it's kind of implicitly limiting the number. Because what you're going to want to do is you're going to want to load more on a few parameters and then have the rest be very small. So there's also things like L0 regularization, which, again, is just limiting-- you've got a cost for the number of parameters you use. Often, that isn't used because it's just not mathematically-- it's nice. It's not differentiable.
But L2 regularization kind of has the same effect in that you're never going to get zero weight parameters. But everything's going to kind of-- a lot of stuff that's unimportant is going to be small. Yeah?
AUDIENCE: Can you speak a little more about what the size of the parameter actually is?
KEVIN SMITH: I mean, you can think of it is, if you go back to this, just the value of b here. So if this y is equal to b naught plus b one, blah, blah, blah, blah, blah, blah, you just take the value of each of these, you square it, and you sum it.
AUDIENCE: So it's just always trying to [INAUDIBLE]?
KEVIN SMITH: Yes. Yeah. Oops. Any other questions? All right. So let's move on to sparse coding.
So for this, I'm going to go back to a kind of classic paper in neuroscience and machine learning where what they tried to do is they tried to take a large set of naturalistic black and white images and say, could we decompose these into basis functions, which means you have these different patches of different kind of black and white intensities. And what you want to do is you want to say, can I recreate natural images by just weighting those patches up and down, smashing them together? Will I actually be able to recover the natural image?
The answer is yes. They actually did a very good job with this. But this is what those basis functions look like. So if you take these 64 different image patches, you upweight them, downweight them, invert them, and smash them together in the right way, or overlay them the right way, you can recreate a vast majority of naturalistic images incredibly well from their data set.
So that's great. But there are some problems with this, because how do we create each natural image is kind of uninterpretable. Every single image that you recreate is going to take some var-- is going to use all of these different basis functions in various ways. It's not really clear what each of the parameters means, because you can't really recover edges from this. And edges, some combination of a lot of different basis functions in a very specific way, it's just not great for interpretability. And it's also kind of inefficient.
So if you want to model this, say, as a model of the visual system, well, you can think of there being a cost for neural spiking. Might not be huge, but there's a metabolic cost to it. So you don't want to have some sort of coding that requires everything spiking all the time. You want to kind of save energy and build things in an efficient way.
And the other thing is, as I mentioned, you want things to be composable. So you want to be able to look at these basis functions and you want to pull out things like edges, parts of features that you think are going to actually make up those images in an interpretable way. That does not come from these basis images, as you can see. So what can we do instead?
Well, we can do what's called sparse coding, which, just like regularization imposed a cost on the cost on the parameters of the model, sparse coding imposes a cost on the activation of each of those different basis functions or each of those different parameters when run through. So you can define it again as this base cost function, which is just, how well does the recreated image match the natural image that I'm trying to recreate, plus this sparsity parameter, which is often defined as things like L1 sparsity, just how much activation is there, or this log penalty sparsity, and again some weighting to say, how much of a sparsity prior do you want in there? Yeah?
AUDIENCE: [INAUDIBLE]
KEVIN SMITH: Sorry. I'm not sure what you mean by having a weight for the basis.
AUDIENCE: [INAUDIBLE]
KEVIN SMITH: Yeah.
AUDIENCE: [INAUDIBLE] right?
KEVIN SMITH: Yeah.
AUDIENCE: And one way [INAUDIBLE].
KEVIN SMITH: OK. So in that case it's kind of the same thing. It's assigning weights on a per image basis.
AUDIENCE: Yes.
KEVIN SMITH: So the activation is those weights for each image. But as you want to recover a new image, those weights are going to change entirely. And so, yes, in this case the weights you're talking about are the activations that I'm referring to here. But whereas weights typically are considered things that are consistent across a model in many cases, like activations, we're assuming, are things that change per data point. Does that make sense? Yeah?
AUDIENCE: From [INAUDIBLE] can change the digitalization of the function that we learned today or the function in the world. [INAUDIBLE]
KEVIN SMITH: Yeah.
AUDIENCE: --of a function. But it's possible it doesn't do anything like that. We can learn a fairly generalized function without [INAUDIBLE].
KEVIN SMITH: Yeah.
AUDIENCE: So how would different functions look like that are sparse and not sparse?
KEVIN SMITH: Well, I mean, one thing we can look at is, once you start adding sparse coding, instead of getting this set of noisy pixilated images, we get out things that look like V1 receptive fields. And the reason for that is because many of these natural images are decomposable into things like edges. And so what you want to be able to do is kind of create them out of a subset of a few different edges.
So that's where sparse coding really helps you, is now you can get these interpretable features. So when you see the activation code across an image, like if you've got this thing and this thing, that's kind of like a junction like that. As opposed to, how would you recover that sort of junction in the non-sparse basis functions? I don't know that. That is too complex for my brain.
AUDIENCE: So it's [INAUDIBLE]
KEVIN SMITH: Yes, it's very much about interpretability. But the interpretability then also helps you build up as you kind of go down the line. So if you have these sorts of basis functions being fed into V2, for instance, then now you're working on a combination, like a combination of these features, as opposed to a combination of these noisy patches. So as you build up the model more and more and more, the interpretability, the fact that you're kind of decomposing the end image or the end state into smaller, interpretable bits helps you model it down the line as well.
AUDIENCE: [INAUDIBLE]
KEVIN SMITH: Yeah. I actually don't know whose how many people have tried to look at building up models of, say, image classification with non-sparse features that don't have these sorts of beginning things, because I think they generally don't work well. So I'd have to get back to you on that about who's actually looked at this.
But it works better to have these as you're trying to model things on top of this. It works better to have these interpretable features than image patches. Any other questions here?
All right. OK, so the final topic we're going to talk about is momentum. So for this we're going to go to a data set where we're trying to predict college senior GPA from a student's ACT scores. I guess this is kind of outdated with some of the recent changes in education. But, hey, it's a nice data set set for this.
So we have this set of individual students' ACTs, their senior year GPA. And we want to say, is there a relationship, and what is it? if we know the ACT of a student coming in, can we predict what their GPA is going to be like?
Well, what we can do is we can define this in kind of this linear regression way. We say our predicted GPA is some intercept plus a slope times the ACT and then do the sum square cost function again. We want to minimize the squared distance between our actual observations of the GPA and our predictions, right? Makes sense? So when we do this, we can, again, do grid search and we get a function like this, and we want to end up somewhere in there.
So what happens when we run gradient descent on this? Well, we can run this, and we get out this value. The intercept's 0.8. The slope's about 0.1. But suddenly it hits us. We've taken basic statistics. There's ways of running linear regression. We can fit these models without needing to do gradient descent.
So we do that instead and we come out with a slightly different answer here. The intercept's a bit higher. The slope's a bit lower. And furthermore, this is took a long time to run. If you actually count up the number of iterations before it kind of hit below some arbitrary threshold, it's like 17 and a half 1,000. It's not very efficient. This runs instantaneously.
So what's going on here? Well, we can look back at this and we can trace out, where did we go in the path of gradient descent? And this is what it looks like. We can zoom in on this and just look at the first 10 points where we stopped while trying this gradient descent. And you see 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. There's a lot of bouncing back and forth there. And you can see from here it does that for most the time before kind of slowly crawling along this valley in the cost function.
And that's not great. Because what we're doing when we're doing gradient descent is we're finding a really steep point where what we want to do is we want to go down a shallow depression. But instead, we're just bouncing back and forth. So to solve this, we can use what's called momentum. And-- sorry-- the reason this is important is because a lot of fitting of things like neural networks, you do get these sorts of trials in the cost function where, if you just do this sort of stochastic gradient descent or gradient descent, you are going to be bouncing back and forth.
And so you want to make the fitting process more efficient. So the way we can do this is with momentum. And here what we're doing is we're modifying the gradient descent algorithm itself such that the step we take at each point in time is not just the gradient, but it's also we keep some of the oomph from the last gradient step we took. So it's combining both this gradient that we had before, the change in parameters of the gradient we had before, plus this momentum contribution, which is some sort of weight times the amount we changed in the last time.
And the reason why this is helpful is because, if we're going back and forth, momentum is going to dampen that motion. So if we go this way, we're supposed to go a lot this way, but we have momentum, well, it's going to stop at a lot. And so you're not going to see as many back and forth swings. On the other hand, if you keep going in a single direction, if the gradient keeps pointing you in the same way, you're going to build momentum, and it's going to push you further and faster in that direction. Makes sense?
Whoop, there it goes. So we have this, but now we can write gradient descent with momentum, and we get out a value that's much closer to this, and importantly, about 10 times faster. It takes 10 times fewer iterations. Because what happens, we don't spend as much time bouncing back and forth across the valley. We find our way to the valley fairly quickly and then build momentum and kind of quickly go down the more shallow depression. Any questions there?
All right. Well, those are all the topics I have for you. Remember, these are all the important terms. You're probably going to hear a lot more about those in the future. So if you've got any questions on those, I guess I'm finishing up about 15 minutes early. So I'll be here to answer questions if you've got them. Yeah?
AUDIENCE: How do you typically express uncertainty in your [INAUDIBLE]?
KEVIN SMITH: Sorry. When you say uncertainty, do you mean like--
AUDIENCE: So for example, the urn example you gave, best estimate's 45%.
KEVIN SMITH: Yeah.
AUDIENCE: 55% also seems like a good estimate.
KEVIN SMITH: Yeah.
AUDIENCE: I just want to know, how do you express that kind of idea, where if your urn was 10 million balls and you got 45% [INAUDIBLE] not?
KEVIN SMITH: So the way you typically do this in frequentist procedures, which is this maximum likelihood estimation, all of that, you can build confidence intervals as well so that, especially for like the binomial distribution, there's ways of saying, given what I've observed, here's my maximum likelihood estimate. But also, here's kind of the 95% confidence or the sampling error that I would expect from that mean as well. So that's one way.
Another way is to go to Bayesian statistics, which is to use the full likelihood function. In that case, we're getting out of maximum likelihood estimation and into things like maximum a posteriori estimation. Because what you're doing, instead of saying, I'm just picking out a point, you want to say, what is my belief distribution over this parameter given what I've observed?
So I think that gets a little bit closer to your question of, how do you deal with uncertainty? Because it's no longer saying, this is my best estimate. It's saying, here's my distribution of probability over what I think the parameter is. And I can pick out what the highest probability is there. But I can also talk about I have more belief that it's 50% than that it's 90%, too. But that's a different kind of realm of statistics and makes different sorts of claims. Yeah?
AUDIENCE: [INAUDIBLE]
KEVIN SMITH: So in terms of things like sparse coding and regularization, I don't think there's a way to optimize them because it's kind of a qualitative choice. Basically, when you think about what those mean with regularization or sparse coding is, when you do those, you're putting priors on your model or building an inductive biases. So with, say, the regularization parameter, you're trying to choose between, how much do I care about fitting this data that I've seen versus believing that my function is going to be simple? And if you have a really high parameter, then that means you're really putting a lot of weight on simplicity.
If you have a really low parameter, you're putting a low weight on simplicity. If the parameter is zero, it goes back to normal optimization, you're saying I shouldn't care about simplicity at all. So it's more of a judgment call. Some things work better or worse for certain problems and helping generalization that way. I think that's much more domain-specific, though.
AUDIENCE: [INAUDIBLE]
KEVIN SMITH: Sorry?
AUDIENCE: [INAUDIBLE]
KEVIN SMITH: What do you mean by different terms interact?
AUDIENCE: [INAUDIBLE]
KEVIN SMITH: Oh, if you're doing both regularization and sparse coding? I mean, my guess is yes. I have never personally encountered instances where they're combined, though I'm betting there are a lot of cases. So I don't have particular expertise in that.
So I'm going to punt on that and say, yes, there probably is interaction. You still have to care, again, about the strength of your simplicity prior and the strength of your sparsity prior. But exactly how that's going to work out, empirical question. Any other questions?
You guys are just really raring for that extra 10 minute break. All right. Well, thanks, everyone. And I will see you guys later.