Beyond Empirical Risk Minimization: the lessons of deep learning
Date Posted:
October 29, 2019
Date Recorded:
October 28, 2019
Speaker(s):
Mikhail Belkin
All Captioned Videos CBMM Special Seminars
Description:
Mikhail Belkin, Professor, The Ohio State University - Department of Computer Science and Engineering, Department of Statistics, Center for Cognitive Science
Abstract: "A model with zero training error is overfit to the training data and will typically generalize poorly" goes statistical textbook wisdom. Yet, in modern practice, over-parametrized deep networks with near perfect fit on training data still show excellent test performance. This apparent contradiction points to troubling cracks in the conceptual foundations of machine learning. While classical analyses of Empirical Risk Minimization rely on balancing the complexity of predictors with training error, modern models are best described by interpolation. In that paradigm a predictor is chosen by minimizing (explicitly or implicitly) a norm corresponding to a certain inductive bias over a space of functions that fit the training data exactly. I will discuss the nature of the challenge to our understanding of machine learning and point the way forward to first analyses that account for the empirically observed phenomena. Furthermore, I will show how classical and modern models can be unified within a single "double descent" risk curve, which subsumes the classical U-shaped bias-variance trade-off.
Finally, as an example of a particularly interesting inductive bias, I will show evidence that deep over-parametrized auto-encoders networks, trained with SGD, implement a form of associative memory with training examples as attractor states.
TOMASO POGGIO: I'm Tomaso Poggio. And this is a special CBMM Seminar. Mikhail Belkin is going to speak here. Mikhail, technically speaking, is an academic grandchild of mine because Partha Niyogi was a student of mine. And for the record, [INAUDIBLE]. And Mikhail was a student of Partha Niyogi, at University Chicago. And then, he went from there to become a professor at Ohio State.
Partha, his advisor, was the purest and clearest mind I have known in machine learning and not only machine learning. He died of brain tumor, in 2010. But fortunately, we have some of his students.
And we have here, the best of his students, I think, to speak with us with some interesting properties, very interesting properties of the [? sealed ?] inverse, from the point of view of machine learning. And I think this could bring about small revolutions in machine learning in the next year, in terms of revising some of the foundations of it, so far based on generalization. And we see that mentioned and so on.
Mikhail.
MIKHAIL BELKIN: Thank you.
[APPLAUSE]
Thank you, very much, Tom. It's great to be here. I actually gave a talk here two years, or just under two years ago, a little less than two years ago. And I think, since then, we actually understand some key missing pieces. So I'm very happy to present the picture the way we see it now.
I mean, it's no secret that neural networks are complex objects. And there are two opposing kind of tendencies. So in terms of engineering, there is a tendency to make the things more complex, to get extra performance from that. From a scientific point of view, of course, we would like to take them apart to actually understand how they work and what are the building blocks.
And I think the question is that what lessons can we learn from the amazing successes of deep learning? And I think one lesson, that I will primarily talk about today, is that really, I think it's quite a powerful lesson, is that we really need a new theoretical foundation for certain aspects of machine learning. And we have to [? sort ?] [? proof ?] revisit the assumptions that we had.
Well, before getting into what's new, we have to understand what is that that we have. And the basic premise for more theoretical analysis for machine learning has been empirical risk minimization.
How does it work? You take a class of functions, H. And you choose a function from the class to minimize the loss, or the risk, on the training data. Now, the analysis of this are based on the following connection. And to see how this connection works, we have to compare the goal of machine learning to the goal of empirical risk minimization.
For the goal of machine learning, at least in the usual statistical setting I'm assuming, [? I ?] [? ID ?] here, is that you would like to minimize the loss on the data, which we have not seen. And in the statistical setting, that means that I would like to find a function which minimizes the expected loss over all the data. And this function, of course, doesn't have to belong to any class. It's just defined by this equation.
Now, the goal of empirical risk minimization, in comparison, is that minimize the empirical loss over some class of functions. So to say something about the connection we basically have to go through the following two points. And this is due to Vapnik.
So he said that the theory of induction is based on first, uniform law of large numbers. And second, effective methods of inference must include capacity control. So a law of large numbers, capacity control.
How does it work? Well, first, what is the law of large numbers? The law of large numbers basically is that empirical loss for any function in the class H approximate expected loss of f, so loss of unseen data.
Now, what is capacity control? Capacity control essentially said that H not contain functions that approximate f star. If we have one and two, then it is easy to see that the ERM solution approximates the true solution. And we are good. So that is a foundation according to Vapnik.
Now, to unroll this a little bit, let's look at the laws of large numbers. And laws of large numbers are what you may call WYSIWYG, what you see is what you get, bounds. And there are many bounds of this form.
They are of the following type. On the left you have the expected risk. This is what you have. And this is what you get. This is the risk in the future.
And that is bounded by the empirical risk, which is what you see, plus some sort of model complexity term. And this [? c ?] can be norm, it can be margin. There are many different variations of this, the meaning of this [? c. ?]
Now, if you look at a posteriori bounds, such as margin bound, they still of the same type. But they can allow [? c ?] to be data dependent. So [? c, ?] for example, can be some sort of margin. So that is uniform laws of large number.
Now, what about capacity control? So again, this is from the book of Vapnik. Basically, they did that [? s ?] we increase the number of functions in the space capacity. The training loss goes down because I have more functions now. So I can fit data better. The complexity goes up. So we have this kind of two opposing curves. One goes down, which is empirical risk. The other is confidence.
Now, the bound is a [? summan. ?] You can see it has this U-shape. So we need to look for the bottom of the U, which is this kind of optimal solution. Now, the way this is usually represented is a slightly more heuristical version of this.
Basically, the idea is that, as we increase in capacity of the space, from low to high model complexity, training risk goes down. Test risk goes initially down and then, up again. The left-hand side is underfitting. The right-hand side is overfitting. And the goal within this line of thinking is to find the sweet spot. And that's basically the optimal model.
Now, an informal corollary of this if that model with zero training error is overfit of the training data it will generalize poorly. And that makes sense, since the model with zero training error is on the very right. And it should correspond to this high part of the U-shaped curve. We will call this interpolation, since in mathematics this is interpolation.
So this is the way we have thought about this. But recently-- well, in fact, not so recently, there have been some cracks in this. And let me point out a influential paper by [? John ?] [? Little ?] when they thought they observed the following.
There are a couple of points there. But this is, at least to me, seems the key point. They noticed that you can train a neural network to have 100% accuracy on training data set. And the test accuracy is still very good. So if you are over fitting, you are not over fitting significantly. Maybe that is normal. I don't know how you quantify this.
Now, this is very suggestive that something is going on here. On its own, this is not quite enough because you may still find ways where bounds can explain something like that. But the next experiment I think erases that possibility.
And let me point out this curve. This is a little more complex. This is from our paper with my students. See you on mine [INAUDIBLE]. We do the following experiment.
We randomize some percentage, although this is [INAUDIBLE]. It's a 10 class problem. And what we do we randomize say 10% of the training set and 10% of the test set. It's matched. And then we can randomize 10%, 20%, and 30%, and so on.
And of course, if we're randomize generally performance gets worse. So now, let's just concentrate on the Laplace kernel. Neural networks and other things are similar, slightly worse.
Now, what is happening here? So first, notice that at 90% you have random. So if it's above 90%, it's a random guess. Now, what is a green line? The green line is the optimal.
So if you're randomized 50% of your data no classifier can possibly give you much better than-- I think the best is about 45% because accidentally, sometimes you guess correctly. And so the green line is the smallest loss you can possibly have with any method.
Now, what is the red line? The red line is Laplace kernel. And I don't necessarily want to define what that is exactly. But it is basically a training method. And the important thing here is that it is trained to have numerically zero loss. So by numerically zero, I mean it's like 10 to the minus 26th. So it's as close to interpolation as you can get, within reasonable computer architecture.
Now, you can see that even though this is, by classical standards, would be extremely overfit. But we're paying no penalty. So interpolation doesn't overfit. Even on very noisy data, with something like 70%, this is not much worse than the best possible.
And this causes serious problem for bounds. And moreover, this is, as you will see, it's hard to see how bounds can explain this. And let me point out what is going on here, why this is difficult to explain from the bounds point of view.
Well, if we have a bound, the training loss of that bond has to be zero. The test loss cannot be zero because there is a large element of randomization there. Now, that means that this capacity term must completely control the test loss.
And let's see what this implies for that case. And if you actually look at, say a reasonably high level, like 80% or 70%, you will see that my bound has to bound this red line. The red line is between 70% and 90% here.
Now, if it's more than 90% the bound is trivial because I can just say it's random. So more than 90% is useless. Smaller than 70% is wrong because we know that 70% is as good as I can possibly get.
So that means that if I want to have a bound, which explains what we're seeing empirically, it has to be between 0.7 and 0.9. And if it's a usual bound, we would like to keep those as n goes to infinity.
Now there are two issues here. And the first issue is that the constant in this-- I put this [? all-star. ?] So [? all-star ?] typically has maybe a log fact. There's other things.
You cannot have any log factors. You cannot even have a constant, which is off by 2 here because if it's between 0.7 and 0.5, if you multiply by 2 you're [? null. ?] That's it. So it has to be exact constant to explain this. And there are really no bounds like that.
But perhaps you can say, well, even though there are no bounds, maybe we can come up with some bound. Maybe just our analysis is not tight enough. And the problem here, I think, is conceptual, in that somehow this 0.7 there is a base risk. That's the risk of the optimal classifier.
So somehow magically, this quantity, CFN, which is some notion of complexity, like the norm, must know about the base risk. And it must be independent of how exactly that base risk is generated, like you can concentrate all there is here or there. There many ways.
And it seems like it would require some sort of magic for this to be true. And there is really no reason for mathematically that something like that should be true. In fact, there are some reasons to think that it's impossible, mathematically impossible. So that suggests that this analysis don't apply.
Now, interestingly enough, this is actually the best practice of deep learning is exactly interpolation. And this is from Ruslan's tutorial. He basically said, well, you start by making the network big enough and get zero training error.
Well, this is not the end. You may tune the parameters after this, and so on. But this is already pretty good. You get zero training error. It does something reasonable. Then, you improve it.
So the whole practice of deep learning is essentially that. And we see that this theoretical analysis just don't apply. They give three real results.
So well, I should point out that there has been recognition of this fact. Yann LeCun said, on several occasions that, deep learning breaks the rules of statistics. Now, well, it actually goes much further back. Leo Breiman has a nice note, which is reflections of the refereeing papers for [? NIPS. ?]
And there are several questions. I'm just giving you the first one. The first one is, why don't heavily parametrized neural network cover feeds of data. And this is from '95, so 24 years. I think now we have at least some answers to this.
So from my point of view, here is the first key lesson of deep learning is that the new theory of induction cannot be based on uniform loss of large numbers with capacity control. Now, of course, well, the question is well, where do we go next?
And let me first point out where we shouldn't go. And then, I'll give at least one possibility for a way we can. So first, these capacity control ideas don't really apply for that reason. Algorithmic stability is a slightly different version of that. But it has the same reliance on WYSIWYG bounds, or the same kind of criticism. The same exact argument would apply to algorithmic stability the same.
Now, regularization type analysis, for example taken [INAUDIBLE] and a number of others. They're actually of a slightly different kind. They're not necessarily based on those bounds. But the problem is that there are not many which don't diverge. Most of them actually diverge as the regularization parameters go to zero, or as you step. Your number of steps of gradient descent or stochastic gradient descent goes to infinity.
However, there is at least one example of something, which is really like what we want. And I don't know. I think it's extremely useful to think about it. And the example is really one nearest neighbor.
Actually, there is a really interesting Hilbert regression scheme, by [? Dave ?] [? Roy, ?] which really is very insightful. But I don't have time to discuss it. But those things based on oracle bounds in this, or what you might call oracle bounds. Basically, the analysis that you get connect expected loss to the optimal loss directly, without going through empirical loss at all.
Now, let me just think about one nearest neighbor because I think that's really the most suggestive. It's so you take your data. You choose the one nearest neighbor. And you assign that label. And it's really simple.
And the main thing is that there is this classical analysis, by Cover and Hart, which says that, yeah, you have a bound. And that's, at most, twice the base risk. It's not great. But it's OK.
And the nice thing is that analysis is not based on this type of bounds. And we directly estimate the expected loss. And the question is, what can we do better than one nearest neighbor? And the answer is yes.
And indeed, we can. And here is a scheme, which actually [? provably ?] better than one nearest neighbor. And in fact, it seems to cast a statistical [? optimally ?] guarantees.
And the scheme is very simple. It's actually closely related to something called Shepard's interpolation. You basically take the weights, which are inverse distance weights, so they single out. They go to infinity as you approach. And you do which, if you look at this, this is classical smoothing. But instead of doing like Gaussian, or some other smart kernel, like what we usually do, we use the singular kernel.
Now, you can see that this scheme was interpolating. It's easy to see. And remarkably, even though it under interpolates, you can actually prove-- And this is joint work with Daniel [? Hsu ?] and Partha Mitra and also with [? Sasha ?] Rakhlin and [? Sasha ?] Tsybakov.-- That these schemes are-- I'm hiding some of the conditions on alpha, and so on-- but these schemes are actually optimal, statistically. So they're consistent. And they're optimal. And this is, in fact, true in one dimension.
And here is an example of that. If you look, for the true model here is linear plus noise. And you can see that the red line, which is what you get with this scheme, is quite far from the optimal model at every data point. At every data point, it's not close to the blue line at all.
But if you actually take a point at random from this interval you will end up very close. And in fact, the more data you get the closer you get to the thing. And remarkably perhaps, this gets better in high dimension. There is some sort of blessing of dimensionality. It's somehow easier to do this in high dimension.
So maybe in one dimension, if you just look at it, you'll think, OK, this is terrible. But it's not so bad. And in high dimension somehow the peaks become more localized.
Now a simple corollary of this is these adversarial examples. And basically, with adversarial examples you take in the image. And you have some neural net and it classifies as correctly, as a dog. You add some invisible noise, specially selected, and now this is an ostrich.
Well, it' is strange. But, in fact, with that kind of analysis it's pretty obvious that, if you have any labeled noise, you will expect to see a set of adversarial example, by that I mean just misclassified points, which are everywhere dense.
I'm not saying that this is really the mechanism by which this adversarial examples are formed in real life, but they do follow directly from the analysis, assuming level voice.
Yeah.
AUDIENCE: [INAUDIBLE] dense?
MIKHAIL BELKIN: Just like rationals are dense, and reals, that kind of dense. So around every point you have unlimited number of examples.
So now, let me sort of-- so this talk so far, I showed empirical effectiveness of interpolation, at least on some examples and argued that theory of interpolation cannot be based on uniform bounds. And c, that there are some other methods, like nearest neighbor methods, which are statistically valid, optimal.
Now, you can say, well, here is actually a disconnect. And what the disconnect is that this nearest neighbor method is not really something that we use in practice. They are quite different type of algorithm. And well, how does it connect to things that we actually use and care about, kernel machines, neural network, that type of thing.
And I think there are two key questions here. First, what is a dependence of generalization on model complexity? And what is the role of optimization? Where does optimization come in?
[? Though, ?] we do have partial answers to this. The story is not complete. And there is a lot to understand. But at least now we understand the connection.
And I think the key observation is the following. If you look at the classical risk curve you see this U-shape. And where does this curve end? This curve ends at a point when the loss is zero, the empirical losses zero.
Now, why does it end at that point? Well, once you think about this complex models, like neural nets, you realize that there is nothing connecting complexity of the model to the loss. So the complexity of the model can be chosen arbitrarily.
This is actually even true for linear regression. But it's kind of an alien way of thinking about this. But for neural nets, it's pretty clear. You just add some neurons.
So you can continue this indefinitely. And you can continue to grow the model. And if you continue growing the model you realize that you have this second regime of the model, when the asterisk goes down again. We call this double descent because that's the second stage. There is a classical dissent, the first part of the u. And this is the second descent.
And the interesting thing here is that every model on the right of this curve is interpolated. So they all have zero loss. But yet, somehow getting more complex model improves performance.
Now, we have done a bunch of experiments on various fully-connected neural networks, random forest, L2-boost, various artificial data and real data. This is a rather robust kind of phenomenon. You can observe it very consistently.
Now, I should also point out that other people also observe this curve. But the key is well, the question is why? Why are we we observing this? And what is the mechanism underlying this second descent?
And I think this picture actually explains. I think that's like a lot of intuition, just in this picture. So let me describe what this is. The blue line here is I am using 40 value features. So I am taking values. and I am generating a two-layer neural network.
The first layer is values. But I am choosing the coefficients at random. So I am not training those coefficients. I am just using it as a feature map.
And, as you can see, 40 value features is enough because it's a 40 dimensional kind of parameter space. It's enough to fit the data exactly. Now the black line is 4,000 value features. And you can see that there are a couple of observations here.
First, I think it is visually obvious that the blue line is much worse than the black line. Now, how do you want to say what is worse, what is better. It's somewhat philosophical question. But it is clearly, I think nobody will choose the blue line over the red one.
And another thing to notice is that when you go to 4,000 value features, this is still a piecewise linear function. But with 4,000 value features, you cannot actually visually observe that it is linear. It's smooth. It's essentially smooth. So you converge into something, which is smooth, and presumably better than what you have when you have fewer features.
And, in fact, if I draw this with 4 billion features it will look exactly the same. You wouldn't be able to see the difference. So you can see, at least, that having more features doesn't actually hurt you.
Now, let me maybe describe this mechanism in a little bit more granularity. So consider random Fourier features, which were proposed by Rahimi and Recht in 2007.
And what are random Fourier features? It's very similar to this random value features, but you just take sines and cosines with random coefficients, or complex exponential. It doesn't really matter. And so, it's a neural network is one hidden layer and cosine linearity.
Now, we take the hidden layer of size n, capital F, and data size f. And the key properties of this, which was observed in their paper, if that as capital N goes to infinity, yeah, I'm increasing the number of features. This actually converges to a kernel machine. And kernel machine is some sort of functional minimum-norm solution.
So if you look at this, you see what is happening. This is actually on [INAUDIBLE] which is a real data set. Oh, yeah, you may say, well 50% error rate, it's pretty bad. But it's like a 30 something class. It's way above the baseline. Actually, the state of that would be about 30%. This is not bad at all.
So what is happening here is that this is 10,000 data points. And you can see that I have this U This U peaks at 10,000 data points, which is exactly when you get zero loss. And then, as I am increasing the number of features, that's 1,000, I convert to this red line, at least visually converge to this red line. And this one is exactly the kernel machine, which I can compute differently by using the kernel itself with Gaussian kernel.
And so why now am I getting better performance here? And the way to understand this is really to look at the picture of the norm. So what is happening is I am first increasing the number of features. My norm initially increases. And then, after this point, it actually starts to decrease?
Why? Because I have now many solutions, which all have zero loss. I have this huge space of solutions with zero loss. And by choosing the minimal norm, it allows me to choose one, which is best. So the more features I have the smaller norm solution I have.
It kind of works the other way around from the-- so in the classical [? regiment, ?] of course, when I have more features I have larger norm. In this regiment, I have more features. I have smaller norm. It's opposite.
And as you see actually, it works quite nicely. You really see that it converges to this kernel machine norm, which again, I can compute explicitly. So more features basically is better approximation to minimum norm solution.
Let me skip this. So in at least in the noiseless case, we can show that infinite norm is actually optimal. But it's a noiseless case. So maybe think it's not super interesting. But actually, there is something interesting because this form of this bound is very different. It's obtained by approximation and the very different form from the usual bounds that we have and machine learning.
I think a kind of interesting outcome of that is that, even in the linear case, or even in the kernel case, this really [? leads ?] to new analysis. And even for the simplest linear cases, it's something that we can now observe. That hasn't been fully observed before. I'm reluctant to say it hasn't been observed because it probably has.
And I think there is significant evidence that deep neural networks exhibit similar properties. Again, the evidence for neural networks is only partial. But it's pretty consistent.
Now, so this is maybe the connection. This is ERM, so classical ERM. We choose a space of function. And we try to find function, which minimizes a loss. This modern interpolation view is that, instead of doing this, we should minimize the norm over the [? sub-space ?] of function, which fits the constraints exactly.
And the interesting thing is that we never actually do this explicitly, at least not a neural networks. The minimization of the norm is hidden somehow within the dynamics of the stochastic gradient methods, or gradient method, so that there is something built in, which do this, while we are pretending we are doing that.
Now, I should point out that the norm is not the only aspect of this. You can get something very similar by averaging, and what the picture here. So imagine I take my data. And I take a tree. And I build this tree by doing random splits.
With enough random splits I can fit my data exactly. In fact, this is very closely related to something called PERT, which Cutler and Zhao proposed, in 2001. Essentially, same thing.
And now so, but the tree that you get is actually not good. It's interpolates the data. But it's not good. But if you average a bunch of those trees you'll get the solution, which is much better than any individual tree. So again, each tree, it's a zero loss. But it's not good. It's at this borderline of the interpolation threshold.
When you average those things, you're getting something, which is functionally much smoother, subject to the same interpolation constraint. So [? that ?] [? if ?] kind of general picture. And maybe we can think of it as a form of Occam's razor, in that the simplicity here is measured by some functional smoothness. And you want to choose the function which is most smooth, subject to the constraints of exactly fitting the data.
Now, there are three ways to increase smoothness. There are explicit minimum functional norm solutions, like what we did. That's the easiest way to understand. There implicit bias GD optimization, when somehow it's hidden in the process. And there is also averaging, which seems quite different.
Interestingly enough, for kernel machine, they all three coincide. With the explicit, there's just solving the system. Implicit doing gradient descent or SGD. And averaging is the Bayesian view of Gaussian process. You're averaging trajectories.
So that is the generalization landscape. On the left, we have with classical WYSIWYG bounds. On the right, we have this modern regime, which are based on inductive biases. And now, we're starting to have first analysis. And the [? Sasha, ?] in particular has recently worked out an interesting example of a kernel machine.
So here is an interesting thing, here. I think that's maybe a remarkable outcome of this. If you think about overfitting, classically what we have is that when we overfit we should reduce the number of parameters.
Here, you can see that overfitting actually is a band. It's a range of parameters. And now, there are two ways to deal with overfitting. One way is to go left, reduce the number of parameters. The other way is actually to increase the number of parameters. So you can deal with overfitting by increasing the number of parameters. That's, you know, deep learning.
[LAUGHTER]
Now, let me make a couple of points about optimization. I think it's important. I don't have time to talk about it in detail. There is more to say about it, much more to say about it. But let me point out two important things.
In the classical regime, there are many local minima. And SGD with fixed ties does not converge. It oscillates. The modern regime is different. First, every local minimum is global, well, at least for networks wide enough. There are some caveats, of course, to every one of these statements, but.
Second, local methods converge to global optima. And third-- and this is the joint work with Siyuan Ma and Raef Bassily-- you can show that small batch SGD converges as fast as gradient descent per iteration.
And what does it mean, per iteration? An iteration of gradient descent is an epoch. An iteration of if you have one million data points, that's like 1 million. You have to go through one million. An iteration of small batch SGD [? think ?] hundred. So it's like a million versus 100. So it's overwhelming computational advantage of stochastic gradient descent.
So this is the summary of that. And then, at the end, I will give a couple of points about the inductive bias. But this is the summary of that part of the talk. So first, classical models, you need careful parameter selection. There are many non-global minima of optimization, at least for things like neural networks, not, of course, for linear regression. And SGD oscillates.
For the modern model, you get good generalization from inductive bias. And it's really not that sensitive to the number of parameters or minima global. And SGD actually converges fast, very efficient computationally. It's an amazing picture.
All right, so. Now, I mean, you can say, well, there are probably a number of questions about this. But I mean, this is far from a complete picture. But one key question here is that I said, well, generalization from inductive bias.
But what is this inductive bias? Where is it coming from? And for linear regression we actually understand that the norm. But in general, what is this norm? It's some sort of functional norm.
So maybe, I'll describe a little bit what it can be. And I should point out that really, this is the key to this understanding inductive bias it is kind of the key to understanding generalization because before, we had this capacity of the space. Now, it's all replaced by this inductive bias.
And I should point out there have been a lot of progress toward understanding this. And for example, neural tangent kernel, this recent work by Jacot et al. As they basically say that in certain regimes we can view a neural network as implementing some kernel. And this kernel maybe we can understand what they are.
Kernels are not easy, but. So just because something reduces to a kernel doesn't mean that it's easy to understand. But at least it's much easier than the original neural net.
But let me describe some of our work with [INAUDIBLE] and Caroline Uhler or an understanding inductive bias in some special cases of neural networks, with deep neural networks, when I think something pretty remarkable happens. And before I describe the actual result, Let me first point out the difference. I think it's important here.
And I would like to differentiate "memorization" and "interpolation." and they're used interchangeably in the literature. And there is significant confusion. I think they're quite different things.
So interpolation is just zero loss. So for example, this curve interpolated four data points. Now, if I remove those points there is no way to extract the data just from the structure of this curve itself.
So this curve interpolates but it doesn't memorize. There is no way I can figure out now where those points were. So memorization need something else. It has to have a retrieval mechanism.
Now, if you have something like this, I can remove the data. And I can still reconstruct the actual data from the structure. If I know that these peaks correspond to my data I can still reconstruct. So that's I think that distinction. And it's important. It will be important in a second.
First, we consider the case of deep autoencoders. And what's an autoencoder? It's simply a map from [? RG ?] to [? RG. ?] So it's just a map from a space to itself. And its parametrized by this w. And this could be some complicated thing from deep neural net.
Now, how do you train this? You train phi of w xi minus xi. So you basically train each point to map into itself. And you use the usual gradient descent to train this, for some standard architectures. They take five-layer neural net, or something.
Now, once we train this, if it interpolates, if it has enough parameter it will interpolate meaning that phi w fxi will be equal to xi. That means that you can view us as a discrete dynamical system with fixed points xi because it maps each xi to itself.
Now, what kind of point is this? And it turns out and I think it's quite remarkable if I'll explain in a second why this would be difficult to expect this, is that fixed points are actually attractors.
This is not always true, but this is mostly true. I would have to give a lot of caveats to say when exactly this is true, but certainly there are some things. We can even make 500 points to be attractors, like you train with 500 points. And they're all attractors.
So now, how do you memorize? And apparently at least for many of those things there are no other attractors. So how do you memorize? You simply train the neural network to minimize this loss, the difference.
And how do you retrieve? Your start with a random input. And you iterate this thing. Now, this is what it actually looks like. This is input. And this is output after one iteration. So you see after one iteration you already get one of this. And this is CIFAR10 images.
So I start with noise. After one iteration I get this. Well, this is visually close to one of the images. It's not actually, exactly one of the images. But if we iterate this two or three more times it becomes numerically exact.
So now, why this surprising? Well, I think this is a picture to explain why this is surprising, actually. And in general, so what's an attractor? An attractor is a picture like this. It's that you have these points, these lines of the dynamical system, and they all point inside.
What's not on an attractor? Not an attractor is a point like that. When some of these arrows go in. And some of them go out. Or, they can all go out, of course. And the trajectories are now like this. You can see that none of the these points will not actually converge unless they exactly on this flow lines.
Now, if you think about it, how many of these arrows I have? The number of these arrows is equal to the number of the dimensions. So if I'm in hundred dimen-- like this image is, of course, all pretty highly dimensional. If I have 100 dimensional data, I have 100 of these arrows. If each arrow is random, I mean it's not, but just humor me, if each of them was chosen at random, then it's the probability that doesn't attract there is 1 over 2 d-th, which is 1 over 2 to the 100th So we basically would never see an attractor here if they were really random.
And the fact that we see these attractors all the time I think it's of a powerful indication that something really interesting is happening when you train this network using gradient descent. It's a strong hint about what inductive bias of gradient descent is for this type of systems.
And in fact, you can even do a movie. So you can encode this sequentially. And then, instead of an attractor you get an attractor cycle. And you can-- let me see. It starts with a noise. And very soon it becomes a movie.
This is just attractor cycling. This is each image gets encoded into the next one. And I mean, maybe I'll end with a speculation that maybe this could be a model for biological memory. It's certainly tempting to say that.
I'll think I'll stop here. And I would like to acknowledge my wonderful collaborators.
[APPLAUSE]