Banach Space Representer Theorems for Neural Networks
June 9, 2021
June 8, 2021
All Captioned Videos CBMM Special Seminars
Prof. Robert D. Nowak, University of Wisconsin-Madison
Abstract: This talk presents a variational framework to understand the properties of functions learned by neural networks fit to data. The framework is based on total variation semi-norms defined in the Radon domain, which is naturally suited to the analysis of neural activation functions (ridge functions). Finding a function that fits a dataset while having a small semi-norm is posed as an infinite dimensional variational optimization. We derive a representer theorem showing that finite-width neural networks are solutions to the variational problem. The representer theorem is reminiscent of the classical reproducing kernel Hilbert space representer theorem, but we show that neural networks are solutions in a non-Hilbertian Banach space. While the learning problems are posed in an infinite dimensional function space, similar to kernel methods, they can be recast as finite-dimensional neural network training problems. These neural network training problems have regularizers which are related to the well-known weight decay and path-norm regularizers. Thus, the results provide new insight into functional characteristics of overparameterized neural networks and also into the design neural network regularizers. Our results also provide new theoretical support for a number of empirical findings in deep learning architectures including the benefits of “skip connections”, sparsity, and low-rank structures.
This is joint work with Rahul Parhi.
RAHUL PARHI: I'm Rahul Parhi. And I'm happy to introduce Robert Nowak, who is a [INAUDIBLE] professor of engineering at Wisconsin-Madison. The early research of Rob is originally signal processing. He has done extensive and elegant work in wavelets, signal compression, with even as far as I an NPR demo for compression of audio and music. Right, Rob?
ROBERT NOWAK: That's right.
RAHUL PARHI: And tomography, matrix completion, sparsity and, of course, machine learning. And his papers are always, I find, are nice, insightful, formal but also with practical depth. I will give you an example to follow.
He has been often in the NSF site visiting the reviews and advises CBMM. And I'm very grateful for it. And so it's a great pleasure to introduce him to you today. He will describe a very recent result that I find quite surprising and mathematically quite important likely for the future theory of deep learning.
The result is kind of representative theorem for deep networks. As a reminder, I would just mentioned that the representative theorem that [INAUDIBLE] introduced for reproducing kernel Hilbert spaces is the cold property of kernel and machines connecting an infinite dimensional problem to a finite measured representation of functions. And with that, Rob, you can start.
ROBERT NOWAK: Awesome.
RAHUL PARHI: Thanks for coming.
ROBERT NOWAK: Yes. Thank you. Thank you for having me. Thanks for that kind introduction. And thanks, everybody, for coming out today to listen to my talk. Just before I begin, I just wanted to mention what this kind of artwork is here that you're looking at.
This is a strange attracter of a simple recurrent neural network. My dad kind of fools around with fractals and things like that. And he started working with rectified linear units and, like I said, really small recurrent neural networks.
And this is the strange attracter that comes out of one of them. So I thought it's a super cool video and picture to look at, but it also is a nice way to show how rich and complex neural networks can behave.
This is just three neurons in a feedback configuration, so it's kind of amazing. So that's Dave Nowak, my dad. He's the artist in the family, and I love his artwork. The way I'm going to look at things today is what I'll call a non-parametric perspective about overparameterization in neural networks.
So I think everybody is familiar with the idea of overparameterization. It just means that we're going to use neural networks that have very, very wide layers of neurons in them-- many more neurons than there are even training examples that we're fitting to.
So that's the so-called overparameterized regime of machine learning in neural networks. And by non-parametric, what I mean is this. So we're going to think about having a training data set. These are these xy pairs you get in kind of the standard setup.
There'll be a space f of real value functions that map our features x to our labels y. And then we'll look at learning problems that take on a pretty standard form. We're going to try to find a function that minimizes a combination of the data fitting error, which is the first term here, and then a regularization which is lambda times the norm of that function.
So that's a kind of standard setup. We're going to try to find a function that fits our data but a function that is in some sense well behaved, and that's what's being measured by that regularization term.
And I'll give more examples about regularizers in their various forms. The Lambda that you see there is the regularization parameter. So I think this is pretty familiar setup for most people. Is that right, Tommy?
ROBERT NOWAK: Good. Good. So I call this overparameterized when the function space is infinite dimensional. And kind of a nice way to think about why infinite dimensional function spaces are basically overparameterized models is to think about a basis expansion for functions in that function space.
So if I have a basis for a function space like this pk set. These are basis functions. Then any function in the function space can be represented as a weighted combination of those basis functions as I'm showing there at the bottom.
And so the coefficients in this expansion, the theta ks, are playing the role of parameters. There would be generally an infinite number of these in an infinite dimensional function space. And then the norm that we were using in the regularizer is some way of measuring the size of those parameters.
And so we'll look at specific cases as we go today, but this kind of series representation gives you this kind of direct connection between infinite dimensional function space and a parameterization in that function space.
So what am I going to really-- the core question that I want to look at today is, what kind of functions do neural networks learn? And so I'm just having a little picture of a neural network there on the left. The neural network function is denoted by f sub theta, and theta is just a collection of all the weights and biases in that network.
If we train a neural network then using gradient descent with weight decay, which is a standard type of regularization used in neural network training. I'll say more about that too as we go. The question is, what sort of functions will you learn when you fit one of these to data?
And specifically if theta hat's the solution to an optimization of this form where we're just trying to minimize data fitting plus a lambda times the sum of squared values of all the weights and biases, what sort of functional properties will the solution have?
Can we say something about those functions? That's the core question that I want to look at. So the talk will be divided into three parts. The first part is just sort of univariate one dimensional neural networks that are shallow-- just a single hidden layer.
And the reason for starting here is that it allows us to relate what we're doing in neural networks to all sorts of classical methods, including things like reproducing kernel Hilbert spaces that we just talked about before.
Then I'll move on to multidimensional shallow networks, and then finally we'll talk about general deep neural network architectures. And all of this is joint work with a brilliant student of mine, Rahul Parhi. So let's start off with the one-dimensional and shallow.
So the way I want to begin is just to tell you what I think are three of the most remarkable ideas in sort of the history of data science, if you will. One is the reproducing kernel Hilbert space idea. Another is wavelet representation-- something else I will have worked a lot on over the years.
And then finally neural networks are the third really remarkable idea. And what I'll show you in this introductory part of the talk is that wavelets and neural networks actually share a lot in common in that they're distinct from the way that reproducing kernel Hilbert space methods work.
In RKHS, as they're called, the models are linear in parameters, and they basically involve L2, or squared norm type of regularization, whereas when we move to wavelets the neural networks, these are nonlinear in parameter models.
And they're effectively trained, if you will, using L1, or absolute value of type regularization as opposed to squared kind of regularizers. And those two features, the linear at L2 versus nonlinear at L1, are really what make for some profound differences between what you can do with wavelets in neural networks versus some of the limitations that we find with kernel methods.
And I'll go through and explain this. So to compare those three things-- the kernel methods, wavelets, and neural networks-- I'm going to focus on two function spaces. The first one, H2, is called a Sobolev space. The second one, BV2, is a bound bounded variation type space.
And what's really kind of important here to understand is simply that they're almost the same type of thing. They're both defined in terms of the second derivatives of function. So we're going to look at the set of all functions whose second derivatives satisfy some sort of bound.
In the case of the Sobolev space, H2, it's that the integral of the square of the second derivative is finite, whereas in BV2 it's that the integral of the absolute value of the second derivative is finite. So it's like the L2 norm of the second derivative is bounded in H2, whereas in BV2 we're only asking for the L1 norm to be bounded.
So what that means-- the L1 norm bounding is a little less restrictive. So we actually have a nesting like this. The Sobolev space is the smaller space, H2 in yellow. BV2 is a bigger space that contains the Sobolev space, and then L2 just means all functions that are square-integrable.
And that contains both of them. So we have this sort of nesting. And just to give you an example of a simple function that has this bounded variation property but doesn't belong to a Sobolev space is this little kind of triangular wave function.
So this is an example of a continuous function that's very smooth except at a finite number of points. And functions like this are in this BV2 space but not in the Sobolev space because of those points where the derivative is discontinuous.
The main point here is that there are kind of some simple functions that we can capture with BV2 that can't be represented using a Sobolev space. And the long story short, what I'll walk you through in a little bit more detail as we go along here.
The long story short is that all three methods-- the kernels, wavelets, and neural networks-- are optimal if we're learning in the Sobolev space, H2. But kernel methods are suboptimal in the space of bounded variation BV2.
So they have a limitation whereas kernels and wavelet methods are also optimal in BV2. And I'll explain why this happens and what the key difference is in a minute here. OK. So let's, first of all, talk about learning in this Sobolev space.
Again, that's the space where the second derivative, F2-- that's the notation here. F with a superscript 2 is the second derivative. If you square it and integrate it, to be in the Sobolev space means that that squared integrated second derivative is finite.
So naturally what you would think to do here is, if you're trying to find a function in H2 that fits your data well, you would search overall functions. And then your regularizer would just be this squared integral of the second derivative, and that's what you see in the equation, the optimization, at the top there.
The solution to this is a classical cubic smoothing spline. This goes back to that famous work of Kimeldorf and Wahba that Tommy mentioned at the beginning. This is the sort of first representer theorem in machine learning and data science and statistics.
And Grace is a colleague of mine and a great friend, so I always have loved her work. And it inspired a lot of what I'm talking about here. So the solution f hat to that optimization above has a really simple form. It's just a weighted combination, these alpha j hats, of a kernel function k where the kernel function is located at each of the training data points-- the x sub j's.
And so the only kind of parameters that you need to tune or adjust here are these alpha hats. And you can see that the overall function that we're learning, f hat, is just linear in those alpha hats. Those are just weights on the kernel function.
The Alpha hat's the solution of a simple quadratic optimization. And because of that, we know that, if we have a quadratic equation-- or a quadratic optimization objective. We take the derivative of a quadratic. We get linear.
And so the solution is just linear in the data. The data here would be y but the given labels that we're trying to fit to. So alphas are linear in y. That's why we call this a linear method.
And one of the nice properties about smoothing splines is that, if our data really were generated according to a function f star in this Sobolev space, then the smoothing spline solution with a properly chosen regularization parameter Lambda satisfies this rate of convergence.
So f hat is converging to f star in a mean square error sense at a rate of n to the minus 4/5, where n is the number of training examples that you give with the method. And it turns out this n to the minus 4/5 is the optimal rate of convergence.
No other method can give you a faster rate of convergence for estimating or learning a function in this Sobolev space. So smoothing splines and reproducing kernel Hilbert space methods are optimal in this setting.
So just kind of go back to this idea that we can also think about things in terms of a basis expansion, or series expansion. If we have a function in the Sobolev space, then we can actually expand it in terms of its Fourier series. That's what I'm showing in here.
So these are just the complex exponentials-- the theta k's are the Fourier series coefficients. And if you're in a Sobolev space, that just means that the Fourier series coefficients have to decay sufficiently quickly. That's what that k to the 4 theta squared business is over there in the right in the top.
So being in a Sobolev space basically just means that your Fourier series converges sufficiently quickly. The coefficients tend to zero at a certain rate. And we can express that smoothing spline solution in terms of the Fourier series.
Another alternative approach is just to use a truncated Fourier series estimator. And what you do here is you just estimate the Fourier series coefficients from the data. Those will give us estimates, the theta hat. You plug those into the Fourier series formula.
And then all you're going to do is just truncate the tail of that series. So basically throwing out all the high-frequency terms above a certain frequency. And if you do this, this very simple truncated Fourier series estimator, it also enjoys this optimal rate of convergence of n to the minus 4/5.
So we can solve this problem equally well in terms of rates of convergence just by taking the Fourier series of our data and truncating it. And another important kind of point to make here is that the Fourier coefficients, of course, are also just linear functions of your data because the Fourier transform is a linear operation. And so this is another linear estimator.
So let's see what are the potential limitations. I've told you that they're going to be optimal for your functions are in these Sobolev spaces. But what would a function look like that might not be in the Sobolev space? It would be maybe a triangular function.
And to really kind of highlight the limitations, I'm looking at this triangular kind of waveform above-- the blue curve that you see there. It has some slowly varying triangular oscillations and then a rapidly varying there in the middle.
And the red dots are data points-- sort of samples of that function plus a little bit of noise. So the reason I like this little test function is that it shows that there's spatially varying smoothness. It varies slowly in some portions and rapidly in others.
Now if we use a smoothing spline-- here are two different smoothing inspired solutions with different choices of that regularization parameter Lambda. And what you can see is that, if we choose Lambda to be large, the smoothing spline does a pretty good job of fitting to the data without totally overfitting in the slowly varying portions.
But it kind of loses all the high-frequency information there in the middle. On the other hand, if we chose a smaller regularization parameter, we can start to recover some of those high-frequency oscillations. But then we start totally overfitting the data in the smooth areas.
And so there's this sort of fundamental tradeoff. Kernel methods cannot adapt to the local smoothness in the function or the data. You sort have to pick what smoothness you want, and it's going to be that general level of smoothness everywhere in the function.
You can't have it be very smooth in one area and less smooth in another. It's just not something you can achieve with a kernel method. Now let's move on to this space of bounded variation, BV2. And here I have an optimization problem almost exactly what we were looking at just a minute ago.
We have a data-fitting term and then this regularization term. But notice that, in the regularization term, the integral there, it's the absolute value of the second derivative of f rather than the square of the second derivative of f. So that's the difference between what we were doing just a minute ago and what we're doing now.
And that would be the natural regularizer you would use for this BV2 space. So the solution to this optimization is not a cubic smoothing spline. It turns out that the solution is what's known as a locally adaptive regression spline, and it has a particular form that looks like this.
It basically looks like a linear portion, or a linear function. That's the alpha 0 plus alpha 1 times x at the front of that equation. And then a sum of essentially truncated linear functions.
These are effectively the form of a rectified linear unit. These x minus xj's with a subscript plus, that means just take the positive part of the argument inside the parentheses. So this is like a rectified linear unit.
And this kind of spline goes back to some classic work of Fisher and Jerome who were looking at L1 regularization in splines. And then Mammen and van de Geer sort of developed the full statistical theory for these methods in '97.
And one really interesting fact that was sort of fleshed out by Ryan Tibshirani not too long ago was that you can actually recast this whole locally adaptive regression spline as sort of what we would call a LASSO optimization.
So it's basically you take that parametric form f sub alpha that I have above. And then you're going to try to choose the parameters alpha so that you fit the data plus the regularization term, which is just the L1 norm of those alpha coefficients.
And so that is one way to solve this for this locally adaptive regression spline. And what's cool is that, if the true data-generating function now was in BV2, which, of course, also means it's in that Sobolev space because we're where you have that nesting feature.
But for any function in that BV2 space, this locally adaptive regression spline also gets this rate the same rate as we were seeing before-- n to the minus 4/5. So this is kind of remarkable because BV2 is much larger than the Sobolev space, but it still is just as effective in terms of how many training examples you'll need to get a good estimate that the rate of this means wherever decay is, like, n to the minus 4/5.
So there's also a series viewpoint in this BV2 space, and that would be another very familiar thing to people at least in signal processing and statistics, which is wavelet thresholding. And so the idea here is instead of looking at a Fourier series or sum of sinusoids, we're going to look at a sum of wavelet functions.
And I'm showing Daubechie for a wavelet there on the right. That's a wavelet basis function. You can represent a function in terms of a weighted combination of these wavelets. Those are the theta jk's that you see in the series expression.
In a wavelet representation, j corresponds to the scale or how big, or wide, or narrow that wavelet function is. And k is the location where the wavelet function is located on the line.
So if we use the notation f sub m to denote the wavelet series approximation to a target function f using the m-terms with the largest coefficients. So the idea here is you have some function f. You look at its wavelet series.
You just look at which m coefficients, the thetas, which subset of size m are the largest. Keep those and throw away everything else. So it's kind of like a truncation, but you're truncating referring to magnitude of the coefficients.
This is what's called a nonlinear approximation. So if we do that, use that kind of m-term nonlinear approximation, if f is in this BV2 space, then the rate of convergence of this m-term approximation is very good. It's, like, n to the minus 4.
And what that means is there's a pretty straightforward back-of-the-envelope calculation you can make then to show that if your true data-generating function f star is actually in this BV2 space, then empirical wavelet coefficient thresholding, which means just throw away all the-- set all the way the coefficients that are below a certain threshold to zero that that resulting estimator f hat also enjoys this n to the minus 4/5 rate of decay in terms of mean square error.
And that starts with the classical work of Donoho and Johnstone back in the '90s. And so we have another kind of parameterized form but with an infinite number of parameters, these wavelet coefficients, which give us the same performance as the locally adaptive regression spline.
So here is what this looks like-- how these things can be spatially adapted. That's kind of the important idea. They're both nonlinear in the data. So either we're doing this LASSO regularization in the case of the locally adaptive regression spline, or we're doing wavelet thresholding in the wavelet case.
And you can see both of those methods, as you see below, are able to do a good job of smoothing the very smooth parts of the signaling function, whereas in the more vastly varying portions, they're still able to maintain and track that variation.
So you sort have this ability to change effectively how much smoothing or regularization you're doing in different parts of the function depending on how regular or smooth the data are in those regions-- something that we couldn't achieve with the kernel methods.
So that's what I mean by adapting to local smoothness. Wavelet and locally adaptive regression spline methods were able to do this, but the kernel methods were not. So to connect this to neural networks, I'm just looking here at a single layer rectified linear unit neural network.
It has a form mathematically that I'm showing there, and just pictorially the neural network looks like what you see there in the upper right. This plus sign again means to take the argument inside the parentheses and threshold it at 0.
So if it goes negative, just set it to zero. So that's the rectified linear unit action. And you'll recognize that this is essentially the same form as what we had for the locally adaptive regression spline. So in fact, it is the exact same form.
And how do people train neural networks? Well, what they would do is they would take a neural network parameterized in terms of the weights, which are the v's and w's here-- the input and output layer weights-- and then the b's which are the biases of each neuron.
So you take that parametric model. You try to find the parameters that give you a good fit to your data-- that's the first term here-- plus a regularizer which is proportional to just the squares of the weights in your neural network.
So that's the sort of objective function that would lead to the well used and common practice of using gradient descent or stochastic gradient descent plus this idea of weight decay. This L2, the squared regularizer term that you see there, is what leads to weight decay in that gradient descended steps.
Now I claim that the solution to this optimization is exactly the same as the locally adaptive regression spline. And to see that, here is again what weight training with weight decay is aiming to minimize.
And sort of the key observation is that if we took any constant c, a positive constant, we could rescale the input and output layer weights. And so we could map each neuron's input and output weight, vj and wj, to a new pair vj over c and c times wj.
And that rescaling on the input and output doesn't affect at all the overall function because of the piecewise linearity of the rectified linear unit. And so because we can rescale like this arbitrarily, it's easy to see that, at the solution to the optimization, we must have the magnitude of the input and output layer weights exactly equal.
And that's just what would be the thing that would-- for any given function, that would make the regularization term the smallest. And this observation was first popularized and worked by Neyshabur and others back in '15. I think it was known before that in some other context, but that's the key observation.
So what does that mean? It means that, effectively, what we're doing when we use weight decay is it's equivalent to regularizing with something that's called the path-norm. And the path-norm is just a reparameterization where we introduce parameters alpha j, which are the product of vj-- the output layer weight with wj, the input layer weight.
And if we do that, we can rewrite it again in that kind of LASSO like form that we saw at the locally adaptive regression spline. And Rahul and I took this work a little further and showed that, then if you do solve this optimization, the solutions are exactly the form of a single hidden layer neural network, which again is also equivalent to one of these locally adaptive progression splines.
Oh yes. Sorry. I just wanted to point this out. So again I made this point a couple of times now that, effectively with the neural networks or at the locally adaptive regression spline, you're sort of using this L1 regularization of the parameters-- these alphas.
And that's what you see at the bottom there. And for those familiar with L1 regularization and LASSO type methods, you know that that gives you sparse solutions. And I just wanted to show you that's exactly what happened.
So here what I'm looking at is a overparameterized rectified linear unit neural network. There are 10 times more neurons than we have training data points. So up in the upper left, you see the data points. Those are the blue dots.
The black function that's superimposed there is the learned function that the trained neural network produces. And again we're using way more neurons, 10 times more neurons, than we have training points.
And so if we run just standard stochastic gradient descent with weigh decay, we do indeed find that the solutions are sparse. Most of the neurons are effectively inactive. And we can check that by just looking at which input and output layer weights have large values.
Most of them are very close to 0. In fact, the total number of active neurons, if you will, is proportional to the number of training examples. So no matter how wide you make this network, SGD with weight decay tends to find sparse solutions, where the width of the effectively learned neural network is proportional to the number of training examples in the data set.
So that sort of follows exactly from some of the theory, and I'll be going into this in more detail throughout the talk. Let me just wrap up this first section by pointing out the connections again between neural networks and wavelet thresholding.
So both of these approaches can be viewed as selecting the best m-terms from a dictionary of atoms. The atoms, or building blocks that we're using here, are either wavelet functions or rectified linear unit neurons.
In the case of the wavelet dictionary, it's a discrete dictionary. We could have an infinite number of elements, all the wavelet basis functions, at different scales and locations. But it's discrete countable. You could just enumerate them all with j and k.
How do we train effectively in wavelet thresholding? We're just thresholding the empirical wavelet coefficients. If we move to the ReLU dictionary, the Rectified Linear Units, now this dictionary's continuously parameterized by the weight's biases.
So it's continuous. So it's different than the wavelet dictionary. And we train it differently too. We use stochastic gradient descent plus weight decay. So they're very different approaches, but they're kind of leading to the same sorts of capabilities and performance in this simple one-dimensional setting that I was just talking about.
And I just wanted to mention that this idea of selecting the best atoms, or building blocks, from a huge dictionary of elements really goes back to the seminal work of Andrew Barron, who sort of formulated neural network as learning over continuous dictionaries using L1 or total variation types of regularizations.
And I'll come back to that again at the end of the talk. But I do think that Barron '93 really had an incredible idea, which now even all these years later turns out to be really important, I think. And maybe should have been-- people should have even pursued it further back then when he first did it.
OK. So here's just a picture summarizing the main points here. The kernel methods cannot give you this spatial adaptation to different levels of smoothness in different places in your function, whereas wavelets in neural networks can.
OK. Let's move on to part two. Now this is where things get a little more complicated. So I'm going to be explaining a lot of it through some pictures that I hope will make it clear.
And just to make some other notation clear going forward and the rest of the talk, we'll be looking at neural networks that have combinations of just linear summing units and also these rectified linear unit activation function.
So I'll use orange for the rectified linear units and gray for just summing nodes. And then phi will denote the rectified linear unit activation function. To begin with, I just want to recall this idea of representer theorems. Tommy mentioned it at the beginning.
A representer tells us that the solutions to certain learning problems in infinite dimensional function spaces can be expressed in terms of finite dimensional parametric functions, and this goes back to the famous work of Kimeldorf and Wahba as I said.
And here's the general reproducing kernel Hilbert space representer theorem. It says if f is a reproducing kernel Hilbert space with a kernel k, then for any training data set and any lower semicontinuous loss function-- you can just think of that as squared error if you want-- the solution of minimized overall functions, the sum of your data fitting losses, that's the first term, plus a regularizer Lambda times the reproducing kernel Hilbert space norm of the function f.
If you solve that optimization, the solution has a representation of the form of a weighted combination of the kernel functions, where again the alphas are these parameters. And so this is what we saw when we looked at the smoothing spline to begin with.
That was in a very special reproducing kernel Hilbert space-- the Sobolev space that we were considering. But this follows generally for any RKHS and any kernel function. So what Rahul and I set out to do is we were sort of interested in this question.
Is there a representer theorem for neural networks? And the answer is yes. But unfortunately, or maybe fortunately depending on how you think about it, it's not in a Hilbert space.
And so I'll kind of-- for people who don't know what the difference is between Hilbert spaces and non-Hilbertian finite spaces, maybe it's not so important. But it's a different class of function spaces. I will explain what these function spaces are.
But they can't-- they are not reproducing kernel Hilbert open space is the important message. So this is an informal statement of our representer theorem. There's a non-parametric Banach space that has a particular norm, which I'll give you in a few minutes here.
Such that for any data set and any loss function-- again it could just be squared error. If you solve a similar regularization problem, or you tried to fit the data, but now instead of using a reproducing kernel Hilbert space norm, you use this special Banach space norm that I'll tell you about.
Then the solutions have a representation that are exactly the form of a single hidden layer rectified linear unit neural network, where the number of neurons can be less than or equal to the number of training points.
So here k is the number of neurons in your neural network and the number that you need is less than or equal to the number of training data. So the phi here, that's the rectified linear unit. You, again, see the weights-- the input and output weights, w's and v's, and the bias is b.
The other term, c of x, is just a linear function-- just a simple line. And I'll explain why that line term comes in there, but it's basically just a skip connection. It turns out that's a natural byproduct of this theory that we need skip connections.
So how do we kind of approach this problem? And it didn't come out of nowhere, or how do we prove something? It didn't come out of nowhere. Really, over the last decade or so, there's been a lot of interesting work that has focused on learning over infinite dictionaries of atoms.
So this really started to take off with people looking at things like compressed sensing-- especially what's known as off-the-grid compressed sensing. I won't go into what that is, but it was a very popular area about 10 years ago.
Also, generalized splines, this is some great work by Michael Unser and his colleagues over recent years. And then more generally people investigated using total variation semi-norms in regularized data-fitting problems.
And one paper in particular kind of brings this all together very nicely. I really like it. It's a paper by Bredies and Carioni, where they kind of considered a general class of problems where you're trying to find a function, in their notation u, that fits data plus a regularizer that's one of these tonal variation semi-norms.
And they have this really nice statement that I think sums up what we were seeing in all these different areas. That is, in an infinite dimensional setting when the domain is a Banach space, then there's a clear evidence that the action of these total variation regularizers is promoting different notions of sparsity.
But there hasn't been a comprehensive theory explaining this. That's exactly what they set out to do in this really beautiful paper of theirs. And after kind of looking through this literature and coming across their paper, we're like, aha. We see how we can use this to study neural networks.
And so really what Rahul and I did is try to take the general theory that came out of this total variation type of analysis and specialize it to the case where we were looking at rectified linear units as our atoms, or our dictionary elements.
So how do we proceed then? So what's really important is this idea of sparsity again. So let's talk about sparsifying transformations-- transformations that take functions and make them sparse.
So in particular, let's look at atoms. So an atom phi, that could be like a rectified linear unit, or it could be something else. We want to find an operator, a linear function or operator, that sparsifies that. So what that notation L phi equal delta means, it means that if you take an atom phi, you apply the linear operator L, it produces a Dirac delta function.
So let's just look at some really common examples. If I have a sinusoid, and I take the Fourier transfer, I get a Dirac delta that's located exactly at the frequency omega 0 of that sinusoid. So that's the sparsifying transformation to use if you have your atoms being sinusoids.
We could also look at localized waves like wavelets. What would be the natural sparsifying transform there? Obviously, the wavelet transform. The wavelet transform of the little localized wave results in a product of delta functions-- one indicating the scale, the j0, and the other indicating the location of that localized wave, k0.
So if we have a ReLU building block, if that's the atom that we're using in our dictionary, what's the right sparsifying transformation? Well, again we can say, well, what are the sort of parameters of that type of atom?
It's the bias and the weight, or the slope in the univariate case. And as we sort of saw, taking the second derivative was the thing that we were doing before with the locally adaptive regression spline. Why does that make sense?
Well, if I have a ramp function like this, I take one derivative. I get a step function. I take another derivative. I get a delta function. Right? It's just kind of standard undergraduate signal processing.
And so specifically what you're going to get, if you take the second derivative of a rectified linear unit, is you get a delta function that's located at the ratio of the bias to the input weight w and then the scale, or size of weight on that delta function, is the absolute value of the slope.
That's what we are doing before effectively when we were talking about the univariate, or one-dimensional, neural networks. How do we move to the multi-dimensional setting, or multivariate setting?
I'm going to show everything in two dimensions. So imagine now I have a two-dimensional input x1 and x2. What I'm showing you here is what does a rectified linear unit looked like in two dimensions. It just looks like a two-dimensional ramp function. That's what I'm showing there on the right.
Now these are what are called generally-- so what's important here is, that this rectified linear unit, it only varies in one direction-- the direction indicated by the weights w1 and w2, and then it's constant in the other direction orthogonal to that direction of the ramp.
And so this is what's known as a ridge function. More generally, ridge functions are parameterized by an angle and offset. And the angle is given again by the weights. You could just think of this sine of the angle and cosine of the a bar the weights w1 and w2.
And so just to illustrate this, let's look at that ramp function from a top-down perspective. What's going on is, if I change the orientation, or the weights, I just sort rotate that ramp function around. And then if I change the bias, I'm shifting it in the plane.
OK. So that's how the weights and biases sort of locate and orient ReLU functions in the plane if we were looking in two dimensions. And, of course, we could generalize this to multiple dimensions.
So whenever we have functions that are sort of-- it's important to look at from different angles and offsets-- the natural tool is what's known as the Radon transform. So probably most people, if they've encountered this before, have seen it in the context of computed tomography where we're trying to shine X-rays through some of these.
It would probably take projections from different angles. And we get these projections, and then we try to [INAUDIBLE] construct a cross section through the body. We're sort of doing the same thing here, but we're going to apply to studying these ridge functions.
So let's look at how this works out then. So here's a two-dimensional rectified linear unit. If I take the second derivative, in two dimensions, we can do that by using the Laplacian operator, which basically takes the second derivative in the first variable x1 plus the second derivative of the function and the second variable x2.
If we did that, you can think about, just by looking at that function, again, if I take one derivative, I kind of get a step function. If I take two derivatives, the second derivatives can be 0 everywhere except along the sort of activation threshold of that ReLU.
So that gives us-- well, I'll call it Dirac line. And then if you take the Radon transform of a line, you get a point. So that's kind of again something that would be familiar to most people if they've ever encountered computed tomography.
And what's beautiful about this in particular is that the location of this point, or Dirac impulse in the Radon transfer domain, exactly reveals the orientation and offset, or bias, of the neuron in question.
And this idea of using the Radon transform like this really first appeared in this really great paper by Greg Ongie and his collaborators just about a year ago. So here's what's going on. We have a ReLU activation function.
We take the second derivative. That's the Laplacian-- that blue delta. Then we apply the Radon transform. That's the stuff in magenta there.
And what pops out if we do that is a Dirac impulse whose location tells us the bias and orientation and whose height tells us how steep the slope is of that ReLU activation function. These are the two norm of the weights.
So here's what happens then if you apply this to a neural network. So what I'm showing on the right-- or sorry. Rather the left-- is a 7-neuron ReLU neural network. So that's a function that's represented by the neural network.
First of all, we take the second derivative, the Laplacian operator. And what I'm showing you now is a top-down look at the function, but it's superimposed by these black lines. Those are the activation thresholds of each of the seven neurons in that neural network.
That's all that we get after we take the second derivative are those black lines. And then if we take the Radon transform of that, we get a bunch of Dirac impulses. And those seven impulses are located exactly at the sort of biases and weights associated with each of the neurons in the neural network.
So we have this beautiful thing that takes a neural network and just maps it into a series of delta functions that we can easily inspect and figure out where the neurons are located. With that sort of visual setup of things, now what we're going to do is this.
We're going to use that whole idea to define a function space, or a space of functions. We're going to look at multivariate functions mapping from d dimensions to one dimension-- so one-dimensional outputs.
And then what we're going to do is we're going to say, let's look at that overall operator, which is the second derivative followed by the Radon transform. That's the colored things in that norm there. We're going to apply that to F and then look at a particular norm.
The norm is what's called the total variation norm in the sense of measures. If you sort of just think of this as an L1 norm again, but for technical reasons, you really have to work in the space of measures. I won't go into that, but you could just think of it as like an L1 norm that's effectively measuring how sparse the derivatives of f are in this Radon domain representation.
And again the reason for everything-- the derivatives wipe out the ReLU thing. The Radon transform naturally sparsifies everything for us. And so it's really a good way to measure the complexity of the function is to look at how sparse is the derivative of the function in the Radon domain that gives us a natural measure of complexity.
And furthermore, this norm that we have here is equivalent to this neural network path-norm that's been popularized and introduced in other works. So that path-norm in this case is just the sum over all the neurons in your network, the absolute value of the output weight of that neuron times the 2 norm of the input weights to that neuron.
The upshot of all this is then, if we use the norm that I just introduced in that function space, and then we solve one of these regularized optimization problems where we're optimizing over this infinite dimensional function space with that regularization given by that Radon BV norm, the solutions are neural networks.
The important thing to note here is that we have ReLU neurons in the solution, but we also have skip connections. And the skip connections came back to the question from Tommy earlier. These are coming up because, if I take the second derivative, of course, I wipe out any linear component in my overall function.
So those linear component of your function is annihilated by the second derivative operator, so those need to be there in the solution to this variational problem. So there is this explicit requirement to have skip connections.
I'm not going to go over some of the theory that's involved here, but bottom line is that this operator that we were looking at, which is derivative followed by Radon transform, you kind of inspect what are the Green's functions of that operator and then take it from there.
We can come back to that if people have questions. But this is the sort of solution that you get. You get a neural network that has ReLU neurons. It also has this skip connection.
And one of the important kind of takeaway messages here is that I already showed you that this regularization was equivalent to path-norm regularization, but for the same reasons as we saw before, this piecewise linearity of the rectified linear units in this rescaling business that we talked about-- we can show that this is exactly equivalent to just minimizing the sum of squares of all the weights in the neural network, which is exactly what you're doing with weight decay.
So if you train a neural network using gradient descent sgd and weight decay, you're effectively optimizing in this Radon BV space. I'll skip over the generalization about stuff. We can talk about that if people are interested. And then I'll wrap up by talking about the extension to the deep neural networks.
So to do this, we kind of tackle it in a couple of steps. So the first step is just to go from what we were talking about before, which was a scalar output to a multidimensional output. So here it's not very much going on. We're just going to say, well, each component in this multidimensional output-- say, they were d outputs of our neural network-- we're going to represent each of those by a function f sub j that lives in these Radon domain BV spaces that we just defined.
And then that naturally kind of gives us a product space and then also naturally gives us a norm over that product space, which is what we saw before this total variation of each component function applying this operator differentiation Radon transform.
And then another term, that L1 term that you see there, I think I have a little note of this. This is sort of an L1 norm of the linear components, or skip connections associated with the neural network. You need to have that there as well to make this as well-defined function space and norm.
So then everything precedes the same way. You have a similar representer theorem for solutions to this variational problem, but now it's just a multidimensional rectified linear neural network. And again there's connections to the path-norm regularization.
That's what I'm showing there at the bottom. And now how do you go to deep networks? So here again we're going to try to do the natural thing. We're going to say, well, let's define a compositional function space where we can take, say, l compositions of functions f1 through f capital L, where each component function in this composition is exactly one of the functions.
We were just looking at it a moment ago in these Radon BV spaces. And then we can sort of define a regularizer in this compositional space, which is just a product of the Radon BV norms of each component function in the composition.
And with not a lot of work, you can show that the solution to regularized data-fitting problems using that compositional kind of deep norm, if you will, is also a neural network-- in this case, a multilayered deep ReLU network.
Let's see. I have just a couple of minutes. Let me just say a couple of things.
So one thing is, how do we actually define these spaces? So in these compositional spaces for each component in this composition, we define its important output dimensions. So those d sub l's.
So what this actually suggests is a slightly unorthodox neural network structure that looks like this. It has ReLU layers which can be as wide as you want to make them, but then these specified dimensions that you choose at the beginning for each function in the composition, those show up as these linear summing unit layers in between.
So you sort have these bow-tie like structures which leads to natural kind of low rank kind of structure in the weight matrices. So you can relate this also to weight decay. I won't say much more about that. We can come back to it. But let me kind of wrap up with these takeaway messages and maybe just one set of experiments in the last two minutes.
So basically here is the final takeaway. The deep ReLU networks that people commonly use are optimal solutions to data-fitting problems in these deep bounded variation Radon domain compositional spaces.
So we sort have introduced these spaces and then showed that neural networks are indeed the solutions to them. So that kind of gives us a lot of new information about what the properties of trained neural networks are. These deep BV spaces contain functions that have spatially varying smoothness, which we saw was important even when we were looking in one dimension.
And deep trained neural networks therefore can adapt to local smoothness in different orientations and everything. So it provides a lot of new insight and support for many key properties that have been observed experimentally.
So what are those? Skip connections. We're saying skip connections actually just pop out as a natural requirement of the variational problem. The solutions are sparse. The ReLU widths don't need to be infinitely wide. They just need to be proportional to the number of training in data.
The solutions are overparameterized though still. The widths are always greater than whatever the prescribed functional widths that you have in your specification of the function space is, which means you have these low rank weight matrices-- something that people have actually used and engineered into neural networks to give better generalization training properties.
And then finally it turns out that these Radon BV norms actually naturally tend to balance the norms of each layer. So there may be some nice regularization that happens there that might avoid some of the other problems that sometimes you see, like vanishing gradients and all that sort of stuff.
And here is just a picture. I'll finish with this. So I'm looking at just a two-layer neural network. And here are the solutions of training with a stochastic gradient descent with and without regularization.
So what I'm visualizing is this overall connection matrix between the two ReLU layers. So that's what I'm highlighting with that shaded blue area in the upper right there.
And what I'm showing you is a picture of the weight matrices in the first case on the left using this BV type regularization, then standard weight decay both with this kind of bow-tie architecture, then standard weight decay without the bow-tie just letting the ReLU layers be fully connected, and then finally no weight decay at all and fully connected.
And what you can see is we definitely see sparse neural network solutions. So there's a sparse subnet lurking inside this bigger neural network. That's where all the action is happening when we use either our new slightly different weight decay, which is this Radon BV thing, or standard weight decay.
And then once, if we remove that special architecture just fully connected, we still see a relatively sparse and low ranked solution using just weight decay. But of course, if you take regularization out of the picture, there nothing's sparse, and nothing's low rank.
So I really think this is kind of tapping into some of the key kind of characteristics of convolutional neural network training. And I will stop there. And happy to take questions.