Improving Generalization by Self-Training & Self Distillation
June 10, 2020
June 9, 2020
Hossein Mobahi, Google Research
All Captioned Videos CBMM Research
In supervised learning we often seek a model which minimizes (to epsilon optimality) a loss function over a training set, possibly subject to some (implicit or explicit) regularization. Suppose you train a model this way and read out the predictions it makes over the training inputs, which may slightly differ from the training targets due to the epsilon optimality. Now suppose you treat these predictions as new target values, and retrain another model from scratch using those predictions instead of the original target values. Surprisingly, the second model can often outperform the original model in terms of accuracy on the test set. Actually, we may repeat this loop a few times, and each time see an increase in the generalization performance. This might sound strange as such a supervised self-training process (aka self-distillation) does not receive any new information about the task and solely evolves by retraining itself. In this talk, I argue such self-training process induces additional regularization, which gets amplified in each round of retraining. In fact, I will rigorously characterize such regularization effects when learning the function in Hilbert space. The latter setting can relate to neural networks with infinite width. I will conclude by discussing some open problems in the area of self-training and self-distillation.
PRESENTER: Hossein is a research scientist at Google Research. His research interest lies at the intersection of generalization and optimization with emphasis on deep learning. Prior to joining Google in 2016, he was a postdoctoral researcher in the Computer Science and Artificial Intelligence Lab at MIT. When he was at MIT, he was also-- he collaborated with Professor Poggio and published a paper in NIPS 2015.
He obtained his PhD in computer science from the University of Illinois at Urbana-Champaign. And he's an active member of the machine learning community, serving as a editor of ICML and [INAUDIBLE] and ICLR.
HOSSEIN MOBAHI: Thanks, everyone, for attending my talk. Today, I'll be talking about self-training and self-distillation and why they can improve generalization. This is joint work with my collaborators, Mehrdad Farajtabar from DeepMind and Peter Bartlett from Google Research and also UC Berkeley.
So I was asked by the organizers to keep this talk high level. As a result of that, I'll skip some of the details during the talk, but I wanted to let you know that all the details are available in this preprint, which I share on the screen. So you can find it on arXiv, and all the details are there.
All right, so let me first start with self-training. So it's actually a quite old technique used in unsupervised and semi-supervised learning in machine learning. So let me first give you an example in case of semi-supervised. Suppose you have some labeled data and some unlabeled data. What they do is that, a first training model on the label data, you get a model and then use that model to make predictions on the unlabeled data.
So, at this point, your unlabeled data have some labels attached to it. These are not real labels. It's generated by the model, predicted by the model. That's why people sometimes call it pseudo labels. But now you can train another classifier on the full set, including these pseudo labels and the original label set, and get a second classifier. And then this second classifier is now like doing a well job.
You can also use this in a fully unsupervised setting. The idea is, since all the data is unlabeled in this setting, you need to have access to some other data set that is somewhat similar, or it has some commonality with your data, to train a classifier on that data, which is labeled, and you get a classifier. Then you take that classifier, apply it to your unlabeled data, and, again, generate pseudo labels for them and use those pseudo labels now to train a model using that in a supervised fashion.
And there are slight tweaks and variations of this whole concept. For example, in the unsupervised case, sometimes, people not like use all of the unlabeled data that now have pseudo labels. They only use a portion that the model is more confident, introduce that into the next round of training.
And, once they get their second model, then, again, they generate pseudo labels on the remaining unlabeled data and, again, only take a portion of more confident subset. But the overall idea is this sort of self-training that you have a model that learns from its own predictions.
Now there is a more restricted version of this, more special case of this, which, in deep learning community, people call it self-distillation. So it's now used in a completely supervised setup. So the idea is as the following. I'm just illustrating the figure on the top.
So we have some input-output pairs, x0, y0. We train a model. We get a classifier. Or, if it's regression, we get some continuous predictions, f0.
And what we do is now we pretend these predictions are new target values. So we replace labels with those while maintaining the inputs the same. And then this creates a second data set, right? We train the model from scratch with this new data, get a second model. And we can repeat this process because this gives its own prediction. Again, you can treat these predictions as the next label.
What people in deep learning community observed, and you see one reference here-- "Born Again Neural Networks" was one of the main references showing this-- was that the test performance of these models actually can improve over this iteration. And that's very-- in my opinion, at the first glance, it was very weird and surprising because there is no external information coming from anywhere.
It's the same network architecture. It's the same training procedure that I am using in this block. And it's the same data set. At least, the input part is the same. And, somehow, this internal loop is able to generate models that perform better. And that was, at a first glance, again, it was surprising to me why this happens.
But, once we studied this problem more closely, we realized that, actually, this is a more profound phenomenon. It's not unique to deep learning. Although, we should be thankful to deep learning community because, to the best of my knowledge, they first observed this. At least, the name self-distillation is exactly attached to deep learning community when they observed this. But now we know that this also happens in more basic regression scenarios, like even simpler setups. You don't need the fancy deep learning for observing this phenomenon.
And, actually, I really use this regression, simple regression setup, today to present my analysis, which explains what's happening and why self-distillation can improve generalization. And the reason is obvious because now we have a problem that is mathematically easier to analyze. And we can get concrete answers to some of the questions about this phenomenon.
So, in this slide, you see the regression setup. What we do is, given a training set where I show as xK, yK as the input-output pairs, you have capital K of these points. And I'm using simple L2 loss. I'm just demanding the total loss to be smaller than some tolerance epsilon.
For this setup, of course, there are many functions in the function space that can satisfy either completely pass through these points or pass close by these points to maintain this epsilon accuracy. So, in order to single out the solutions that we care about, we have to use regularization, which, basically, is our bias introduced into this regression, which solutions we prefer.
And here you see a specific form of the regularization that I'm using. I think, because of the broad audience, if you haven't seen this sort of regularization, it may look a little bit complex, but you really don't need to worry about it because I have some slides soon that will at least provide some intuition of what's going on with this form of regularization.
For now, just think of it as a regularizer. It takes a function, outputs a number. And this number, basically, assigns some sort of complexity to this function or some sort of smoothness of f.
So we can use a Lagrange multiplier, very straightforward, to convert this into an unconstrained optimization. So this will be what I will focus on. It's easier to present. We have this coefficient c, which [? really is ?] the trade-off between fitting accuracy and regularization.
And, actually, this form of regression problem is very well studied in machine learning literature. In fact, Tommy has been a pioneer in this area from the machine learning perspective. He and Federico Girosi in the '90s published a series of interesting papers on problems of this sort.
And I think those are great references if anybody is interested to learn more about some of the detailed aspects of how this regularization framework works. These are great papers here. I just have one of them with more than 4,000 citations as an example.
OK, so I promised that I'd provide some intuition about what this regularization is doing. And I think eigendecomposition will be a very, very appropriate tool for achieving that goal. So here the regularization was characterized by a kernel u in the earlier slide. You can think of this as a linear operator that this integral, u times function, as a linear operator that takes a function and returns a function, right? Because the dummy variable x stack eliminates, and you get another function in x.
So, in the same-- in a very like similar way that we think about matrices and vectors and matrices having eigenvectors, we can have operators similar to matrices and functions similar to vectors. And then, for matrices or, in this case, operators, we can have eigenfunctions and eigenvalues. And they basically satisfy a very similar property.
So, if phi I is an eigenfunction of this operator, it means that, when I apply it to this linear operator, I get the same function phi i times a constant, which is the eigenvalue of this eigenvector. Now, without loss of generality, I can represent my f using this basis because this gives me a complete orthonormal basis.
So now my goal is to just identify the coefficients aI. The bases are given by the regularizer to fit my-- to find my solution. So now this helps me to reduce the variational problem that you saw earlier to something that-- OK, I just have a quick question because there is this portion that's blocking part of the formula. Is it just on my screen, or you also have it? Maybe I can--
AUDIENCE: We see the full thing.
HOSSEIN MOBAHI: Oh, you see the full, OK, great, great. So, yeah, what happens is that, for the regularization, now it greatly simplifies. You can see that the regularization part is now sum of squared of these coefficients. Although, each coefficient is penalized by this eigenvalue thing, right?
So, essentially, it says, OK, if my regularization is like introducing some eigenvectors so that you can-- or eigenfunctions so that you can build your solution by a weighted sum of those, I also assign a cost for you to pick each of these eigenfunctions. And the cost is determined by this operator.
So here is a very simple illustrative example. Suppose I'm considering functions that map the interval of 0 to 1 to real numbers. And suppose my regularization is just penalizing the second-order derivative of my function over the entire domain of 0 to 1. So I square it and integrate it.
Of course, because I introduced this differential operator, I also need to talk about what happens at the boundaries, but really those are the details. You don't need to worry about those. OK, so I just need to get my parrot is getting upset. So he will be listening throughout the talk as well.
So now, for this operator, for this second-order derivative, I have plotted the eigenfunctions. So the blue one is the first eigenfunction. Then you have orange, green, and red. And then, on the right, you see the eigenvalues.
So you can see that the red one is actually the largest, and the blue one is-- I think, if you read the scales, the blue one may be less than 10, while the larger one, the red one, is about 120. So it means that, if I want to use the red one to contribute to my solution, I have to pay the cost of like 120 times more than using the earlier one. So that's how it biases us to use some bases versus others.
All right, now this is a key intuition. We will use it across the talk. And one more thing I should say before we move on is that this kind of regularization problem that we are studying has a closed-form solution.
Again, the details how we get this closed form are not like crucial for this talk. I can just present the final form, but, if you're interested, you can either refer to our paper. Or, if you want more details, as I said, Tommy has a series of papers on this, which you can refer to.
But, like if I just want to give you the gist of it, so, associated with this kernel or kernel of the regularization operator, we can identify another like kernel, which is called Green's function. And it satisfies this identity that you see. Just applying that to our operators should produce the delta function.
So, once you have the Green's function, then you need to form two quantities, one matrix, capital G, and one vector, small g. So the capital G is just taking all pairs of training points and evaluate them with this Green's function. That's why you get a matrix.
The other one uses Green's function, but only use one argument for training points. The other argument is free. So you get a vector, but each component is now a function of x.
And now, if you put these matrices and your labels y-- I also arrange all the labels as a vertical-- as a column vector-- then I can express my solution in this form. So this is a very well known result. I don't need to like emphasize too much. So we just take it as granted, but we will use this form for our analysis.
OK, now, again, before we move on, up to this point, I want to make some connections that are interesting and important. One of them is that you can clearly see there is a close connection here between this kind of regularization problem and a kernel regression problem.
Essentially, g is a kernel function. So you can think of this g as kernel in the sense of kernels that we use in SVM. So it's the same thing. So this is essentially like this regularization problem can be written equivalently as a kernel regression problem. So that's one point.
And, now that we see this as a kernel regression problem, it provides another connection, interesting connection, and that's to wide neural networks, which operate in NTK, Neural Tangent Kernel, regime because, for these neural networks, it's been shown that the problem is down to just a kernel regression.
So all the bias, fancy biases you have in a deep architecture like a convolution, pruning, hierarchical representation, all of that is, at the end of the day, encoded into a single function, and that's the kernel function for the case of like wide neural networks. So we will get back to these intuitions.
OK, now that we have the regression problem set up, let's talk about self-distillation. So, again, some notation here, as I said, the vector of y is just stacking all the training scalar points y. I have K training points. So I have y1 to yK.
And suppose also, if I have a model trained on this data, I get predictions. I can again form a vector of predictions over the training points from x1 to xK. So this is the bold f.
All right, so now, as I showed earlier, the solution form for this regression has this form. So, for the first round, f0, the solution has this form, as you see, where y0 is the initial ground truth label. Now what we do in self-distillation is now pretending that this f0 is going to be a label.
So we set our next round of-- the labels for the next round, which I denote by y1, equal to f0. And then I write the formula for f1, which is exactly the same because we are not changing anything other than changing the label from y0 to y1. And then I replace the definition of y1.
So you can actually keep repeating this and see that, at the end, after t steps, your solution actually evolves according to this format at the bottom. The important part is that product. So you get the product of t plus 1 terms. Everything is the same, except the coefficient, the regularization coefficient cI, which varies in each round of self-distillation.
So, at a first glance, this may look very simple form. It's like a power iteration, but, actually, it's not a conventional power iteration. Why? Because, in power iteration, this thing, this linear operator-- so I'm basically grouping everything after this pi as a operator. So the matrix or the operator A is g times this parentheses that's inverted. Take that as the linear operator.
So we are applying this linear operator over and over to y0 t times, but this is not really a standard power iteration because this linear operator is changing dynamically over time because it depends on this like iteration-dependent coefficient cI. And this actually greatly complicates the analysis because this cI itself has a very complicated dependency of the norm of the solution in the previous rounds and creates a messy recurrence relationship, which doesn't have a closed form.
But, in the main paper, we basically provide bounds here and there to get control over this thing. But, for the sake of this talk, you can just pretend that this coefficient ct is constant over time because it still enables us to make our point. Although, it's not exactly correct, but it's enough for making the point. And, as I said, the full exposition is in the paper.
OK, so, if this coefficient cI is constant-- it doesn't change over time-- now things greatly simplify because now I can say, OK, this product thing inside this pi is just essentially raising the same thing to power t plus 1, right? And that's what I'm doing. And, on top of that, I can just use eigendecomposition of the matrix G because everything you see is either G or identity matrix or some matrix inversion, and none of these changes the eigenvectors. So I can push the eigenvectors out and write the thing, the meat that is in the middle, just in terms of diagonals.
OK, now let's look at this diagonal. Now things are very clear now. You can see that, as t goes larger and larger, what happens is that these smaller components of this diagonal shrink faster and faster to the point that maybe, after like some iterations, you can completely consider the initially small ones as negligible now because they shrunk so much relative to the magnitude of the larger values that you can actually consider them as nonexistent.
So, as actually t goes to infinity, all these components will die, become like super negligible relative to the largest one. And so you can think of it as only having like one non-zero component on the diagonals, everything else as 0. And here I'm not talking about absolute numbers, absolute value numbers. I'm talking about the relative scale of these things. So, compared to the largest one, the smaller ones are becoming smaller and smaller.
So that's when t goes to infinity. You end up with one significant component, but, of course, during this path, as you increase t, you get gradual elimination or weakening of these small values. So it kind of like progressively sparsifies this diagonal matrix, OK?
So why this is important? Because this is exactly how it's imposing this capacity control that can improve generalization. Here I just do a simple renaming to, I think, make it slightly more clear. Let's name these rotated labels y0 times the eigenvector, which is a rotation matrix. Just call it z0. These are the labels.
And the solution on the left-hand side of z0 doesn't depend on labels at all, except possibly through the scalar c, but none of the vectors or matrices depend on label. So this is saying that all the information, at least in the vector form, of the labels goes into this z0. And now, when this middle term is very sparse, it means that I'm only allowed to use a few number of these rotated labels to construct my solution.
So, initially, I didn't have this restriction, right? As it's sparsed even more and more, I'm losing my degrees of freedom to make use of more points in my representation. So this is exactly how the capacity of this model is being controlled. And the sparsity level of this matrix is basically determining the effective number of basis functions that you're using to represent your solution.
OK, so, actually, you can use this sparsity pattern to even come up with generalization guarantees. Because I want to keep the talk high level, I will not get into the details of this part. But the high-level idea is that you can bound the so-called Rademacher complexity, which is a complexity measure of function classes of models of this type depending on their sparsely level.
And then, once you have a bound on the Rademacher complexity, then you can use standard generalization bounds. There are standard theorems that, based on a bound on Rademacher complex, you can get a generalization bound.
All right, so now let's revisit our toy example that I showed earlier. Although, this time, we want to study it within the self-distillation loop and see what happens there. So the first part of this slide is the same as before. We are penalizing the second-order derivative, and we'll have the same boundary condition.
But here I also provided the analytical form of the Green's function, although you really don't need to have it. Like it's just for completeness of the presentation. You don't need to know how to derive the Green's function from the regularization operator for the kind of results we present in this talk, but I have it just for completeness.
OK, so, for this example, the first figure a is showing just the shape of the Green's function. And the middle figure is showing our setup, regression setup. So the orange curve in the middle is the sinusoid that we are trying to fit. It's like the underlying ground truth function that we don't see.
Instead, we see some noisy samples from this function, and those noisy samples are shown by these small blue dots. I hope you can see them. It's still in the middle figure I'm talking. I have 11 of those points if I'm not mistaken, and the goal is to use these points to recover a function that's as close as possible to the sinusoid.
All right, so, if you just go ahead with the first round of training, you get this blue curve in the middle, which is clearly overfitting to the data, right? You don't get anything close to the sinusoid.
But, on the right, I'm showing what happens if you do additional rounds of training by just taking the predictions of this original training and doing another round of training and repeating this loop. So you can see that, the functions-- you go from blue to orange to green and, eventually, your red-- are becoming smoother and smoother.
Actually, I think the orange one, which is just one round of self-distillation, is already very close to a sinusoid. So you can stop there. But one thing you can see is that the further rounds doing additional smoothing plus shrinking the size up the function. So you see that the function is becoming also closer to zero. So this is basically confirming what we were discussing within the toy example.
All right, now let's look at these diagonal components also, how they evolve. So, because we have 11 training points in this example, the matrix that we have, it's K by K if you remember the notation. So it's 11 by 11. It's diagonal. So we have only 11 components.
Initially, these components are distributed as shown on the left. So you'll have some larger ones, some smaller ones. But, after one round of self-distillation, you see that the smaller ones quickly shrink, and only the larger ones are significant relative to others. And, if you keep doing this, of course, this process exaggerates. And, at the end, I think you can fairly say it's only two or three bases that remain for you to represent your function.
So one thing that perhaps you saw in the figure, but I didn't explain much was that, as you increase the self-distillation rounds, because, as we showed, it amplifies this regularization effect, it shrinks the function also. Like the value of the function is shrinking toward zero.
And, at some point, the process will collapse. You will just get zero function. And, from that point on, there is no more self-distillation. Just it's a fixed point. You just produce the zero function from that point on.
And the reason that this happens is very obvious because the labels-- because the predictions are shrinking, at some point, you can easily satisfy the error tolerance being smaller than epsilon because the labels are very small. So you can choose very small functions.
And, at this point, zero function may even be within your error tolerance. So zero function is a solution, and it also minimizes your regularization. So, at that point, your the solution that minimizes the regularization that's valid for the constraint is zero function. So you get zero function, and that's collapsed and then nothing interesting going on after.
But, actually, we can bound the number of rounds that you can get meaningful self-distillation before the solution collapse. So, again, the derivation I will not enter. I'll just present the end result.
So K is the number of examples. Epsilon is your training error like, in each training round, at what point I stopped training based on the training error reaching epsilon. And kappa is the condition number of your ground matrix that you have from that kernel, like the big matrix G, the condition number of that.
So that is one, I think, interesting result that we can show. First of all, it collapses. And, second, we can bound the number of meaningful iterations. Another thing is the advantage of having small epsilon, which allows these models to move near interpolation regime. So that means the models are attaining a very close error to zero, but not perfect zero.
So we can actually show that by reducing epsilon, the error tolerance. You can increase the sparsity level of the solution that you ultimately get after like repeating this for whatever number of iterations that we have here that guarantees no collapse.
At the end, before the collapse happens, you get some solution. And, that solution, the sparsity level is affected by the error tolerance that you choose for these self-distillation intermediate problems. And smaller is better. So it suggests that, in order to get the highest sparsity, it's best to choose a smaller epsilon.
Of course, this is in theory. In practice, you cannot make epsilon too small because there is numerical issues, right? But, in theory, the smaller gives you the sparser [INAUDIBLE].
Another thing that I want to discuss is comparing this with early stopping because, OK, everybody knows that early stopping is providing some kind of regularization. And here we are saying self-distillation is also providing some kind of regularization. Is there a connection between the two? Are they similar?
So, although people use the name of early stopping in the field a lot, but I don't think there is a concrete and crisp definition for it. So here maybe I first give a general definition for early stopping. I call it any procedure that cuts convergence short of the optimal solution. And then there are different instances of this.
For example, if you're using a numerical optimizer to minimize your loss, such as SGD, you could, for example, limit the number of iterations, say, OK, after this many iterations, done. Or you can do early stopping by increasing your training error loss tolerance. So, instead of like getting to near zero training error, you can stop at some epsilon that's slightly bigger. And that's also another way of doing early stopping.
But the first definition is not applicable for our analysis because here there is no numerical optimization. We're looking at the closed form. And our analysis is independent of how you parameterize the function. So there's no numerical optimizer. So we can only look at the second definition.
So, under the second definition, let's see what happens. In the yellow box, what I'm doing is again listing the solution after the first round of self-distillation. We are not doing self-distillation for early stopping. So it's just a first round. So you stop after getting this solution.
And, if you play with the error tolerance, epsilon, it's going to affect the Lagrange multiplier, the c0, but nothing else in this form of solution. So, if we know what happens for like range of c0 from very small to very large, then early stopping will be somewhere in there. So it cannot be outside of this range, right?
And we can actually see that for both cases of very large and very small c0. This never is able to sparsify the matrix that we have in between. For example, if c0 is very large, then you get this approximation in the first bullet point, which gives the matrix D, which, you know, is the initial sparsity pattern we have in D. And, if c0 is very small, then, actually, this whole thing becomes close to identity, which is a full rank matrix. So, actually, you go against sparsity.
So this is saying that, with the early stopping, at best, you can maintain the original sparsity pattern. You cannot make it more sparse, but you can make it more dense, depending on how you choose this epsilon. So it's not-- it's not actually doing anything similar to self-distillation.
And one thing also I think I can say about this is, when you're-- early stopping means that you choose a bigger epsilon. So that means that you choose a smaller c0. And, for a smaller c0, you get close to identity here. So that's actually showing that it's doing the opposite of self-distillation. So, if you do early stopping, you will end up with something that's even like less sparse than your initial solution. So you're moving toward the identity matrix.
All right, so now let's move on to some experiments on deep learning. I should say that our theory so far was only about this specific regression setup. OK, so I'm not claiming that like the results that we have here clearly carry over to deep learning, but there are some hope there that could at least make us feel, OK, maybe this is providing some approximation to what's happening in deep learning.
And that connection was through, again, wide neural networks and neural tangent kernels because, in the NTK regime, we know that the problem of deep learning is equivalent to a kernel regression. And kernel regression is what we studied. So, at least in that regime, these are related very closely. Of course, as you move away from wide networks, then this becomes a noisier and noisier approximation.
The second thing I want to say is the beauty of self-distillation. So, throughout the talk, we had to think about this regularization, underlying regularization, and its Green's function and all that. But, in fact, you don't need to know any of those in order to like use this kind of regularization, to get the effect of this regularization. All of that, you can think of it as a black box.
So, as I said, all these biases of a deep neural network are now encoded into that kernel, obviously, in NTK regime. And we don't need to even know the kernel because what we show is that, whatever that kernel or Green's function is-- and we don't know-- we just provide input, read the predictions, and then feed in these predictions again as new target values to the system.
And it sparsifies that underlying like regularizer, which we don't know, and we don't see, and we don't need to know, OK? So I think that's very beautiful because we can claim that we are sparsifying the representation that is induced by that regularizer without even needing to know that regularizer.
So is this clear? Because I think that's a very interesting point. I want to make sure-- or is there any questions so far in general? Because we have time. So, if not, I can move to some experimental results on deep learning. So we have experimented with both VGG architecture and ResNet architecture. I only show the slide for ResNet, but, for VGG, it's similar.
And we have used this for CIFAR-10 and CIFAR-100. OK, ImageNet was not really an option for us because we wanted to do multiple rounds of self-distillation. It's not just one-time training. It's like-- for example, here it's like 12 times retraining the model from scratch. On top of that, we wanted to get to the variances. So we repeated each experiment 10 times. So we couldn't really go beyond CIFAR-100.
So what are these plots saying? So the left two plots are the train and test for CIFAR-10. The right two are for CIFAR-100. The leftmost one-- OK, maybe I start from the-- OK, maybe I start from the second plot from the left. So that is CIFAR-10, but the training accuracy.
So this is the training accuracy with respect to the original labels, y0. We see that the accuracy, training accuracy, is going consistently down. And this is very consistent with a regularization viewpoint because, as we increase the regularization effect, amplify the regularization effect, we know that it's limiting the ways that you can fit your training data. So we expect the training to get-- the training accuracy to get worse.
Now let's look at the leftmost plot, which has the test accuracy. We see that, up to I think three rounds of self-distillation, in each round, we are able to benefit from this regularization. Each time, it amplifies it a little bit.
But, after about three rounds, it becomes too much. We start to over regularize because this process just keeps amplifying the regularization forever, right? So, at some point, you start to over regularize, and then you see a decline of performance in the test accuracy.
And then, on the two right plots, it's very similar trend. It's CIFAR-100 data set, but the trends are quite similar. You can see that the training accuracy, which is on the rightmost plot, is declining, which is not surprising and well aligned with the theory. And, also, the test accuracy is increasing. So I think you get the peak about like maybe after 10 iterations. But, after 10 iterations, it starts to decline.
Now one thing I need to say here is that people have observed that self-distillation-- empirically have observed that self-distillation improves test performance by running it for one or two rounds, but here I think we did it for a longer window. And we can even see this decline thing because I think, based on previous results, where it just shows, OK, you improve for the first one or two rounds.
You don't know what happens after that. Does it saturate? You get a flat curve after that. Or it starts to decline. But here we show that it declined, and we know now why it declines because it's over regularization.
All right, so we are getting close to the end of the talk. So maybe I can talk about some of the open problems here that can be pursued by anyone interested. I categorize to applied and theoretical.
So, from an applied side, there is actually a very important direction to pursue. And that is whether we can use this understanding that now we have based on this theory. We know that what happens. We know how it's regular-- how self-distillation is regularizing. And what is exactly that regularization for? It's, you know, gradually sparsifying the matrix that we discussed.
So whether we can use this understanding to develop, construct new efficient algorithms that can achieve similar regularization effect, but more efficiently-- because it's like insane. You want to do self-distillation for 10 rounds on a big model. I mean, that's not possible, right? It's not interesting. People will not use it.
But the hope is that maybe, by just understanding this regularization, you can now use it more directly, maybe somehow change your training loop. And just, in the first cycle of training, you somehow enforce this regularization effect simultaneously. So you don't need to run the model multiple times. You don't need to run training multiple times from scratch. So I think that's very important direction.
And then, from theoretical viewpoint, there are some I think interesting directions. One of them is extending this analysis to cross-entropy loss. So the plots I showed for deep learning were based on L2 loss. But, in the main paper, we have plots also for cross-entropy. And cross-entropy shows similar trend.
Like, if you use cross-entropy loss to train the models with self-distillation and cross-entropy loss, you see a similar trend that the training accuracy goes down, test accuracy going up. But we don't have a theory why that happens. So extending the analysis from L2 to cross-entropy is very important because cross-entropy is what people use more often.
The other question is whether we can take this analysis beyond the RKHS, which was the key setup. So, the regularization that we studied here, things ended up being like a kernel regression problem.
So it's like a problem at the end of studying reproducing kernel Hilbert space. That was a setup that we could prove all things, but, whether we can relax this a little bit, go beyond this, and still show what happens with self-distillation, that's another interesting direction for future study.
With that, I conclude the talk. And thank you again, everyone, for being here and attending the talk. If there are questions, we have I think some time to discuss and answer questions.