Hossein Mobahi, Google Research
Improving Generalization Performance by Self-Training and Self-Distillation
In supervised learning we often seek a model which minimizes (to epsilon optimality) a loss function over a training set, possibly subject to some (implicit or explicit) regularization. Suppose you train a model this way and read out the predictions it makes over the training inputs, which may slightly differ from the training targets due to the epsilon optimality. Now suppose you treat these predictions as new target values, and retrain another model from scratch using those predictions instead of the original target values. Surprisingly, the second model can often outperform the original model in terms of accuracy on the test set. Actually, we may repeat this loop a few times, and each time see an increase in the generalization performance. This might sound strange as such a supervised self-training process (aka self-distillation) does not receive any new information about the task and solely evolves by retraining itself. In this talk, I argue such self-training process induces additional regularization, which gets amplified in each round of retraining. In fact, I will rigorously characterize such regularization effects when learning the function in Hilbert space. The latter setting can relate to neural networks with infinite width. I will conclude by discussing some open problems in the area of self-training and self-distillation.