Classical generalization bounds for classification suggest maximization of the margin of a deep network under the constraint of unit Frobenius norm of the weight matrix at each layer. We show that this goal can be achieved by gradient algorithms enforcing a unit norm constraint. We describe three algorithms of this kind and their relation with existing weight normalization and batch normalization algorithms, thus explaining their effectivenss. We also show that continuous standard gradient descent with normalization at the end is equivalent to gradient descent with norm constraint. We conjecture that this surprising property corresponds to the elusive implicit regularization of gradient descent in deep networks responsible for generalization despite overparametrization.

^{1}This replaces previous versions of Theory IIIa and Theory IIIb.