Here we consider a simplified model of the dynamics of gradient flow under the square loss in ReLU networks. We show that convergence to a solution with the absolute minimum "norm" -- defined as the product of the Frobenius norms of each layer weight matrix -- is expected when normalization by a Lagrange multiplier (LN) is used together with Weight Decay (WD). In the absence of LN+WD, good solutions for classification may still be achieved because of the implicit bias towards small norm solutions in the trajectory dynamics of gradient descent introduced by close-to-zero initial conditions on the norms of the weights. The main property of the minimizers that bounds their expected binary classification error is the norm: we prove that among all the close-to-interpolating solutions, the ones associated with smaller norm have better margin and better bounds on the expected classification error. We also prove that

quasi-interpolating solutions obtained by gradient descent in the presence of WD show the recently discovered behavior of Neural Collapse and describe related predictions. Our analysis supports the idea that the advantage of deep networks relative to other standard classifiers is restricted to specific deep architectures such as CNNs and is due to their good approximation properties for target functions that are locally compositional.