Deep Learning offers many puzzles to the theoretically-inclined researcher: How can deep neural networks — optimized by stochastic gradient descent (SGD) agnostic of concepts of invariance, minimality, disentanglement — somehow manage to learn representations that exhibit those traits? How do these networks, that can overfit random labels, somehow manage to generalize? How can the “flatness” of minima of the training loss be related to generalization, when flatness is coordinate-dependent?
I will present theoretical results that can not only explain but quantify these phenomena by showing: 1) a bound between the amount of information in the weights and total correlation (a measure of disentanglement), minimality and invariance of the resulting representation; 2) if complexity is measured not by the number of parameters, but the information they contain, deep networks follow the bias-variance tradeoff faithfully and there is no need to ”rethink” generalization; 3) the nuclear norm of the Hessian (a measure of flatness) bounds the information in the weights, which is the regularizer that guarantees the representation to be minimal, sufficient, invariant and maximally disentangled. The resulting information-theoretic framework, ten years in the works, finally yields a theory with predictive power: One can predict a sharp phase transition between overfitting and underfitting – verified empirically – and quantify the amount of information needed to overfit a given dataset with a given network to the fraction of a NAT. The theory has many connections with variational Bayesian inference, the Information Bottleneck principle, PAC-Bayes bounds, and Kolmogorov complexity.
Once a theoretically-sound regularized loss is in place, learning a representation amounts to solving a high-dimensional, non-convex optimization problem. In the second part of the talk, I will discuss some peculiarities of the residual surface, and introduce Entropy-SGD, an algorithm designed to exploit them using insights from statistical physics. As it turns out, Entropy-SGD computes the solution of a viscous Hamilton-Jacobi PDE, which leads to a non-greedy, stochastic optimal control version of SGD that is provably faster. In the non-viscous limit, the PDE leads to the classical proximal point iteration via the Hopf-Lax formula. The analysis establishes previously unknown connections between statistical physics, non-convex optimization and the theory of PDEs. Moreover, Entropy-SGD includes as special cases some of the most popular distributed algorithms in Deep Learning, e.g., Elastic-SGD, which it outperforms leading to state-of-the-art generalization error with optimal convergence rates, all without any extra hyper-parameters to tune.
Stefano Soatto, Amazon Web Services and University of California, Los Angeles