A Bayesian Perspective on Generalization and Stochastic Gradient Descent

ICLR (2018)

Abstract

Zhang et al. (2016) argued that understanding deep learning requires rethinking
generalization. To justify this claim, they showed that deep networks can easily
memorize randomly labeled training data, despite generalizing well when shown
real labels of the same inputs. We show here that the same phenomenon occurs
in small linear models with fewer than a thousand parameters; however there is
no need to rethink anything, since our observations are explained by evaluating
the Bayesian evidence in favor of each model. This Bayesian evidence penalizes
sharp minima. We also explore the “generalization gap” observed between small
and large batch training, identifying an optimum batch size which scales linearly
with both the learning rate and the size of the training set. Surprisingly, in our
experiments the generalization gap was closed by regularizing the model.