Memory-Efficient Adaptive Optimization

Rohan Anil
Yoram Singer
NeurIPS 2019

Abstract

Adaptive gradient-based optimizers such as AdaGrad and Adam are among the
methods of choice in machine learning. These methods maintain
second-moment statistics of each entry in the gradient, thus doubling
the optimizer's memory footprint. In behemoth-size applications, this
memory overhead restricts the size of the model being used as well as the
number of examples in a mini-batch. We describe a novel, simple, and flexible
adaptive optimization method with a sublinear memory cost that retains the
benefits of per-parameter adaptivity while allowing for larger models and
mini-batches. We give convergence guarantees for our method and demonstrate
its effectiveness in training very large deep architectures.

Research Areas