Yasaman Bahri
Research Scientist, Brain.
Research Areas
Authored Publications
Sort By
Preview abstract
The test loss of well-trained neural networks often follows precise power-law scaling relations with either the size of the training dataset or the number of parameters in the network. We propose a theory that explains and connects these scaling laws. We identify variance-limited and resolution-limited scaling behavior for both model and dataset size, for a total of four scaling regimes. The variance-limited scaling follows simply from the existence of a well-behaved infinite data or infinite width limit, while the resolution-limited regime can be explained by positing that models are effectively resolving a smooth data manifold. In the large width limit, this can be equivalently obtained from the spectrum of certain kernels, and we present evidence that large width and large dataset resolution-limited scaling exponents are related by a duality. We exhibit all four scaling regimes in the controlled setting of large random feature and pre-trained models and test the predictions empirically on a range of standard architectures and datasets. We also observe several empirical relationships between datasets and scaling exponents: super-classing image classifiers does not change exponents, while changing input distribution (via changing datasets or adding noise) has a strong effect. We further explore the effect of architecture aspect ratio on scaling exponents.
View details
Preview abstract
Although machine learning models typically experience a drop in performance
on out-of-distribution data, accuracies on in- versus out-of-distribution data are
widely observed to follow a single linear trend when evaluated across a testbed of
models. Models that are more accurate on the out-of-distribution data relative to this
baseline exhibit “effective robustness” and are exceedingly rare. Identifying such
models, and understanding their properties, is key to improving out-of-distribution
performance. We conduct a thorough empirical investigation of effective robustness
during fine-tuning and surprisingly find that models pre-trained on larger datasets
exhibit effective robustness during training that vanishes at convergence. We study
how properties of the data influence effective robustness, and we show that it
increases with the larger size, more diversity, and higher example difficulty of
the dataset. We also find that models that display effective robustness are able to
correctly classify 10% of the examples that no other current testbed model gets
correct. Finally, we discuss several strategies for scaling effective robustness to the
high-accuracy regime to improve the out-of-distribution accuracy of state-of-the-art
models.
View details
Preview abstract
The choice of initial learning rate can have a profound effect on the performance of deep networks. We present a class of neural networks with solvable training dynamics that exhibit sharply distinct behaviors at small and large learning rates. The two regimes are separated by a phase transition. In the small learning rate phase training can be understood using the existing theory of infinitely wide neural networks. At large learning rates the model captures qualitatively distinct phenomena, including the convergence of gradient descent dynamics to flatter minima. One key prediction of our model is a narrow range of large stable learning rates. We find good agreement between our model's predictions and training dynamics in realistic deep learning settings. Furthermore, we find that the optimal performance in such settings is often found in the large learning rate phase. We believe our results shed light on characteristics of models trained at different learning rates. In particular, they fill a gap between existing wide neural network theory, and the nonlinear, large learning rate, training dynamics relevant to practice.
View details
Infinite attention: NNGP and NTK for deep attention networks
Jiri Hron
Jascha Sohl-dickstein
Roman Novak
International Conference on Machine Learning 2020 (2020) (to appear)
Preview abstract
There is a growing amount of literature on the relationship between wide neural networks (NNs) and Gaussian processes (GPs), identifying an equivalence between the two for a variety of NN architectures. This equivalence enables, for instance, accurate approximation of the behaviour of wide Bayesian NNs without MCMC or variational approximations, or characterisation of the distribution of randomly initialised wide NNs optimised by gradient descent without ever running an optimiser. We provide a rigorous extension of these results to NNs involving attention layers, showing that unlike single-head attention, which induces non-Gaussian behaviour, multi-head attention architectures behave as GPs as the number of heads tends to infinity. We further discuss the effects of positional encodings and layer normalisation, and propose modifications of the attention mechanism which lead to improved results for both finite and infinitely wide NNs. We evaluate attention kernels empirically, leading to a moderate improvement upon the previous state-of-the-art on CIFAR-10 for GPs without trainable kernels and advanced data preprocessing. Finally, we introduce new features to the Neural Tangents library (Novak et al., 2020) allowing applications of NNGP/NTK models, with and without attention, to variable-length sequences, with an example on the IMDb reviews dataset.
View details
Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes
Roman Novak
Jaehoon Lee
Greg Yang
Jiri Hron
Dan Abolafia
Jeffrey Pennington
Jascha Sohl-dickstein
ICLR (2019)
Preview abstract
There is a previously identified equivalence between wide fully connected neural
networks (FCNs) and Gaussian processes (GPs). This equivalence enables, for
instance, test set predictions that would have resulted from a fully Bayesian, infinitely wide trained FCN to be computed without ever instantiating the FCN, but
by instead evaluating the corresponding GP. In this work, we derive an analogous
equivalence for multi-layer convolutional neural networks (CNNs) both with and
without pooling layers, and achieve state of the art results on CIFAR10 for GPs
without trainable kernels. We also introduce a Monte Carlo method to estimate
the GP corresponding to a given neural network architecture, even in cases where
the analytic form has too many terms to be computationally feasible.
Surprisingly, in the absence of pooling layers, the GPs corresponding to CNNs
with and without weight sharing are identical. As a consequence, translation
equivariance, beneficial in finite channel CNNs trained with stochastic gradient
descent (SGD), is guaranteed to play no role in the Bayesian treatment of the infinite channel limit – a qualitative difference between the two regimes that is not
present in the FCN case. We confirm experimentally, that while in some scenarios
the performance of SGD-trained finite CNNs approaches that of the corresponding GPs as the channel count increases, with careful tuning SGD-trained CNNs
can significantly outperform their corresponding GPs, suggesting advantages from
SGD training compared to fully Bayesian parameter estimation.
View details
Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
Jaehoon Lee
Sam Schoenholz
Roman Novak
Jascha Sohl-dickstein
Jeffrey Pennington
NeurIPS (2019)
Preview abstract
A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspondence between wide Bayesian neural networks and Gaussian processes, gradient-based training of wide neural networks with a squared loss produces test set predictions drawn from a Gaussian process with a particular compositional kernel. While these theoretical results are only exact in the infinite width limit, we nevertheless find excellent empirical agreement between the predictions of the original network and those of the linearized version even for finite practically-sized networks. This agreement is robust across different architectures, optimization methods, and loss functions.
View details
Deep Neural Networks as Gaussian Processes
Jaehoon Lee
Roman Novak
Sam Schoenholz
Jeffrey Pennington
Jascha Sohl-dickstein
ICLR (2018)
Preview abstract
It has long been known that a single-layer fully-connected neural network with an
i.i.d. prior over its parameters is equivalent to a Gaussian process (GP), in the limit
of infinite network width. This correspondence enables exact Bayesian inference
for infinite width neural networks on regression tasks by means of evaluating the
corresponding GP. Recently, kernel functions which mimic multi-layer random
neural networks have been developed, but only outside of a Bayesian framework.
As such, previous work has not identified that these kernels can be used as covariance
functions for GPs and allow fully Bayesian prediction with a deep neural
network.
In this work, we derive the exact equivalence between infinitely wide deep networks
and GPs. We further develop a computationally efficient pipeline to compute
the covariance function for these GPs. We then use the resulting GPs to perform
Bayesian inference for wide deep neural networks on MNIST and CIFAR10.
We observe that trained neural network accuracy approaches that of the corresponding
GP with increasing layer width, and that the GP uncertainty is strongly
correlated with trained network prediction error. We further find that test performance
increases as finite-width trained networks are made wider and more similar
to a GP, and thus that GP predictions typically outperform those of finite-width
networks. Finally we connect the performance of these GPs to the recent theory
of signal propagation in random neural networks.
View details
Preview abstract
In recent years, state-of-the-art methods in computer vision have utilized increasingly deep convolutional neural network architectures (CNNs), with some of the most successful models employing hundreds or even thousands of layers. A variety of pathologies such as vanishing/exploding gradients make training such deep networks challenging. While residual connections and batch normalization do enable training at these depths, it has remained unclear whether such specialized architecture designs are truly necessary to train deep CNNs. In this work, we demonstrate that it is possible to train vanilla CNNs with ten thousand layers or more simply by using an appropriate initialization scheme. We derive this initialization scheme theoretically by developing a mean field theory for signal propagation and by characterizing the conditions for dynamical isometry, the equilibration of singular values of the input-output Jacobian matrix. These conditions require that the convolution operator be an orthogonal transformation in the sense that it is norm-preserving. We present an algorithm for generating such random initial orthogonal convolution kernels and demonstrate empirically that they enable efficient training of extremely deep architectures.
View details
Preview abstract
In practice it is often found that large over-parameterized neural networks generalize better than their smaller counterparts, an observation that appears to conflict with classical notions of function complexity, which typically favor smaller models. In this work, we investigate this tension between complexity and generalization through an extensive empirical exploration of two natural metrics of complexity related to sensitivity to input perturbations. Our experiments survey thousands of models with various fully-connected architectures, optimizers, and other hyper-parameters, as well as four different image classification datasets.
We find that trained neural networks are more robust to input perturbations in the vicinity of the training data manifold, as measured by the norm of the input-output Jacobian of the network, and that it correlates well with generalization. We further establish that factors associated with poor generalization − such as full-batch training or using random labels − correspond to lower robustness, while factors associated with good generalization − such as data augmentation and ReLU non-linearities − give rise to more robust functions. Finally, we demonstrate how the input-output Jacobian norm can be predictive of generalization at the level of individual test points.
View details
Preview abstract
In recent years, state-of-the-art methods in computer vision have utilized increasingly deep convolutional neural network architectures (CNNs), with some of the most successful models employing 1000 layers or more. Optimizing networks of such depth is extremely challenging and has up until now been possible only when the architectures incorporate special residual connections and batch normalization. In this work, we demonstrate that it is possible to train vanilla CNNs of depth 1500 or more simply by careful choice of initialization. We derive this initialization scheme theoretically, by developing a mean field theory for the dynamics of signal propagation in random CNNs with circular boundary conditions. We show that the order-to-chaos phase transition of such CNNs is similar to that of fully-connected networks, and we provide empirical evidence that ultra-deep vanilla CNNs are trainable if the weights and biases are initialized near the order-to-chaos transition.
View details