The Evolution of Out-of-Distribution Robustness Throughout Fine-Tuning
Abstract
                Although machine learning models typically experience a drop in performance
on out-of-distribution data, accuracies on in- versus out-of-distribution data are
widely observed to follow a single linear trend when evaluated across a testbed of
models. Models that are more accurate on the out-of-distribution data relative to this
baseline exhibit “effective robustness” and are exceedingly rare. Identifying such
models, and understanding their properties, is key to improving out-of-distribution
performance. We conduct a thorough empirical investigation of effective robustness
during fine-tuning and surprisingly find that models pre-trained on larger datasets
exhibit effective robustness during training that vanishes at convergence. We study
how properties of the data influence effective robustness, and we show that it
increases with the larger size, more diversity, and higher example difficulty of
the dataset. We also find that models that display effective robustness are able to
correctly classify 10% of the examples that no other current testbed model gets
correct. Finally, we discuss several strategies for scaling effective robustness to the
high-accuracy regime to improve the out-of-distribution accuracy of state-of-the-art
models.
        on out-of-distribution data, accuracies on in- versus out-of-distribution data are
widely observed to follow a single linear trend when evaluated across a testbed of
models. Models that are more accurate on the out-of-distribution data relative to this
baseline exhibit “effective robustness” and are exceedingly rare. Identifying such
models, and understanding their properties, is key to improving out-of-distribution
performance. We conduct a thorough empirical investigation of effective robustness
during fine-tuning and surprisingly find that models pre-trained on larger datasets
exhibit effective robustness during training that vanishes at convergence. We study
how properties of the data influence effective robustness, and we show that it
increases with the larger size, more diversity, and higher example difficulty of
the dataset. We also find that models that display effective robustness are able to
correctly classify 10% of the examples that no other current testbed model gets
correct. Finally, we discuss several strategies for scaling effective robustness to the
high-accuracy regime to improve the out-of-distribution accuracy of state-of-the-art
models.
