Measuring and Reducing Gendered Correlations in Pre-trained Models

Abstract

Large pre-trained models have revolutionized natural language understanding.
However, researchers have found they can encode correlations undesired in many applications, like \emph{surgeon} being associated more with \emph{he} than \emph{she}.
We explore such \emph{gendered correlations} as a case study, to learn how we can configure and train models to mitigate the risk of encoding unintended associations.
We find that it is important to define correlation metrics, since they can reveal differences among models with similar accuracy.
Large models have more capacity to encode gendered correlations, but this can be mitigated with general dropout regularization.
Counterfactual data augmentation is also effective, and can even reduce correlations not explicitly targeted for mitigation, potentially making it useful beyond gender too.
Both techniques yield models with comparable accuracy to unmitigated analogues, and still resist re-learning correlations in fine-tuning.