Measuring and Mitigating Unintended Bias in Text Classification

John Li
AAAI/ACM Conference on AI, Ethics, and Society (2018)

Abstract

We introduce and illustrate a new approach to measuring and
mitigating unintended bias in machine learning models. Our
definition of unintended bias is parameterized by a test set
and a subset of input features. We illustrate how this can
be used to evaluate text classifiers using a synthetic test set
and a public corpus of comments annotated for toxicity from
Wikipedia Talk pages. We also demonstrate how imbalances
in training data can lead to unintended bias in the resulting
models, and therefore potentially unfair applications. We use
a set of common demographic identity terms as the subset of
input features on which we measure bias. This technique permits
analysis in the common scenario where demographic information
on authors and readers is unavailable, so that bias
mitigation must focus on the content of the text itself. The
mitigation method we introduce is an unsupervised approach
based on balancing the training dataset. We demonstrate that
this approach reduces the unintended bias without compromising
overall model quality