Ben Hutchinson
Ben Hutchinson is a Research Scientist in Google's Research, in Google's Responsible AI and Human-Centered Technology team. His research includes learning from various disciplines to inform the ethical development of AI. Prior to joining Google Research, he spent ten years working on a variety of products such as Google Wave, Google Maps, Knowledge Graph, Google Search, Social Impact, and others. He now uses this experience to work closely with product teams as a consultant on responsible practices and the development of responsible data sets machine learning models. He has a PhD in Natural Language Processing from the University of Edinburgh, and undergraduate degrees in linguistics and mathematics.
Authored Publications
Sort By
Preview abstract
Testing, within the machine learning (ML) community, has been predominantly about assessing a learned model's predictive performance measured against a test dataset. This test dataset is often a held-out subset of the dataset used to train the model, and hence expected to follow the same data distribution as the training dataset. While recent work on robustness testing within ML has pointed to the importance of testing against distributional shifts, these efforts also focus on estimating the likelihood of the model making an error against a reference dataset/distribution. In this paper, we argue that this view of testing actively discourages researchers and developers from looking into many other sources of robustness failures, for instance corner cases. We draw parallels with decades of work within software engineering testing focused on assessing a software system against various stress conditions, including corner cases, as opposed to solely focusing on average-case behaviour. Finally, we put forth a set of recommendations to broaden the view of machine learning testing to a rigorous practice.
View details
LaMDA: Language Models for Dialog Applications
Aaron Daniel Cohen
Alena Butryna
Alicia Jin
Apoorv Kulshreshtha
Ben Zevenbergen
Chung-ching Chang
Cosmo Du
Daniel De Freitas Adiwardana
Dehao Chen
Dmitry (Dima) Lepikhin
Erin Hoffman-John
Igor Krivokon
James Qin
Jamie Hall
Joe Fenton
Johnny Soraker
Kathy Meier-Hellstern
Maarten Paul Bosma
Marc Joseph Pickett
Marcelo Amorim Menegali
Marian Croak
Maxim Krikun
Noam Shazeer
Rachel Bernstein
Ravi Rajakumar
Ray Kurzweil
Romal Thoppilan
Steven Zheng
Taylor Bos
Toju Duke
Tulsee Doshi
Vincent Y. Zhao
Will Rusch
Yuanzhong Xu
arXiv (2022)
Preview abstract
We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and arepre-trained on 1.56T words of public dialog data and web text. While model scaling alone canimprove quality, it shows less improvements on safety and factual grounding. We demonstrate thatfine-tuning with annotated data and enabling the model to consult external knowledge sources canlead to significant improvements towards the two key challenges of safety and factual grounding.The first challenge, safety, involves ensuring that the model’s responses are consistent with a set ofhuman values, such as preventing harmful suggestions and unfair bias. We quantify safety using ametric based on an illustrative set of values, and we find that filtering candidate responses using aLaMDA classifier fine-tuned with a small amount of crowdworker-annotated data offers a promisingapproach to improving model safety. The second challenge, factual grounding, involves enabling themodel to consult external knowledge sources, such as an information retrieval system, a languagetranslator, and a calculator. We quantify factuality using a groundedness metric, and we find that ourapproach enables the model to generate responses grounded in known sources, rather than responsesthat merely sound plausible. Finally, we explore the use of LaMDA in the domains of education andcontent recommendations, and analyze their helpfulness and role consistency.
View details
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Hyung Won Chung
Sebastian Gehrmann
Parker Schuh
Sasha Tsvyashchenko
Abhishek Rao
Yi Tay
Noam Shazeer
Nan Du
Reiner Pope
James Bradbury
Guy Gur-Ari
Toju Duke
Henryk Michalewski
Xavier Garcia
Liam Fedus
David Luan
Barret Zoph
Ryan Sepassi
David Dohan
Shivani Agrawal
Mark Omernick
Marie Pellat
Aitor Lewkowycz
Erica Moreira
Rewon Child
Oleksandr Polozov
Zongwei Zhou
Brennan Saeta
Michele Catasta
Jason Wei
Kathy Meier-Hellstern
arxiv:2204.02311 (2022)
Preview abstract
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.
View details
Preview abstract
Questions regarding implicitness, ambiguity and underspecification are crucial for multimodal image+text systems, but have received little attention to date. This paper maps out a conceptual framework to address this gap for systems which generate images from text inputs, specifically for systems which generate images depicting scenes from descriptions of those scenes. In doing so, we account for how texts and images convey different forms of meaning. We then outline a set of core challenges concerning textual and visual ambiguity and specificity tasks, as well as risks that may arise from improper handling of ambiguous and underspecified elements. We propose and discuss two strategies for addressing these challenges: a) generating a visually ambiguous output image, and b) generating a set of diverse output images.
View details
Preview abstract
In order to build trust that a machine learned model is appropriate and responsible within a systems context involving technical and
human components, a broad range of factors typically need to be considered. However in practice model evaluations frequently focus
on only a narrow range of expected predictive behaviours. This paper examines the critical evaluation gap between the idealized
breadth of concerns and the observed narrow focus of actual evaluations. In doing so, we demonstrate which values are centered—and
which are marginalized—within the machine learning community. Through an empirical study of machine learning papers from
recent high profile conferences, we demonstrate the discipline’s general focus on a small set of evaluation methods. By considering the
mathematical formulations of evaluation metrics and the test datasets over which they are calculated, we draw attention to which
properties of models are centered in the field. This analysis also reveals an important gap: the properties of models which are frequently
neglected or sidelined during evaluation. By studying the structure of this gap, we demonstrate the machine learning discipline’s
implicit assumption of a range of commitments which have normative impacts; these include commitments to consequentialism,
abstractability from context, the quantifiability of impacts, the irrelevance of non-predictive features, and the equivalence of different
failure modes. Shedding light on these assumptions and commitments enables us to question their appropriateness for different ML
system contexts, and points the way towards more diverse and contextualized evaluation methodologies which can be used to more
robustly examine the trustworthiness of ML models.
View details
Towards Accountability for Machine Learning Datasets
Alex Hanna
Christina Greer
Margaret Mitchell
Proceedings of FAccT 2021 (2021) (to appear)
Preview abstract
Rising concern for the societal implications of artificial intelligence systems has inspired demands for greater transparency and accountability. However the datasets which empower machine learning are often used, shared and re-used with little visibility into the processes of deliberation which led to their creation. Which stakeholder groups had their perspectives included when the dataset was conceived? Which domain experts were consulted regarding how to model subgroups and other phenomena? How were questions of representational biases measured and addressed? Who labeled the data? In this paper, we introduce a rigorous framework for dataset development transparency which supports decision-making and accountability. The framework uses the cyclical, infrastructural and engineering nature of dataset development to draw on best practices from the software development lifecycle. Each stage of the data development lifecycle yields a set of documents that facilitate improved communication and decision-making, as well as drawing attention the value and necessity of careful data work. The proposed framework is intended to contribute to closing the accountability gap in artificial intelligence systems, by making visible the often overlooked work that goes into dataset creation.
View details
Preview abstract
Conventional algorithmic fairness is West-centric, as seen in its sub-groups, values, and optimisations. In this paper, we de-center algorithmic fairness and analyse AI power in India. Based on 36 qualitative interviews and a discourse analysis of algorithmic deployments in India, we find that several assumptions of algorithmic fairness are challenged in India. We find that data is not always reliable due to socio-economic factors, users are given third world treatment by ML makers, and AI signifies unquestioning aspiration. We contend that localising model fairness alone can be window dressing in India, where the distance between models and oppressed communities is large. Instead, we re-imagine algorithmic fairness in India and provide a roadmap to re-contextualise data and models, empower oppressed communities, and enable Fair-ML ecosystems.
View details
Social Biases in NLP Models as Barriers for Persons with Disabilities
Stephen Craig Denuyl
Proceedings of ACL 2020, ACL (to appear)
Preview abstract
Building equitable and inclusive technologies
demands paying attention to how social attitudes towards persons with disabilities are
represented within technology. Representations perpetuated by NLP models often inadvertently encode undesirable social biases
from the data on which they are trained. In this
paper, first we present evidence of such undesirable biases towards mentions of disability in
two different NLP models: toxicity prediction
and sentiment analysis. Next, we demonstrate
that neural embeddings that are critical first
steps in most NLP pipelines also contain undesirable biases towards mentions of disabilities.
We then expose the topical biases in the social
discourse about some disabilities which may
explain such biases in the models; for instance,
terms related to gun violence, homelessness,
and drug addiction are over-represented in discussions about mental illness.
View details
Fairness Preferences, Actual and Hypothetical: A Study of Crowdworker Incentives
Angie Peng
Jeff Naecker
Nyalleng Moorosi
Proceedings of ICML 2020 Workshop on Participatory Approaches to Machine Learning (to appear)
Preview abstract
How should we decide which fairness criteria or
definitions to adopt in machine learning systems?
To answer this question, we must study the fair-
ness preferences of actual users of machine learn-
ing systems. Stringent parity constraints on treat-
ment or impact can come with trade-offs, and
may not even be preferred by the social groups
in question (Zafar et al., 2017). Thus it might
be beneficial to elicit what the group’s prefer-
ences are, rather than rely on a priori defined
mathematical fairness constraints. Simply asking
for self-reported rankings of users is challenging
because research has shown that there are often
gaps between people’s stated and actual prefer-
ences(Bernheim et al., 2013).
This paper outlines a research program and ex-
perimental designs for investigating these ques-
tions. Participants in the experiments are invited
to perform a set of tasks in exchange for a base
payment—they are told upfront that they may
receive a bonus later on, and the bonus could de-
pend on some combination of output quantity and
quality. The same group of workers then votes on
a bonus payment structure, to elicit preferences.
The voting is hypothetical (not tied to an outcome)
for half the group and actual (tied to the actual
payment outcome) for the other half, so that we
can understand the relation between a group’s
actual preferences and hypothetical (stated) pref-
erences. Connections and lessons from fairness
in machine learning are explored.
View details
Diversity and Inclusion Metrics for Subset Selection
Margaret Mitchell
Dylan Baker
Nyalleng Moorosi
Alex Hanna
Timnit Gebru
Jamie Morgenstern
Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES), ACM (2020)
Preview abstract
The concept of fairness has recently been applied in machine learning settings to describe a wide range of constraints and objectives. When applied to ranking, recommendation, or subset selection problems for an individual, it becomes less clear that fairness goals are more applicable than goals that prioritize diverse outputs and instances that represent the individual's goals well. In this work, we discuss the relevance of the concept of fairness to the concepts of diversity and inclusion, and introduce metrics that quantify the diversity and inclusion of an instance or set. Diversity and inclusion metrics can be used in tandem, including additional fairness constraints, or may be used separately, and we detail how the different metrics interact. Results from human subject experiments demonstrate that the proposed criteria for diversity and inclusion are consistent with social notions of these two concepts, and human judgments on the diversity and inclusion of example instances are correlated with the defined metrics.
View details