Ben Hutchinson
Ben Hutchinson is a Research Scientist in Google's Research, in Google's Responsible AI and Human-Centered Technology team. His research includes learning from various disciplines to inform the ethical development of AI. Prior to joining Google Research, he spent ten years working on a variety of products such as Google Wave, Google Maps, Knowledge Graph, Google Search, Social Impact, and others. He now uses this experience to work closely with product teams as a consultant on responsible practices and the development of responsible data sets machine learning models. He has a PhD in Natural Language Processing from the University of Edinburgh, and undergraduate degrees in linguistics and mathematics.
Authored Publications
Sort By
Envisioning Aboriginal and Torres Strait Islander AI Futures
Journal of Global Indigeneity (2025)
Preview abstract
In January 2025, over forty Aboriginal and Torres Strait Islander researchers, practitioners, community members, and allies, gathered at the Centre for Global Indigenous Futures at the Wallumattagal Campus of Macquarie University in Sydney to envisage Aboriginal and Torres Strait Islander AI futures. This publication reports on attendees' vision for the future of AI for Aboriginal and Torres Strait Islander people.
View details
Envisioning Aboriginal and Torres Strait Islander AI Futures
Journal of Global Indigeneity (2025)
Preview abstract
In January 2025, over forty Aboriginal and Torres Strait Islander researchers, practitioners, community members, and allies, gathered at the Centre for Global Indigenous Futures at the Wallumattagal Campus of Macquarie University in Sydney to envisage Aboriginal and Torres Strait Islander AI futures. This publication reports on attendees' vision for the future of AI for Aboriginal and Torres Strait Islander people.
View details
Preview abstract
Settler colonialism has led to ancestral language endangerment and extinction on a mass scale. It has also forced `global' languages such as English on Indigenous communities worldwide. In Australia, post-contact languages, including creoles, and local varieties of international languages emerged as a result of forced contact with English speakers. These contact varieties are widely used, but to date they have to-date been poorly supported by language technologies. This oversight presents barriers to participation in civil and economic society for Indigenous communities using these languages. It also reproduces minoritisation of contemporary Indigenous sociolinguistic identities. This paper concerns the question of whether (and, if so, how) Indigenous people may be supported by technologies for their non-ancestral languages. We argue that multiple real-world opportunities exist, and explore this position through a case study of a project which aims to improve Automated Speech Recognition for Australian Aboriginal English. We discuss how we integrated culturally appropriate processes into the project. We call for increased support for languages used by Indigenous communities, including contact varieties, providing practical economic and socio-cultural benefits.
View details
Envisioning Aboriginal and Torres Strait Islander AI Futures
Journal of Global Indigeneity (2025)
Preview abstract
In January 2025, over forty Aboriginal and Torres Strait Islander researchers, practitioners, community members, and allies, gathered at the Centre for Global Indigenous Futures at the Wallumattagal Campus of Macquarie University in Sydney to envisage Aboriginal and Torres Strait Islander AI futures. This publication reports on attendees' vision for the future of AI for Aboriginal and Torres Strait Islander people.
View details
Preview abstract
Indigenous languages are historically under-served by Natural Language Processing (NLP) technologies, but this is changing for some languages with the recent scaling of large multilingual models and an increased focus by the NLP community on endangered languages. This position paper explores ethical considerations in building NLP technologies for Indigenous languages, based on the premise that such projects should primarily serve Indigenous communities. We report on interviews with 17 researchers working in or with Aboriginal and/or Torres Strait Islander communities on language technology projects in Australia. Drawing on insights from the interviews, we recommend practices for NLP researchers to increase attention to the process of engagements with Indigenous communities, rather than focusing only on decontextualised artefacts.
View details
Preview abstract
As Generative AI (GenAI) systems increasingly enter our daily lives, reshaping social norms and practices, we must examine the norms and practices we use to evaluate the systems themselves. Recent scholarship has started to make explicit the normative dimensions of Machine Learning (ML) development and evaluation. \citet{birhane2022values} demonstrate that particular normative values are encoded in Machine Learning (ML) practice. \citet{hutchinson2022evaluation}, in a review of ML evaluation practices, identify several commitments implicit in the way ML models are evaluated. These include a commitment to consequentialism, the assumptions that evaluations can be undertaken acontextually and that model inputs need only play a limited during model evaluation, and the expectations that impacts can be quantified and that ML failure modes are commensurable. In this provocation, we extend this line of inquiry by arguing two points: we need to attend to the implicit assumptions and values reflected in how societal impacts are conceptualised and constructed through ML evaluations; and doing so reveals that many of the problems that societal impact evaluations attempt to address would be better conceptualised as governance, rather than evaluation, issues.
View details
Socially Responsible Data for Large Multilingual Language Models
Zara Wudiri
Mbangula Lameck Amugongo
Alex
Stanley Uwakwe
João Sedoc
Edem Wornyo
Seyi Olojo
Amber Ebinama
Suzanne Dikker
2024
Preview abstract
Large Language Models (LLMs) have rapidly increased in size and apparent capabilities in the last three years but their training
data is largely English text. There is growing interest in language inclusivity in LLMs, and various efforts are striving for models
to accommodate language communities outside of the Global North1
, which include many languages that have been historically
underrepresented digitally. These languages have been coined as “low resource languages” or “long tail languages”, and LLMs
performance on these languages is generally poor. While expanding the use of LLMs to more languages may bring many potential
benefits, such as assisting cross-community communication and language preservation, great care must be taken to ensure that
data collection on these languages is not extractive and that it does not reproduce exploitative practices of the past. Collecting
data from languages spoken by previously colonized people, indigenous people, and non-Western languages raises many complex
sociopolitical and ethical questions, e.g., around consent, cultural safety, and data sovereignty. Furthermore, linguistic complexity and
cultural nuances are often lost in LLMs. This position paper builds on recent scholarship, and our own work, and outlines several
relevant social, cultural, and ethical considerations and potential ways to mitigate them through qualitative research, community
partnerships and participatory design approaches. We provide twelve recommendations for consideration when collecting language
data on underrepresented language communities outside of the Global North.
View details
LaMDA: Language Models for Dialog Applications
Aaron Daniel Cohen
Alena Butryna
Alicia Jin
Apoorv Kulshreshtha
Ben Zevenbergen
Chung-ching Chang
Cosmo Du
Daniel De Freitas Adiwardana
Dehao Chen
Dmitry (Dima) Lepikhin
Erin Hoffman-John
Igor Krivokon
James Qin
Jamie Hall
Joe Fenton
Johnny Soraker
Kathy Meier-Hellstern
Maarten Paul Bosma
Marc Joseph Pickett
Marcelo Amorim Menegali
Marian Croak
Maxim Krikun
Noam Shazeer
Rachel Bernstein
Ravi Rajakumar
Ray Kurzweil
Romal Thoppilan
Steven Zheng
Taylor Bos
Toju Duke
Tulsee Doshi
Vincent Y. Zhao
Will Rusch
Yanping Huang
Yuanzhong Xu
Zhifeng Chen
arXiv (2022)
Preview abstract
We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and arepre-trained on 1.56T words of public dialog data and web text. While model scaling alone canimprove quality, it shows less improvements on safety and factual grounding. We demonstrate thatfine-tuning with annotated data and enabling the model to consult external knowledge sources canlead to significant improvements towards the two key challenges of safety and factual grounding.The first challenge, safety, involves ensuring that the model’s responses are consistent with a set ofhuman values, such as preventing harmful suggestions and unfair bias. We quantify safety using ametric based on an illustrative set of values, and we find that filtering candidate responses using aLaMDA classifier fine-tuned with a small amount of crowdworker-annotated data offers a promisingapproach to improving model safety. The second challenge, factual grounding, involves enabling themodel to consult external knowledge sources, such as an information retrieval system, a languagetranslator, and a calculator. We quantify factuality using a groundedness metric, and we find that ourapproach enables the model to generate responses grounded in known sources, rather than responsesthat merely sound plausible. Finally, we explore the use of LaMDA in the domains of education andcontent recommendations, and analyze their helpfulness and role consistency.
View details
Preview abstract
Testing, within the machine learning (ML) community, has been predominantly about assessing a learned model's predictive performance measured against a test dataset. This test dataset is often a held-out subset of the dataset used to train the model, and hence expected to follow the same data distribution as the training dataset. While recent work on robustness testing within ML has pointed to the importance of testing against distributional shifts, these efforts also focus on estimating the likelihood of the model making an error against a reference dataset/distribution. In this paper, we argue that this view of testing actively discourages researchers and developers from looking into many other sources of robustness failures, for instance corner cases. We draw parallels with decades of work within software engineering testing focused on assessing a software system against various stress conditions, including corner cases, as opposed to solely focusing on average-case behaviour. Finally, we put forth a set of recommendations to broaden the view of machine learning testing to a rigorous practice.
View details
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Gaurav Mishra
Hyung Won Chung
Sebastian Gehrmann
Parker Schuh
Sasha Tsvyashchenko
Abhishek Rao
Yi Tay
Noam Shazeer
Nan Du
Reiner Pope
James Bradbury
Jacob Austin
Guy Gur-Ari
Toju Duke
Henryk Michalewski
Xavier Garcia
Liam Fedus
David Luan
Barret Zoph
Ryan Sepassi
David Dohan
Shivani Agrawal
Mark Omernick
Andrew M. Dai
Marie Pellat
Aitor Lewkowycz
Erica Moreira
Rewon Child
Oleksandr Polozov
Katherine Lee
Zongwei Zhou
Brennan Saeta
Michele Catasta
Jason Wei
Kathy Meier-Hellstern
arxiv:2204.02311 (2022)
Preview abstract
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.
View details