Ben Hutchinson
Ben Hutchinson is a Research Scientist in Google's Research, in Google's Responsible AI and Human-Centered Technology team. His research includes learning from various disciplines to inform the ethical development of AI. Prior to joining Google Research, he spent ten years working on a variety of products such as Google Wave, Google Maps, Knowledge Graph, Google Search, Social Impact, and others. He now uses this experience to work closely with product teams as a consultant on responsible practices and the development of responsible data sets machine learning models. He has a PhD in Natural Language Processing from the University of Edinburgh, and undergraduate degrees in linguistics and mathematics.
Authored Publications
Sort By
Envisioning Aboriginal and Torres Strait Islander AI Futures
Journal of Global Indigeneity (2025)
Preview abstract
In January 2025, over forty Aboriginal and Torres Strait Islander researchers, practitioners, community members, and allies, gathered at the Centre for Global Indigenous Futures at the Wallumattagal Campus of Macquarie University in Sydney to envisage Aboriginal and Torres Strait Islander AI futures. This publication reports on attendees' vision for the future of AI for Aboriginal and Torres Strait Islander people.
View details
Preview abstract
Settler colonialism has led to ancestral language endangerment and extinction on a mass scale. It has also forced `global' languages such as English on Indigenous communities worldwide. In Australia, post-contact languages, including creoles, and local varieties of international languages emerged as a result of forced contact with English speakers. These contact varieties are widely used, but to date they have to-date been poorly supported by language technologies. This oversight presents barriers to participation in civil and economic society for Indigenous communities using these languages. It also reproduces minoritisation of contemporary Indigenous sociolinguistic identities. This paper concerns the question of whether (and, if so, how) Indigenous people may be supported by technologies for their non-ancestral languages. We argue that multiple real-world opportunities exist, and explore this position through a case study of a project which aims to improve Automated Speech Recognition for Australian Aboriginal English. We discuss how we integrated culturally appropriate processes into the project. We call for increased support for languages used by Indigenous communities, including contact varieties, providing practical economic and socio-cultural benefits.
View details
Envisioning Aboriginal and Torres Strait Islander AI Futures
Journal of Global Indigeneity (2025)
Preview abstract
In January 2025, over forty Aboriginal and Torres Strait Islander researchers, practitioners, community members, and allies, gathered at the Centre for Global Indigenous Futures at the Wallumattagal Campus of Macquarie University in Sydney to envisage Aboriginal and Torres Strait Islander AI futures. This publication reports on attendees' vision for the future of AI for Aboriginal and Torres Strait Islander people.
View details
Envisioning Aboriginal and Torres Strait Islander AI Futures
Journal of Global Indigeneity (2025)
Preview abstract
In January 2025, over forty Aboriginal and Torres Strait Islander researchers, practitioners, community members, and allies, gathered at the Centre for Global Indigenous Futures at the Wallumattagal Campus of Macquarie University in Sydney to envisage Aboriginal and Torres Strait Islander AI futures. This publication reports on attendees' vision for the future of AI for Aboriginal and Torres Strait Islander people.
View details
Preview abstract
As Generative AI (GenAI) systems increasingly enter our daily lives, reshaping social norms and practices, we must examine the norms and practices we use to evaluate the systems themselves. Recent scholarship has started to make explicit the normative dimensions of Machine Learning (ML) development and evaluation. \citet{birhane2022values} demonstrate that particular normative values are encoded in Machine Learning (ML) practice. \citet{hutchinson2022evaluation}, in a review of ML evaluation practices, identify several commitments implicit in the way ML models are evaluated. These include a commitment to consequentialism, the assumptions that evaluations can be undertaken acontextually and that model inputs need only play a limited during model evaluation, and the expectations that impacts can be quantified and that ML failure modes are commensurable. In this provocation, we extend this line of inquiry by arguing two points: we need to attend to the implicit assumptions and values reflected in how societal impacts are conceptualised and constructed through ML evaluations; and doing so reveals that many of the problems that societal impact evaluations attempt to address would be better conceptualised as governance, rather than evaluation, issues.
View details
Preview abstract
Indigenous languages are historically under-served by Natural Language Processing (NLP) technologies, but this is changing for some languages with the recent scaling of large multilingual models and an increased focus by the NLP community on endangered languages. This position paper explores ethical considerations in building NLP technologies for Indigenous languages, based on the premise that such projects should primarily serve Indigenous communities. We report on interviews with 17 researchers working in or with Aboriginal and/or Torres Strait Islander communities on language technology projects in Australia. Drawing on insights from the interviews, we recommend practices for NLP researchers to increase attention to the process of engagements with Indigenous communities, rather than focusing only on decontextualised artefacts.
View details
Socially Responsible Data for Large Multilingual Language Models
Zara Wudiri
Mbangula Lameck Amugongo
Alex
Stanley Uwakwe
João Sedoc
Edem Wornyo
Seyi Olojo
Amber Ebinama
Suzanne Dikker
2024
Preview abstract
Large Language Models (LLMs) have rapidly increased in size and apparent capabilities in the last three years but their training
data is largely English text. There is growing interest in language inclusivity in LLMs, and various efforts are striving for models
to accommodate language communities outside of the Global North1
, which include many languages that have been historically
underrepresented digitally. These languages have been coined as “low resource languages” or “long tail languages”, and LLMs
performance on these languages is generally poor. While expanding the use of LLMs to more languages may bring many potential
benefits, such as assisting cross-community communication and language preservation, great care must be taken to ensure that
data collection on these languages is not extractive and that it does not reproduce exploitative practices of the past. Collecting
data from languages spoken by previously colonized people, indigenous people, and non-Western languages raises many complex
sociopolitical and ethical questions, e.g., around consent, cultural safety, and data sovereignty. Furthermore, linguistic complexity and
cultural nuances are often lost in LLMs. This position paper builds on recent scholarship, and our own work, and outlines several
relevant social, cultural, and ethical considerations and potential ways to mitigate them through qualitative research, community
partnerships and participatory design approaches. We provide twelve recommendations for consideration when collecting language
data on underrepresented language communities outside of the Global North.
View details
Preview abstract
Questions regarding implicitness, ambiguity and underspecification are crucial for multimodal image+text systems, but have received little attention to date. This paper maps out a conceptual framework to address this gap for systems which generate images from text inputs, specifically for systems which generate images depicting scenes from descriptions of those scenes. In doing so, we account for how texts and images convey different forms of meaning. We then outline a set of core challenges concerning textual and visual ambiguity and specificity tasks, as well as risks that may arise from improper handling of ambiguous and underspecified elements. We propose and discuss two strategies for addressing these challenges: a) generating a visually ambiguous output image, and b) generating a set of diverse output images.
View details
Preview abstract
In order to build trust that a machine learned model is appropriate and responsible within a systems context involving technical and
human components, a broad range of factors typically need to be considered. However in practice model evaluations frequently focus
on only a narrow range of expected predictive behaviours. This paper examines the critical evaluation gap between the idealized
breadth of concerns and the observed narrow focus of actual evaluations. In doing so, we demonstrate which values are centered—and
which are marginalized—within the machine learning community. Through an empirical study of machine learning papers from
recent high profile conferences, we demonstrate the discipline’s general focus on a small set of evaluation methods. By considering the
mathematical formulations of evaluation metrics and the test datasets over which they are calculated, we draw attention to which
properties of models are centered in the field. This analysis also reveals an important gap: the properties of models which are frequently
neglected or sidelined during evaluation. By studying the structure of this gap, we demonstrate the machine learning discipline’s
implicit assumption of a range of commitments which have normative impacts; these include commitments to consequentialism,
abstractability from context, the quantifiability of impacts, the irrelevance of non-predictive features, and the equivalence of different
failure modes. Shedding light on these assumptions and commitments enables us to question their appropriateness for different ML
system contexts, and points the way towards more diverse and contextualized evaluation methodologies which can be used to more
robustly examine the trustworthiness of ML models.
View details
Preview abstract
Testing, within the machine learning (ML) community, has been predominantly about assessing a learned model's predictive performance measured against a test dataset. This test dataset is often a held-out subset of the dataset used to train the model, and hence expected to follow the same data distribution as the training dataset. While recent work on robustness testing within ML has pointed to the importance of testing against distributional shifts, these efforts also focus on estimating the likelihood of the model making an error against a reference dataset/distribution. In this paper, we argue that this view of testing actively discourages researchers and developers from looking into many other sources of robustness failures, for instance corner cases. We draw parallels with decades of work within software engineering testing focused on assessing a software system against various stress conditions, including corner cases, as opposed to solely focusing on average-case behaviour. Finally, we put forth a set of recommendations to broaden the view of machine learning testing to a rigorous practice.
View details