Ian Tenney
Authored Publications
Sort By
LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models
Michael Xieyang Liu
Krystal Kallarackal
Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '24), ACM (2024)
Preview abstract
Automatic side-by-side evaluation has emerged as a promising approach to evaluating the quality of responses from large language models (LLMs). However, analyzing the results from this evaluation approach raises scalability and interpretability challenges. In this paper, we present LLM Comparator, a novel visual analytics tool for interactively analyzing results from automatic side-by-side evaluation. The tool supports interactive workflows for users to understand when and why a model performs better or worse than a baseline model, and how the responses from two models are qualitatively different. We iteratively designed and developed the tool by closely working with researchers and engineers at Google. This paper details the user challenges we identified, the design and development of the tool, and an observational study with participants who regularly evaluate their models.
View details
Retrieval-guided Counterfactual Generation for QA
Bhargavi Paranjape
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics (2022), pp. 1670-1686 (to appear)
Preview abstract
Deep NLP models have been shown to be brittle to input perturbations. Recent work has shown that data augmentation using counterfactuals — i.e. minimally perturbed inputs — can help ameliorate this weakness. We focus on the task of creating counterfactuals for question answering, which presents unique challenges related to world knowledge, semantic diversity, and answerability. To address these challenges, we develop a Retrieve-Generate-Filter(RGF) technique to create counterfactual evaluation and training data with minimal human supervision. Using an open-domain QA framework and question generation model trained on original task data, we create counterfactuals that are fluent, semantically diverse, and automatically labeled. Data augmentation with RGF counterfactuals improves performance on out-of-domain and challenging evaluation sets over and above existing methods, in both the reading comprehension and open-domain QA settings. Moreover, we find that RGF data leads to significant improvements in a model’s robustness to local perturbations.
View details
The MultiBERTs: BERT Reproductions for Robustness Analysis
Steve Yadlowsky
Jason Wei
Naomi Saphra
Iulia Raluca Turc
2022
Preview abstract
Experiments with pretrained models such as BERT are often based on a single checkpoint. While the conclusions drawn apply to the artifact (i.e., the particular instance of the model), it is not always clear whether they hold for the more general procedure (which includes the model architecture, training data, initialization scheme, and loss function). Recent work has shown that re-running pretraining can lead to substantially different conclusions about performance, suggesting that alternative evaluations are needed to make principled statements about procedures. To address this question, we introduce MultiBERTs: a set of 25 BERT-base checkpoints, trained with similar hyper-parameters as the original BERT model but differing in random initialization and data shuffling. The aim is to enable researchers to draw robust and statistically justified conclusions about pretraining procedures. The full release includes 25 fully trained checkpoints, as well as statistical guidelines and a code library implementing our recommended hypothesis testing methods. Finally, for five of these models we release a set of 28 intermediate checkpoints in order to support research on learning dynamics.
View details
What Happens To BERT Embeddings During Fine-tuning?
Amil Merchant
Elahe Rahimtoroghi
Proceedings of the 2020 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Association for Computational Linguistics (to appear)
Preview abstract
While there has been much recent work studying how linguistic information is encoded in pre-trained sentence representations, comparatively little is understood about how these models change when adapted to solve downstream tasks. Using a suite of analysis techniques (probing classifiers, Representational Similarity Analysis, and model ablations), we investigate how fine-tuning affects the representations of the BERT model. We find that while fine-tuning necessarily makes significant changes, it does not lead to catastrophic forgetting of linguistic phenomena. We instead find that fine-tuning primarily affects the top layers of BERT, but with noteworthy variation across tasks. In particular, dependency parsing reconfigures most of the model, whereas SQuAD and MNLI appear to involve much shallower processing. Finally, we also find that fine-tuning has a weaker effect on representations of out-of-domain sentences, suggesting room for improvement in model generalization.
View details
Asking without Telling: Exploring Latent Ontologies in Contextual Representations
Julian Michael
Jan A. Botha
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics (to appear)
Preview abstract
The success of pretrained contextual encoders, such as ELMo and BERT, has brought a great deal of interest in what these models learn: do they, without explicit supervision, learn to encode meaningful notions of linguistic structure? If so, how is this structure encoded? To investigate this, we introduce latent subclass learning (LSL): a modification to existing classifier-based probing methods that induces a latent categorization (or ontology) of the probe's inputs. Without access to fine-grained gold labels, LSL extracts emergent structure from input representations in an interpretable and quantifiable form. In experiments, we find strong evidence of familiar categories, such as a notion of personhood in ELMo, as well as novel ontological distinctions, such as a preference for fine-grained semantic roles on core arguments. Our results provide unique new evidence of emergent structure in pretrained encoders, including departures from existing annotations which are inaccessible to earlier methods.
View details
The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models
Andy Coenen
Sebastian Gehrmann
Ellen Jiang
Carey Radebaugh
Ann Yuan
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics (to appear)
Preview abstract
We present the Language Interpretability Tool (LIT), an open-source platform for visualization and understanding of NLP models. We focus on core questions about model behavior: Why did my model make this prediction? When does it perform poorly? What happens under a controlled change in the input? LIT integrates local explanations, aggregate analysis, and counterfactual generation into a streamlined, browser-based interface to enable rapid exploration and error analysis. We include case studies for a diverse set of workflows, including exploring counterfactuals for sentiment analysis, measuring gender bias in coreference systems, and exploring local behavior in text generation. LIT supports a wide range of models--including classification, seq2seq, and structured prediction--and is highly extensible through a declarative, framework-agnostic API. LIT is under active development, with code and full documentation available at https://github.com/pair-code/lit.
View details
Measuring and Reducing Gendered Correlations in Pre-trained Models
Alex Beutel
Emily Pitler
arXiv (2020)
Preview abstract
Large pre-trained models have revolutionized natural language understanding.
However, researchers have found they can encode correlations undesired in many applications, like \emph{surgeon} being associated more with \emph{he} than \emph{she}.
We explore such \emph{gendered correlations} as a case study, to learn how we can configure and train models to mitigate the risk of encoding unintended associations.
We find that it is important to define correlation metrics, since they can reveal differences among models with similar accuracy.
Large models have more capacity to encode gendered correlations, but this can be mitigated with general dropout regularization.
Counterfactual data augmentation is also effective, and can even reduce correlations not explicitly targeted for mitigation, potentially making it useful beyond gender too.
Both techniques yield models with comparable accuracy to unmitigated analogues, and still resist re-learning correlations in fine-tuning.
View details
Preview abstract
We show that embedding-based language models capture a significant amount of information about the scalar magnitudes of objects but are short of the capability required for general common-sense reasoning. We identify ambiguity and numeracy as the key factors limiting their performance, and show that a simple reversible transformation of the pre-training corpus can have a significant effect on the results. We identify the best models and metrics to use when doing zero-shot transfer across tasks in this domain.
View details
Preview abstract
Pre-trained sentence encoders such as ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2018) have rapidly advanced the state-of-theart on many NLP tasks, and have been shown to encode contextual information that can resolve many aspects of language structure. We extend the edge probing suite of Tenney et al. (2019) to explore the computation performed at each layer of the BERT model, and find that tasks derived from the traditional NLP pipeline appear in a natural progression: part-of-speech tags are processed earliest, followed by constituents, dependencies, semantic roles, and coreference. We trace individual examples through the encoder and find that while this order holds on average, the encoder occasionally inverts the order, revising low-level decisions after deciding higher-level contextual relations.
View details
What do you learn from context? Probing for sentence structure in contextualized word representations
Patrick Xia
Berlin Chen
Alex Wang
Adam Poliak
R. Thomas McCoy
Najoung Kim
Benjamin Van Durme
Samuel R. Bowman
International Conference on Learning Representations (2019)
Preview abstract
Contextualized representation models such as CoVe (McCann et al., 2017) and
ELMo (Peters et al., 2018a) have recently achieved state-of-the-art results on a
broad suite of downstream NLP tasks. Building on recent token-level probing
work (Peters et al., 2018a; Blevins et al., 2018; Belinkov et al., 2017b; Shi et al.,
2016), we introduce a broad suite of sub-sentence probing tasks derived from the traditional
structured-prediction pipeline, including parsing, semantic role labeling,
and coreference, and covering a range of syntactic, semantic, local, and long-range
phenomena. We use these tasks to examine the word-level contextual representations
and investigate how they encode information about the structure of
the sentence in which they appear. We probe three recently-released contextual encoder models,
and find that ELMo better encodes linguistic structure at the word level than do other comparable
models. We find that the existing models trained on language modeling and translation
produce strong representations for syntactic phenomena, but only offer small improvements
on semantic tasks over a non-contextual baseline.
View details