James Wexler
Research Areas
Authored Publications
Sort By
Preview abstract
Large language models (LLMs) are highly capable at a variety of tasks given the right prompt, but writing one is still a difficult and tedious process. In this work, we introduce ConstitutionalExperts, a method for learning a prompt consisting of constitutional principles (i.e. rules), given a training dataset. Unlike prior methods that optimize the prompt as a single entity, our method incrementally improves the prompt by surgically editing individual principles. We also show that we can improve overall performance by learning unique prompts for different semantic regions of the training data and using a mixture-of-experts (MoE) architecture to route inputs at inference time. We compare ConstitutionalExperts to other state of the art prompt-optimization techniques across six benchmark datasets. We also investigate whether MoE improves these other techniques. Our results suggest that ConstitutionalExperts outperforms other prompt optimization techniques and that mixture-of-experts improves all techniques on average by 4.7%, suggesting broader applicability.
View details
ConstitutionMaker: Interactively Critiquing Large Language Models by Converting Feedback into Principles
Aaron Donsbach
ACM IUI 2024 (2024)
Preview abstract
Large language model (LLM) prompting is a promising new approach for users to create and customize their own chatbots. However, current methods for steering a chatbot's outputs, such as prompt engineering and fine-tuning, do not support users in converting their natural feedback on the model's outputs to changes in the prompt or model. In this work, we explore how to enable users to interactively refine model outputs through their feedback, by helping them convert their feedback into a set of principles (i.e. a constitution) that dictate the model's behavior. From a formative study, we (1) found that users needed support converting their feedback into principles for the agent and (2) classified the different principle types desired by users. Inspired by these findings, we developed ConstitutionMaker, an interactive tool for converting user feedback into principles, to steer LLM-based chatbots. With ConstitutionMaker, users can provide either positive or negative feedback in natural language, select auto-generated feedback, or rewrite the agent’s response; each mode of feedback automatically generates a principle that is inserted into the chatbot’s prompt. In a user study with 14 participants, we compare ConstitutionMaker to an ablated version, where users write their own principles. With ConstitutionMaker, participants felt that their principles could better guide the agent, that they could more easily convert their feedback into principles, and that they could write principles more efficiently, with less mental demand. ConstitutionMaker helped users identify ways to improve the chatbot, formulate their intuitive responses to the model into feedback, and convert this feedback into specific and clear principles. Together, these findings inform future tools that support the interactive critiquing of LLM outputs.
View details
LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models
Minsuk Kahng
Michael Xieyang Liu
Krystal Kallarackal
Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '24), ACM (2024)
Preview abstract
Automatic side-by-side evaluation has emerged as a promising approach to evaluating the quality of responses from large language models (LLMs). However, analyzing the results from this evaluation approach raises scalability and interpretability challenges. In this paper, we present LLM Comparator, a novel visual analytics tool for interactively analyzing results from automatic side-by-side evaluation. The tool supports interactive workflows for users to understand when and why a model performs better or worse than a baseline model, and how the responses from two models are qualitatively different. We iteratively designed and developed the tool by closely working with researchers and engineers at Google. This paper details the user challenges we identified, the design and development of the tool, and an observational study with participants who regularly evaluate their models.
View details
Automatic Histograms: Leveraging Language Models for Text Dataset Exploration
Minsuk Kahng
Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '24), ACM, Honolulu, HI, USA (2024), pp. 9
Preview abstract
Making sense of unstructured text datasets is perennially difficult, yet increasingly relevant with Large Language Models. Data practitioners often rely on dataset summaries, especially distributions of various derived features. Some features, like toxicity or topics, are relevant to many datasets, but many interesting features are domain specific, e.g., instruments and genres for a music dataset, or diseases and symptoms for a medical dataset. Accordingly, data practitioners often run custom analyses for each dataset, which is cumbersome and difficult, or use unsupervised methods. We present AutoHistograms, a visualization tool leveraging LLMs. AutoHistograms automatically identifies relevant entity-based features, visualizes their distributions, and allows the user to interactively query the dataset for new categories of entities. In a user study with (n=10) data practitioners, we observe that participants were able to quickly onboard to AutoHistograms, use the tool to identify actionable insights, and conceptualize a broad range of applicable use cases. We also describe a variety of usage scenarios from different types of users to highlight how this app can provide value in many different contexts. Finally, we present a quantitative evaluation of the tool. Together, this tool and user study contribute to the growing field of LLM-assisted sensemaking tools.
View details
Take it, Leave it, or Fix it: Measuring Productivity and Trust in Human-AI Collaboration
29th International Conference on Intelligent User Interfaces (IUI ’24), ACM, New York, NY, USA (2024)
Preview abstract
Although recent developments in generative AI have greatly enhanced the capabilities of conversational agents such as Google's Bard or OpenAI's ChatGPT, it's unclear whether the usage of these agents aids users across various contexts. To better understand how access to conversational AI affects productivity and trust, we conducted a mixed-methods, task-based user study, observing 76 software engineers (N=76) as they completed a programming exam with and without access to Bard. Effects on performance, efficiency, satisfaction, and trust vary depending on user expertise, question type (open-ended "solve" questions vs. definitive "search" questions), and measurement type (demonstrated vs. self-reported). Our findings include evidence of automation complacency, increased reliance on the AI over the course of the task, and increased performance for novices on “solve”-type questions when using the AI. We discuss common behaviors, design recommendations, and impact considerations to improve collaborations with conversational AI.
View details
Best of both worlds: local and global explanations with human-understandable concepts
Jessica Schrouff
Sebastien Baur
Shaobo Hou
Eric Loreaux
Diana Mincu
Ralph Blanes
(2021)
Preview abstract
Interpretability techniques aim to provide the rationale behind a model's decision, typically by explaining either an individual prediction (local explanation, e.g. `why is this patient diagnosed with this condition') or a class of predictions (global explanation, e.g. `why is this set of patients diagnosed with this condition in general'). While there are many methods focused on either one, few frameworks can provide both local and global explanations in a consistent manner. In this work, we combine two powerful existing techniques, one local (Integrated Gradients, IG) and one global (Testing with Concept Activation Vectors), to provide local and global concept-based explanations. We first sanity check our idea using two synthetic datasets with a known ground truth, and further demonstrate with a benchmark natural image dataset. We test our method with various concepts, target classes, model architectures and IG parameters (e.g. baselines). We show that our method improves global explanations over vanilla TCAV when compared to ground truth, and provides useful local insights. Finally, a user study demonstrates the usefulness of the method compared to no or global explanations only. We hope our work provides a step towards building bridges between many existing local and global methods to get the best of both worlds.
View details
The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models
Tolga Bolukbasi
Andy Coenen
Sebastian Gehrmann
Ellen Jiang
Carey Radebaugh
Ann Yuan
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics (to appear)
Preview abstract
We present the Language Interpretability Tool (LIT), an open-source platform for visualization and understanding of NLP models. We focus on core questions about model behavior: Why did my model make this prediction? When does it perform poorly? What happens under a controlled change in the input? LIT integrates local explanations, aggregate analysis, and counterfactual generation into a streamlined, browser-based interface to enable rapid exploration and error analysis. We include case studies for a diverse set of workflows, including exploring counterfactuals for sentiment analysis, measuring gender bias in coreference systems, and exploring local behavior in text generation. LIT supports a wide range of models--including classification, seq2seq, and structured prediction--and is highly extensible through a declarative, framework-agnostic API. LIT is under active development, with code and full documentation available at https://github.com/pair-code/lit.
View details
Bach Doodle: Approachable music composition with machine learning at scale
Curtis Hawthorne
Monica Dinculescu
Leon Hong
Jacob Howcroft
Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR) (2019)
Preview abstract
Many of us like music, but composing can feel intimidating, not knowing where to begin. Even when we have a melody, without sufficient skills in harmony we are deterred from developing it into a composition. Machine learning could potentially extend our creative abilities by offering generative models that can fill in the missing parts of our composition.
To make music composition more approachable, we designed a composition web-app where users can create their own melody and have it harmonized by a machine learning model. For inputting melodies, we designed a simplified sheet music interface that facilitates easy trial and error, and found that users adapted to it quickly even when they were not familiar with western music notation. Users can rapidly explore different possibilities in harmonizations by tweaking their melody and requesting for new harmonizations.
The harmonizations are provided by Coconet, a flexible generative model of counterpoint. Several technical challenges had to be overcome to support an interactive experience at scale. First, as most users do not have dedicated hardware to run machine learning models, we re-implemented Coconet in TensorFlow.js so that it could run in the browser. Second, our initial re-implementation took more than 40 seconds to generate two measures of music. By adopting dilated depth-wise separable convolutions and model quantization, we reduced it down to 2 seconds. Third, to prepare for large-scale deployment, we calibrated a speed test to determine if a user’s device is fast enough for running the model in the browser, if not the harmonization requests were sent to remote TPU servers.
In three days, the web-app received more than 50 million queries for harmonization around the world. Users could choose to rate their compositions and contribute them to a public dataset, which we are releasing with this paper. We hope that the community might find this dataset useful for, ranging from ethnomusicological studies, to music education to improving machine learning models. We end with a quote from a user: “It's really fun to play with. This might be the first time in my life I feel competent at music.”
View details
Preview abstract
Interpretability has become an important topic of research as more machine learning (ML) models are deployed and widely used to make important decisions. Most of the current explanation methods provide explanations through feature importance scores, which identify features that are salient for each individual input. However, how to systematically summarize and interpret such per sample feature importance scores itself is challenging. In this work, we propose principles and desiderata for \emph{concept} based explanation, which goes beyond per-sample features to identify higher level human-understandable concepts that apply across the entire dataset. We develop a new algorithm, ACE, to automatically extract visual concepts. Our systematic experiments demonstrate that ACE discovers concepts that are human-meaningful, coherent and salient for the neural network's predictions.
View details
The What-If Tool: Interactive Probing of Machine Learning Models
Tolga Bolukbasi
(to appear)
Preview abstract
A key challenge in developing and deploying Machine Learning (ML) systems is understanding their performance across a wide range of inputs. To address this challenge, we created the What-If Tool, an open-source application that allows practitioners to probe, visualize, and analyze ML systems, with minimal coding. The What-If Tool lets practitioners test performance in hypothetical situations, analyze the importance of different data features, and visualize model behavior across multiple models and subsets of input data. It also lets practitioners measure systems according to multiple ML fairness metrics. We describe the design of the tool, and report on real-life usage at different organizations.
View details