Asma Ghandeharioun

Asma Ghandeharioun

Asma Ghandeharioun, Ph.D., is a senior research scientist with the People + AI Research team at Google DeepMind. She works on aligning AI with human values through better understanding and controlling (language) models, uniquely by demystifying their inner workings and correcting collective misconceptions along the way. While her current research is mostly focused on machine learning interpretability, her previous work spans conversational AI, affective computing, and, more broadly, human-centered AI. She holds a doctorate and master’s degree from MIT and a bachelor’s degree from the Sharif University of Technology. She has been trained as a computer scientist/engineer and has research experience at MIT, Google Research, Microsoft Research, Ecole Polytechnique Fédérale de Lausanne (EPFL), to name a few.

Her work has been published in premier peer-reviewed machine learning venues such as ICLR, NeurIPS, ICML, EMNLP, AAAI, ACII, and AISTATS. She has received awards at NeurIPS and her work has been featured in Wired, Wall Street Journal, and New Scientist.

For more details, visit her personal page.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract The profound success of transformer-based language models can largely be attributed to their ability to integrate relevant contextual information from an input sequence in order to generate a response or complete a task. However, we know very little about the algorithms that a model employs to implement this capability, nor do we understand their failure modes. For example, given the prompt "John is going fishing, so he walks over to the bank. Can he make an ATM transaction?", a model may incorrectly respond "Yes" if it has not properly contextualized "bank" as a geographical feature, rather than a financial institution. We propose the LLM Race Conditions Hypothesis as an explanation of contextualization errors of this form. This hypothesis identifies dependencies between tokens (e.g., "bank" must be properly contextualized before the final token, "?", integrates information from "bank"), and claims that contextualization errors are a result of violating these dependencies. Using a variety of techniques from mechanistic intepretability, we provide correlational and causal evidence in support of the hypothesis, and suggest inference-time interventions to address it. View details
    Preview abstract Why do models respond to harmful queries in some cases but not others? Despite significant investments in improving model safety, it has been shown that misaligned capabilities remain latent in safety-tuned models. In this work, we shed light on the mechanics of this phenomenon. First, we show that even when model generations are safe, harmful content persists in hidden representations, and this content can be extracted by decoding from earlier layers. Then, we show that whether the model divulges such content depends significantly on who it is talking to, which we refer to as user persona. We study both natural language prompting and activation steering as methods for manipulating inferred user persona and show that the latter is significantly more effective at bypassing safety filters. In fact, we find it is even more effective than direct attempts to control a model's refusal tendency. This suggests when it comes to deciding whether to respond to harmful queries, the model is deeply biased with respect to user persona. We leverage the generative capabilities of the language model itself to investigate why certain personas break model safeguards, and discover that they enable the model to form more charitable interpretations of otherwise dangerous queries. Finally, we show that we can predict a persona’s effect on refusal given only the geometry of its steering vector. View details
    Preview abstract An Explorable explaining the concept of patchoscopes for an external audience. Patchoscopes is an interpretability tool that allows researchers to better understand an LLMs output representations through natural language experiments. View details
    Preview abstract An Explorable explaining the concept of patchoscopes for an external audience. Patchoscopes is an interpretability tool that allows researchers to better understand an LLMs output representations through natural language experiments. View details
    Preview abstract Inspecting hidden representations of large language models (LLM) is of growing interest. Not only to understand a model's behavior and verify its alignment with human values, but also to control it before it goes awry. Given the capabilities of LLMs in generating coherent, human-understandable text, we propose leveraging the model itself to explain its internal representations in natural language. We introduce a framework called Patchscopes and show how it can be used to answer various kinds of questions, which we refer in the singular, as a specific Patchscope. We show that many prior inspection methods based on projecting the representations into the vocabulary space, such as logit lens, tuned lens, and linear shortcuts, can be viewed as special instances of this framework. Moreover, several of their shortcomings such as failure in inspecting early layers or lack of expressivity can be mitigated by a Patchscope. Beyond unifying prior inspection techniques, Patchscopes also opens up new possibilities, such as using a more capable model to explain the representations of a smaller model. Finally, we demonstrate the utility of Patchscopes for practical applications, such as harmful belief extraction and self-correction in multi-hop reasoning. View details
    Preview abstract With the growing complexity and capability of large language models, a need to understand model reasoning has emerged, often motivated by an underlying goal of controlling and aligning models. While numerous interpretability and steering methods have been proposed as solutions, they are typically designed either for understanding or for control, seldom addressing both. Additionally, the lack of standardized applications, motivations, and evaluation metrics makes it difficult to assess methods' practical utility and efficacy. To address the aforementioned issues, we argue that intervention is a fundamental goal of interpretability and introduce success criteria to evaluate how well methods can control model behavior through interventions. To evaluate existing methods for this ability, we unify and extend four popular interpretability methods-sparse autoencoders, logit lens, tuned lens, and probing-into an abstract encoder-decoder framework, enabling interventions on interpretable features that can be mapped back to latent representations to control model outputs. We introduce two new evaluation metrics: intervention success rate and coherence-intervention tradeoff, designed to measure the accuracy of explanations and their utility in controlling model behavior. Our findings reveal that (1) while current methods allow for intervention, their effectiveness is inconsistent across features and models, (2) lens-based methods outperform SAEs and probes in achieving simple, concrete interventions, and (3) mechanistic interventions often compromise model coherence, underperforming simpler alternatives, such as prompting, and highlighting a critical shortcoming of current interpretability approaches in applications requiring control. View details
    Preview abstract LLMs based on the transformer architectures can solve highly complex tasks. But are there simple tasks that such models cannot solve? Here we focus on very simple counting tasks, that involve counting how many times token in the vocabulary have appeared in a string. We show that if the dimension of the transformer state is linear in the context length, this task can be solved. However, the solution we propose does not scale beyond this limit, and we provide theoretical arguments for what it is likely impossible for a size limited transformer to implement this task. Our empirical results demonstrate the same phase-transition in performance, as anticipated by the theoretical argument. Our results demonstrate the importance of understanding how transformers can solve simple tasks. View details
    Preview abstract A common method to study deep learning systems is to create simplified representations---for example, using singular value decomposition to visualize the model's hidden states in a lower dimensional space. This approach assumes that the simplified model is faithful to the original model. Here, we illustrate an important caveat to this assumption: even if a simplified representation of the model can accurately approximate the original model on the training set, it may fail to match its behavior out of distribution; the understanding developed from simplified representations may be an illusion. We illustrate this by training Transformer models on controlled datasets with systematic generalization splits, focusing on the Dyck balanced-parenthesis languages. We simplify these models using tools like dimensionality-reduction and clustering, and find clear patterns in the resulting representations. We then explicitly test how these simplified proxy models match the original models behavior on various out-of-distribution test sets. Generally, the simplified proxies are less faithful out of distribution. For example, in cases where the original model generalizes to novel structures or deeper depths, the simplified model may fail to generalize, or may generalize too well. We then show the generality of these results: even model simplifications that do not directly use data can be less faithful out of distribution, and other tasks can also yield generalization gaps. Our experiments raise questions about the extent to which mechanistic interpretations derived using tools like SVD can reliably predict what a model will do in novel situations. View details
    Post Hoc Explanations of Language Models Can Improve Language Models
    Satyapriya Krishna
    Jiaqi Ma
    Dylan Z Slack
    Sameer Singh
    Himabindu Lakkaraju
    NeurIPS 2023 (2023)
    Preview abstract Large Language Models (LLMs) have shown remarkable capabilities in performing complex tasks, excelling at in-context learning, and providing step-by-step reasoning. However, incorporating human-annotated rationales such as Chain-of-Thoughts for enhancing model performance faces challenges in scalability and can sometimes adversely affect performance. In this work, we present a novel approach, AMPLIFY: Advanced Model Performance Leveraging In-Context Learning with Post Hoc Explanations, which addresses these challenges by replacing human-annotated rationales with automatically generated rationales using post hoc explanation methods. Post hoc explanation techniques have gained popularity for determining attribution scores for input features in model predictions, deepening our understanding of model behavior and helping pinpoint errors in complex models. We leverage these explanations to provide corrective signals to large language models, reducing prediction errors and augmenting in-context learning with automatically generated rationales. Our findings demonstrate that AMPLIFY leads to performance improvements between 10-25% across a wide range of tasks, including those where previously considered prompting techniques, such as Chain-of-Thoughts, which rely on human-annotated explanations, fall short. This highlights the potential of utilizing post hoc explanation methods as a valuable tool for enhancing the efficiency and effectiveness of large language models in various tasks. Furthermore, we conduct an extensive empirical analysis to examine the impact and improvements attributed to each step of AMPLIFY offering critical insights for refining in-context learning while addressing the limitations posed by methods dependent on human-annotated rationales. View details
    Preview abstract In 2021, researchers made a striking discovery while training a series of tiny models on toy tasks [1]. They found a set of models that suddenly flipped from memorizing their training data to correctly generalizing on unseen inputs after training for much longer. This phenomenon – where generalization seems to happen abruptly and long after fitting the training data – is called grokking and has sparked a flurry of interest [2, 3, 4, 5, 6]. Do more complex models also suddenly generalize after they’re trained longer? Large language models can certainly seem like they have a rich understanding of the world, but they might just be regurgitating memorized bits of the enormous amount of text they’ve been trained on [7, 8]. How can we tell if they’re generalizing or memorizing? In this article we’ll examine the training dynamics of a tiny model and reverse engineer the solution it finds – and in the process provide an illustration of the exciting emerging field of mechanistic interpretability [9, 10]. While it isn’t yet clear how to apply these techniques to today’s largest models, starting small makes it easier to develop intuitions as we progress towards answering these critical questions about large language models. View details
    ×