Craig Boutilier

Craig Boutilier

Craig Boutilier is Principal Scientist at Google. He works on various aspects of decision making under uncertainty, with a current focus on sequential decision models: reinforcement learning, Markov decision processes, temporal models, etc.

Positions and Appointments:
He was a Professor in the Department of Computer Science at the University of Toronto (on leave) and Canada Research Chair in Adaptive Decision Making for Intelligent Systems. He received his Ph.D. in Computer Science from the University of Toronto in 1992, and worked as an Assistant and Associate Professor at the University of British Columbia from 1991 until his return to Toronto in 1999. He served as Chair of the Department of Computer Science at Toronto from 2004-2010. He was co-founder (with Tyler Lu) of Granata Decision Systems from 2012-2015, until his move to Google in 2015.

Boutilier was a consulting professor at Stanford University from 1998-2000, an adjunct professor at the University of British Columbia from 1999-2010, and a visiting professor at Brown University in 1998, at the University of Toronto in 1997-98, at Carnegie Mellon University in 2008-09, and at Université Paris-Dauphine (Paris IX) in the spring of 2011. He served on the Technical Advisory Board of CombineNet, Inc. from 2001 to 2010.

Research:
Boutilier's current research efforts focus on various aspects of decision making under uncertainty, including the use of generative models and LLMs, in areas such as: recommender systems, preference modeling and elicitation, mechanism design, game theory and multiagent decision processes, economic models, social choice, computational advertising, Markov decision processes, reinforcement learning and probabilistic inference. His research interests have spanned a wide range of topics, from knowledge representation, belief revision, default reasoning, and philosophical logic, to probabilistic reasoning, decision making under uncertainty, multiagent systems, and machine learning.

Research & Academic Service:
Boutilier is a past Editor-in-Chief of the Journal of Artificial Intelligence Research (JAIR). He was a past Associate Editor with the ACM Transactions on Economics and Computation (TEAC), the Journal of Artificial Intelligence Research (JAIR), the Journal of Machine Learning Research (JMLR), and Autonomous Agents and Multiagent Systems (AAMAS); and he has sat on the editorial/advisory boards of several other journals. Boutilier has organized several international conferences and workshops, including his work as Program Chair of the Twenty-first International Joint Conference on Artificial Intelligence (IJCAI-09) and Program Chair of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI-2000). He has also served on the conference program committees of roughly 75 leading international conferences.

He will serve as Conference Chair of the Thirty-seventh International Joint Conference on Artificial Intelligence (IJCAI-28).

Awards and Honors:
Boutilier is a Fellow of the Royal Society of Canada (RSC), the Association for Computing Machinery (ACM) and the Association for the Advancement of Artificial Intelligence (AAAI). He was the recipient of the 2018 ACM/SIGAI Autonomous Agents Research Award, He was awarded a Tier I Canada Research Chair, an Isaac Walton Killam Research Fellowship, and an IBM Faculty Award. He received the Killam Teaching Award from the University of British Columbia in 1997. He has also received a number of Best Paper awards including:

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders
    Guy Tennenholtz
    Jihwan Jeong
    Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL-26), Rabat, Morocco (2026), pp. 5270-5304
    Preview abstract LLM-based user simulators are a scalable solution for improving conversational AI, but a critical realism gap undermines their effectiveness. To close this gap, we introduce a framework for building and validating high-fidelity simulators. We present a novel dataset of human-AI shopping conversations designed to capture a wide spectrum of user experiences. To measure fidelity, we propose a hybrid evaluation protocol that combines statistical alignment with a learned, discriminator-based Human-Likeness Score. Our most sophisticated simulator, trained via reinforcement learning with iterative critique, achieves a significant leap in realism. Critically, we demonstrate through counterfactual validation that our simulator—trained exclusively on optimal interactions—realistically adapts its behavior to suboptimal system responses, mirroring real user reactions and marking a key advance in creating reliable simulators for robust AI development. View details
    ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders
    Guy Tennenholtz
    Jihwan Jeong
    The 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL-26), Rabat, Morocco (2026)
    Preview abstract LLM-based user simulators are a scalable solution for improving conversational AI, but a critical realism gap undermines their effectiveness. To close this gap, we introduce a framework for building and validating high-fidelity simulators. We present a novel dataset of human-AI shopping conversations designed to capture a wide spectrum of user experiences. To measure fidelity, we propose a hybrid evaluation protocol that combines statistical alignment with a learned, discriminator-based Human-Likeness Score. Our most sophisticated simulator, trained via reinforcement learning with iterative critique, achieves a significant leap in realism. Critically, we demonstrate through counterfactual validation that our simulator—trained exclusively on optimal interactions—realistically adapts its behavior to suboptimal system responses, mirroring real user reactions and marking a key advance in creating reliable simulators for robust AI development. View details
    Asking Clarifying Questions for Preference Elicitation with Large Language Models
    Ali Montazer
    Guy Tennenholtz
    1st Workshop on Next Generation of IR and Recommender Systems with Language Agents, Generative Models, and Conversational AI (GENNEXT@SIGIR'25), Padua, IT (2025)
    Preview abstract Large Language Models (LLMs) have made it possible for recommendation systems to interact with users in open-ended conversational interfaces. In order to personalize LLM responses, it is crucial to elicit user preferences, especially when there is limited user history. One way to get more information is to present clarifying questions to the user. However, generating effective sequential clarifying questions across various domains remains a challenge, as even advanced LLMs still struggle with this task. To address this, we introduce a novel approach for training LLMs to ask sequential questions that reveal user preferences. Our method follows a two-stage process inspired by diffusion models: starting from a user profile, in a forward process we generate clarifying questions, obtain answers, and then remove the corresponding information from the user profile, which is analogous to adding noise to the user profile. In the reverse process, zour model learns to “denoise” the user profile by learning to ask effective clarifying questions. Our results show that our method significantly boosts the LLM’s proficiency in asking funnel questions and elicit user preferences effectively. View details
    Preview abstract We address the problem of interactive text-to-image (T2I) generation, designing a reinforcement learning (RL) agent which iteratively improves a set of generated images for a user through a sequence of prompt expansions. Using human raters, we create a novel dataset of sequential preferences, which we leverage, together with large-scale open-source (non-sequential) datasets. We construct user-preference and user-choice models using an EM strategy and identify varying user preference types. We then leverage a large multimodal language model (LMM) and a value-based RL approach to suggest an adaptive and diverse slate of prompt expansions to the user. Our Preference Adaptive and Sequential Text-to-image Agent (PASTA) extends T2I models with adaptive multi-turn capabilities, fostering collaborative co-creation and addressing uncertainty or underspecification in a user's intent. We evaluate PASTA using human raters, showing significant improvement compared to baseline methods. We also open-source our sequential rater dataset and simulated user-rater interactions to support future research in user-centric multi-turn T2I systems. View details
    Preview abstract Large Language Models (LLMs) have made it possible for recommendation systems to interact with users in open-ended conversational interfaces. In order to personalize LLM responses, it is crucial to elicit user preferences, especially when there is limited user history. One way to get more information is to present clarifying questions to the user. However, generating effective sequential clarifying questions across various domains remains a challenge, as even advanced LLMs still struggle with this task. To address this, we introduce a novel approach for training LLMs to ask sequential questions that reveal user preferences. Our method follows a two-stage process inspired by diffusion models: starting from a user profile, in a forward process we generate clarifying questions, obtain answers, and then remove the corresponding information from the user profile, which is analogous to adding noise to the user profile. In the reverse process, zour model learns to “denoise” the user profile by learning to ask effective clarifying questions. Our results show that our method significantly boosts the LLM’s proficiency in asking funnel questions and elicit user preferences effectively. View details
    Preference Adaptive and Sequential Text-to-Image Generation
    Ofir Nabati
    Guy Tennenholtz
    Moonkyung Ryu
    Sean Li
    42nd International Conference on Machine Learning (ICML-25), Vancouver (2025), pp. 45362-45394
    Preview abstract We consider the problem of sequential text-to-image generation. Specifically, we formulate a personalized interactive framework, where an agent iteratively improves a user's prompt through a series of sequential prompt expansions. We formulate the problem as a sequential decision-making task. Using human raters, we create a dataset of sequential preferences for this problem. We then leverage our sequential data, together with large-scale open-source non-sequential datasets to construct user-preference and user-choice models. Particularly, we employ an EM strategy to develop a personalized sequential user model. We then leverage a multi-modal large language model (MM-LLM) and a value-based reinforcement learning (RL) agent to suggest a personalized and diverse slate of prompt expansions to the user. Our Personalized And Sequential Text-to-image Agent (PASTA) empowers diffusion models with personalized multi-turn capabilities, fostering collaborative co-creation, and addressing uncertainties or under-specifications in user intent. We evaluate our agent using human raters, showing significant improvement compared to baseline methods. We also release our sequential rater dataset and additional simulated data of user-agent interactions to advance future research in personalized multi-turn text-to-image generation. View details
    Preview abstract We address the problem of interactive text-toimage (T2I) generation, designing a reinforcement learning (RL) agent which iteratively improves a set of generated images for a user through a sequence of prompt expansions. Using human raters, we create a novel dataset of sequential preferences, which we leverage, together with largescale open-source (non-sequential) datasets. We construct user-preference and user-choice models using an EM strategy and identify varyinguser preference types. We then leverage a large multimodal language model (LMM) and a valuebased RL approach to suggest an adaptive and diverse slate of prompt expansions to the user. Our Preference Adaptive and Sequential Text-toimage Agent (PASTA) extends T2I models with adaptive multi-turn capabilities, fostering collaborative co-creation and addressing uncertainty or underspecification in a user’s intent. We evaluate PASTA using human raters, showing significant improvement compared to baseline methods. We also open-source our sequential rater dataset and simulated user-rater interactions to support future research in user-centric multi-turn T2I systems. View details
    Synthetic Dialogue Generation for Interactive Conversational Elicitation & Recommendation (ICER)
    Moonkyung Ryu
    Mohammad Ghavamzadeh
    GENNEXT@SIGIR’25: The 1st Workshop on Next Generation of IR and Recommender Systems with Language Agents, Generative Models, and Conversational AI (2025)
    Preview abstract Large language models (LLM), despite their success in conducting natural conversations and solving various challenging NLP tasks, may not aim to understand and solicit a user’s preferences and suggest recommendations using the learned user preferences through the interactions. One primary challenge for pursuing this task is the lack of rich conversational recommendation data sets (of user/agent dialog conversations) that contain diverse preference elicitation scenarios on top of item recommendations. While several standard conversational recommender datasets do contain dialogs that the agents query users for preferences, they are often restricted to the use of particular key-phrases that limit users’ preference expressions w.r.t. particular item features (e.g., location, price, rating, etc). On the other hand, hiring human raters to create a personalized conversational recommender dataset that consists of rich preference queries (with the use of various abstract preference-describing attributes) and recommendations is a very expensive, error-prone, and time-consuming process because it requires data collection on numerous personalized recommendation scenarios, where each of them may involve exponential possibilities of preference elicitation interactions (with many different queries). We propose a synthetic data generation methodology that utilizes a synthetic recommendation dialog simulator that generates templatized dialogs and uses Gemini Ultra LLM to in-paint the templatized dialogs so that the dialogs become linguistically more natural. The simulator generates recommendation dialogs that align to the user preferences that are represented by embeddings. View details
    Demystifying Embedding Spaces using Large Language Models
    Guy Tennenholtz
    Jihwan Jeong
    Lior Shani
    Aza Tulepbergenov
    Martin Mladenov
    The Twelfth International Conference on Learning Representations (2024)
    Preview abstract Embeddings have become a pivotal means to represent complex, multi-faceted information about entities, concepts, and relationships in a condensed and useful format. Nevertheless, they often preclude direct interpretation. While downstream tasks make use of these compressed representations, meaningful interpretation usually requires visualization using dimensionality reduction or specialized machine learning interpretability methods. This paper addresses the challenge of making such embeddings more interpretable and broadly useful, by employing large language models (LLMs) to directly interact with embeddings -- transforming abstract vectors into understandable narratives. By injecting embeddings into LLMs, we enable querying and exploration of complex embedding data. We demonstrate our approach on a variety of diverse tasks, including: enhancing concept activation vectors (CAVs), communicating novel embedded entities, and decoding user preferences in recommender systems. Our work couples the immense information potential of embeddings with the interpretative power of LLMs. View details
    Discovering Personalized Semantics for Soft Attributes in Recommender Systems using Concept Activation Vectors
    Christina Göpfert
    Alex Haig
    Ivan Vendrov
    Tyler Lu
    Hubert Pham
    Mohammad Ghavamzadeh
    ACM Transactions on Recommender Systems (2024)
    Preview abstract Interactive recommender systems have emerged as a promising paradigm to overcome the limitations of the primitive user feedback used by traditional recommender systems (e.g., clicks, item consumption, ratings). They allow users to express intent, preferences, constraints, and contexts in a richer fashion, often using natural language (including faceted search and dialogue). Yet more research is needed to find the most effective ways to use this feedback. One challenge is inferring a user's semantic intent from the open-ended terms or attributes often used to describe a desired item, and using it to refine recommendation results. Leveraging concept activation vectors (CAVs) (Kim, et al., 2018) a recently developed approach for model interpretability in machine learning, we develop a framework to learn a representation that captures the semantics of such attributes and connects them to user preferences and behaviors in recommender systems. One novel feature of our approach is its ability to distinguish objective and subjective attributes (both subjectivity of degree and of sense), and associate different senses of subjective attributes with different users. We demonstrate on both synthetic and real-world data sets that our CAV representation not only accurately interprets users' subjective semantics, but can also be used to improve recommendations through interactive item critiquing. View details
    ×