Tal Schuster
Tal Schuster is a Staff Research Scientist at Google AI working on Machine Learning and Natural Language Processing. He is developing robust and efficient models that leverage uncertainty-aware methods, and focusing on information-seeking applications.
For more details and full list of publications see his Personal Website and Google Scholar profile.
Authored Publications
Sort By
SEMQA: Semi-Extractive Multi-Source Question Answering
Haitian Sun
NAACL (2024) (to appear)
Preview abstract
Recently proposed long-form question answering (QA) systems, supported by large language models (LLMs), have shown promising capabilities. Yet, attributing and verifying their generated abstractive answers can be difficult, and automatically evaluating their accuracy remains an ongoing challenge.
In this paper, we introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion. Specifically, Semi-extractive Multi-source QA (SEMQA) requires models to output a comprehensive answer while mixing between factual quoted spans---copied verbatim from given input sources---and non-factual free-text connectors that glue these spans together into a single cohesive passage. This setting bridges the gap between the outputs of well-grounded but constrained extractive QA systems and more fluent but harder to attribute fully abstractive answers. Particularly, it enables a new mode for language models that leverages their advanced language generation capabilities, while also producing fine in-line attributions by-design that are easy to verify, interpret, and evaluate. To study this task, we create the first dataset of this kind with human-written semi-extractive answers to natural and generated questions, and define text-based evaluation metrics. Experimenting with several LLMs in various settings, we find this task to be surprisingly challenging, demonstrating the importance of our work for developing and studying such consolidation capabilities.
View details
Preview abstract
We extend conformal prediction to control the expected value of any monotone loss function. The
algorithm generalizes split conformal prediction together with its coverage guarantee. Like conformal
prediction, the conformal risk control procedure is tight up to an O(1/n) factor. Worked examples from
computer vision and natural language processing demonstrate the usage of our algorithm to bound the
false negative rate, graph distance, and token-level F1-score.
View details
Preview abstract
Recent efforts to address hallucinations in Large Language Models (LLMs) have focused on attributed text generation, which supplements generated texts with citations of supporting sources for post-generation fact-checking and corrections. Yet, these citations often point to entire documents or paragraphs, burdening users with extensive verification work. In this paper, we introduce a locally-attributable text generation approach, prioritizing concise attributions. Our method, named ``Attribute First, then Generate'', breaks down the conventional end-to-end generation process into three intuitive steps: content selection, sentence planning, and sequential sentence generation. By initially identifying relevant source segments (``select first'') and then conditioning the generation process on them (``then generate''), we ensure these segments also act as the output's fine-grained attributions (``select'' becomes ``attribute''). Tested on Multi-document Summarization and Long-form Question-answering, our method not only yields more concise citations than the baselines but also maintains - and in some cases enhances - both generation quality and attribution accuracy. Furthermore, it significantly reduces the time required for fact verification by human assessors.
View details
Conformal Language Modeling
Victor Quach
Adam Fisch
Adam Yala
Jae Ho Sohn
Tommi Jaakkola
Regina Barzilay
ICLR (2024)
Preview abstract
In this paper, we propose a novel approach to conformal prediction (CP) that is adapted to generative, large language models (LLMs). Conformal prediction is a popular technique for deriving prediction sets from machine learning models that have rigorous, statistical performance guarantees. We extend conformal techniques to a broad class of language models that sample from a conditional distribution over the combinatorial, unbounded space of possible text outputs, given some input prompt. Specifically, we translate the process of constructing prediction sets into calibrating a \emph{stopping rule}, under which we draw diverse samples from our model until we are confident that the growing set of candidate answers includes at least one high-quality response. At the same time, we calibrate a \emph{rejection rule} to selectively discard low-quality or redundant responses to reduce sample noise. Under minimal assumptions, we theoretically prove that our resulting output sets contain at least one high-quality answer with some desired probability that a user can set (such as $90\%$), while still remaining empirically precise on average. Furthermore, within this set of sampled candidate answers, we show that we can also accurately identify subsets of individual components (e.g., phrases or sentences) that are each independently correct (e.g., that are not ``hallucinations'')---again, with provably high probability. We demonstrate the effectiveness of our approach on multiple types of large language models applied to tasks in open-domain question answering, text summarization, and radiology report generation.
View details
UL2: Unifying Language Learning Paradigms
Yi Tay
Xavier Garcia
Jason Wei
Hyung Won Chung
Steven Zheng
Neil Houlsby
ICLR (2023)
Preview abstract
Existing pre-trained models are generally geared towards a particular class of
problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for
pre-training models that are universally effective across datasets and setups. We
begin by disentangling architectural archetypes with pre-training objectives – two
concepts that are commonly conflated. Next, we present a generalized and unified perspective for self-supervision in NLP and show how different pre-training
objectives can be cast as one another and how interpolating between different
objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pretraining objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning
is associated with specific pre-training schemes. We conduct extensive ablative
experiments to compare multiple pre-training objectives and find that our method
pushes the Pareto-frontier by outperforming T5 and/or GPT-like models across
multiple diverse setups. Finally, by scaling our model up to 20B parameters, we
achieve SOTA performance on 50 well-established supervised NLP tasks ranging from language generation (with automated and human evaluation), language
understanding, text classification, question answering, commonsense reasoning,
long text reasoning, structured knowledge grounding and information retrieval.
Our model also achieve strong results at in-context learning, outperforming 175B
GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on oneshot summarization. Finally, we show that UL2 20B works well with chain-ofthought prompting and reasoning tasks, making it an appealing choice for research
into reasoning at a small to medium scale of 20B parameters. We publicly release
Flax-based T5X model checkpoints for the 20B model.
View details
Preview abstract
Transformer encoders contextualize token representations by attending to all other tokens at each layer, leading to quadratic increase in compute effort with the input length. In practice, however, the input text of many NLP tasks can be seen as a sequence of related segments (e.g., the sequence of sentences within a passage, or the hypothesis and premise in NLI). While attending across these segments is highly beneficial for many tasks, we hypothesize that this interaction can be delayed until later encoding stages. To this end, we introduce Layer-adjustable Interactions in Transformers (LAIT). Within LAIT, segmented inputs are first encoded independently, and then jointly. This partial two-tower architecture bridges the gap between a Dual Encoder's ability to pre-compute representations for segments and a fully self-attentive Transformer's capacity to model cross-segment attention. Also, LAIT can be introduced only when finetuning, effectively converting an existing pretrained Transformer into the hybrid of the two aforementioned architectures, and providing an intuitive control over the performance-efficiency tradeoff. Experimenting on a wide range of NLP tasks, we find LAIT to significantly improve efficiency while preserving accuracy.
View details
Preview abstract
The widely studied task of Natural Language Inference (NLI) requires a system to recognize whether one piece of text is textually entailed by another, i.e. whether the entirety of its meaning can be inferred from the other. In current NLI corpora and models, the textual entailment relation is typically defined on the sentence- or paragraph- level. However, even a simple sentence often contains multiple propositions, i.e. distinct units of meaning conveyed by the sentence. These propositions can carry different truth values in the context of a given premise, and we argue for the need to identify such fine-grained textual entailment relations.
To facilitate the study on proposition-level segmentation and entailment, we propose PropSegmEnt, a corpus of over 35K propositions annotated by trained expert annotators.
Our dataset structure resembles the tasks of (1) segmenting sentences within a document to the set of propositions, and (2) classifying the entailment relation of each proposition with respect to a different yet topically-aligned document, i.e. documents describing the same event or entity.
We establish strong baselines for the segmentation and entailment tasks.
We demonstrate that our conceptual framework is potentially useful for understanding and explaining the compositionality of NLI labels.
View details
Preview abstract
The predictions of question answering (QA) systems are typically evaluated against manually annotated finite sets of one or more answers. This leads to a coverage limitation that results in underestimating the true performance of systems, and is typically addressed by extending over exact match (EM) with predefined rules or with the token-level F1 measure. In this paper, we present the first systematic conceptual and data-driven analysis to examine the shortcomings of token-level equivalence measures.
To this end, we define the asymmetric notion of answer equivalence (AE), accepting answers that are equivalent to or improve over the reference, and publish over 23k human judgments for candidates produced by multiple QA systems on SQuAD. Through a careful analysis of this data, we reveal and quantify several concrete limitations of the F1 measure, such as a false impression of graduality, or missing dependence on the question.
Since collecting AE annotations for each evaluated model is expensive, we learn a BERT matching (BEM) measure to approximate this task. Being a simpler task than QA, we find BEM to provide significantly better AE approximations than F1, and to more accurately reflect the performance of systems.
Finally, we demonstrate the practical utility of AE and BEM on the concrete application of minimal accurate prediction sets, reducing the number of required answers by up to ×2.6.
View details
Preview abstract
Accurate estimates of uncertainty are important for many difficult or sensitive prediction tasks in natural language processing (NLP). Though large-scale pre-trained models have vastly improved the accuracy of applied machine learning models throughout the field, there still are many instances in which they fail. The ability to precisely quantify uncertainty while handling the challenging scenarios that modern models can face when deployed in the real world is critical for reliable, consequential-decision making. This tutorial is intended for both academic researchers and industry practitioners alike, and provides a comprehensive introduction to uncertainty estimation for NLP problems---from fundamentals in probability calibration, Bayesian inference, and confidence set (or interval) construction, to applied topics in modern out-of-distribution detection and selective inference.
Tutorial website
View details