Jennimaria Palomaki
Research Areas
Authored Publications
Sort By
QED: A Linguistically Principled Framework for Explainable Question Answering
Eunsol Choi
TACL (2021)
Preview abstract
A question answering system that in addition to providing an answer provides an explanation of the reasoning that leads to that answer has potential advantages in terms of debuggability, extensibility, and trust. To this end, we propose QED, a linguistically informed, extensible framework for explanations in question answering. A QED explanation specifies the relationship between a question and answer according to formal semantic notions such as referential equality, sentencehood, and entailment. We describe and publicly release an expert-annotated dataset of QED explanations built upon a subset of the Google Natural Questions dataset, and report baseline models on two tasks—post- hoc explanation generation given an answer, and joint question answering and explanation generation. In the joint setting, a promising result suggests that training on a relatively small amount of QED data can improve question answering. In addition to describing the formal, language-theoretic motivations for the QED approach, we describe a large user study showing that the presence of QED explanations significantly improves the ability of untrained raters to spot errors made by a strong neural QA baseline.
View details
TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages
Eunsol Choi
Transactions of the Association for Computational Linguistics (2020)
Preview abstract
Confidently making progress on multilingual modeling requires challenging, trustworthy evaluations. We present TyDi QA, a question answering dataset covering 11 typologically diverse languages. Until recently, most multilingual research in natural language processing has been limited to machine translation or to technical tasks such as tagging and parsing. Question answering offers a scenario that is natural in that non-technical users intuitively understand the task, allowing high quality data collection in the absence of abundant annotators with expertise in both linguistics and the language of interest. This allows us select languages that are diverse with regard to their typology -- the set of linguistic features that each language expresses. We expect that models that can perform well on this set will generalize across a large number of the languages in the world. To encourage a more realistic distribution, the data is collected entirely in each native language without the use of translation (human or otherwise) and question creation is performed without seeing the answers. We present a quantitative analysis of the data quality, we provide example-level linguistic analyses and glosses of language phenomena that would not be found in English-only corpora, and we measure the performance of baseline systems.
View details
New Protocols and Negative Results for Textual Entailment Data Collection
Sam Bowman
Emily Blythe Pitler
EMNLP 2020 - Conference on Empirical Methods in Natural Language Processing (to appear)
Preview abstract
Natural language inference (NLI) data has proven useful in benchmarking and, especially, as pretraining data for tasks requiring language understanding. However, the crowdsourcing protocol that was used to collect this data has known issues and was not explicitly optimized for either of these purposes, so it is likely far from ideal. We propose four alternative protocols, each aimed at improving either the ease with which annotators can produce sound training examples or the quality and diversity of those examples. Using these alternatives and a fifth baseline protocol, we collect and compare five new 8.5k-example training sets. In evaluations focused on transfer learning applications, our results are solidly negative, with models trained on our baseline dataset yielding good transfer performance to downstream tasks, but none of our four new methods (nor the recent ANLI) showing any improvements over that baseline. In a small silver lining, we observe that all four new protocols, especially those where annotators edit pre-filled text boxes, reduce previously observed issues with annotation artifacts.
View details
Natural Questions: a Benchmark for Question Answering Research
Olivia Redfield
Danielle Epstein
Illia Polosukhin
Matthew Kelcey
Jacob Devlin
Llion Jones
Ming-Wei Chang
Jakob Uszkoreit
Transactions of the Association of Computational Linguistics (2019) (to appear)
Preview abstract
We present the Natural Questions corpus, a question answering dataset. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations, 7,830 examples with 5-way annotations for development data, and a further 7,842 examples 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature.
View details
A Case for a Range of Acceptable Annotations
Olivia Rhinehart
Workshop on Subjectivity, Ambiguity and Disagreement in Crowdsourcing, AAAI (HCOMP 2018) (2018)
Preview abstract
Multi-way annotation is often used to ensure data quality in crowdsourced annotation tasks. Each item is annotated redundantly and the contributors’ judgments are converted into a single “ground truth” label or more complex annotation through a resolution technique (e.g., on the basis of majority or plurality). Recent crowdsourcing research has argued against the notion of a single “ground truth” annotation for items in semantically oriented tasks—that is, we should accept the aggregated judgments of a large pool of crowd contributors as “crowd truth.” While we agree that many semantically oriented tasks are inherently subjective, we do not go so far as to trust the judgments of the crowd in all cases. We recognize that there may be items for which there is truly only one acceptable response, and that there may be divergent annotations that are truly of unacceptable quality. We propose that there exists a class of annotations between these two categories that exhibit acceptable variation, which we define as the range of annotations for a given item that meet the standard of quality for a task. We illustrate acceptable variation within existing annotated data sets, including a labeled sound corpus and a medical relation extraction corpus. Finally, we explore the implications of acceptable variation on annotation task design and annotation quality evaluation.
View details