Bernd Bohnet
Research Areas
Authored Publications
Sort By
A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains
Alon Jacovi
Or Honovich
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (2024), pp. 4615–4634
Preview abstract
Prompting language models to provide step-by-step answers (e.g., “Chain-of-Thought”) is the prominent approach for complex reasoning tasks, where more accurate reasoning chains typically improve downstream task performance. Recent literature discusses automatic methods to verify reasoning to evaluate and improve their correctness. However, no fine-grained step-level datasets are available to enable thorough evaluation of such verification methods, hindering progress in this direction. We introduce REVEAL: Reasoning Verification Evaluation, a dataset to benchmark automatic verifiers of complex Chain-of-Thought reasoning in open-domain question-answering settings. REVEAL includes comprehensive labels for the relevance, attribution to evidence passages, and logical correctness of each reasoning step in a language model’s answer, across a variety of datasets and state-of-the-art language models. Evaluation on REVEAL shows that verifiers struggle at verifying reasoning chains — in particular, verifying logical correctness and detecting contradictions. Available at https://reveal-dataset.github.io/.
View details
A Comprehensive Evaluation of Tool-Assisted Generation Strategies
Alon Jacovi
Findings of EMNLP (2023)
Preview abstract
A growing area of research investigates augmenting language models with tools (e.g., search engines, calculators) to overcome their shortcomings (e.g., missing or incorrect knowledge, incorrect logical inferences). Various few-shot tool-usage strategies have been proposed. However, there is no systematic and fair comparison across different strategies, or between these strategies and strong baselines that do not leverage tools. We conduct an extensive empirical analysis, finding that (1) across various datasets, example difficulty levels, and models, strong no-tool baselines are competitive to tool-assisted strategies, implying that effectively using tools with in-context demonstrations is a difficult unsolved problem; (2) for knowledge-retrieval tasks, strategies that *refine* incorrect outputs with tools outperform strategies that retrieve relevant information *ahead of* or *during generation*; (3) tool-assisted strategies are expensive in the number of tokens they require to work -- incurring additional costs by orders of magnitude -- which does not translate into significant improvement in performance. Overall, our findings suggest that few-shot tool integration is still an open challenge, emphasizing the need for comprehensive evaluations of future strategies to accurately assess their *benefits* and *costs*.
View details
Preview abstract
Most recent coreference resolution systems use search algorithms over possible spans to identify mentions and resolve coreference. We instead present a coreference resolution system that uses a text-to-text (seq2seq) paradigm to predict mentions and links jointly, which simplifies the coreference resolution by eliminating both the search for mentions and coreferences. We implemented the coreference system as a transition system and use multilingual T5 as language model. We obtained state-of-the-art accuracy with 83.3 F1-score on the CoNLL-2012 data set. We use the SemEval-2010 data sets to evaluate on languages other than English and get substantially higher Zero-shot F1-scores for 3 out of 4 languages than previous approaches and significantly exceed previous supervised state-of-the-art results for all five tested languages.
View details
Preview abstract
This work explores techniques to predict Part-ofSpeech (PoS) tags from neural signals measured at millisecond resolution with electroencephalography (EEG) during text reading. We show that information about word length, frequency and word class is encoded by the brain at different poststimulus latencies. We then demonstrate that pretraining on averaged EEG data and data augmentation techniques boost PoS single-trial EEG decoding accuracy for Transformers (but not linear SVMs). Applying optimised temporally-resolved decoding techniques we show that Transformers outperform linear SVMs on PoS tagging of unigram and bigram data more strongly when information requires integration across longer time windows.
View details
Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models
Pat Verga
Jianmo Ni
arXiv (2022)
Preview abstract
Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?).
View details
Preview abstract
Named Entity Recognition (NER) is a fundamental task in Natural Language Processing,
concerned with identifying spans of text expressing references to entities. NER research
is often focused on flat entities only (flat NER),
ignoring the fact that entity references can be
nested, as in [Bank of [China]] (Finkel and
Manning, 2009). In this paper, we use ideas
from graph-based dependency parsing to provide our model a global view on the input via
a biaffine model (Dozat and Manning, 2017).
The biaffine model scores pairs of start and end
tokens in a sentence which we use to explore
all spans, so that the model is able to predict
named entities accurately. We show that the
model works well for both nested and flat NER
through evaluation on 8 corpora and achieving
SoTA performance on all of them, with accuracy gains of up to 2.2 percentage points.
View details
On Faithfulness and Factuality in Abstractive Summarization
Ryan Thomas Mcdonald
Proceedings of The 58th Annual Meeting of the Association for Computational Linguistics (ACL) (2020)
Preview abstract
It is well known that the standard likelihood training and approximate decoding objectives in neural text generation models are fundamentally flawed and lead to dull and repetitive responses. We found that these models when tested on abstractive summarization are highly prone to hallucinate content that is either unfaithful to the input document, completely irrelevant or gibberish. We conduct a large scale human evaluation of several state of the art neural abstractive summarization systems including pretrained language models to better understand the types of hallucinations. Furthermore, we study the extent to which the hallucinated content (i) co-occurs with the common linguistic irregularities such as repetition and incoherence, and (ii) can be measured by NLU measures such as textual entailment, question answering and OpenIE-based fact checking.
View details
A Gold Standard Dependency Treebank for Turkish
Proceedings of The 12th Language Resources and Evaluation Conference, European Language Resources Association" (2020), pp. 5156-5163
Preview abstract
We introduce TWT; a new treebank for Turkish which consists of web and Wikipedia sentences that are annotated for segmentation, morphology, part-of-speech and dependency relations. To date, it is the largest publicly available human-annotated morpho-syntactic Turkish treebank in terms of the annotated word count. It is also the first large Turkish dependency treebank that has a dedicated
Wikipedia section. We present the tagsets and the methodology that are used in annotating the treebank and also the results of the baseline experiments on Turkish dependency parsing with this treebank.
View details
Recursive LSTM Tree Representation for Arc-Standard Transition-BasedDependency Parsing
Mohab El-karef
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019) (2019)
Preview abstract
We propose a method to represent dependency trees as dense vectors through the re-cursive application of Long Short-Term Memory networks to build Recursive LSTM Trees (RLTs). We show that the dense vectors produced by Recursive LSTM Trees replace the need for structural features by using them as feature vectors for a greedy Arc-Standard transition-based dependency parser. We also show that RLTs have the ability to incorporate useful information from the bi-LSTM positional representation used by \newcite{crossH16} and \newcite{kiperwasser2016simple}. The resulting dense vectors are able to express both structural information relating to the dependency tree, as well as sequential information relating to the position in the sentence. The resulting parser only requires the vector representations of the top two items on the parser stack, which is, to the best of our knowledge, the smallest feature set ever published for Arc-Standard parsers to date, while still managing to achieve competitive results.
View details
82 Treebanks, 34 Models: Universal Dependency Parsing with Cross-Treebank Models
Aaron Smith
Joakim Nivre
Miryam de Lhoneux
Sara Stymne
Yan Shao
Conference on Computational Natural Language Learning (2018)
Preview abstract
We present the Uppsala system for the CoNLL 2018 Shared Task on Multilingual Parsing from Raw Text to Universal Dependencies. Our system is a pipeline consisting of three components: the first performs joint word and sentence segmentation; the second predicts part-of speech tags and morphological features; the third predicts dependency trees from words and tags. Instead of training a single parsing model for each treebank, we trained models with multiple treebanks for the same language or closely related languages, greatly reducing the number of models. On the official test run, we achieved a macro-averaged LAS F1 of 72.37 and a macro-averaged MLAS F1 of 59.20, ranking 7th of 27 teams for both of these metrics.
View details