Felix Stahlberg

Felix Stahlberg

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract In many natural language processing (NLP) tasks the same input (e.g. source sentence) can have multiple possible outputs (e.g. translations). To analyze how this ambiguity (also known as intrinsic uncertainty) shapes the distribution learned by neural sequence models we measure sentence-level uncertainty by computing the degree of overlap between references in multi-reference test sets from two different NLP tasks: machine translation (MT) and grammatical error correction (GEC). At both the sentence- and the task-level, intrinsic uncertainty has major implications for various aspects of search such as the inductive biases in beam search and the complexity of exact search. In particular, we show that well-known pathologies such as a high number of beam search errors, the inadequacy of the mode, and the drop in system performance with large beam sizes apply to tasks with high level of ambiguity such as MT but not to less uncertain tasks such as GEC. Furthermore, we propose a novel exact n-best search algorithm for neural sequence models, and show that intrinsic uncertainty affects model uncertainty as the model tends to overly spread out the probability mass for uncertain tasks and sentences. View details
    Uncertainty Determines the Adequacy of the Mode and the Tractability of Decoding in Sequence-to-Sequence Models
    Ilia Kulikov
    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2022), pp. 8634-8645
    Preview abstract A widely used approach for neural machine translation (NMT) is to train an autoregressive model by maximizing the probability of training sentence pairs in conjunction with a mode-seeking decoding strategy for inference. The ultimate goal is to reduce the system error, i.e. to achieve a high translation quality of unseen sentences. However, this high-level perspective is oblivious to potential pitfalls within the training and decoding pipeline. In this work we propose to measure mode and search errors in addition to the system error in order to better understand the connections amongst them. We study how these errors change when we vary both the decoding strategy and the degree of sparsity of the learned distribution. First, we empirically confirm the high prevalence of modeling errors in NMT, and that the relation between search error and system error is highly non-monotonic. Second, we show that adding sparsity to the model can effectively reduce both mode and search error. Analyzing the mode translations shows that the qualitative improvements are partially due to better length modeling. However, the overall system error slowly increases as we make the decoder sparse suggesting that the current choice of decoding strategy can be further improved in the context of sparse models. View details
    Preview abstract Text normalization, or the process of transforming text into a consistent, canonical form, is crucial for speech applications such as text-to-speech synthesis (TTS). In TTS, the system must decide whether to verbalize "1995" as "nineteen ninety five" in "born in 1995" or as "one thousand nine hundred ninety five" in "page 1995". We present an experimental comparison of various Transformer-based sequence-to-sequence (seq2seq) models of text normalization for speech and evaluate them on a variety of datasets of written text aligned to its normalized spoken form. These models include variants of the 2-stage RNN-based tagging/seq2seq architecture introduced by Zhang et al (2019) where we replace the RNN with a Transformer in one or more stages. We evaluate the performance when initializing the encoder with a pre-trained BERT model. We compare these model variants with a vanilla Transformer that outputs string representations of edit sequences. Of our approaches, using Transformers for sentence context encoding within the 2-stage model proved most effective, with the fine-tuned BERT model yielding the best performance. View details
    Preview abstract Text-editing models have recently become a prominent alternative to seq2seq models for monolingual natural language generation (NLG) tasks such as grammatical error correction, text simplification, and style transfer. These tasks exhibit a large amount of textual overlap between the source and target texts. Text-editing models take advantage of this trait and learn to generate the output by predicting edit operations applied to the source sequence in contrast to seq2seq models that generate the output from scratch. Text-editing models provide several benefits over seq2seq models including faster inference speed, higher sample efficiency, and better control and interpretability of the outputs. This tutorial provides a comprehensive overview of the text-edit based approaches and current state-of-the-art models, analyzing the pros and cons of different methods. We discuss challenges related to productionization and how these models can to help mitigate hallucination and bias, both pressing challenges in the field of text generation. View details
    Jam or Cream First? Modeling Ambiguity in Neural Machine Translation with SCONES
    Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, pp. 4950-4961
    Preview abstract The softmax layer in neural machine translation is designed to model the distribution over mutually exclusive tokens. Machine translation, however, is intrinsically uncertain: the same source sentence can have multiple semantically equivalent translations. Therefore, we propose to replace the softmax activation with a multi-label classification layer that can model ambiguity more effectively. We call our loss function Single-label Contrastive Objective for Non-Exclusive Sequences (SCONES). We show that the multi-label output layer can still be trained on single reference training data using the SCONES loss function. SCONES yields consistent BLEU score gains across six translation directions, particularly for medium-resource language pairs and small beam sizes. By using smaller beam sizes and avoiding the expensive softmax partition function we can speed up inference by a factor of X without any degradation in BLEU score. Furthermore, we demonstrate that SCONES can be used to train NMT models that assign the highest probability to adequate translations, thus mitigating the "beam search curse". Additional experiments on synthetic language pairs with varying levels of uncertainty suggest that the improvements from SCONES can be attributed to better handling of ambiguity. View details
    Conciseness: An Overlooked Language Task
    Aashish Kumar
    Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Abu Dhabi
    Preview abstract We report on novel investigations into training models that make sentences concise. We define the task and show that it is different from related tasks such as summarization and simplification. For evaluation, we release two test sets, consisting of 2000 sentences each, that were annotated by two and five raters, respectively. We demonstrate that conciseness is a difficult task for which zero-shot setups with giant neural language models often do not perform well. Given the limitations of these approaches, we propose a synthetic data generation method based on round-trip translations. Using this data to either train Transformers from scratch or fine-tune T5 models yields our strongest baselines that can be further improved by fine-tuning on an artificial conciseness dataset that we derived from multi-annotator machine translation test sets. View details
    Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models
    Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications (2021)
    Preview abstract Synthetic data generation is widely known to boost the accuracy of neural grammatical error correction (GEC) systems, but existing methods often lack diversity or are too simplistic to realistically generate the broad range of grammatical errors made by human writers in practice. In this work, we use explicit error-type tags from automatic annotation tools like ERRANT to guide synthetic data generation. We compare several models that can produce ungrammatical sentences given a clean sentence and an error type tag, and use these models to build a new large synthetic pre-training set that matches the tag frequency distributions in a development set. Our synthetic data set yields large and consistent gains, leading to state-of-the-art performance on the BEA-test and CoNLL-14 test sets. We also show that our approach is particularly effective in adapting a GEC system that has been trained on mixed native and non-native English to a native English test set, even surpassing real training data consisting of high-quality sentence pairs. View details
    Data Strategies for Low-Resource Grammatical Error Correction
    Simon Flachs
    Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, ACL, https://sig-edu.org/bea/current (2021)
    Preview abstract Grammatical Error Correction (GEC) is a task that has been extensively investigated for the English language. However for other low-resource languages the best practices for training GEC systems have not yet been systematically determined. We investigate how best to take advantage of existing data sources for improving GEC systems for languages with limited quantities of high quality training data. In particular, we compare methods for generating artificial error data to train GEC systems, and show that these methods can benefit from including morphological errors. We then look into the usefulness of noisy error correction data gathered from Wikipedia and the language learning website Lang8, and demonstrate that despite their inherent noise, these are valuable data sources. Finally, we show that GEC systems pre-trained on the noisy data sources can be fine-tuned effectively using small amounts of high quality, human-annotated data. View details
    Neural Machine Translation: A Review
    Journal of Artificial Intelligence Research, 69 (2020), pp. 343-418
    Preview abstract The field of machine translation (MT), the automatic translation of written text from one natural language into another, has experienced a major paradigm shift in recent years. Statistical MT, which mainly relies on various count-based models and which used to dominate MT research for decades, has largely been superseded by neural machine translation (NMT), which tackles translation with a single neural network. In this work we will trace back the origins of modern NMT architectures to word and sentence embeddings and earlier examples of the encoder-decoder network family. We will conclude with a survey of recent trends in the field. View details
    Sequence Transduction Using Span-level Edit Operations
    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 5147-5159
    Preview abstract We propose an open-vocabulary approach to sequence editing for natural language processing (NLP) tasks with a high degree of overlap between input and output texts. We represent sequence-to-sequence transduction as a sequence of edit operations, where each operation either replaces an entire source span with target tokens or keeps it unchanged. We test our method on five NLP tasks (text normalization, sentence fusion, sentence splitting and rephrasing, text simplification, and grammatical error correction) and report competitive results across the board. We show that our method has clear speed advantages over full sequence models for grammatical error correction because inference time depends on the number of edits rather than the number of target tokens. For text normalization, sentence fusion, and grammatical error correction, we associate each edit operation with a task-specific tag to improve explainability. View details