Markus Freitag

Markus Freitag

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Collecting high-quality translations is crucial for the development and evaluation of machine translation systems. However, traditional human-only approaches are costly and slow. This study presents a comprehensive investigation of 11 approaches for acquiring translation data, including human-only, machine-only, and hybrid approaches. Our findings demonstrate that human-machine collaboration can match or even exceed the quality of human-only translations, while being more cost-efficient. Error analysis reveals the complementary strengths between human and machine contributions, highlighting the effectiveness of collaborative methods. Cost analysis further demonstrates the economic benefits of human-machine collaboration methods, with some approaches achieving top-tier quality at around 60% of the cost of traditional methods. We release a publicly available dataset containing nearly 18,000 segments of varying translation quality with corresponding human ratings to facilitate future research. View details
    Preview abstract Recent research in neural machine translation (NMT) has shown that training on high-quality machine-generated data can outperform training on human-generated data. This work accompanies the first-ever release of a LLM-generated, MBR-decoded and QE-reranked dataset with both sentence-level and multi-sentence examples. We perform extensive experiments to demonstrate the quality of our dataset in terms of its downstream impact on NMT model performance. We find that training from scratch on our (machine-generated) dataset outperforms training on the (web-crawled) WMT’23 training dataset (which is 300 times larger), and also outperforms training on the top-quality subset of the WMT’23 training dataset. We also find that performing self-distillation by finetuning the LLM which generated this dataset outperforms the LLM’s strong few-shot baseline. These findings corroborate the quality of our dataset, and demonstrate the value of high-quality machine-generated data in improving performance of NMT models. View details
    Findings of the WMT24 General Machine Translation Shared Task: The LLM Era is Here but MT is Not Solved Yet
    Tom Kocmi
    Eleftherios Avramidis
    Rachel Bawden
    Ondrej Bojar
    Anton Dvorkovich
    Christian Federman
    Mark Fishel
    Thamme Gowda
    Roman Grundkiewicz
    Barry Haddow
    Marzena Karpinska
    Philipp Koehn
    Benjamin Marie
    Christof Monz
    Kenton Murray
    Masaaki Nagata
    Martin Popel
    Maja Popovic
    Mariya Shmatova
    Steinþór Steingrímsson
    Vilém Zouhar
    2024
    Preview abstract This overview paper presents the results of the General Machine Translation Task organised as part of the 2024 Conference on Machine Translation (WMT). In the general MT task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting of three to five different domains. In addition to participating systems, we collected translations from 8 different large language models (LLMs) and 4 online translation providers. We evaluate system outputs with professional human annotators using a new protocol called Error Span Annotations (ESA). View details
    Are LLMs Breaking MT Metrics? Results of the WMT24 Metrics Shared Task
    Nitika Mathur
    Chi-kiu Lo
    Eleftherios Avramidis
    Ricardo Rei
    Brian Thompson
    Frédéric Blain
    Tom Kocmi
    Jiayi Wang
    David Adelani
    Marianna Buchicchio
    Chrysoula Zerva
    Alon Lavie
    2024
    Preview abstract The WMT24 Metrics Shared Task evaluated the performance of automatic metrics for machine translation (MT), with a major focus on LLM-based translations that were generated as part of the WMT24 General MT Translation Task. As LLMs become increasingly popular in MT, it's crucial to determine if existing evaluation metrics can accurately assess the output of these systems. To provide a robust benchmark for this evaluation, human assessments were collected using Multidimensional Quality Metrics (MQM), continuing the practice from the previous year. Furthermore, building on the success of the previous year, a challenge set subtask was included, requiring participants to design contrastive test suites that specifically target a metric's ability to identify and penalize different types of translation errors. Finally, the meta-evaluation procedure was refined to better reflect real-world usage of MT metrics, focusing on pairwise accuracy at both the system and segment levels. We present an extensive analysis on how well metrics perform on three language pairs: English to Spanish (Latin America), Japanese to Chinese, and English to German. The results strongly confirm the results reported last year, that neural fine-tuned metrics continue to stay strong also for LLM-based translation systems. View details
    Mitigating metric bias in minimum bayes risk decoding
    Proceedings of the Ninth Conference on Machine Translation (2024), pp. 1063-1094
    Preview abstract Minimum bayes risk decoding has been shown to improve translation quality both on automated metrics and human evaluations. In this paper we show that MBR decoding tends to show larger improvements in the utility metric and similar metrics, compared to other unrelated metrics. To mitigate this metric bias issue, we explore using MBR decoding using ensembles of multiple metrics as the utility function, as well as QE filtering followed by MBR decoding. Human evaluations show that using an ensemble of metrics improves quality over MBR or QE decoding with a single metric. View details
    Preview abstract Reliable human evaluation is critical to the development of successful natural language generation models, but achieving it is notoriously difficult. Stability is a crucial requirement when ranking systems by quality: consistent ranking of systems across repeated evaluations is not just desirable, but essential. Without it, there is no reliable foundation for hill-climbing or product launch decisions. In this paper, we use machine translation and its state-of-the-art human evaluation framework, MQM, as a case study to understand how to set up reliable human evaluations that yield stable conclusions. We investigate the optimal configurations for item allocation to raters, number of ratings per item, and score normalization. Our study on two language pairs provides concrete recommendations for designing replicable human evaluation studies. We also collect and release the largest publicly available dataset of multi-segment translations rated by multiple professional translators, consisting of nearly 140,000 segment annotations across two language pairs. View details
    WMT23 Metrics shared task Submission: Quality Estimation using Minimum Bayes Risk
    Subhajit Naskar
    Proceedings of the Eighth Conference on Machine Translation, Association for Computational Linguistics, Singapore (2023), pp. 806-811
    Preview abstract This report describes the Minimum Bayes Risk Quality Estimation (MBR-QE) submission to the Workshop on Machine Translation's 2023 Metrics Shared Task. MBR decoding with neural utility metrics (BLEURT) are known to be very effective in generating high quality machine translations. We use the underlying assumption of MBR decoding and develop a MBR based reference-free quality estimation metric. Our method uses a evaluator machine translation system and a reference-based utility metric (BLEURT, MeticX) to calculate a quality estimation score of a model. We report results related to comparing different MBR configuration and utility metrics. View details
    There's no Data Like Better Data: Using QE Metrics for MT Data Filtering
    Jan-Thorsten Peter
    Mara Finkelstein
    Jurik Juraska
    Proceedings of the Eighth Conference on Machine Translation, Association for Computational Linguistics, Singapore (2023), pp. 561-577
    Preview abstract Quality Estimation (QE), the evaluation of machine translation output without the need of explicit references, has seen big improvements in the last years with the use of neural metrics. In this paper we analyze the viability of using QE metrics for filtering out bad quality sentence pairs in the training data of neural machine translation systems (NMT). While most corpus filtering methods are focused on detecting noisy examples in collections of texts, usually huge amounts of web crawled data, QE models are trained to discriminate more fine-grained quality differences. We show that by selecting the highest quality sentence pairs in the training data, we can improve translation quality while reducing the training size by half. We also provide a detailed analysis of the filtering results, which highlights the differences between both approaches. View details
    Epsilon Sampling Rocks: Investigating Sampling Strategies for Minimum Bayes Risk Decoding for Machine Translation
    Behrooz Ghorbani
    Patrick Fernandes
    Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, Singapore, pp. 9198-9209
    Preview abstract Recent advances in machine translation (MT) have shown that Minimum Bayes Risk (MBR) decoding can be a powerful alternative to beam search decoding, especially when combined with neural-based utility functions. However, the performance of MBR decoding depends heavily on how and how many candidates are sampled from the model. In this paper, we explore how different sampling approaches for generating candidate lists for MBR decoding affect performance. We evaluate popular sampling approaches, such as ancestral, nucleus, and top-k sampling. Based on our insights into their limitations, we experiment with the recently proposed epsilon-sampling approach, which prunes away all tokens with a probability smaller than epsilon, ensuring that each token in a sample receives a fair probability mass. Through extensive human evaluations, we demonstrate that MBR decoding based on epsilon-sampling significantly outperforms not only beam search decoding, but also MBR decoding with all other tested sampling methods across four language pairs. View details
    Results of WMT23 Metrics Shared Task: Metrics might be Guilty but References are not Innocent
    Nitika Mathur
    Chi-kiu Lo
    Eleftherios Avramidis
    Ricardo Rei
    Brian Thompson
    Tom Kocmi
    Frédéric Blain
    Craig Stewart
    Chrysoula Zerva
    Sheila Castilho
    Alon Lavie
    George Foster
    Proceedings of the Eighth Conference on Machine Translation, Association for Computational Linguistics, Singapore (2023), pp. 576-626
    Preview abstract This paper presents the results of the WMT23 Metrics Shared Task. Participants submitting automatic MT evaluation metrics were asked to score the outputs of the translation systems competing in the WMT23 News Translation Task. All metrics were evaluated on how well they correlate with human ratings at the system and segment level. Similar to last year, we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM). Following last year's success, we also included a challenge set subtask, where participants had to create contrastive test suites for evaluating metrics' ability to capture and penalise specific types of translation errors. Furthermore, we improved our meta-evaluation procedure by considering fewer tasks and calculating a global score by weighted averaging across the various tasks. We present an extensive analysis on how well metrics perform on three language pairs: Chinese-English, Hebrew-English on the sentence-level and English-German on the paragraph-level. The results strongly confirm the results reported last year, that neural-based metrics are significantly better than non-neural metrics in their levels of correlation with human judgments. Further, we investigate the impact of bad reference translations on the correlations of metrics with human judgment. We present a novel approach for generating synthetic reference translations based on the collection of MT system outputs and their corresponding MQM ratings, which has the potential to mitigate bad reference issues we observed this year for some language pairs. Finally, we also study the connections between the magnitude of metric differences and their expected significance in human evaluation, which should help the community to better understand and adopt new metrics. View details