Cibu C Johny
Research Areas
Authored Publications
Sort By
Context-aware Transliteration of Romanized South Asian Languages
Christo Kirov
Computational Linguistics, 50 (2) (2024), 475–534
Preview abstract
While most transliteration research is focused on single tokens such as named entities -- e.g., transliteration of "અમદાવાદ" from the Gujarati script to the Latin script "Ahmedabad" -- the informal romanization prevalent in South Asia and elsewhere often requires transliteration of full sentences. The lack of large parallel text collections of full sentence (as opposed to single word) transliterations necessitates incorporation of contextual information into transliteration via non-parallel resources, such as via mono-script text collections. In this paper, we present a number of methods for improving transliteration in context for such a use scenario. Some of these methods in fact improve performance without making use of sentential context, allowing for better quantification of the degree to which contextual information in particular is responsible for system improvements. Our final systems, which ultimately rely upon ensembles including large pretrained language models finetuned on simulated parallel data, yield substantial improvements over the best previously reported results for full sentence transliteration from Latin to native script on all 12 languages in the Dakshina dataset (Roark et al. 2020), with an overall 4.8% absolute (27.1% relative) mean word-error rate reduction.
View details
Criteria for Useful Automatic Romanization in South Asian Languages
Proceedings of the 13th Language Resources and Evaluation Conference.(LREC), European Language Resources Association (ELRA), 20-25 June, Marseille, France (2022), 6662‑6673
Preview abstract
This paper presents a number of possible criteria for systems that transliterate South Asian languages from their native scripts into the Latin script. This process is also known as romanization. These criteria are related to either fidelity to human linguistic behavior (pronunciation transparency, naturalness and conventionality) or processing utility for people (ease of input) as well as under-the-hood in systems (invertibility and stability across languages and scripts). When addressing these differing criteria several linguistic considerations, such as modeling of prominent phonological processes and their relation to orthography, need to be taken into account. We discuss these key linguistic details in the context of Brahmic scripts and languages that use them, such as Hindi and Malayalam. We then present the core features of several romanization algorithms, implemented in finite state transducer (FST) formalism, that address differing criteria. Implementation of these algorithms will be released as part of the Nisaba finite-state script processing library.
View details
Graphemic Normalization of the Perso-Arabic Script
Raiomond Doctor
Richard Sproat
Proceedings of Grapholinguistics in the 21st Century, 2022 (G21C, Grafematik), Paris, France
Preview abstract
Since its original appearance in 1991, the Perso-Arabic script representation in Unicode has grown from 169 to over 440 atomic isolated characters spread over several code pages representing standard letters, various diacritics and punctuation for the original Arabic and numerous other regional orthographic traditions (Unicode Consortium, 2021). This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages, such as Arabic and Persian, building on earlier work by the expert community (ICANN, 2011, 2015). We particularly focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues such as the use of visually ambiguous yet canonically nonequivalent letters and the mixing of letters from different orthographies. Among the contributing conflating factors are the lack of input methods, the instability of modern orthographies (e.g., Aazim et al., 2009; Iyengar, 2018), insufficient literacy, and loss or lack of orthographic tradition (Jahani and Korn, 2013; Liljegren, 2018). We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks. Our results indicate statistically significant improvements in performance in most conditions for all the languages considered when normalization is applied. We argue that better understanding and representation of Perso-Arabic script variation within regional orthographic traditions, where those are present, is crucial for further progress of modern computational NLP techniques (Ponti et al., 2019; Conneau et al., 2020; Muller et al., 2021) especially for languages with a paucity of resources.
View details
Extensions to Brahmic script processing within the Nisaba library: new scripts, languages and utilities
Raiomond Doctor
Proceedings of the 13th Language Resources and Evaluation Conference.(LREC), European Language Resources Association (ELRA), 20-25 June, Marseille, France (2022), 6450‑6460
Preview abstract
The Brahmic family of scripts is used to record some of the most spoken languages in the world and is arguably the most diverse family of writing systems. In this work, we present several substantial extensions to Brahmic script functionality within the open-source Nisaba library of finite-state script normalization and processing utilities (Johny et. al. , 2021). First, we extend coverage from the original ten scripts to an additional ten scripts of South Asia and beyond, including some used to record endangered languages such as Dogri. Second, we augment the language layer so that scripts used by multiple languages in distinct ways can be processed correctly for more languages, such as the Bengali script when used for the low-resource language Santali. We document key changes to the finite-state engine required to support these new languages and scripts. Finally, we add new script processing utilities, including lightweight script-level reading normalization that (unlike existing visual normalization) does not preserve visual invariance, and a fixed-input transliteration mechanism specifically tailored to Brahmic text entry with ASCII characters.
View details
Beyond Arabic: Software for Perso-Arabic Script Manipulation
Raiomond Doctor
Richard Sproat
Proceedings of the 7th Arabic Natural Language Processing Workshop (WANLP2022) at EMNLP, Association for Computational Linguistics (ACL), Abu Dhabi, United Arab Emirates (Hybrid), pp. 381-387
Preview abstract
This paper presents an open-source software library that provides a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script. The operations include various levels of script normalization, including visual invariance-preserving operations that subsume and go beyond the standard Unicode normalization forms, as well as transformations that modify the visual appearance of characters in accordance with the regional orthographies for ten contemporary languages from diverse language families. The library also provides simple FST-based romanization and transliteration. We additionally attempt to formalize the typology of Perso-Arabic characters by providing one-to-many mappings from Unicode code points to the languages that use them. While our work focuses on the Arabic script diaspora rather than Arabic itself, this approach could be adopted for any language that uses the Arabic script, thus providing a unified framework for treating a script family used by close to a billion people.
View details
Finite-state script normalization and processing utilities: The Nisaba Brahmic library
The 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021): System Demonstrations, Association for Computational Linguistics, [Online], Kyiv, Ukraine, April, 2021, pp. 14-23
Preview abstract
This paper presents an open-source library for efficient low-level processing of ten major South Asian Brahmic scripts. The library provides a flexible and extensible framework for supporting crucial operations on Brahmic scripts, such as NFC, visual normalization, reversible transliteration, and validity checks, implemented in Python within a finite-state transducer formalism. We survey some common Brahmic script issues that may adversely affect the performance of downstream NLP tasks, and provide the rationale for finite-state design and system implementation details.
View details
Open-source Multi-speaker Speech Corpora for Building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu Speech Synthesis Systems
Fei He
Shan Hui Cathy Chu
Clara E. Rivera
Martin Jansche
Supheakmungkol Sarin
Knot Pipatsrisawat
Proc. 12th Language Resources and Evaluation Conference (LREC 2020), European Language Resources Association (ELRA), 11--16 May, Marseille, France, 6494‑-6503
Preview abstract
We present free high quality multi-speaker speech corpora for Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu, which are six of the twenty two official languages of India. The corpora is primarily intended for use in text-to-speech (TTS) applications, such as constructing multilingual voices or being used for speaker or language adaptation. The data can also be useful for automatic speech recognition (ASR) in various multilingual scenarios. Most of the corpora (apart from Marathi, which is a female-only database) consist of at least 2,000 recorded lines from female and male native speakers of the language. We present the methodological details behind corpora acquisition, which can be scaled to acquiring the data for more languages of interest. We describe the experiments in building a multilingual text-to-speech model that is constructed by combining our corpora. Our results indicate that using these corpora results in good quality voices, with Mean Opinion Scores (MOS) $>$ 3.6, for all the languages tested. We believe that these resources, released with an open-source license, and the described methodology will help developing speech applications for the Indic languages and aid corpora development for other, smaller, languages of India and beyond.
View details
Processing South Asian languages written in the Latin script: the Dakshina dataset
Christo Kirov
Sabrina J. Mielke
Keith Hall
Proceedings of the 12th Conference on Language Resources and Evaluation (LREC) (2020), 2413–2423
Preview abstract
This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet. We document the methods used for preparation and selection of the Wikipedia text in each language; collection of attested romanizations for sampled lexicons; and manual romanization of held-out sentences from the native script collections. We additionally provide baseline results on several tasks made possible by the dataset, including single word transliteration, full sentence transliteration, and language modeling of native script and romanized text.
View details
Google Crowdsourced Speech Corpora and Related Open-Source Resources for Low-Resource Languages and Dialects: An Overview
Alena Butryna
Shan Hui Cathy Chu
Linne Ha
Fei He
Martin Jansche
Chen Fang Li
Tatiana Merkulova
Yin May Oo
Knot Pipatsrisawat
Clara E. Rivera
Supheakmungkol Sarin
Pasindu De Silva
Keshan Sodimana
Richard Sproat
Jaka Aris Eko Wibawa
2019 UNESCO International Conference Language Technologies for All (LT4All): Enabling Linguistic Diversity and Multilingualism Worldwide, 4--6 December, Paris, France, pp. 91-94
Preview abstract
This paper presents an overview of a program designed to address the growing need for developing free speech resources for under-represented languages. At present we have released 38 datasets for building text-to-speech and automatic speech recognition applications for languages and dialects of South and Southeast Asia, Africa, Europe and South America. The paper describes the methodology used for developing such corpora and presents some of our findings that could benefit under-represented language community.
View details
Cross-Lingual Consistency of Phonological Features: An Empirical Study
Martin Jansche
Proc. of Interspeech 2019 (20th Annual Conference of the International Speech Communication Association), International Speech Communication Association (ISCA), September 15--19, Graz, Austria, pp. 1741-1745
Preview abstract
The concept of a phoneme arose historically as a theoretical abstraction that applies language-internally. Using phonemes and phonological features in cross-linguistic settings raises an important question of conceptual validity: Are contrasts that are meaningful within a language also empirically robust across languages? This paper develops a method for assessing the cross-linguistic consistency of phonological features in phoneme inventories. The method involves training separate binary neural classifiers for several phonological contrast in audio spans centered on particular segments within continuous speech. To assess cross-linguistic consistency, these classifiers are evaluated on held-out languages and classification quality is reported. We apply this method to several common phonological contrasts, including vowel height, vowel frontness, and retroflex consonants, in the context of multi-speaker corpora for ten languages from three language families (Indo-Aryan, Dravidian, and Malayo-Polynesian). We empirically evaluate and discuss the consistency of phonological contrasts derived from features found in phonological ontologies such as PanPhon and PHOIBLE.
View details