Arindrima Datta
Research Areas
Authored Publications
Sort By
Preview abstract
Automated speech recognition (ASR) coverage of the world's languages continues to expand. Yet, as data-demanding neural network models continue to revolutionize the field, it poses a challenge for data-scarce languages. Multilingual models allow for the joint training of data-scarce and data-rich languages enabling data and parameter sharing. One of the main goals of multilingual ASR is to build a single model for all languages while reaping the benefits of sharing on data-scarce languages without impacting performance on the data-rich languages. However, most state-of-the-art multilingual models require the encoding of language information and therefore are not as flexible or scalable when expanding to newer languages. Language independent multilingual models help to address this, as well as, are more suited to multicultural societies such as in India, where languages overlap and are frequently used together by native speakers. In this paper, we propose a new approach to building a language-agnostic multilingual ASR system using transliteration. This training strategy maps all languages to one writing system through a many-to-one transliteration transducer that maps similar sounding acoustics to one target sequences such as, graphemes, phonemes or wordpieces resulting in improved data sharing and reduced phonetic confusions. We propose a training strategy that maps all languages to one writing system through a many-to-one transliteration transducer. We show with four Indic languages, namely, Hindi, Bengali, Tamil and Kannada, that the resulting multilingual model achieves a performance comparable to a language-dependent multilingual model, with an improvement of up to 15\% relative on the data-scarce language.
View details
Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model
Interspeech 2019 (2019) (to appear)
Preview abstract
Multilingual end-to-end (E2E) models have shown great
promise as a means to expand coverage of the world’s lan-
guages by automatic speech recognition systems. They im-
prove over monolingual E2E systems, especially on low re-
source languages, and simplify training and serving by elimi-
nating language-specific acoustic, pronunciation, and language
models. This work aims to develop an E2E multilingual system
which is equipped to operate in low-latency interactive applica-
tions as well as handle the challenges of real world imbalanced
data. First, we present a streaming E2E multilingual model.
Second, we compare techniques to deal with imbalance across
languages. We find that a combination of conditioning on a
language vector and training language-specific adapter layers
produces the best model. The resulting E2E multilingual model
system achieves lower word error rate (WER) than state-of-the-
art conventional monolingual models by at least 10% relative
on every language.
View details