Google Crowdsourced Speech Corpora and Related Open-Source Resources for Low-Resource Languages and Dialects: An Overview

Alena Butryna
Shan Hui Cathy Chu
Linne Ha
Fei He
Martin Jansche
Chen Fang Li
Tatiana Merkulova
Yin May Oo
Knot Pipatsrisawat
Clara E. Rivera
Supheakmungkol Sarin
Pasindu De Silva
Keshan Sodimana
Richard Sproat
Jaka Aris Eko Wibawa
2019 UNESCO International Conference Language Technologies for All (LT4All): Enabling Linguistic Diversity and Multilingualism Worldwide, 4--6 December, Paris, France, pp. 91-94

Abstract

This paper presents an overview of a program designed to address the growing need for developing free speech resources for under-represented languages. At present we have released 38 datasets for building text-to-speech and automatic speech recognition applications for languages and dialects of South and Southeast Asia, Africa, Europe and South America. The paper describes the methodology used for developing such corpora and presents some of our findings that could benefit under-represented language community.