Training Keyword Spotters with Limited and Synthesized Speech Data

Dominik Roblek
James Lin
International Conference on Acoustics, Speech, and Signal Processing, IEEE, Barcelona, Spain (2020)
Google Scholar

Abstract

With the rise of low power speech-enabled devices, there is a growing demand to quickly produce models for recognizing arbitrary sets of keywords. As with many machine learning tasks, one of the most challenging parts in the model creation process is obtaining a sufficient amount of high training data. In this paper, we explore the effectiveness of synthesized speech data in training small spoken term detection models. Instead of training such models directly on the audio or low level feature such as MFCCs we use a small speech embedding model trained to extract useful features for keyword spotting models. Using this embedding, we show that such a model for detecting 10 keywords when trained on only synthetic speech is equivalent to a model trained on over 50 real examples, and to a model trained on 4000 real examples if we do not use the speech embeddings.

Research Areas