Regularizing Word Segmentation by Creating Misspellings

Hainan Xu
Jesse Emond
Yinghui Huang
Interspeech 2021 (2021) (to appear)
Google Scholar

Abstract

This work focuses on improving subword segmentation algorithms for end-to-end speech recognition models, and makes two major contributions. Firstly, we propose a novel word segmentation algorithm. The algorithm uses the same vocabulary file generated by a regular wordpiece model, is easily extensible and supports a variety of regularization techniques in the segmentation space, and outperforms the regular wordpiece model. Secondly, we propose a number of novel regularization methods that introduces randomness into the tokenization algorithm, which bring further gains in speech recognition performance. A noteworthy discovery from this work is that creating artificial misspelling in words results in the best performance among all the methods, which could inspire future research for strategies in this area.