Global Normalization for Streaming Speech Recognition in a Modular Framework
Abstract
We introduce the Globally Normalized Autoregressive Transducer (GNAT) foraddressing the label bias problem in streaming speech recognition. Our solutionadmits a tractable exact computation of the denominator for the sequence-levelnormalization. Through theoretical and empirical results, we demonstrate thatby switching to a globally normalized model, the word error rate gap betweenstreaming and non-streaming speech-recognition models can be greatly reduced (bymore than 50% on the Librispeech dataset). This model is developed in a modularframework which encompasses all the common neural speech recognition models.The modularity of this framework enables controlled comparison of modellingchoices and creation of new models.