Unary Data Structures for Language Models

Interspeech 2011, International Speech Communication Association, pp. 1425-1428
Google Scholar

Abstract

Language models are important components of speech recognition and machine translation systems.
Trained on billions of words, and consisting of billions of parameters, language models often are the
single largest components of these systems. There have been many proposed techniques to reduce the
storage requirements for language models. A technique based upon pointer-free compact storage of
ordinal trees shows compression competitive with the best proposed systems, while retaining the full
finite state structure, and without using computationally expensive block compression schemes or
lossy quantization techniques.