Cyril Allauzen
Cyril Allauzen is a research scientist at Google in New York. His main research interests are in finite-state methods and their applications to text, speech and natural language processing and machine learning. Before joining Google, he worked as a researcher at AT&T Labs Research and at NYU's Courant Institute of Mathematical Sciences. Cyril received his Ph.D. in computer science from the Université de Marne-la-Vallée in 2001.
Cyril is an author of the OpenFst Library, the OpenKernel Library and the GRM Library.
Authored Publications
Sort By
E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR
David Rybach
Cal Peyser
Zhiyun Lu
Interspeech 2022 (2022) (to appear)
Preview abstract
Improving the performance of end-to-end ASR models on long utterances of minutes to hours is an ongoing problem in speech recognition.
A common solution is to segment the audio in advance using a separate voice activity detector (VAD) that decides segment boundaries based purely on acoustic speech/non-speech information. VAD segmenters, however, may be sub-optimal for real-world speech where, e.g., a complete sentence that should be taken as a whole may contain hesitations in the middle ("set a alarm for... 5 o'clock").
Here, we propose replacing the VAD with an end-to-end ASR model capable of predicting segment boundaries, allowing the segmentation to be conditioned not only on deeper acoustic features but also on linguistic features from the decoded text, while requiring negligible extra compute.
In experiments on real world long-form audio (YouTube) of up to 30 minutes long, we demonstrate WER gains of 5\% relative to the VAD baseline on a state-of-the-art Conformer RNN-T setup.
View details
On Weight Interpolation of the Hybrid Autoregressive Transducer Model
David Rybach
Interspeech 2022, Interspeech 2022 (2022) (to appear)
Preview abstract
This paper explores ways to improve a two-pass speech recognition system when the first-pass
is hybrid autoregressive transducer model and the second-pass is a neural language model.
The main focus is on the scores provided by each of these models, their quantitative analysis,
how to improve them and the best way to integrate them with the objective of better recognition
accuracy. Several analysis are presented to show the importance of the choice of the
integration weights for combining the first-pass and the second-pass scores. A sequence level weight
estimation model along with four training criteria are proposed which allow adaptive integration
of the scores per acoustic sequence.
The effectiveness of this algorithm is demonstrated by constructing and analyzing
models on the Librispeech data set.
View details
An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling
Rami Botros
Ruoming Pang
David Johannes Rybach
James Qin
Quoc-Nam Le-The
Anmol Gulati
Cal Peyser
Chung-Cheng Chiu
Emmanuel Guzman
Jiahui Yu
Qiao Liang
Wei Li
Yu Zhang
Interspeech (2021) (to appear)
Preview abstract
On-device end-to-end (E2E) models have shown improvementsover a conventional model on Search test sets in both quality, as measured by Word Error Rate (WER), and latency, measured by the time the result is finalized after the user stops speaking. However, the E2E model is trained on a small fraction of audio-text pairs compared to the 100 billion text utterances that a conventional language model (LM) is trained with. Thus E2E models perform poorly on rare words and phrases. In this paper, building upon the two-pass streaming Cascaded Encoder E2E model, we explore using a Hybrid Autoregressive Transducer (HAT) factorization to better integrate an on-device neural LM trained on text-only data. Furthermore, to further improve decoder latency we introduce a non-recurrent embedding decoder, in place of the typical LSTM decoder, into the Cascaded Encoder model. Overall, we present a streaming on-device model that incorporates an external neural LM and outperforms the conventional model in both search and rare-word quality, as well as latency, and is 318X smaller.
View details
Hybrid Autoregressive Transducer (HAT)
David Rybach
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, pp. 6139-6143
Preview abstract
This paper proposes and evaluates the hybrid autoregressive transducer (HAT) model, a time-synchronous encoder-decoder model that preserves the modularity of conventional automatic speech recognition systems. The HAT model provides a way to measure the quality of the internal language model that can be used to decide whether inference with an external language model is beneficial or not. We evaluate our proposed model on a large-scale voice search task. Our experiments show significant improvements in WER compared to the state-of-the-art approaches.
View details
Federated Learning of N-gram Language Models
Adeline Wong
The SIGNLL Conference on Computational Natural Language Learning (2019)
Preview abstract
We propose algorithms to train production-quality n-gram language models using federated learning. Federated learning is a machine learning technique to train global models to be used on portable devices such as smart phones, without the users' data ever leaving their devices. This is especially relevant for applications handling privacy-sensitive data, such as virtual keyboards. While the principles of federated learning are fairly generic, its methodology assumes that the underlying models are neural networks. However, virtual keyboards are typically powered by n-gram language models, mostly for latency reasons.
We propose to train a recurrent neural network language model using the decentralized "FederatedAveraging" algorithm directly on training and to approximating this federated model server-side with an n-gram model that can be deployed to devices for fast inference.
Our technical contributions include novel ways of handling large vocabularies, algorithms to correct capitalization errors in user data, and efficient finite state transducer algorithms to convert word language models to word-piece language models and vice versa.
The n-gram language models trained with federated learning are compared to n-grams trained with traditional server-based algorithms using A/B tests on tens of millions of users of a virtual keyboard.
Results are presented for two languages, American English and Brazilian Portuguese. This work demonstrates that high-quality n-gram language models can be trained directly on client mobile devices without sensitive training data ever leaving the device.
View details
Contextual Recovery of Out-of-Lattice Named Entities in Automatic Speech Recognition
Jack Serrino
ISCA Interspeech 2019, ISCA, Graz, Austria (2019), pp. 3830-3834
Preview abstract
As voice-driven intelligent assistants become commonplace, adaptation to user context becomes critical for Automatic Speech Recognition (ASR) systems. For example, ASR systems may be expected to recognize a user’s contact names containing improbable or out-of-vocabulary (OOV) words.
We introduce a method to identify contextual cues in a firstpass ASR system’s output and to recover out-of-lattice hypotheses that are contextually relevant. Our proposed module is agnostic to the architecture of the underlying recognizer, provided it generates a word lattice of hypotheses; it is sufficiently compact for use on device. The module identifies subgraphs in the lattice likely to contain named entities (NEs), recovers phoneme hypotheses over corresponding time spans, and inserts NEs that are phonetically close to those hypotheses. We measure a decrease in the mean word error rate (WER) of word lattices from 11.5% to 4.9% on a test set of NEs.
View details
Algorithms for Weighted Finite Automata with Failure Transitions
International Conference of Implementation and Applications of Automata (CIAA) (2018), pp. 46-58
Preview abstract
In this paper we extend some key weighted finite automata (WFA) algorithms to automata with failure transitions (phi-WFAs). Failure transitions, which are taken only when no immediate\ match is possible at a given state, are used to compactly epresent automata and have many applications. An efficient intersection algorithm and a shortest distance algorithm (over R+) are presented as well as a related algorithm to remove failure transitions from a phi-WFA.
View details
Transliterated mobile keyboard input via weighted finite-state transducers
Lars Hellsten
Prasoon Goyal
David Rybach
Proceedings of the 13th International Conference on Finite State Methods and Natural Language Processing (FSMNLP) (2017)
Preview abstract
We present an extension to a mobile keyboard input decoder based on finite-state transducers that provides general transliteration support, and demonstrate its use for input of South Asian languages using a QWERTY keyboard. On-device keyboard decoders must operate under strict latency and memory constraints, and we present several transducer optimizations that allow for high accuracy decoding under such constraints. Our methods yield substantial accuracy improvements and latency reductions over an existing baseline transliteration keyboard approach. The resulting system was launched for 22 languages in Google Gboard in the first half of 2017.
View details
Distributed representation and estimation of WFST-based n-gram models
Proceedings of the ACL Workshop on Statistical NLP and Weighted Automata (StatFSM) (2016), pp. 32-41
Preview abstract
We present methods for partitioning a weighted finite-state transducer (WFST) representation of an n-gram language model into multiple shards, each of which is a stand-alone WFST n-gram model in its own right, allowing processing with existing algorithms. After independent estimation, including normalization, smoothing and pruning on each shard, the shards can be merged into a single WFST that is identical to the model that would have resulted from estimation without sharding. We then present an approach that uses data partitions in conjunction with WFST sharding to estimate models on orders-of-magnitude more data than would have otherwise been feasible with a single process. We present some numbers on shard characteristics when large models are trained from a very large data set. Functionality to support distributed n-gram modeling has been added to the OpenGrm library.
View details
Improved recognition of contact names in voice commands
Preview
David Elson
Aleks Kracun
Diego Melendo Casado
Pedro J. Moreno
ICASSP 2015