Optimal completion distillation for sequence learning

Will Chan
Mohammad Norouzi
ICLR (2019)

Abstract

We present Optimal Completion Distillation (OCD), a training procedure for
optimizing sequence to sequence models based on edit distance. OCD is efficient,
has no hyper-parameters of its own, and does not require pretraining or joint
optimization with conditional log-likelihood. Given a partial sequence generated
by the model, we first identify the set of optimal suffixes that minimize the total
edit distance, using an efficient dynamic programming algorithm. Then, for each
position of the generated sequence, we define a target distribution that puts an equal
probability on the first token of each optimal suffix. OCD achieves the state-of-theart performance on end-to-end speech recognition, on both Wall Street Journal and
Librispeech datasets, achieving 9.3% and 4.5% word error rates, respectively.

Research Areas