Faster Transformer Decoding: N-gram Masked Self-Attention

Ankur Bapna
Noam Shazeer
ArXiv, Google Research (2020)

Abstract

Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence S=s1,...,sS, we propose truncating the target-side context used for incremental predictions by making a Markov (N-gram) assumption. Experiments on WMT EnDe and EnFr data sets show that the N-gram masked self-attention model loses very little in BLEU score for N values in the range 4,...,8, depending on the task.