Understanding How Encoder-Decoder Architectures Attend

Kyle Aitken
Niru Maheswaranathan
NeurIPS (2021)

Abstract

Encoder-decoder networks with attention have proven to be a powerful way to solve
many sequence-to-sequence tasks. In these networks, attention aligns encoder and
decoder states and is often used for visualizing network behavior. However, the
mechanisms used by networks to generate appropriate attention matrices are still
mysterious. Moreover, how these mechanisms vary depending on the particular
architecture used for the encoder and decoder (recurrent, feed-forward, etc.) are also
not well understood. In this work, we investigate how encoder-decoder networks
solve different sequence-to-sequence tasks. We introduce a way of decomposing
hidden states over a sequence into temporal (independent of input) and inputdriven (independent of sequence position) components. This reveals how attention
matrices are formed: depending on the task requirements, networks rely more
heavily on either the temporal or input-driven components. These findings hold
across both recurrent and feed-forward architectures despite their differences in
forming the temporal components. Overall, our results provide new insight into the
inner workings of attention-based encoder-decoder networks.

Research Areas