LEARNING THE TRANSFORMER KERNEL

Sankalan Chowdhury
Adamos Solomou
Mrinmaya Sachan
Transactions on Machine Learning Research (2022)

Abstract

In this work we introduce KL-Transformer, a generic, scalable, data driven framework for learning the kernel function in Transformers. Our framework approximates the Transformer kernel as a dot product between spectral feature maps and learns the kernel by learning the spectral distribution. This not only helps in learning a generic kernel end-to-end, but also reduces the time and space complexity of the Transformers from quadratic to linear. We show that KL-Transformers achieve performance comparable to existing efficient Transformer architectures, both in terms of accuracy and computational efficiency. Our study also demonstrates that the choice of the kernel has a substantial impact on performance, and kernel learning variants are competitive alternatives to fixed kernel Transformers, both in long as well as short sequence tasks.

Research Areas