Efficient Convolution Optimisation by Composing Microkernels

Nicolas Tollenaere
Auguste Olivry
Guillaume Iooss
Hugo Brunie
P Sadayappan
Fabrice Rastello
INRIA (2021)

Abstract

Optimizing the implementation of tensor computations is essential to exploiting the full capacity
of a given processor architecture on a wide range of scientific and machine learning applications.
However, the complexity of the microarchitectural features that come into play when approaching
the peak performance of the processor makes it very hard. Focusing on 2D convolutions, we observe a
common weakness in all tensor compilers and libraries related to efficiently covering the wide variety
of problem sizes occurring in real-world applications.
We propose TTile, a domain-specific code generator and autotuner for implementing efficient
convolutions. Similarly to BLIS, TTile nests multiple levels of tiling above a vectorized tensor
contraction microkernel. But unlike traditional approaches, we explore of a variety of microkernels
and compose them to fit exactly the tensor shapes of a convolution. While this helps achieving
consistently high performance on virtually all possible tensor sizes, our method also introduces more
degrees of freedom in the optimization space, which makes it challenging for autotuning strategies.
To address this, we leverage an analytical model of data movement, and combine it with
feedback-directed autotuning. We evaluate TTile as a stand-alone compiler and also as a complement
to TVM on recent Intel x86 microarchitectures.