Transformers learn in-context by gradient descent

Johannes von Oswald

Eyvind Niklasson

Ettore Randazzo

João Sacramento

Alexander Mordvintsev

Andrey Zhmoginov

Max Vladymyrov

International Conference on Machine Learning (2023), pp. 35151-35174

Download Google Scholar

Abstract

Transformers have become the state-of-the-art neural network architecture across numerous
domains of machine learning. This is partly due to their celebrated ability to transfer and
to learn in-context based on a few examples. Nevertheless, the mechanism of why and
how Transformers become in-context learners is not well understood and remains mostly an
intuition. Here, we argue that training Transformers on auto-regressive tasks can be closely
related to well-known gradient-based meta-learning formulations. We do so by providing
a simple construction that shows the equivalence of data transformations induced by 1) a
single linear self-attention layer and by 2) gradient-descent on a regression loss. Motivated by
that construction, we show empirically that when training self-attention only Transformers
on simple regression tasks either the models learned by GD and Transformers show great
similarity or, remarkably, the solutions found by gradient descent converge in weight space to
our construction. This allows us, at least on our simple regression tasks, to mechanistically
understand the inner workings of Transformers that enables in-context learning within.
Finally, we discuss intriguing parallels to a mechanism identified as crucial for in-context
learning termed induction-head (Olsson et al., 2022) and show how it could be generalized
by in-context learning by gradient descent within Transformers.

Research Areas

Machine Intelligence

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Transformers learn in-context by gradient descent

Abstract

Research Areas

Meet the teams driving innovation

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Transformers learn in-context by gradient descent

Abstract

Research Areas

Meet the teams driving innovation

AI/ML Foundations  & Capabilities