
Johannes von Oswald
My research is focused on AI, neural network architectures, learning algorithms, mechanistic interpretability, mesa-optimization and meta-learning as well as reinforcement learning.
Research Areas
Authored Publications
Sort By
Preview abstract
Randomization is a powerful tool that endows algorithms with remarkable properties. For instance, randomized algorithms excel in adversarial settings, often surpassing the worst-case performance of deterministic algorithms with large margins. Furthermore, their success probability can be amplified by simple strategies such as repetition and majority voting. In this paper, we enhance deep neural networks, in particular transformer models, with randomization. We demonstrate for the first time that randomized algorithms can be instilled in transformers through learning, in a purely data- and objective-driven manner. First, we analyze known adversarial objectives for which randomized algorithms offer a distinct advantage over deterministic ones. We then show that common optimization techniques, such as gradient descent or evolutionary strategies, can effectively learn transformer parameters that make use of the randomness provided to the model. To illustrate the broad applicability of randomization in empowering neural networks, we study three conceptual tasks: associative recall, graph coloring, and agents that explore grid worlds. In addition to demonstrating increased robustness against oblivious adversaries through learned randomization, our experiments reveal remarkable performance improvements due to the inherently random nature of the neural networks’ computation and predictions.
View details
Preview abstract
The effect of regularizers such as weight decay when training deep neural networks is not well understood. We study the influence of weight decay as well as $L2$-regularization when training neural network models in which parameter matrices interact multiplicatively. This combination is of particular interest as this parametrization is common in attention layers, the workhorse of transformers. Here, key-query, as well as value-projection parameter matrices, are multiplied directly with each other: $W_K^TW_Q$ and $PW_V$.
We extend previous results and show on one hand that any local minimum of a $L2$-regularized loss of the form $L(AB^\top) + \lambda (\|A\|^2 + \|B\|^2)$ coincides with a minimum of the nuclear norm-regularized loss $L(AB^\top) + \lambda\|AB^\top\|_*$, and on the other hand that the 2 losses become identical exponentially quickly during training. We thus complement existing works linking $L2$-regularization with low-rank regularization, and in particular, explain why such regularization on the matrix product affects early stages of training.
Based on these theoretical insights, we verify empirically that the key-query and value-projection matrix products $W_K^TW_Q, PW_V$ within attention layers, when optimized with weight decay, as usually done in vision tasks and language modelling, indeed induce a significant reduction in the rank of $W_K^TW_Q$ and $PW_V$, even in fully online training.
We find that, in accordance with existing work, inducing low rank in attention matrix products can damage language model performance, and observe advantages when decoupling weight decay in attention layers from the rest of the parameters.
View details
Transformers learn in-context by gradient descent
João Sacramento
International Conference on Machine Learning (2023), pp. 35151-35174
Preview abstract
Transformers have become the state-of-the-art neural network architecture across numerous
domains of machine learning. This is partly due to their celebrated ability to transfer and
to learn in-context based on a few examples. Nevertheless, the mechanism of why and
how Transformers become in-context learners is not well understood and remains mostly an
intuition. Here, we argue that training Transformers on auto-regressive tasks can be closely
related to well-known gradient-based meta-learning formulations. We do so by providing
a simple construction that shows the equivalence of data transformations induced by 1) a
single linear self-attention layer and by 2) gradient-descent on a regression loss. Motivated by
that construction, we show empirically that when training self-attention only Transformers
on simple regression tasks either the models learned by GD and Transformers show great
similarity or, remarkably, the solutions found by gradient descent converge in weight space to
our construction. This allows us, at least on our simple regression tasks, to mechanistically
understand the inner workings of Transformers that enables in-context learning within.
Finally, we discuss intriguing parallels to a mechanism identified as crucial for in-context
learning termed induction-head (Olsson et al., 2022) and show how it could be generalized
by in-context learning by gradient descent within Transformers.
View details