Anton Raichuk
Research Areas
Authored Publications
Sort By
Preview abstract
The Q-function is a central quantity in many Reinforcement Learning (RL) algorithms for which RL agents behave following a (soft)-greedy policy w.r.t. to Q. It is a powerful tool that allows action selection without a model of the environment and even without explicitly modeling the policy. Yet, this scheme can only be used in discrete action tasks, with small numbers of actions, as the softmax over actions cannot be computed exactly otherwise. More specifically, the usage of function approximation to deal with continuous action spaces in modern actor-critic architectures intrinsically prevents the exact computation of a softmax. We propose to alleviate this issue by parametrizing the $Q$-function implicitly, as the sum of a log-policy and a value function. We use the resulting parametrization to derive a practical off-policy deep RL algorithm, suitable for large action spaces, and that enforces the softmax relation between the policy and the Q-value. We provide a theoretical analysis of our algorithm: from an Approximate Dynamic Programming perspective, we show its equivalence to a regularized version of value iteration, accounting for both entropy and Kullback-Leibler regularization, and that enjoys beneficial error propagation results. We then evaluate our algorithm on classic control tasks, where its results compete with state-of-the-art methods.
View details
Continuous Control with Action Quantization from Demonstrations
Léonard Hussenot
Damien Vincent
Sertan Girgin
Matthieu Geist
Olivier Pietquin
International Conference on Machine Learning (ICML) (2022)
Preview abstract
In this paper, we propose a novel Reinforcement Learning (RL) framework for problems with continuous action spaces: Action Quantization from Demonstrations (AQuaDem). The proposed approach consists in learning a discretization of continuous action spaces from human demonstrations. This discretization returns a set of plausible actions (in light of the demonstrations) for each input state, thus capturing the priors of the demonstrator and their multimodal behavior. By discretizing the action space, any discrete action deep RL technique can be readily applied to the continuous control problem. Experiments show that the proposed approach outperforms state-of-the-art methods such as SAC in the RL setup, and GAIL in the Imitation Learning setup. We provide a website with interactive videos: https://google-research.github.io/aquadem/ and make the code available: https://github.com/google-research/google-research/tree/master/aquadem.
View details
What Matters for Adversarial Imitation Learning?
Manu Orsini
Léonard Hussenot
Damien Vincent
Sertan Girgin
Matthieu Geist
Olivier Pietquin
Marcin Andrychowicz
NeurIPS (2021)
Preview abstract
Adversarial imitation learning has become a standard framework for imitation in continuous control. Over the years, several variations of its components were proposed to enhance the performance of the learned policies as well as the sample complexity of the algorithm. In practice, many of these choices are rarely tested all together in rigorous empirical studies. It is therefore difficult to discuss and understand what choices, among the high-level algorithmic options as well as low-level implementation details, matter.
To tackle this issue, we implement more than 50 of these choices in a generic adversarial imitation learning framework and investigate their impacts in a large-scale study (>500k trained agents) with both synthetic and human-generated demonstrations. We analyze the key results and highlight the most surprising findings.
View details
What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study
Marcin Andrychowicz
Piotr Michal Stanczyk
Manu Orsini
Sertan Girgin
Léonard Hussenot
Matthieu Geist
Olivier Pietquin
Marcin Michalski
Sylvain Gelly
ICLR (2021)
Preview abstract
In recent years, reinforcement learning (RL) has been successfully applied to many different continuous control tasks. While RL algorithms are often conceptually simple, their state-of-the-art implementations take numerous low- and high-level design decisions that strongly affect the performance of the resulting agents. Those choices are usually not extensively discussed in the literature, leading to discrepancy between published descriptions of algorithms and their implementations. This makes it hard to attribute progress in RL and slows down overall progress [Engstrom'20]. As a step towards filling that gap, we implement >50 such ``"choices" in a unified on-policy deep actor-critic framework, allowing us to investigate their impact in a large-scale empirical study. We train over 250'000 agents in five continuous control environments of different complexity and provide insights and practical recommendations for the training of on-policy deep actor-critic RL agents.
View details
Hyperparameter Selection for Imitation Learning
Léonard Hussenot
Marcin Andrychowicz
Damien Vincent
Lukasz Piotr Stafiniak
Sertan Girgin
Nikola M Momchev
Manu Orsini
Matthieu Geist
Olivier Pietquin
ICML (2021)
Preview abstract
We address the issue of tuning hyperparameters (HPs) for imitation learning algorithms when the underlying reward function of the demonstrating expert cannot be observed at any time. The vast literature in imitation learning mostly considers this reward function to be available for HP selection, although this is not a realistic setting. Indeed, would this reward function be available, it should then directly be used for policy training and imitation would not make sense. To tackle this mostly ignored problem, we propose and study, for different representative agents and benchmarks, a number of possible proxies to the return, within an extensive empirical study. We observe that, depending on the algorithm and the environment, some methods allow good performance to be achieved without using the unknown return.
View details
Episodic Curiosity through Reachability
Nikolay Savinov
Damien Vincent
Marc Pollefeys
Timothy Lillicrap
Sylvain Gelly
ICLR (2019)
Preview abstract
Rewards are sparse in the real world and most today’s reinforcement learning algorithms struggle with such sparsity. One solution to this problem is to allow the
agent to create rewards for itself — thus making rewards dense and more suitable
for learning. In particular, inspired by curious behaviour in animals, observing
something novel could be rewarded with a bonus. Such bonus is summed up with
the real task reward — making it possible for RL algorithms to learn from the
combined reward. We propose a new curiosity method which uses episodic memory to form the novelty bonus. To determine the bonus, the current observation
is compared with the observations in memory. Crucially, the comparison is done
based on how many environment steps it takes to reach the current observation
from those in memory — which incorporates rich information about environment
dynamics. This allows us to overcome the known “couch-potato” issues of prior
work — when the agent finds a way to instantly gratify itself by exploiting actions
which lead to hardly predictable consequences. We test our approach in visually
rich 3D environments in VizDoom, DMLab and MuJoCo. In navigational tasks
from VizDoom and DMLab, our agent outperforms the state-of-the-art curiosity
method ICM. In MuJoCo, an ant equipped with our curiosity module learns locomotion out of the first-person-view curiosity only. The code is available at
https://github.com/google-research/episodic-curiosity.
View details
Google Research Football: A Novel Reinforcement Learning Environment
Karol Kurach
Piotr Michal Stanczyk
Michał Zając
Lasse Espeholt
Carlos Riquelme
Damien Vincent
Marcin Michalski
Sylvain Gelly
AAAI (2019)
Preview abstract
Recent progress in the field of reinforcement learning has been accelerated by virtual learning environments such as video games, where novel algorithms and ideas can be quickly tested in a safe and reproducible manner. We introduce the Google Research Football Environment, a new reinforcement learning environment where agents are trained to play football in an advanced, physics-based 3D simulator.
The resulting environment is challenging, easy to use and customize, and it is available under a permissive open-source license. We further propose three full-game scenarios of varying difficulty with the Football Benchmarks, we report baseline results for three commonly used reinforcement algorithms (Impala, PPO, and Ape-X DQN), and we also provide a diverse set of simpler scenarios with the Football Academy.
View details