Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations

Ayush Sekhari
Karthik Sridharan
NeurIPS (2021)

Abstract

There have been many recent advances on provably efficient reinforcement learning in problems with rich observation spaces and general function classes. Unfortunately, common to all such approaches is a realizability assumption, that requires the function class to contain the optimal value function of true MDP model, that holds in hardly any real-world setting. In this work, we consider the more realistic setting of agnostic reinforcement learning with a policy class (that may not contain any near-optimal policy). We provide an algorithm for this setting and prove instance-dependent regret bounds when the MDP has small rank $d$. Our bounds scale exponentially with the rank $d$ in the worst case but importantly are polynomial in the horizon, number of actions and the log number of policies. We further show through a nearly matching lower bound that this dependency on horizon is unavoidable.

Research Areas