Csaba Szepesvari

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Regularization and Variance-Weighted Regression Achieves Minimax Optimality in Linear MDPs: Theory and Practice
    Toshinori Kitamura
    Tadashi Kozuno
    Yunhao Tang
    Nino Vieillard
    Michal Valko
    Wenhao Yang
    Jincheng Mei
    Pierre Menard
    Mo Azar
    Remi Munos
    Olivier Pietquin
    Matthieu Geist
    Wataru Kumagai
    Yutaka Matsuo
    International Conference on Machine Learning (ICML) (2023)
    Preview abstract Mirror descent value iteration (MDVI), an abstraction of Kullback--Leibler (KL) and entropy-regularized reinforcement learning (RL), has served as the basis for recent high-performing practical RL algorithms. However, despite the use of function approximation in practice, the theoretical understanding of MDVI has been limited to tabular Markov decision processes (MDPs). We study MDVI with linear function approximation through its sample complexity required to identify an $\varepsilon$-optimal policy with probability $1-\delta$ under the settings of an infinite-horizon linear MDP, generative model, and G-optimal design. We demonstrate that least-squares regression weighted by the variance of an estimated optimal value function of the next state is crucial to achieving minimax optimality. Based on this observation, we present Variance-Weighted Least-Squares MDVI (VWLS-MDVI), the first theoretical algorithm that achieves nearly minimax optimal sample complexity for infinite-horizon linear MDPs. Furthermore, we propose a practical VWLS algorithm for value-based deep RL, Deep Variance Weighting (DVW). Our experiments demonstrate that DVW improves the performance of popular value-based deep RL algorithms on a set of MinAtar benchmarks. View details
    On the Optimality of Batch Policy Optimization Algorithms
    Chenjun Xiao
    Yifan Wu
    Tor Lattimore
    Jincheng Mei
    Lihong Li
    ICML 2021 (2021)
    Preview abstract Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment. Although interest in this problem has grown significantly in recent years, its theoretical foundations remain under-developed. To advance the understanding of this problem, we provide three results that characterize the limits and possibilities of batch policy optimization in the finite-armed stochastic bandit setting. First, we introduce a class of confidence-adjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis. For this family, we show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral. Our analysis reveals that instance-dependent optimality, commonly used to establish optimality of on-line stochastic bandit algorithms, cannot be achieved by any algorithm in the batch setting. In particular, for any algorithm that performs optimally in some environment, there exists another environment where the same algorithm suffers arbitrarily larger regret. Therefore, to establish a framework for distinguishing algorithms, we introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction. We demonstrate how this criterion can be used to justify commonly used pessimistic principles for batch policy optimization. View details
    Preview abstract We study stochastic policy optimization in the on-policy case and make the following four contributions. \textit{First}, we show that the ordering of optimization algorithms by their efficiency gets reversed when they have or they not to the true gradient information. In particular, this finding implies that, unlike in the true gradient scenario, geometric information cannot be easily exploited without detrimental consequences in stochastic policy optimization. \textit{Second}, to explain these findings we introduce the concept of \textit{committal rate} for stochastic policy optimization, and show that this can serve as a criterion for determining almost sure convergence to global optimality. \textit{Third}, we show that if there is no external mechanism that allows an algorithm to determine the difference between optimal and sub-optimal actions using only on-policy samples, then there must be an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely. That is, an algorithm either converges to a globally optimal policy with probability $1$ but at a rate no better than $O(1/t)$, or it achieves a faster than $O(1/t)$ convergence rate but then must fail to converge to the globally optimal deterministic policy with some positive probability. \textit{Finally}, we use our committal rate theory to explain why practical policy optimization methods are sensitive to random initialization, and how an ensemble method with parallelism can be guaranteed to achieve near-optimal solutions with high probability. View details
    Preview abstract We propose to use Thompson sampling to adapt to a sequence of bandit tasks that the algorithm interacts with in a sequential manner. Our algorithm, which we call AdaTS, achieves lower regret than the alternatives that do not adapt by maintaining a distribution over the parameters of the prior, which drives additional exploration. AdaTS is a natural fully-Bayesian algorithm, which can be implemented efficiently in a number of scenarios, as we demonstrate in this work. Our bounds on its Bayes regret show benefits of the extra layer of adaptivity. Our theoretical findings are also supported by experiments, where AdaTS outperforms prior algorithms and is applied to a challenging real-world problem. View details
    Meta-Thompson Sampling
    Branislav Kveton
    Michael Konobeev
    Martin Mladenov
    Proceedings of the 38th International Conference on Machine Learning (ICML 2021), pp. 5884-5893
    Preview abstract Efficient exploration in multi-armed bandits is a fundamental online learning problem. In this work, we propose a variant of Thompson sampling that learns to explore over time by interacting with problem instances sampled from an unknown prior distribution. This algorithm meta-learns the prior and therefore we call it Meta-TS. We propose efficient implementations of Meta-TS and analyze it in Gaussian bandits. Our analysis captures the improvement due to learning the prior and is of a broader interest, because we derive the first prior-dependent upper bound on the Bayes regret. Our regret bound is complemented by empirical evaluation, which shows that Meta-TS quickly adapts to the unknown prior. View details
    Preview abstract Classical global convergence results for first-order methods rely on uniform smoothness and the Łojasiewicz inequality. Motivated by properties of objective functions that arise in machine learning, we propose a non-uniform refinement of these notions, leading to \emph{Non-uniform Smoothness} (NS) and \emph{Non-uniform Łojasiewicz inequality} (NŁ). The new definitions inspire new geometry-aware first-order methods that are able to converge to global optimality faster than the classical Ω(1/t2) lower bounds. To illustrate the power of these geometry-aware methods and their corresponding non-uniform analysis, we consider two important problems in machine learning: policy gradient optimization in reinforcement learning (PG), and generalized linear model training in supervised learning (GLM). For PG, we find that normalizing the gradient ascent method can accelerate convergence to O(e−t) while incurring less overhead than existing algorithms. For GLM, we show that geometry-aware normalized gradient descent can also achieve a linear convergence rate, which significantly improves the best known results. We additionally show that the proposed geometry-aware descent methods escape landscape plateaus faster than standard gradient descent. Experimental results are used to illustrate and complement the theoretical findings. View details
    Differentiable Meta-Learning of Bandit Policies
    Branislav Kveton
    Martin Mladenov
    Advances in Neural Information Processing Systems 33 (NeurIPS 2020), pp. 2122-2134
    Preview abstract Exploration policies in Bayesian bandits maximize the average reward over problem instances drawn from some distribution P. In this work, we learn such policies for an unknown distribution P using samples from P. Our approach is a form of meta-learning and exploits properties of P without making strong assumptions about its form. To do this, we parameterize our policies in a differentiable way and optimize them by policy gradients, an approach that is pleasantly general and easy to implement. We derive effective gradient estimators and propose novel variance reduction techniques. We also analyze and experiment with various bandit policy classes, including neural networks and a novel softmax policy. The latter has regret guarantees and is a natural starting point for our optimization. Our experiments show the versatility of our approach. We also observe that neural network policies can learn implicit biases expressed only through the sampled instances. View details
    Randomized Exploration in Generalized Linear Bandits
    Branislav Kveton
    Lihong Li
    Mohammad Ghavamzadeh
    23rd International Conference on Artificial Intelligence and Statistics (2020)
    Preview abstract We study two randomized algorithms for generalized linear bandits. The first, GLM-TSL, samples a generalized linear model (GLM) from the Laplace approximation to the posterior distribution. The second, GLM-FPL, fits a GLM to a randomly perturbed history of past rewards. We analyze both algorithms and derive $\tilde{O}(d \sqrt{n \log K})$ upper bounds on their $n$-round regret, where $d$ is the number of features and $K$ is the number of arms. The former improves on prior work while the latter is the first for Gaussian noise perturbations in non-linear models. We empirically evaluate both GLM-TSL and GLM-FPL in logistic bandits, and apply GLM-FPL to neural network bandits. Our work showcases the role of randomization, beyond posterior sampling, in exploration. View details
    Escaping the Gravitational Pull of Softmax
    Jincheng Mei
    Chenjun Xiao
    Lihong Li
    Advances in Neural Information Processing Systems 33 (NeurIPS 2020)
    Preview abstract The softmax is the standard transformation used in machine learning to map real-valued vectors to categorical distributions. Unfortunately, the softmax poses serious drawbacks for gradient descent optimization. We establish two negative results for this transform: (1) optimizing any expectation with respect to the softmax must exhibit extreme sensitivity to parameter initialization (``the softmax gravity well''), and (2) optimizing log-probabilities under the softmax must exhibit slow convergence (``softmax damping''). Both findings are based on an analysis of convergence rates using the Lojasiewicz inequality. To circumvent these shortcomings we investigate an alternative transformation, the escort (p-norm) mapping, that demonstrates better optimization properties. In addition to proving bounds on convergence rates to firmly establish these results, we also provide experimental evidence for the superiority of the escort transformation. View details
    Preview abstract We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, where the goal is to estimate a confidence interval on a target policy’s value, given only access to a static experience dataset collected by unknown behavior policies. Starting from a function space embedding of the linear program formulation of the Q-function, we obtain an optimization problem with generalized estimating equation constraints. By applying the generalized empirical likelihood method to the resulting Lagrangian, we propose CoinDICE, a novel and efficient algorithm for computing confidence intervals. Theoretically, we prove the obtained confidence intervals are valid, in both asymptotic and finite-sample regimes. Empirically, we show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods View details