Tomer Koren
I'm a Senior Research Scientist at Google Research, Tel Aviv, and an Assistant Professor at the Blavatnik School of Computer Science at Tel Aviv University. My research interests are in machine learning and optimization.
Research Areas
Authored Publications
Sort By
Preview abstract
An abundance of recent impossibility results establish that regret minimization in Markov games with adversarial opponents is both statistically and computationally intractable.
Nevertheless, none of these results preclude the possibility of regret minimization under the assumption that all parties adopt the same learning procedure.
In this work, we present the first (to our knowledge) algorithm for learning in general-sum Markov games that provides sublinear regret guarantees when executed by all agents.
The bounds we obtain are for \emph{swap regret},
and thus, along the way, imply convergence to a \emph{correlated} equilibrium.
Our algorithm is decentralized, computationally efficient, and does not require any communication between agents.
Our key observation is that
online learning via policy optimization in Markov games essentially reduces to a form of \emph{weighted} regret minimization, with \emph{unknown} weights determined by the path length of the agents' policy sequence. Consequently, controlling the path length leads to weighted regret objectives for which sufficiently adaptive algorithms provide sublinear regret guarantees.
View details
Preview abstract
We study reinforcement learning with linear function approximation and adversarially changing cost functions, a setup that has mostly been considered under simplifying assumptions such as full information feedback or exploratory conditions.
We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback, featuring a combination of mirror-descent and least squares policy evaluation in an auxiliary MDP used to compute exploration bonuses.
Our algorithm obtains a $\widetilde O(K^{6/7})$ regret bound, improving significantly over previous state-of-the-art of $\widetilde O (K^{14/15})$ in this setting.
In addition, we present a version of the same algorithm under the assumption a simulator of the environment is available to the learner (but otherwise no exploratory assumptions are made), and prove it obtains state-of-the-art regret of $\widetilde O (K^{2/3})$.
View details
Preview abstract
A natural, seemingly undisputed understanding is that a learning algorithm must obtain a good fit to the training data in order to perform well on independent test data.
This view is reflected both in the traditional bias-complexity trade-off, and in the newly emerging theory of generalization of interpolating learning rules. In this work, we ask to what extent may stochastic gradient descent (SGD) be similarly understood to generalize by means of fit to the training data.
We consider the fundamental stochastic convex optimization framework, and seek bounds on the empirical risk and generalization gap of one-pass SGD.
Surprisingly, we discover there exist convex learning problems where the output of SGD exhibits empirical risk and generalization gap both $\Omega(1)$, but generalizes well nonetheless with population risk of $O(1/\sqrt n)$, as guaranteed by the classical analysis. Consequently, it turns out SGD is not algorithmically stable in \emph{any} sense, and cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis).
We then continue to analyze the variant of SGD given by sampling \emph{with}-replacement from the training set, for which we prove that in counter to what might be expected, overfitting does not occur and the population risk converges at the optimal rate.
Finally, we study empirical risk guaranties of multiple epochs of without-replacement SGD, and derive nearly tight upper and lower bounds significantly improving over previously known results in this setting.
View details
Preview abstract
We study the stochastic Multi-Armed Bandit~(MAB) problem with
random delays in the feedback received by the algorithm. We consider two settings: the {\it reward-dependent} delay setting, where realized delays may depend on the stochastic rewards, and the {\it reward-independent} delay setting. Our main contribution is algorithms that achieve near-optimal regret in each of the settings, with an additional additive dependence on the quantiles of the delay distribution.
Our results do not make any assumptions on the delay distributions: in particular, we do not assume they come from any parametric family of distributions and allow for unbounded support and expectation; we further allow for the case of infinite delays where the algorithm might occasionally not observe any feedback.
View details
Preview abstract
We study online learning of finite-horizon Markov Decision Processes (MDPs) with adversarially changing loss functions and unknown dynamics.
In each episode, the learner observes a trajectory realized by her policy chosen for this episode. In addition, the learner suffers and observes the loss accumulated along the trajectory which we call aggregate bandit feedback. The learner, however, never observes any additional information about the loss; in particular, the individual losses suffered along the trajectory.
Our main result is a computationally-efficient algorithm with \sqrt{K} regret for this setting, where K is the number of episodes.
We efficiently reduce \emph{Online MDPs with Aggregate Bandit Feedback} to a novel setting: Distorted Linear Bandits (DLB). This setting is a robust generalization of linear bandits in which selected actions are adversarially perturbed.
We give a computationally-efficient online learning algorithm for DLB and prove a \sqrt{T} regret bound, where T is the number of time steps.
Our algorithm is based on a schedule of increasing learning rates used in Online Mirror Descent with a self-concordant barrier regularization.
We use the DLB algorithm to derive our main result of \sqrt{K} regret.
View details
Preview abstract
We study online convex optimization in the random order model, recently proposed by \citet{garber2020online}, where the loss functions may be chosen by an adversary, but are then presented to the online algorithm in a uniformly random order.
We focus on the scenario where the cumulative loss function is (strongly) convex, yet individual loss functions are smooth but might be non-convex.
Our algorithms achieve the optimal bounds and significantly outperform the results of \citet{garber2020online}, completely removing the dimension dependence and improving the dependence on the strong convexity parameter.
Our analysis relies on novel connections between algorithmic stability and generalization for sampling without-replacement analogous to those studied in the with-replacement i.i.d.~setting, as well as on a refined average stability analysis of stochastic gradient descent.
View details
Preview abstract
We address the problem of convex optimization with preference (dueling) feedback. Like the traditional optimization objective, the goal is to find the optimal point with least possible query complexity, however without the luxury of even a zeroth order feedback (function value at the queried point). Instead, the learner can only observe a single noisy bit which is a win-loss feedback for a pair of queried points based on the difference of their function values.
The problem is undoubtedly of great practical relevance as in many real world scenarios, such as recommender systems or learning from customer preferences, where the system feedback is often restricted to just one binary-bit preference information.
We consider the problem of online convex optimization (OCO) solely by actively querying $\{0,1\}$ noisy-comparison feedback of decision point pairs, with the objective of finding a near-optimal point (function minimizer) with the least possible number of queries.
For the non-stationary OCO setup, where the underlying convex function may change over time, we prove an impossibility result towards achieving the above objective.
We next focus only the on stationary OCO problem and our main contribution lies in designing a normalized-gradient-descent based algorithm towards finding a $\epsilon$-best optimal point. Towards this our algorithm is shown to yield a convergence rate of $\tilde O(\nicefrac{d\beta}{\epsilon \nu^2})$ ($\nu$ being the noise parameter) when the underlying function is $\beta$-smooth. Further we show an improved convergence rate of just $\tilde O(\nicefrac{d\beta}{\alpha \nu^2} \log \frac{1}{\epsilon})$ when the function is additionally also $\alpha$-strongly convex.
View details
Preview abstract
We consider stochastic optimization with delayed gradients where, at each time step~$t$, the algorithm makes an update using a stale stochastic gradient from step $t - d_t$ for some arbitrary delay $d_t$.
This setting abstracts asynchronous distributed optimization where a central server receives gradient updates computed by worker machines. These machines can experience computation and communication loads that might vary significantly over time.
In the general non-convex smooth optimization setting, we give a simple and efficient algorithm that requires $O( \sigma^2/\epsilon^4 + \tau/\epsilon^2 )$ steps for finding an $\epsilon$-stationary point $x$, where $\tau$ is the \emph{average} delay $\smash{\frac{1}{T}\sum_{t=1}^T d_t}$ and $\sigma^2$ is the variance of the stochastic gradients.
This improves over previous work, which showed that stochastic gradient decent achieves the same rate but with respect to the \emph{maximal} delay $\max_{t} d_t$, that can be significantly larger than the average delay especially in heterogeneous distributed systems.
Our experiments demonstrate the efficacy and robustness of our algorithm in cases where the delay distribution is skewed or heavy-tailed.
View details
Preview abstract
We introduce the problem of regret minimization in Adversarial Dueling Bandits. As in classic Dueling Bandits, the learner has to repeatedly choose a pair of items and observe only a relative binary `win-loss' feedback for this pair, but here this feedback is generated from an arbitrary preference matrix, possibly chosen adversarially.
Our main result is an algorithm whose $T$-round regret compared to the \emph{Borda-winner} from a set of $K$ items is $\tilde{O}(K^{1/3}T^{2/3})$, as well as a matching $\Omega(K^{1/3}T^{2/3})$ lower bound. We also prove a similar high probability regret bound.
We further consider a simpler \emph{fixed-gap} adversarial setup, which bridges between two extreme preference feedback models for dueling bandits: stationary preferences and an arbitrary sequence of preferences. For the fixed-gap adversarial setup we give an $\smash{ \tilde{O}((K/\Delta^2)\log{T}) }$ regret algorithm, where $\Delta$ is the gap in Borda scores between the best item and all other items, and show a lower bound of $\Omega(K/\Delta^2)$ indicating that our dependence on the main problem parameters $K$ and $\Delta$ is tight (up to logarithmic factors).
View details
Stochastic Optimization with Laggard Data Pipelines
Cyril Zhang
Kunal Talwar
Naman Agarwal
Rohan Anil
Thirty-fourth Conference on Neural Information Processing Systems, 2020 (2020) (to appear)
Preview abstract
State-of-the-art optimization has increasingly moved toward massively parallel pipelines with extremely large batches. As a consequence, the performance bottleneck is shifting towards the CPU- and disk-bound data loading and preprocessing, as opposed to hardware-accelerated backpropagation. In this regime, a recently proposed approach is data echoing (Choi et al. '19), which takes repeated gradient steps on the same batch. We provide the first convergence analysis of data echoing-based extensions of ubiquitous optimization methods, exhibiting provable improvements over their synchronous counterparts. Specifically, we show that asynchronous batch reuse can magnify the gradient signal in a stochastic batch, without harming the statistical rate.
View details