Yishay Mansour

Yishay Mansour

Prof. Yishay Mansour got his PhD from MIT in 1990, following it he was a postdoctoral fellow in Harvard and a Research Staff Member in IBM T. J. Watson Research Center. Since 1992 he is at Tel-Aviv University, where he is currently a Professor of Computer Science and has serves as the first head of the Blavatnik School of Computer Science during 2000-2002. He was the founder and first director of the Israeli Center of Research Excellence in Algorithms. Prof. Mansour has published over 100 journal papers and over 200 proceeding papers in various areas of computer science with special emphasis on machine learning, algorithmic game theory, communication networks, and theoretical computer science and has supervised over a dozen graduate students in those areas. Prof. Mansour was named as an ACM fellow 2014, and he is currently an associate editor in a number of distinguished journals and has been on numerous conference program committees. He was both the program chair of COLT (1998) and the STOC (2016) and served twice on the COLT steering committee and is a member of the ALT steering committee.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Principal-Agent Reward Shaping in MDPs
    Omer Ben-Porat
    Michal Moshkovitz
    Boaz Taitler
    AAAI 2024
    Preview abstract Principal-agent problems arise when one party acts on behalf of another, leading to conflicts of interest. The economic literature has extensively studied principal-agent problems, and recent work has extended this to more complex scenarios such as Markov Decision Processes (MDPs). In this paper, we further explore this line of research by investigating how reward shaping under budget constraints can improve the principal's utility. We study a two-player Stackelberg game where the principal and the agent have different reward functions, and the agent chooses an MDP policy for both players. The principal offers an additional reward to the agent, and the agent picks their policy selfishly to maximize their reward, which is the sum of the original and the offered reward. Our results establish the NP-hardness of the problem and offer polynomial approximation algorithms for two classes of instances: Stochastic trees and deterministic decision processes with a finite horizon. View details
    Partially Interpretable Models with Guarantees on Coverage and Accuracy
    Nave Frost
    Zachary Lipton
    Michal Moshkovitz
    Algorithmic Learning Theory (ALT) (2024)
    Preview abstract Simple, sufficient explanations furnished by short decision lists can be useful for guiding stakeholder actions. Unfortunately, this transparency can come at the expense of the higher accuracy enjoyed by black box methods, like deep nets. To date, practitioners typically either (i) insist on the simpler model, forsaking accuracy; or (ii) insist on maximizing accuracy, settling for post-hoc explanations of dubious faithfulness. In this paper, we propose a hybrid \emph{partially interpretable model} that represents a compromise between the two extremes. In our setup, each input is first processed by a decision list that can either execute a decision or abstain, handing off authority to the opaque model. The key to optimizing the decision list is to optimally trade off the accuracy of the composite system against coverage (the fraction of the population that receives explanations). We contribute a new principled algorithm for constructing partially interpretable decision lists, providing theoretical guarantees addressing both interpretability and accuracy. As an instance of our result, we prove that when the optimal decision list has length $k$, coverage $c$, and $b$ mistakes, our algorithm will generate a decision list that has length no greater than $4k$, coverage at least $c/2$, and makes at most $4b$ mistakes. Finally, we empirically validate the effectiveness of the new model. View details
    Preview abstract Blackwell's celebrated theory measures approachability using the $\ell_2$ (Euclidean) distance. In many applications such as regret minimization, it is often more useful to study approachability under other distance metrics, most commonly the $\ell_\infty$ metric. However, the time and space complexity of the algorithms designed for $\ell_\infty$ approachability depend on the dimension of the space of the vectorial payoffs, which is often prohibitively large. We present a framework for converting high-dimensional $\ell_\infty$ approachability problems to low-dimensional \emph{pseudonorm} approachability problems, thereby resolving such issues. We first show that the $\ell_\infty$ distance between the average payoff and the approachability set can be equivalently defined as a \emph{pseudodistance} between a lower-dimensional average vector payoff and a new convex convex set we define. Next, we develop an algorithmic theory of pseudonorm approachability analogous to previous work norm approachability showing that it can be achieved via online linear optimization (OLO) over a convex set given by the Fenchel dual of the unit pseudonorm ball. We then use that to show, modulo mild normalization assumptions, that there exists an $\ell_\infty$ approachability algorithm whose convergence is independent of the dimension of the original vector payoff. We further show that that algorithm admits a polynomial-time complexity, assuming that the original $\ell_\infty$-distance can be computed efficiently. We also give an $\ell_\infty$ approachability algorithm whose convergence is logarithmic in that dimension using an FTRL algorithm with a maximum-entropy regularizer. Finally, we illustrate the benefits of our framework by applying it to several problems in regret minimization. View details
    Preview abstract There is often a great degree of freedom in the reward design when formulating a task as a reinforcement learning (RL) problem. The choice of reward function has significant impact on the learned policy and how fast the algorithm converges to it. Typically several iterations of specifying and learning with the reward function are necessary to find one that leads to sample-efficient learning of desired behavior. In this work, we instead propose to directly pass multiple alternate reward formulations of the task to the RL agent. We show that natural extensions of action-elimination algorithms to multiple rewards achieve more favorable instance-dependent regret bounds than their single-reward counterparts, both in multi-armed bandits and in tabular Markov decision processes. Specifically our bounds scale for each state-action pair with the inverse of the most favorable gap among all reward functions. This suggests that learning with multiple rewards can indeed be more sample-efficient, as long as the rewards agree on an optimal policy. We further prove that when rewards do not agree on the optimal policy, multi-reward action elimination in multi-armed bandits still learns a policy that is good across all reward functions. View details
    Preview abstract We study reinforcement learning with linear function approximation and adversarially changing cost functions, a setup that has mostly been considered under simplifying assumptions such as full information feedback or exploratory conditions. We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback, featuring a combination of mirror-descent and least squares policy evaluation in an auxiliary MDP used to compute exploration bonuses. Our algorithm obtains a $\widetilde O(K^{6/7})$ regret bound, improving significantly over previous state-of-the-art of $\widetilde O (K^{14/15})$ in this setting. In addition, we present a version of the same algorithm under the assumption a simulator of the environment is available to the learner (but otherwise no exploratory assumptions are made), and prove it obtains state-of-the-art regret of $\widetilde O (K^{2/3})$. View details
    Preview abstract We introduce the concurrent shuffle model of differential privacy. In this model we have multiple concurrent shufflers permuting messages from different, possibly overlapping, batches of users. Similarly to the standard (single) shuffle model, the privacy requirement is that the concatenation of all shuffled messages should be differentially private. We study the private continual summation problem (a.k.a. the counter problem) and show that the concurrent shuffle model allows for significantly improved error compared to a standard (single) shuffle model. Specifically, we give a summation algorithm with error $\Tilde{O}(n^{1/(2k+1)})$ with $k$ concurrent shufflers on a sequence of length $n$. Furthermore, we prove that this bound is tight for any $k$, even if the algorithm can choose the sizes of the batches adaptively. For $k=\log n$ shufflers, the resulting error is polylogarithmic, much better than $\Tilde{\Theta}(n^{1/3})$ which we show is the smallest possible with a single shuffler. We use our online summation algorithm to get algorithms with improved regret bounds for the contextual linear bandit problem. In particular we get optimal $\Tilde{O}(\sqrt{n})$ regret with $k= \Tilde{\Omega}(\log n)$ concurrent shufflers. View details
    Preview abstract We study a generalization of boosting to the multiclass setting. We introduce a weak learning condition for multiclass classification that captures the original notion of weak learnability as being “slightly better than random guessing”. We give a simple and efficient boosting algorithm, that does not require realizability assumptions and its sample and oracle complexity bounds are independent of the number of classes. Furthermore, we utilize our new boosting technique in two fundamental settings: multiclass PAC learning and List PAC learning, resulting in simplified algorithms compared to previous works. View details
    Preview abstract Given a policy, we define a {\newprob}%\emph{safe zone} as a subset of states, such that most of the policy's trajectories are confined to this subset. The quality of the {\newprob }%safe zone is parameterized by the number of states and the escape probability, i.e., the probability that a random trajectory will leave the subset. Safe zones are especially interesting when they have a small number of states and low escape probability. We study the complexity of finding optimal {\textsc{safeZones}}, and show that in general the problem is computationally hard. For this reason we concentrate on computing approximate {\textsc{safeZones}.} Our main result is a bi-criteria approximation algorithm which gives a factor of almost $2$ approximation for both the escape probability and safe zone size, using a polynomial size sample complexity. We conclude the paper with an empirical evaluation of our algorithm. View details
    Preview abstract In classic reinforcement Learning (RL) problems, policies are evaluated with respect to some reward function and all optimal policies obtain the same expected return. However, when considering real-world dynamic environments in which different users have different preferences, a policy that is optimal for one user might sub-optimal for another. In this work, we propose a multi-objective reinforcement learning framework that accommodates different user preferences over objectives, where preferences are learned via policy comparisons. Our setup consists of a Markov Decision Process with a multi-objective reward function, in which each user corresponds to (unknown) personal preferences vector and their reward in each state-action is the inner product of their preference vector with the multi-objective reward at that state-action. Our goal is to efficiently compute a near-optimal policy for a given user. We consider two user feedback models. We first address the case where a user is provided with two policies and the user feedback is their preferred policy. We then move to a different user feedback model, where a user is instead provided with two small weighted sets of representative trajectories and selects the preferred one. In both cases, we suggest an algorithm that finds a nearly optimal policy for the user using a small number of comparison queries. View details
    Preview abstract An abundance of recent impossibility results establish that regret minimization in Markov games with adversarial opponents is both statistically and computationally intractable. Nevertheless, none of these results preclude the possibility of regret minimization under the assumption that all parties adopt the same learning procedure. In this work, we present the first (to our knowledge) algorithm for learning in general-sum Markov games that provides sublinear regret guarantees when executed by all agents. The bounds we obtain are for \emph{swap regret}, and thus, along the way, imply convergence to a \emph{correlated} equilibrium. Our algorithm is decentralized, computationally efficient, and does not require any communication between agents. Our key observation is that online learning via policy optimization in Markov games essentially reduces to a form of \emph{weighted} regret minimization, with \emph{unknown} weights determined by the path length of the agents' policy sequence. Consequently, controlling the path length leads to weighted regret objectives for which sufficiently adaptive algorithms provide sublinear regret guarantees. View details