Zhe Feng

Zhe Feng

I am a Research Scientist in the Market Algorithms group at Google Research, Mountain View. I got my PhD in Computer Science at Harvard University in 2021, where I was fortunately advised by David C. Parkes. My research broadly lies in the intersection between economics and computer science, and mainly focus on understanding how to design better mechanisms through machine learning approaches. See my personal website for more information.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We propose a new Markov Decision Process (MDP) model for ad auctions to capture the user response to the quality of ads, with the objective of maximizing the long-term discounted revenue. By incorporating user response, our model takes into consideration all three parties involved in the auction (advertiser, auctioneer, and user). The state of the user is modeled as a user-specific click-through rate (CTR) with the CTR changing in the next round according to the set of ads shown to the user in the current round. We characterize the optimal mechanism for this MDP as a Myerson’s auction with a notion of modified virtual value, which relies on the value distribution of the advertiser, the current user state, and the future impact of showing the ad to the user. Leveraging this characterization, we design a sample-efficient and computationally-efficient algorithm which outputs an approximately optimal policy that requires only sample access to the true MDP and the value distributions of the bidders. Finally, we propose a simple mechanism built upon second price auctions with personalized reserve prices and show it can achieve a constant-factor approximation to the optimal long term discounted revenue. View details
    Learning Thresholds with Latent Value and Censored Feedback
    Jiahao Zhang
    Tao Lin
    Weiqiang Zheng
    Xiaotie Deng
    ICLR (2024)
    Preview abstract In this paper, we investigate a problem of \emph{actively} learning threshold in latent space, where the \emph{unknown} reward $g(\gamma, v)$ depends on the proposed threshold $\gamma$ and latent value $v$ and it can be \emph{only} achieved if the threshold is lower than or equal to the \emph{unknown} latent value. This problem has broad applications in practical scenarios, e.g., reserve price optimization in online auctions, online task assignments in crowdsourcing, setting recruiting bars in hiring, etc. We first characterize the query complexity of learning a threshold with the expected reward at most $\eps$ smaller than the optimum and prove that the number of queries needed can be infinitely large even when $g(\gamma, v)$ is monotone with respect to both $\gamma$ and $v$. On the positive side, we provide a tight query complexity $\Tilde{\Theta}(1/\eps^3)$ when $g$ is monotone and the CDF of value distribution is Lipschitz. Moreover, we show a tight $\Tilde{\Theta}(1/\eps^3)$ query complexity can be achieved as long as $g$ satisfies one-sided Lipschitzness, which provides a complete characterization for this problem. Finally, we extend this model to an online learning setting and demonstrate a tight $\Theta(T^{2/3})$ regret bound using continuous-arm bandit techniques and the aforementioned query complexity results. View details
    Preview abstract In this survey, we summarize recent developments in research fueled by the growing adoption of automated bidding strategies in online advertising. We explore the challenges and opportunities that have arisen as markets embrace this autobidding and cover a range of topics in this area, including bidding algorithms, equilibrium analysis and efficiency of common auction formats, and optimal auction design. View details
    Preview abstract We study the design of loss functions for click-through rates (CTR) to optimize (social) welfare in ad auctions. Existing works either only focus on CTR predictions without consideration of business objectives (e.g. welfare) in auctions or assume that the distribution over the participants' expected cost-per-impression (eCPM) is known a priori and uses various additional assumptions on the parametric form of the distribution to derive loss functions for predicting CTRs. In this work, we bring back the welfare objectives of ad auctions into CTR predictions and propose a novel weighted rankloss to train CTR model. Compared with the previous literature, our approach provides a provable guarantee on welfare but without any assumptions on the eCPMs' distribution, while also avoiding the intractability of naively applying existing learning-to-rank methods. We further propose a theoretically justifiable technique for calibrating the losses using labels generated from a teacher network, only assuming that the teacher network has bounded $\ell_2$ generalization error. Finally, we demonstrate the practical advantages of the proposed loss on both synthetic and real-world data. View details
    Learning to Bid in Contextual First Price Auctions
    Ashwinkumar Badanidiyuru Varadaraja
    Guru Prashanth Guruganesh
    The Proceedings of the ACM Web Conference 2023
    Preview abstract In this paper, we investigate the problem about how to bid in repeated contextual first price auctions. We consider a single bidder (learner) who repeatedly bids in the first price auctions: at each time t, the learner observes a context xt ∈ R d and decides the bid based on historical information and xt. We assume a structured linear model of the maximum bid of all the others mt = α0 · xt + zt, where α0 ∈ R d is unknown to the learner and zt is randomly sampled from a noise distribution F with log-concave density function f. We consider both binary feedback (the learner can only observe whether she wins or not) and full information feedback (the learner can observe mt) at the end of each time t. For binary feedback, when the noise distribution F is known, we propose a bidding algorithm, by using maximum likelihood estimation (MLE) method to achieve at most Oe( p log(d)T ) regret. Moreover, we generalize this algorithm to the setting with binary feedback and the noise distribution is unknown but belongs to a parametrized family of distributions. For the full information feedback with unknown noise distribution, we provide an algorithm that achieves regret at most Oe( √ dT). Our approach combines an estimator for log-concave density functions and then MLE method to learn the noise distribution F and linear weight α0 simultaneously. We also provide a lower bound result such that any bidding policy in a broad class must achieve regret at least Ω(√ T), even when the learner receives the full information feedback and F is known. View details
    Follow-ups Matter: Improving Contextual Bandits via Post-serving Features
    Chaoqi Wang
    Ziyu Ye
    Ashwinkumar Badanidiyuru Varadaraja
    Haifeng Xu
    NeurIPS-23 (Spotlight) (2023)
    Preview abstract In the realm of contextual bandit algorithms, the standard framework involves observing a context, selecting an arm, and then observing the reward. This approach, while functional, often falls short when dealing with the complexity of real-world scenarios where additional valuable contexts are revealed after arm-selection. We introduce a new algorithm, pLinUCB, designed to incorporate these post-serving contexts effectively, thereby achieving sublinear regret. Our key technical contribution is a robustified and generalized version of the well-known Elliptical Potential Lemma (EPL), which allows us to handle the noise in the context vectors, a crucial aspect in a practical setting. This generalized EPL is of independent interest as it has potential applications beyond the scope of this work. Through extensive empirical tests on both synthetic and real-world datasets, we demonstrate that our proposed algorithm outperforms the state-of-the-art, thus establishing its practical potential. Our work underscores the importance of post-serving contexts in the contextual bandit setting and lays the groundwork for further research in this field. View details
    Preview abstract In this work, we investigate the online learning problem of revenue maximization in ad auctions, where the seller needs to learn the click-through rates (CTRs) of each ad candidate and charge the price of the winner through a pay-per-click manner. We focus on two models of the advertisers’ strategic behaviours. First, we assume that the advertiser is completely myopic; i.e. in each round, they aim to maximize their utility only for the current round. In this setting, we develop an online mechanism based on upper-confidence bounds that achieves a tight O(√T) regret in the worst-case and negative regret when the values are static across all the auctions. Next, we assume that the advertiser is non-myopic but cares about their γ-discounted long term utility. This setting is much more complex since an advertiser is incentived to “game” the mechanism by bidding strategically in earlier rounds. Advertisers may place strategic bids in earlier rounds to influence the algorithm in later rounds. In this setting, we develop an online mechanism that allows us to bound how often the advertiser misreports. Further, this mechanism achieves O(√T) regret against the optimal mechanism with truthful bids, when γ < 1. This complements the prior work that shows that Ω(T^{2/3}) regret is necessary when γ = 1, i.e. when the advertisers do not discount their utility. Furthermore, we provide an algorithm to achieve constant regret for the static valuation setting, even when γ = 1. View details
    Online Bidding Algorithms for Return-on-Spend Constrained Advertisers
    Swati Padmanabhan
    The Proceedings of the ACM Web Conference 2023
    Preview abstract Online advertising has recently become a highly competitive, complex, multi-billion-dollar industry with advertisers bidding for ad slots at large scales and high frequencies. This has resulted in the growing need for efficient ``auto-bidding'' algorithms to determine the bids for incoming queries to maximize advertisers' targets subject to their specified constraints. Our work focuses on designing efficient online algorithms for a single value-maximizing advertiser under a commonly seen constraint: Return-on-Spend (RoS). We quantify the efficiency in terms of \emph{regret} in comparison with the optimal algorithm that knows all the queries a priori. Our main contribution is an algorithm that achieves near-optimal regret while respecting the specified RoS constraint. We are able to also incorporate our results with the pre-existing work of Balseiro, Lu, and Mirrokni~\cite{BLM20} to achieve near-optimal regret while respecting \emph{both} RoS and fixed budget constraints. Our main technical insight is in leveraging ideas from the literature on (offline) packing linear programs, in addition to the generic structural properties arising from our auction model. View details
    Sequential Information Design: Markov Persuasion Process and Its Efficient Reinforcement Learning
    Jibang Wu
    Zixuan Zhang
    Zhaoran Wang
    Zhuoran Yang
    Michael I. Jordan
    Haifeng Xu
    The Twenty-Third ACM Conference on Economics and Computation (EC'22) (2022)
    Preview abstract In today's economy, it becomes important for Internet platforms to consider the sequential information design problem to align its long term interest with incentives of the gig service providers (e.g., drivers, hosts). This paper proposes a novel model of sequential information design, namely the Markov persuasion processes (MPPs), in which a sender, with informational advantage, seeks to persuade a stream of myopic receivers to take actions that maximize the sender's cumulative utilities in a finite horizon Markovian environment with varying prior and utility functions. Planning in MPPs thus faces the unique challenge in finding a signaling policy that is simultaneously persuasive to the myopic receivers and inducing the optimal long-term cumulative utilities of the sender. Nevertheless, in the population level where the model is known, it turns out that we can efficiently determine the optimal (resp. $\epsilon$-optimal) policy with finite (resp. infinite) states and outcomes, through a modified formulation of the Bellman equation that additionally takes persuasiveness into consideration. Our main technical contribution is to study the MPP under the online reinforcement learning (RL) setting, where the goal is to learn the optimal signaling policy by interacting with with the underlying MPP, without the knowledge of the sender's utility functions, prior distributions, and the Markov transition kernels. For such a problem, we design a provably efficient no-regret learning algorithm, the \fullmodel{} (\ORAI), which features a novel combination of both optimism and pessimism principles. In particular, we obtain optimistic estimates of the value functions to encourage exploration under the unknown environment, and additionally robustify the signaling policy with respect to the uncertainty of prior estimation to prevent receiver's detrimental equilibrium behavior. Our algorithm enjoys sample efficiency by achieving a sublinear $\sqrt{T}$-regret upper bound. Furthermore, both our algorithm and theory can be applied to MPPs with large space of outcomes and states via function approximation, and we showcase such a success under the linear setting. View details
    Incrementality Bidding via Reinforcement Learning under Mixed and Delayed Rewards
    Ashwinkumar Badanidiyuru Varadaraja
    Tianxi Li
    Haifeng Xu
    Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS 2022) (2022)
    Preview abstract Incrementality, which is used to measure the causal effect of showing an ad to a potential customer (e.g. a user in an internet platform) versus not, is a central problem for advertisers in advertising systems. In this paper, we investigate the problem about how the advertiser can decide the bid through learning conversion incrementality, in an online manner. We formulate this problem as an episodic Markov Decision Process (MDP) and propose a novel reinforcement learning (RL) algorithm that can achieve regret at most $O(H^2\sqrt{T})$, which doesn't rely on the number of actions (bids), where $H$ is the number of rounds in each episode and $T$ is the total number of episodes in MDP. In sharp contrast with standard RL framework, conversion incrementality feedback are delayed and mixed. To handle this difficulty we propose a novel pairwise moment-matching algorithm to learn conversion incrementality, which is of independent of interest. View details