Approximate Linear Programming for Logistic Markov Decision Processes

Martin Mladenov
Tyler Lu
Proceedings of the Twenty-sixth International Joint Conference on Artificial Intelligence (IJCAI-17), Melbourne, Australia (2017), pp. 2486-2493

Abstract

This is an extended version of the paper Logistic Markov Decision Processes that appeared in the Proceedings of the Twenty-sixth International Joint Conference on Artificial Intelligence (IJCAI-17), pp.2486-2493, Melbourne (2017).

Online and mobile interactions with users, in areas such as advertising and product or content
recommendation, have been transformed by machine learning techniques. However, such methods have largely focused on myopic prediction, i.e., predicting immediate user response to system
actions (e.g., ads or recommendations), without explicitly accounting for the long-term impact
on user behavior, nor the potential need for planning action sequences. In this work, we propose
the use of Markov decision processes (MDPs) to formulate the long-term decision problem and
address two key questions that emerge in their application to user interaction.

The first focuses on model formulation, specifically, how best to construct MDP models of
user interaction in a way that exploits the great successes of myopic prediction models. To this
end, we propose a new model called logistic MDPs, an MDP formulation that allows the concise specification of transition dynamics. It does so by augmenting the natural factored form of
dynamic Bayesian networks (DBNs) with user response variables that are captured by a logistic
regression model (the latter being precisely the model used for myopic user interaction).

The second question we address is how best to solve large logistic MDPs of this type. A
variety of methods have been proposed for solving MDPs that exploit the conditional independence reflected in the DBN representations, including approximate linear programming (ALP).
Despite their compact form, logistic MDPs do not admit the same conditional independence as
DBNs, nor do they satisfy the linearity requirements for standard ALP. We propose a constraint
generation approach to ALP for logistic MDPs that circumvents these problems by: (a) recovering
compactness by conditioning on the logistic response variable; and (b) devising two procedures,
one exact and one approximate, that linearize the search for violated constraints in the master LP.
For the approximation procedure, we also derive error bounds on the quality of the induced policy. We demonstrate the effectiveness of our approach on advertising problems with up to several
thousand sparse binarized features (up to 2^54 and 2^39 actions).

Research Areas