Daniel Golovin

Daniel Golovin

Daniel Golovin currently leads the Google DeepMind group located in Pittsburgh. He works broadly on machine learning driven optimization and automated experimental design -- looking at how we can leverage ML to automatically design superior systems and products across the board.

He founded Vizier which is widely used across Google, including what might be the largest AI-driven culinary optimization in history - the ML Cookie experiment.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We are living in a golden age of machine learning. Powerful models are being trained to perform many tasks far better than is possible using traditional software engineering approaches alone. However, developing and deploying those models in existing software systems remains difficult. In this paper we present SmartChoices, a novel approach to incorporating machine learning into mature software stacks easily, safely, and effectively. We explain the overall design philosophy and present case studies using SmartChoices within large scale industrial systems. View details
    Preview abstract Single-objective black box optimization (also known as zeroth-order optimization) is the process of minimizing a scalar objective $f(x)$, given evaluations at adaptively chosen inputs $x$. In this paper, we consider multi-objective optimization, where $f(x)$ outputs a vector of possibly competing objectives and the goal is to converge to the Pareto frontier. Quantitatively, we wish to maximize the standard \emph{hypervolume indicator} metric, which measures the dominated hypervolume of the entire set of chosen inputs. In this paper, we introduce a novel scalarization function, which we term the \emph{hypervolume scalarization}, and show that drawing random scalarizations from an appropriately chosen distribution can be used to efficiently approximate the \emph{hypervolume indicator} metric. We utilize this connection to show that Bayesian optimization with our scalarization via common acquisition functions, such as Thompson Sampling or Upper Confidence Bound, provably converges to the whole Pareto frontier by deriving tight \emph{hypervolume regret} bounds on the order of $\widetilde{O}(\sqrt{T})$. Furthermore, we highlight the general utility of our scalarization framework by showing that any provably convergent single-objective optimization process can be converted to a multi-objective optimization process with provable convergence guarantees. View details
    Preview abstract Zeroth-order optimization is the process of minimizing an objective $f(x)$ given oracle access to evaluations at adaptively chosen inputs $x$. In this paper, we present two simple yet powerful GradientLess Descent (GLD) algorithms that do not rely on an underlying gradient estimate and are numerically stable. We analyze our algorithm from a novel geometric perspective and we derive two invariance properties of our algorithm: monotone and affine invariance. Specifically, for {\it any monotone transform} of a smooth and strongly convex objective with latent dimension $k$, then we present a novel analysis that shows convergence within an $\epsilon$-ball of the optimum in $O(kQ\log(n)\log(R/\epsilon))$ evaluations, where the input dimension is $n$, $R$ is the diameter of the input space and $Q$ is the condition number. Our rates are the first of its kind to be both 1) poly-logarithmically dependent on dimensionality and 2) invariant under monotone transformations. From our geometric perspective, we can further show that our analysis is optimal. We emphasize that monotone and affine invariance are key to the empirical success of gradientless algorithms, as demonstrated on BBOB and MuJoCo benchmarks. View details
    Bayesian Optimization for a Better Dessert
    Subhodeep Moitra
    D. Sculley
    Proceedings of the 2017 NIPS Workshop on Bayesian Optimization, December 9, 2017, Long Beach, USA (to appear)
    Preview abstract We present a case study on applying Bayesian Optimization to a complex real-world system; our challenge was to optimize chocolate chip cookies. The process was a mixed-initiative system where both human chefs, human raters, and a machine optimizer participated in 144 experiments. This process resulted in highly rated cookies that deviated from expectations in some surprising ways -- much less sugar in California, and cayenne in Pittsburgh. Our experience highlights the importance of incorporating domain expertise and the value of transfer learning approaches. View details
    Preview abstract We present a simple and robust optimization algorithm related to genetic algorithms, and with analogies to the popular CMA-ES search algorithm, that serves as a cheap alternative to Bayesian Optimization. The algorithm is robust against both monotonic transforms of the objective function value and affine transformations of the feasible region. It is fast and easy to implement, and has performance comparable to CMA-ES on a suite of benchmarks while spending less CPU in the optimization algorithm, and can exhibit better overall performance than Bayesian Optimization when the objective function is cheap. View details
    Preview abstract Any sufficiently complex system acts as a black box when it becomes easier to experiment with than to understand. Hence, black-box optimization has become increasingly important as systems have become more complex. In this paper we describe Google Vizier, a Google-internal service for performing black-box optimization that has become the de facto parameter tuning engine at Google. Google Vizier is used to optimize many of our machine learning models and other systems, and also provides core capabilities to Google's Cloud Machine Learning HyperTune subsystem. We discuss our requirements, infrastructure design, underlying algorithms, and advanced features such as transfer learning and automated early stopping that the service provides. View details
    Machine Learning: The High Interest Credit Card of Technical Debt
    D. Sculley
    Eugene Davydov
    Dietmar Ebner
    Vinay Chaudhary
    Michael Young
    SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop)
    Preview abstract Machine learning offers a fantastically powerful toolkit for building complex systems quickly. This paper argues that it is dangerous to think of these quick wins as coming for free. Using the framework of technical debt, we note that it is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning. The goal of this paper is highlight several machine learning specific risk factors and design patterns to be avoided or refactored where possible. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, changes in the external world, and a variety of system-level anti-patterns. View details
    Ad Click Prediction: a View from the Trenches
    D. Sculley
    Michael Young
    Dietmar Ebner
    Julian Grady
    Lan Nie
    Eugene Davydov
    Sharat Chikkerur
    Dan Liu
    Arnar Mar Hrafnkelsson
    Tom Boulos
    Jeremy Kubica
    Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (2013)
    Preview abstract Predicting ad click--through rates (CTR) is a massive-scale learning problem that is central to the multi-billion dollar online advertising industry. We present a selection of case studies and topics drawn from recent experiments in the setting of a deployed CTR prediction system. These include improvements in the context of traditional supervised learning based on an FTRL-Proximal online learning algorithm (which has excellent sparsity and convergence properties) and the use of per-coordinate learning rates. We also explore some of the challenges that arise in a real-world system that may appear at first to be outside the domain of traditional machine learning research. These include useful tricks for memory savings, methods for assessing and visualizing performance, practical methods for providing confidence estimates for predicted probabilities, calibration methods, and methods for automated management of features. Finally, we also detail several directions that did not turn out to be beneficial for us, despite promising results elsewhere in the literature. The goal of this paper is to highlight the close relationship between theoretical advances and practical engineering in this industrial setting, and to show the depth of challenges that appear when applying traditional machine learning methods in a complex dynamic system. View details
    Large-Scale Learning with Less RAM via Randomization
    D. Sculley
    Michael Young
    Proceedings of the 30 International Conference on Machine Learning (ICML) (2013), pp. 10
    Preview abstract We reduce the memory footprint of popular large-scale online learning methods by projecting our weight vector onto a coarse discrete set using randomized rounding. Compared to standard 32-bit float encodings, this reduces RAM usage by more than 50% during training and by up to 95% when making predictions from a fixed model, with almost no loss in accuracy. We also show that randomized counting can be used to implement per-coordinate learning rates, improving model quality with little additional RAM. We prove these memory-saving methods achieve regret guarantees similar to their exact variants. Empirical evaluation confirms excellent performance, dominating standard approaches across memory versus accuracy tradeoffs. View details