Fabian Pedregosa

Fabian Pedregosa

I am a research scientist in the Brain group in Montreal. My main area of expertise is optimization. My previous work includes analysis and development of parallel stochastic methods, the development of more practical methods for optimizing non-smooth functions and new methods for smooth hyperparameter optimization. More broadly, I'm interested in optimization, machine learning and scientific software.

I am also one of the founders of the scikit-learn machine learning library, of which I have been core developer and maintainer from 2010 to 2012, and I have been an active contributor to many open source projects including the computer algebra system sympy and the python profiler memory-profiler.

My current editorial involvement includes being the managing editor of the journal of machine learning research (JMLR) and reviewer for the conferences NIPS and ICML and the journals Mathematical Programming, JMLR.

Finally, I maintain a blog about my research at http://fa.bianp.net.

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    GradMax: Growing Neural Networks using Gradient Information
    Bart van Merriënboer
    Thomas Unterthiner
    The International Conference on Learning Representations (2022)
    Preview abstract The architecture and the parameters of neural networks are often optimized independently, which requires costly retraining of the parameters whenever the architecture is modified. In this work we instead focus on growing the architecture without requiring costly retraining. We present a method that adds new neurons during training without impacting what is already learned, while improving the training dynamics. We achieve the latter by maximizing the gradients of the new weights and find the optimal initialization efficiently by means of the singular value decomposition (SVD). We call this technique Gradient Maximizing Growth (GradMax) and demonstrate its effectiveness in variety of vision tasks and architectures. View details
    Super-Acceleration with Cyclical Step-sizes
    Baptiste Goujaud
    Damien Scieur
    Aymeric Dieuleveut
    Adrien B. Taylor
    Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS), PMLR (2022)
    Preview abstract Cyclical step-sizes have become increasingly popular in deep learning. Motivated by recent observations on the spectral gaps of Hessians in machine learning, we show that these step-size schedules offer a simple way to exploit such properties. More precisely, we develop a convergence rate analysis for quadratic objectives that provides optimal parameters and shows that cyclical learning rates can improve upon traditional lower complexity bounds. We further propose a systematic approach to design optimal first order methods for quadratic minimization with given spectral structure. Finally, we provide a local convergence rate analysis beyond quadratic minimization for those methods, and illustrate these findings through benchmarks on least squares and logistic regression problems. View details
    Average-case Acceleration for Bilinear Games and Normal Matrices
    Carles Domingo-Enrich
    Damien Scieur
    International Conference on Learning Representations (2021)
    Preview abstract Advances in generative modeling and adversarial learning have given rise to renewed interest in smooth games. However, the absence of symmetry in the matrix of second derivatives poses challenges that are not present in the classical minimization framework. While a rich theory of average-case analysis has been developed for minimization problems, little is known in the context of smooth games. In this work we take a first step towards closing this gap by developing average-case optimal first-order methods for a subset of smooth games. We make the following three main contributions. First, we show that for zero-sum bilinear games the average-case optimal method is the optimal method for the minimization of the Hamiltonian. Second, we provide an explicit expression for the optimal method corresponding to normal matrices, potentially non-symmetric. Finally, we specialize it to matrices with eigenvalues located in a disk and show a provable speed-up compared to worst-case optimal algorithms. We illustrate our findings through benchmarks with a varying degree of mismatch with our assumptions View details
    SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize Criticality
    Courtney Paquette
    Elliot Paquette
    Kiwon Lee
    Proceedings of Machine Learning Research (2021)
    Preview abstract We propose a new framework, inspired by random matrix theory, for analyzing the dynamics of stochastic gradient descent (SGD) when both number of samples and dimensions are large. This framework applies to any fixed stepsize and the finite sum setting. Using this new framework, we show that the dynamics of SGD on a least squares problem with random data become deterministic in the large sample and dimensional limit. Furthermore, the limiting dynamics are governed by a Volterra integral equation. This model predicts that SGD undergoes a phase transition at an explicitly given critical stepsize that ultimately affects its convergence rate, which we also verify experimentally. View details
    On the interplay between noise and curvature and its effect on optimization and generalization
    Valentin Thomas
    Bart van Merriënboer
    Yoshua Bengio
    Nicolas Le Roux
    Proceedings of the 23rdInternational Conference on Artificial Intelligence and Statistics (AISTATS) (2020)
    Preview abstract This work revisits the notion of \textit{information criterion} to characterize generalization for modern deep learning models. In particular, we empirically demonstrate the effectiveness of the Takeuchi Information Criterion, an extension of the Akaike Information Criterion for misspecified models, in estimating the generalization gap, shedding light on why quantities such as the number of parameters cannot quantify generalization. The TIC depends on both the Hessian of the loss $\rmH$ and the covariance matrix of the gradients $\rmSS$. By exploring the semantic and numerical similarities and differences between these two matrices as well as the Fisher information matrix $\rmF$, we bring further evidence that flatness cannot in itself predict generalization. We also address the question of when is $\rmSS$ a reasonable approximation to $\rmF$, as commonly assumed. View details
    Average-case Acceleration Through Spectral Density Estimation
    Damien Scieur
    Proceedings of the 37th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research (2020)
    Preview abstract We develop a framework for designing optimal quadratic optimization methods in terms of their average-case runtime. This yields a new class of methods that achieve acceleration through a model of the Hessian's expected spectral density. We develop explicit algorithms for the uniform, Marchenko-Pastur, and exponential distributions. These methods are momentum-based gradient algorithms whose hyper-parameters can be estimated without knowledge of the Hessian's smallest singular value, in contrast with classical accelerated methods like Nesterov acceleration and Polyak momentum. Empirical results on quadratic, logistic regression and neural networks show the proposed methods always match and in many cases significantly improve over classical accelerated methods. View details
    Universal Average-Case Optimality of Polyak Momentum
    Damien Scieur
    Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research (2020)
    Preview abstract Polyak momentum (PM), also known as the heavy-ball method, is a widely used optimization method that enjoys an optimal worst-case complexity on quadratic objectives. However, it's remarkable empirical success is not fully explained by this optimality, as the worst-case analysis is not representative of the typical complexity of an algorithm. In this work we establish a novel link between PM and the average-case analysis, which contrary to worst-case, is representative of the typical complexity of an algorithm. Our main contribution is to prove that \emph{any} method that has optimal complexity in the average-case analysis, converges under mild assumptions in the number of iterations to PM. This brings a new perspective on this classical method, showing that PM is both worst-case and (asymptotically) average-case optimal. View details
    Linearly Convergent Frank-Wolfe with Backtracking Line-Search
    Geoffrey Negiar
    Armin Askari
    Martin Jaggi
    Proceedings of the 23rdInternational Conference on Artificial Intelligence and Statistics (AISTATS) (2020)
    Preview abstract Structured constraints in Machine Learning have recently brought the Frank-Wolfe (FW) family of algorithms back in the spotlight. While the classical FW algorithm has poor local convergence properties, the Away-steps and Pairwise FW variants have emerged as improved variants with faster convergence. However, these improved variants suffer from two practical limitations: they require at each iteration to solve a 1-dimensional minimization problem to set the step-size and also require the Frank-Wolfe linear subproblems to be solved exactly. In this paper, we propose variants of Away-steps and Pairwise FW that lift both restrictions simultaneously. The proposed methods set the step-size based on a sufficient decrease condition, and do not require prior knowledge of the objective. Furthermore, they inherit all the favorable convergence properties of the exact line-search version, including linear convergence for strongly convex functions over polytopes. Benchmarks on different machine learning problems illustrate large performance gains of the proposed variants. View details
    Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization
    Geoffrey Negiar
    Gideon Dresdner
    Alicia Tsai
    Laurent El Ghaoui
    Francesco Locatello
    Proceedings of the 37th International Conference on Machine Learning (ICML) (2020)
    Preview abstract We propose a novel Stochastic Frank-Wolfe ($\equiv$ Conditional Gradient) algorithm with fixed batch size tailored to the constrained optimization of a finite sum of smooth objectives. We make use of a primal-dual interpretation of the Frank-Wolfe algorithm. Recent work to design stochastic variants of the Frank-Wolfe algorithm fall into two categories: algorithms with increasing batch size, and algorithms with constant batch size. The former have faster convergence rates but are impractical; the latter are practical but slower. Our method combines the advantages of both: it converges for any constant batch size, and has faster theoretical worst case rates than previous fixed batch size algorithms. Our experiments also show faster empirical convergence than previous fixed batch methods for several tasks. Finally, we construct a stochastic estimator of the Frank-Wolfe gap. It allows us to bound the true Frank-Wolfe gap, which is a primal-dual gap in the convex case and a measure of stationarity in general. Our gap estimator can therefore be used as a practical stopping criterion in all cases. View details