Zachary Charles
Researcher in federated optimization and federated learning. Interested in distributed learning, communication-efficient learning, robustness, fairness, and applied mathematics. Received a PhD in applied mathematics from the University of Wisconsin-Madison.
Research Areas
Authored Publications
Sort By
VaultGemma
Lynn Chua
Prem Eruvbetine
Chiyuan Zhang
Thomas Mesnard
Borja De Balle Pigem
Daogao Liu
Amer Sinha
Pritish Kamath
Yangsibo Huang
Christopher A. Choquette-Choo
George Kaissis
Armand Joulin
Da Yu
Ryan McKenna
arxiv (2025)
Preview abstract
In this work, we present VaultGemma 1B, a model based on the Gemma family of models fully trained with differential privacy. VaultGemma 1B is 1 billion parameter pretrained model based on the Gemma 2 series of models and uses the same dataset for training. We will be releasing a tech report and the weights of this model.
View details
Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo
Gabriel Teston
Lucio Dery
Nova Fallen
Arthur Szlam
Arthur Douillard
(2025) (to appear)
Preview abstract
As we scale to more massive machine learning models, the frequent synchronization demands inherent in data-parallel approaches create significant slowdowns, posing a critical challenge to further scaling. Recent work develops an approach (DiLoCo) that relaxes synchronization demands without compromising model quality. However, these works do not carefully analyze how DiLoCo's behavior changes with model size. In this work, we study the scaling law behavior of DiLoCo when training LLMs under a fixed compute budget. We focus on how algorithmic factors, including number of model replicas, hyperparameters, and token budget affect training in ways that can be accurately predicted via scaling laws. We find that DiLoCo scales both predictably and robustly with model size. When well-tuned, DiLoCo scales better than data-parallel training with model size, and can outperform data-parallel training even at small model sizes. Our results showcase a more general set of benefits of DiLoCo than previously documented, including increased optimal batch sizes, improved downstream generalization with scale, and improved evaluation loss for a fixed token budget.
View details
Leveraging Function Space Aggregation for Federated Learning at Scale
Nikita Dhawan
Karolina Dziugaite
Transactions on Machine Learning Research (2024)
Preview abstract
The federated learning paradigm has motivated the development of methods for aggregating multiple client updates into a global server model, without sharing client data. Many federated learning algorithms, including the canonical Federated Averaging (FedAvg), take a direct (possibly weighted) average of the client parameter updates, motivated by results in distributed optimization. In this work, we adopt a function space perspective and propose a new algorithm, FedFish, that aggregates local approximations to the functions learned by clients, using an estimate based on their Fisher information. We evaluate FedFish on realistic, large-scale cross-device benchmarks. While the performance of FedAvg can suffer as client models drift further apart, we demonstrate that FedFish is more robust to longer local training. Our evaluation across several settings in image and language benchmarks shows that FedFish outperforms FedAvg as local training epochs increase. Further, FedFish results in global networks that are more amenable to efficient personalization via local fine-tuning on the same or shifted data distributions. For instance, federated pretraining on the C4 dataset, followed by few-shot personalization on Stack Overflow, results in a 7% improvement in next-token prediction by FedFish over FedAvg.
View details
Federated Automatic Differentiation
Journal of Machine Learning Research (JMLR), 25 (2024), pp. 1-39
Preview abstract
Federated learning (FL) is a framework for learning across an axis of group partitioned data (heterogeneous clients) while preserving data privacy, under the orchestration of a central server. FL methods often compute gradients of loss functions purely locally (e.g. at each client), typically using automatic differentiation (AD) techniques. In this work, we consider the problem of applying AD to federated computations while preserving compatibility with privacy-enhancing technologies. We propose a framework, federated automatic differentiation (federated AD), that 1) enables computing derivatives of functions involving client and server computation as well as communication between them and 2) operates in a manner compatible with existing federated technology. We show, in analogy with AD, that federated AD may be implemented using various accumulation modes, which introduce distinct computation-communication trade-offs and systems requirements. Further, we show that a broad class of federated computations is closed under these modes of federated AD, implying that if the original computation can be implemented using privacy-preserving primitives, its derivative may be computed using the same primitives. We then show how federated AD can be used to create algorithms that dynamically learn components of the algorithm itself. We demonstrate that performance of FedAvg-style algorithms can be significantly improved by using federated AD in this manner.
View details
Towards Federated Foundation Models: Scalable Dataset Pipelines for Group-Structured Learning
Krishna Pillutla
Michael Reneer
37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks (2023)
Preview abstract
We introduce Dataset Grouper, a library to create large-scale group-structured (e.g., federated) datasets, enabling federated learning simulation at the scale of foundation models. This library facilitates the creation of group-structured versions of existing datasets based on user-specified partitions and directly leads to a variety of useful heterogeneous datasets that can be plugged into existing software frameworks. Dataset Grouper offers three key advantages. First, it scales to settings where even a single group's dataset is too large to fit in memory. Second, it provides flexibility, both in choosing the base (non-partitioned) dataset and in defining partitions. Finally, it is framework-agnostic. We empirically demonstrate that Dataset Grouper enables large-scale federated language modeling simulations on datasets that are orders of magnitude larger than in previous work, allowing for federated training of language models with hundreds of millions, and even billions, of parameters. Our experimental results show that algorithms like FedAvg operate more as meta-learning methods than as empirical risk minimization methods at this scale, suggesting their utility in downstream personalization and task-specific adaptation. Dataset Grouper is available at https://github.com/google-research/dataset_grouper.
View details
Preview abstract
Compressing model updates is critical for reducing communication costs in federated learning. We examine the problem using rate--distortion theory to present a compression method that is near-optimal in many use cases. We empirically show that common transforms applied to model updates in standard compression algorithms, normalization in QSGD and random rotation in DRIVE, yield sub-optimal compressed representations in practice.
View details
Preview abstract
Model sizes are limited in Federated Learning due to communication bandwidth constraints and on-device memory constraints. The success of scaling model sizes in other machine learning domains, especially when it comes to generalizing to new data distributions, motivates the development of methods of training large scale models in Federated Learning. Inspired by dropout, [3] proposed Federated Dropout as a way of scaling up model sizes: clients train randomly selected subsets of the larger server model. In spite of the promising empirical results and the many other works that build on it [1, 8, 13], we argue in this paper that the metrics used to measure performance of Federated Dropout and its variants are misleading. We propose and perform new experiments which suggest that Federated Dropout is actually detrimental to scaling efforts. We show how a simple ensembling technique outperforms Federated Dropout and other baselines. We perform ablations which suggest that the best performing variations of Federated Dropout attempt to approximate ensembling. The simplicity of ensembling allows for easy, practical implementations. Furthermore, our ensembling technique naturally leverages the parallelizable nature of Federated Learning—recall that it is easy to train several models independently because there are a lot of clients and server-compute is not the bottleneck. Ensembling’s strong performance against our baselines suggests that Federated Learning models may be more easily scaled than previously thought e.g., via boosting.
View details
Preview abstract
The federated learning (FL) framework trains a machine learning model using decentralized data stored at edge client devices by periodically aggregating locally trained models. Popular optimization algorithms of FL use vanilla (stochastic) gradient descent for both local updates at clients and global updates at the aggregating server. Recently, adaptive optimization methods such as AdaGrad have been studied for server updates. However, the effect of using adaptive optimization methods for local updates at clients is not yet understood. We show in both theory and practice that while local adaptive methods can accelerate convergence, they can cause a non-vanishing solution bias, where the final converged solution may be different from the stationary point of the global objective function. We propose correction techniques to overcome this inconsistency and complement the local adaptive methods for FL. Extensive experiments on realistic federated training tasks show that the proposed algorithms can achieve faster convergence and higher test accuracy than the baselines without local adaptivity.
View details
A Field Guide to Federated Optimization
Jianyu Wang
Zheng Xu
Gauri Joshi
Maruan Al-Shedivat
Galen Andrew
A. Salman Avestimehr
Katharine Daly
Deepesh Data
Suhas Diggavi
Hubert Eichner
Advait Gadhikar
Antonious M. Girgis
Filip Hanzely
Chaoyang He
Samuel Horvath
Martin Jaggi
Tara Javidi
Satyen Chandrakant Kale
Sai Praneeth Karimireddy
Jakub Konečný
Sanmi Koyejo
Tian Li
Peter Richtarik
Karan Singhal
Virginia Smith
Mahdi Soltanolkotabi
Weikang Song
Sebastian Stich
Ameet Talwalkar
Hongyi Wang
Blake Woodworth
Honglin Yuan
Manzil Zaheer
Mi Zhang
Tong Zhang
Chunxiang (Jake) Zheng
Chen Zhu
arxiv (2021)
Preview abstract
Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection. The distributed learning process can be formulated as solving federated optimization problems, which emphasize communication efficiency, data heterogeneity, compatibility with privacy and system requirements, and other constraints that are not primary considerations in other problem settings. This paper provides recommendations and guidelines on formulating, designing, evaluating and analyzing federated optimization algorithms through concrete examples and practical implementation, with a focus on conducting effective simulations to infer real-world performance. The goal of this work is not to survey the current literature, but to inspire researchers and practitioners to design federated learning algorithms that can be used in various practical applications.
View details
Adaptive Federated Optimization
Manzil Zaheer
Jakub Konečný
(2021)
Preview abstract
Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. Due to the heterogeneity of the client datasets, standard federated optimization methods such as Federated Averaging (FedAvg) are often difficult to tune and exhibit unfavorable convergence behavior. In non-federated settings, adaptive optimization methods have had notable success in combating such issues. In this work, we propose federated versions of adaptive optimizers, including Adagrad, Yogi and Adam, and analyze their convergence in the presence of heterogeneous data for general nonconvex settings. Our results highlight the interplay between client heterogeneity and communication efficiency. We also perform extensive experiments on these methods and show that the use of adaptive optimizers can improve the performance of federated learning.
View details