Peter Kairouz
Peter Kairouz is a researcher interested in machine learning, security, and privacy. At Google, he is a Research Scientist working on decentralized and privacy-preserving machine learning algorithms. Prior to Google, his doctoral and postdoctoral research have largely focused on building decentralized technologies for anonymous broadcasting over complex networks, understanding the fundamental trade-off between data privacy and utility, and leveraging state-of-the-art deep generative models for data-driven privacy. You can learn more about his background and research by visiting his Stanford webpage. Some of his recent Google publications are listed below.
Authored Publications
Sort By
Preview abstract
Differentially private (DP) synthetic data is a versatile tool for enabling the analysis of private data. With the rise of foundation models, a number of new synthetic data algorithms privately finetune the weights of foundation models to improve over existing approaches to generating private synthetic data. In this work, we propose two algorithms for using API access only to generate DP tabular synthetic data. We extend the Private Evolution algorithm \citep{lin2023differentially, xie2024differentially} to the tabular data domain, define a workload-based distance measure, and propose a family of algorithms that use one-shot API access to LLMs.
View details
Differentially Private Insights into AI Use
Daogao Liu
Pritish Kamath
Alexander Knop
Adam Sealfon
Da Yu
Chiyuan Zhang
Conference on Language Modeling (COLM) 2025 (2025)
Preview abstract
We introduce Urania, a novel framework for generating insights about LLM chatbot interactions with rigorous differential privacy (DP) guarantees. The framework employs a private clustering mechanism and innovative keyword extraction methods, including frequency-based, TF-IDF-based, and LLM-guided approaches. By leveraging DP tools such as clustering, partition selection, and histogram-based summarization, Urania provides end-to-end privacy protection. Our evaluation assesses lexical and semantic content preservation, pair similarity, and LLM-based metrics, benchmarking against a non-private method inspired by CLIO (Tamkin et al., 2024). Moreover, we develop a simple empirical privacy evaluation that demonstrates the enhanced robustness of our DP pipeline. The results show the framework’s ability to extract meaningful conversational insights while maintaining stringent user privacy, effectively balancing data utility with privacy preservation.
View details
Preview abstract
Differentially private (DP) synthetic data is a versatile tool for enabling the analysis of private data. With the rise of foundation models, a number of new synthetic data algorithms privately finetune the weights of foundation models to improve over existing approaches to generating private synthetic data. In this work, we propose two algorithms for using API access only to generate DP tabular synthetic data. We extend the Private Evolution algorithm \citep{lin2023differentially, xie2024differentially} to the tabular data domain, define a workload-based distance measure, and propose a family of algorithms that use one-shot API access to LLMs.
View details
Preview abstract
Service providers of large language model (LLM) applications collect user instructions in the wild and use them in further aligning LLMs with users’ intentions. These instructions, which potentially contain sensitive information, are annotated by human workers in the process. This poses a new privacy risk not addressed by the typical private optimization. To this end, we propose using synthetic instructions to replace real instructions in data annotation and model fine-tuning. Formal differential privacy is guaranteed by generating those synthetic instructions using privately fine-tuned generators. Crucial in achieving the desired utility is our novel filtering algorithm that matches the distribution of the synthetic instructions to that of the real ones. In both supervised fine-tuning and reinforcement learning from human feedback, our extensive experiments demonstrate the high utility of the final set of synthetic instructions by showing comparable results to real instructions. In supervised fine-tuning, models trained with private synthetic instructions outperform leading open-source models such as Vicuna
View details
Improved Communication-Privacy Trade-offs in L2 Mean Estimation under Streaming Differential Privacy
Wei-Ning Chen
Albert No
Sewoong Oh
Zheng Xu
International Conference on Machine Learning (ICML) (2024)
Preview abstract
We study $L_2$ mean estimation under central differential privacy and communication constraints, and address two key challenges: firstly, existing mean estimation schemes that simultaneously handle both constraints are usually optimized for $L_\infty$ geometry and rely on random rotation or Kashin's representation to adapt to $L_2$ geometry, resulting in suboptimal leading constants in mean square errors (MSEs); secondly, schemes achieving order-optimal communication-privacy trade-offs do not extend seamlessly to streaming differential privacy (DP) settings (e.g., tree aggregation or matrix factorization), rendering them incompatible with DP-FTRL type optimizers.
In this work, we tackle these issues by introducing a novel privacy accounting method for the sparsified Gaussian mechanism that incorporates the randomness inherent in sparsification into the DP noise. Unlike previous approaches, our accounting algorithm directly operates in $L_2$ geometry, yielding MSEs that fast converge to those of the uncompressed Gaussian mechanism. Additionally, we extend the sparsification scheme to the matrix factorization framework under streaming DP and provide a precise accountant tailored for DP-FTRL type optimizers. Empirically, our method demonstrates at least a 100x improvement of compression for DP-SGD across various FL tasks.
View details
Federated Learning of Gboard Language Models with Differential Privacy
Zheng Xu
Yanxiang Zhang
Galen Andrew
Christopher Choquette
Jesse Rosenstock
Yuanbo Zhang
ACL industry track (2023) (to appear)
Preview abstract
We train language models (LMs) with federated learning (FL) and differential privacy (DP) in the Google Keyboard (Gboard). We apply the DP-Follow-the-Regularized-Leader (DP-FTRL)~\citep{kairouz21b} algorithm to achieve meaningfully formal DP guarantees without requiring uniform sampling of client devices.
To provide favorable privacy-utility trade-offs, we introduce a new client participation criterion and discuss the implication of its configuration in large scale systems. We show how quantile-based clip estimation~\citep{andrew2019differentially} can be combined with DP-FTRL to adaptively choose the clip norm during training or reduce the hyperparameter tuning in preparation for training.
With the help of pretraining on public data, we train and deploy more than twenty Gboard LMs that achieve high utility and $\rho-$zCDP privacy guarantees with $\rho \in (0.2, 2)$, with two models additionally trained with secure aggregation~\citep{bonawitz2017practical}.
We are happy to announce that all the next word prediction neural network LMs in Gboard now have DP guarantees, and all future launches of Gboard neural network LMs will require DP guarantees.
We summarize our experience and provide concrete suggestions on DP training for practitioners.
View details
A Field Guide to Federated Optimization
Jianyu Wang
Zheng Xu
Gauri Joshi
Maruan Al-Shedivat
Galen Andrew
A. Salman Avestimehr
Katharine Daly
Deepesh Data
Suhas Diggavi
Hubert Eichner
Advait Gadhikar
Antonious M. Girgis
Filip Hanzely
Chaoyang He
Samuel Horvath
Martin Jaggi
Tara Javidi
Satyen Chandrakant Kale
Sai Praneeth Karimireddy
Jakub Konečný
Sanmi Koyejo
Tian Li
Peter Richtarik
Karan Singhal
Virginia Smith
Mahdi Soltanolkotabi
Weikang Song
Sebastian Stich
Ameet Talwalkar
Hongyi Wang
Blake Woodworth
Honglin Yuan
Manzil Zaheer
Mi Zhang
Tong Zhang
Chunxiang (Jake) Zheng
Chen Zhu
arxiv (2021)
Preview abstract
Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection. The distributed learning process can be formulated as solving federated optimization problems, which emphasize communication efficiency, data heterogeneity, compatibility with privacy and system requirements, and other constraints that are not primary considerations in other problem settings. This paper provides recommendations and guidelines on formulating, designing, evaluating and analyzing federated optimization algorithms through concrete examples and practical implementation, with a focus on conducting effective simulations to infer real-world performance. The goal of this work is not to survey the current literature, but to inspire researchers and practitioners to design federated learning algorithms that can be used in various practical applications.
View details
Privacy-first Health Research with Federated Learning
Adam Sadilek
Dung Nguyen
Methun Kamruzzaman
Benjamin Rader
Stefan Mellem
Elaine O. Nsoesie
Jamie MacFarlane
Anil Vullikanti
Madhav Marathe
Paul C. Eastham
John S. Brownstein
npj Digital Medicine (2021)
Preview abstract
Privacy protection is paramount in conducting health research. However, studies often rely on data stored in a centralized repository, where analysis is done with full access to the sensitive underlying content. Recent advances in federated learning enable building complex machine-learned models that are trained in a distributed fashion. These techniques facilitate the calculation of research study endpoints such that private data never leaves a given device or healthcare system. We show—on a diverse set of single and multi-site health studies—that federated models can achieve similar accuracy, precision, and generalizability, and lead to the same interpretation as standard centralized statistical models while achieving considerably stronger privacy protections and without significantly raising computational costs. This work is the first to apply modern and general federated learning methods that explicitly incorporate differential privacy to clinical and epidemiological research—across a spectrum of units of federation, model architectures, complexity of learning tasks and diseases. As a result, it enables health research participants to remain in control of their data and still contribute to advancing science—aspects that used to be at odds with each other.
View details
Preview abstract
Building privacy-preserving systems for machine learning and data science on decentralized data
View details
Practical and Private (Deep) Learning without Sampling or Shuffling
Preview
Om Thakkar
Abhradeep Thakurta
Zheng Xu
38th International Conference on Machine Learning (ICML 2021) (2021) (to appear)