Peter Kairouz

Peter Kairouz

Peter Kairouz is a researcher interested in machine learning, security, and privacy. At Google, he is a Research Scientist working on decentralized and privacy-preserving machine learning algorithms. Prior to Google, his doctoral and postdoctoral research have largely focused on building decentralized technologies for anonymous broadcasting over complex networks, understanding the fundamental trade-off between data privacy and utility, and leveraging state-of-the-art deep generative models for data-driven privacy. You can learn more about his background and research by visiting his Stanford webpage. Some of his recent Google publications are listed below.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Differentially private (DP) synthetic data is a versatile tool for enabling the analysis of private data. With the rise of foundation models, a number of new synthetic data algorithms privately finetune the weights of foundation models to improve over existing approaches to generating private synthetic data. In this work, we propose two algorithms for using API access only to generate DP tabular synthetic data. We extend the Private Evolution algorithm \citep{lin2023differentially, xie2024differentially} to the tabular data domain, define a workload-based distance measure, and propose a family of algorithms that use one-shot API access to LLMs. View details
    Differentially Private Insights into AI Use
    Daogao Liu
    Pritish Kamath
    Alexander Knop
    Adam Sealfon
    Da Yu
    Chiyuan Zhang
    Conference on Language Modeling (COLM) 2025 (2025)
    Preview abstract We introduce Urania, a novel framework for generating insights about LLM chatbot interactions with rigorous differential privacy (DP) guarantees. The framework employs a private clustering mechanism and innovative keyword extraction methods, including frequency-based, TF-IDF-based, and LLM-guided approaches. By leveraging DP tools such as clustering, partition selection, and histogram-based summarization, Urania provides end-to-end privacy protection. Our evaluation assesses lexical and semantic content preservation, pair similarity, and LLM-based metrics, benchmarking against a non-private method inspired by CLIO (Tamkin et al., 2024). Moreover, we develop a simple empirical privacy evaluation that demonstrates the enhanced robustness of our DP pipeline. The results show the framework’s ability to extract meaningful conversational insights while maintaining stringent user privacy, effectively balancing data utility with privacy preservation. View details
    Preview abstract Differentially private (DP) synthetic data is a versatile tool for enabling the analysis of private data. With the rise of foundation models, a number of new synthetic data algorithms privately finetune the weights of foundation models to improve over existing approaches to generating private synthetic data. In this work, we propose two algorithms for using API access only to generate DP tabular synthetic data. We extend the Private Evolution algorithm \citep{lin2023differentially, xie2024differentially} to the tabular data domain, define a workload-based distance measure, and propose a family of algorithms that use one-shot API access to LLMs. View details
    Confidential Federated Computations
    Hubert Eichner
    Dzmitry Huba
    Brett McLarnon
    Timon Van Overveldt
    Nova Fallen
    Albert Cheu
    Katharine Daly
    Adria Gascon
    Marco Gruteser
    ArXiv (2024)
    Preview abstract Federated Learning and Analytics (FLA) have seen widespread adoption by technology platforms for processing sensitive on-device data. However, basic FLA systems have privacy limitations: they do not necessarily require anonymization mechanisms like differential privacy (DP), and provide limited protections against a potentially malicious service provider. Adding DP to a basic FLA system currently requires either adding excessive noise to each device's updates, or assuming an honest service provider that correctly implements the mechanism and only uses the privatized outputs. Secure multiparty computation (SMPC) -based oblivious aggregations can limit the service provider's access to individual user updates and improve DP tradeoffs, but the tradeoffs are still suboptimal, and they suffer from scalability challenges and susceptibility to Sybil attacks. This paper introduces a novel system architecture that leverages trusted execution environments (TEEs) and open-sourcing to both ensure confidentiality of server-side computations and provide externally verifiable privacy properties, bolstering the robustness and trustworthiness of private federated computations. View details
    Privacy-Preserving Instructions for Aligning Large Language Models
    Da Yu
    Sewoong Oh
    Zheng Xu
    International Conference on Machine Learning (ICML) (2024)
    Preview abstract Service providers of large language model (LLM) applications collect user instructions in the wild and use them in further aligning LLMs with users’ intentions. These instructions, which potentially contain sensitive information, are annotated by human workers in the process. This poses a new privacy risk not addressed by the typical private optimization. To this end, we propose using synthetic instructions to replace real instructions in data annotation and model fine-tuning. Formal differential privacy is guaranteed by generating those synthetic instructions using privately fine-tuned generators. Crucial in achieving the desired utility is our novel filtering algorithm that matches the distribution of the synthetic instructions to that of the real ones. In both supervised fine-tuning and reinforcement learning from human feedback, our extensive experiments demonstrate the high utility of the final set of synthetic instructions by showing comparable results to real instructions. In supervised fine-tuning, models trained with private synthetic instructions outperform leading open-source models such as Vicuna View details
    Improved Communication-Privacy Trade-offs in L2 Mean Estimation under Streaming Differential Privacy
    Wei-Ning Chen
    Albert No
    Sewoong Oh
    Zheng Xu
    International Conference on Machine Learning (ICML) (2024)
    Preview abstract We study $L_2$ mean estimation under central differential privacy and communication constraints, and address two key challenges: firstly, existing mean estimation schemes that simultaneously handle both constraints are usually optimized for $L_\infty$ geometry and rely on random rotation or Kashin's representation to adapt to $L_2$ geometry, resulting in suboptimal leading constants in mean square errors (MSEs); secondly, schemes achieving order-optimal communication-privacy trade-offs do not extend seamlessly to streaming differential privacy (DP) settings (e.g., tree aggregation or matrix factorization), rendering them incompatible with DP-FTRL type optimizers. In this work, we tackle these issues by introducing a novel privacy accounting method for the sparsified Gaussian mechanism that incorporates the randomness inherent in sparsification into the DP noise. Unlike previous approaches, our accounting algorithm directly operates in $L_2$ geometry, yielding MSEs that fast converge to those of the uncompressed Gaussian mechanism. Additionally, we extend the sparsification scheme to the matrix factorization framework under streaming DP and provide a precise accountant tailored for DP-FTRL type optimizers. Empirically, our method demonstrates at least a 100x improvement of compression for DP-SGD across various FL tasks. View details
    Preview abstract Cascades are a common type of machine learning system where a larger, remote model can be queried if a local model is not able to handle a user’s query by itself. They are becoming an increasingly popular choice of a design for Large Language Models (LLMs) serving stacks due to their ability to preserve task performance, while dramatically reducing inference costs. However, applying cascade systems in situations where the local model has access to sensitive data constitutes a significant privacy risk for users since any such data could be forwarded to the remote model. In this work, we show the feasibility of applying cascade systems in such setups, equipping the local model with privacy-preserving techniques that reduce the risk of leaking private information when querying the remote model. To analyze the privacy of such a setup, we introduce a novel privacy measure that quantifies sensitive information leakage. We then propose a system that leverages the recently introduced social learning paradigm in which LLMs collaboratively learn from each other by exchanging natural language and demonstrate on several datasets that our methods minimize the privacy loss while at the same time improving task performance compared to a non-cascade baseline. View details
    Federated Learning of Gboard Language Models with Differential Privacy
    Zheng Xu
    Yanxiang Zhang
    Galen Andrew
    Christopher Choquette
    Jesse Rosenstock
    Yuanbo Zhang
    ACL industry track (2023) (to appear)
    Preview abstract We train language models (LMs) with federated learning (FL) and differential privacy (DP) in the Google Keyboard (Gboard). We apply the DP-Follow-the-Regularized-Leader (DP-FTRL)~\citep{kairouz21b} algorithm to achieve meaningfully formal DP guarantees without requiring uniform sampling of client devices. To provide favorable privacy-utility trade-offs, we introduce a new client participation criterion and discuss the implication of its configuration in large scale systems. We show how quantile-based clip estimation~\citep{andrew2019differentially} can be combined with DP-FTRL to adaptively choose the clip norm during training or reduce the hyperparameter tuning in preparation for training. With the help of pretraining on public data, we train and deploy more than twenty Gboard LMs that achieve high utility and $\rho-$zCDP privacy guarantees with $\rho \in (0.2, 2)$, with two models additionally trained with secure aggregation~\citep{bonawitz2017practical}. We are happy to announce that all the next word prediction neural network LMs in Gboard now have DP guarantees, and all future launches of Gboard neural network LMs will require DP guarantees. We summarize our experience and provide concrete suggestions on DP training for practitioners. View details
    Privacy-first Health Research with Federated Learning
    Adam Sadilek
    Dung Nguyen
    Methun Kamruzzaman
    Benjamin Rader
    Stefan Mellem
    Elaine O. Nsoesie
    Jamie MacFarlane
    Anil Vullikanti
    Madhav Marathe
    Paul C. Eastham
    John S. Brownstein
    npj Digital Medicine (2021)
    Preview abstract Privacy protection is paramount in conducting health research. However, studies often rely on data stored in a centralized repository, where analysis is done with full access to the sensitive underlying content. Recent advances in federated learning enable building complex machine-learned models that are trained in a distributed fashion. These techniques facilitate the calculation of research study endpoints such that private data never leaves a given device or healthcare system. We show—on a diverse set of single and multi-site health studies—that federated models can achieve similar accuracy, precision, and generalizability, and lead to the same interpretation as standard centralized statistical models while achieving considerably stronger privacy protections and without significantly raising computational costs. This work is the first to apply modern and general federated learning methods that explicitly incorporate differential privacy to clinical and epidemiological research—across a spectrum of units of federation, model architectures, complexity of learning tasks and diseases. As a result, it enables health research participants to remain in control of their data and still contribute to advancing science—aspects that used to be at odds with each other. View details
    Practical and Private (Deep) Learning without Sampling or Shuffling
    Om Thakkar
    Abhradeep Thakurta
    Zheng Xu
    38th International Conference on Machine Learning (ICML 2021) (2021) (to appear)
    Preview
    ×